Can I run AI locally?
- mark_l_watson - 7306 sekunder sedanI have spent a HUGE amount of time the last two years experimenting with local models.
A few lessons learned:
1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.
2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...
Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.
I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.
- meatmanek - 10605 sekunder sedanThis seems to be estimating based on memory bandwidth / size of model, which is a really good estimate for dense models, but MoE models like GPT-OSS-20b don't involve the entire model for every token, so they can produce more tokens/second on the same hardware. GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.
(In terms of intelligence, they tend to score similarly to a dense model that's as big as the geometric mean of the full model size and the active parameters, i.e. for GPT-OSS-20B, it's roughly as smart as a sqrt(20b*3.6b) ≈ 8.5b dense model, but produces tokens 2x faster.)
- mopierotti - 5870 sekunder sedanThis (+ llmfit) are great attempts, but I've been generally frustrated by how it feels so hard to find any sort of guidance about what I would expect to be the most straightforward/common question:
"What is the highest-quality model that I can run on my hardware, with tok/s greater than <x>, and context limit greater than <y>"
(My personal approach has just devolved into guess-and-check, which is time consuming.) When using TFA/llmfit, I am immediately skeptical because I already know that Qwen 3.5 27B Q6 @ 100k context works great on my machine, but it's buried behind relatively obsolete suggestions like the Qwen 2.5 series.
I'm assuming this is because the tok/s is much higher, but I don't really get much marginal utility out of tok/s speeds beyond ~50 t/s, and there's no way to sort results by quality.
- twampss - 14274 sekunder sedanIs this just llmfit but a web version of it?
- gopalv - 675 sekunder sedanChrome runs Gemini Nano if you flip a few feature flags on [1].
The model is not great, but it was the "least amount of setup" LLM I could run on someone else's machine.
Including structured output, but has a tiny context window I could use.
- paxys - 901 sekunder sedanI wish creators of local model inference tools (LM Studio, Ollama etc.) would release these numbers publicly, because you can be sure they are sitting on a large dataset of real-world performance.
- LeifCarrotson - 12888 sekunder sedanThis lacks a whole lot of mobile GPUs. It also does not understand that you can share CPU memory with the GPU, or perform various KV cache offloading strategies to work around memory limits.
It says I have an Arc 750 with 2 GB of shared RAM, because that's the GPU that renders my browser...but I actually have an RTX1000 Ada with 6 GB of GDDR6. It's kind of like an RTX 4050 (not listed in the dropdowns) with lower thermal limits. I also have 64 GB of LPDDR5 main memory.
It works - Qwen3 Coder Next, Devstral Small, Qwen3.5 4B, and others can run locally on my laptop in near real-time. They're not quite as good as the latest models, and I've tried some bigger ones (up to 24GB, it produces tokens about half as fast as I can type...which is disappointingly slow) that are slower but smarter.
But I don't run out of tokens.
- andy_ppp - 7303 sekunder sedanIs it correct that there's zero improvement in performance between M4 (+Pro/Max) and M5 (+Pro/Max) the data looks identical. Also the memory does not seem to improve performance on larger models when I thought it would have?
Love the idea though!
EDIT: Okay the whole thing is nonsense and just some rough guesswork or asking an LLM to estimate the values. You should have real data (I'm sure people here can help) and put ESTIMATE next to any of the combinations you are guessing.
- sxates - 15381 sekunder sedanCool thing!
A couple suggestions:
1. I have an M3 Ultra with 256GB of memory, but the options list only goes up to 192GB. The M3 Ultra supports up to 512GB. 2. It'd be great if I could flip this around and choose a model, and then see the performance for all the different processors. Would help making buying decisions!
- mmaunder - 3867 sekunder sedanOP can you please make it not as dark and slightly larger. Super useful otherwise. Qwen 3.5 9B is going to get a lot of love out of this.
- kpw94 - 1050 sekunder sedanPeople complaining about how hard to get simple answer is don't appreciate the complexity in figuring out optimal models...
There's so many knobs to tweak, it's a non trivial problem
- Average/median length of your Prompts
- prompt eval speed (tok/s)
- token generation speed (tok/s)
- Image/media encoding speed for vision tasks
- Total amount of RAM
- Max bandwidth of ram (ddr4, ddr5, etc.?)
- Total amount of VRAM
- "-ngl" (amount of layers offloaded to GPU)
- Context size needed (you may need sub 16k for OCR tasks for instance)
- Size of billion parameters
- Size of active billion parameters for MoE
- Acceptable level of Perplexity for your use case(s)
- How aggressive Quantization you're willing to accept (to maintain low enough perplexity)
- even finer grain knobs: temperature, penalties etc.
Also, Tok/s as a metric isn't enough then because there's:
- thinking vs non-thinking: which mode do you need?
- models that are much more "chatty" than others in the same area (i remember testing few models that max out my modest desktop specs, qwen 2.5 non-thinking was so much faster than equivalent ministral non-thinking even though they had equivalent tok/s... Qwen would respond to the point quickly)
At the end, final questions are: are you satisfied with how long getting an answer took? and was the answer good enough?
The same exercise with paid APIs exists too, obviously less knobs but depending on your use case, there's still differences between providers and models. You can abstract away a lot of the knobs , just add "are you satisfied with how much it cost" on top of the other 2 questions
- carra - 11626 sekunder sedanHaving the rating of how well the model will run for you is cool. I miss to also have some rating of the model capabilities (even if this is tricky). There are way too many to choose. And just looking at the parameter number or the used memory is not always a good indication of actual performance.
- phelm - 14897 sekunder sedanThis is awesome, it would be great to cross reference some intelligence benchmarks so that I can understand the trade off between RAM consumption, token rate and how good the model is
- freediddy - 10632 sekunder sedani think the perplexity is more important than tokens per second. tokens per second is relatively useless in my opinion. there is nothing worse than getting bad results returned to you very quickly and confidently.
ive been working with quite a few open weight models for the last year and especially for things like images, models from 6 months would return garbage data quickly, but these days qwen 3.5 is incredible, even the 9b model.
- fraywing - 1399 sekunder sedanThis is amazing. Still waiting for the "Medusa" class AMD chips to build my own AI machine.
- azmenak - 3630 sekunder sedanFrom my personal testing, running various agentic tasks with a bunch of tool calls on an M4 Max 128GB, I've found that running quantized versions of larger models to produce the best results which this site completely ignores.
Currently, Nemotron 3 Super using Unsloth's UD Q4_K_XL quant is running nearly everything I do locally (replacing Qwen3.5 122b)
- cafed00d - 7940 sekunder sedanOpen with multiple browsers (safari vs chrome) to get more "accurate + glanceable" rankings.
Its using WebGPU as a proxy to estimate system resource. Chrome tends to leverage as much resources (Compute + Memory) as the OS makes available. Safari tends to be more efficient.
Maybe this was obvious to everyone else. But its worth re-iterating for those of us skimmers of HN :)
- kuon - 2627 sekunder sedanI have amd 9700 and it is not listed while it is great llm hardware because it has 32Gb for a reasonable price. I tried doing "custom" but it didn't seem to work.
The tool is very nice though.
- rcarmo - 6548 sekunder sedanThis is kind of bogus since some of the S and A tier models are pretty useless for reasoning or tool calls and can’t run with any sizable system prompt… it seems to be solely based on tokens per second?
- GrayShade - 15212 sekunder sedanThis feels a bit pessimistic. Qwen 3.5 35B-A3B runs at 38 t/s tg with llama.cpp (mmap enabled) on my Radeon 6800 XT.
- SXX - 5129 sekunder sedanSorry if already been answered, but will there be a metric for latency aka time to first token?
Since I considered buying M3 Ultra and feel like it the most often discussed regarding using Apple hardware for runninh local LLMs. Where speed might be okay, but prompt processing can take ages.
- am17an - 9432 sekunder sedanYou can still run larger MoE models using expert weight off-loading to the CPU for token generation. They are by and large useable, I get ~50 toks/second on a kimi linear 48B (3B active) model on a potato PC + a 3090
- orthoxerox - 10971 sekunder sedanFor some reason it doesn't react to changing the RAM amount in the combo box at the top. If I open this on my Ryzen AI Max 395+ with 32 GB of unified memory, it thinks nothing will fit because I've set it up to reserve 512MB of RAM for the GPU.
- bearjaws - 3010 sekunder sedanSo many people have vibe coded these websites, they are posted to Reddit near daily.
- reactordev - 3665 sekunder sedanThis shows no models work with my hardware but that’s furthest from the truth as I’m running Qwen3.5…
This isn’t nearly complete.
- amelius - 9530 sekunder sedanIt would be great if something like this was built into ollama, so you could easily list available models based on your current hardware setup, from the CLI.
- John23832 - 16835 sekunder sedanRTX Pro 6000 is a glaring omission.
- AstroBen - 11502 sekunder sedanThis doesn't look accurate to me. I have an RX9070 and I've been messing around with Qwen 3.5 35B-A3B. According to this site I can't even run it, yet I'm getting 32tok/s ^.-
- sdingi - 9617 sekunder sedanWhen running models on my phone - either through the web browser or via an app - is there any chance it uses the phone's NPU, or will these be GPU only?
I don't really understand how the interface to the NPU chip looks from the perspective of a non-system caller, if it exists at all. This is a Samsung device but I am wondering about the general principle.
- zitterbewegung - 6778 sekunder sedanThe M4 Ultra doesn't exist and there is more credible rumors for an M5 Ultra. I wouldn't put a projection like that without highlighting that this processor doesn't exist yet.
- anigbrowl - 5027 sekunder sedanUseful tool, although some of the dark grey text is dark that I had to squint to make it out against the background.
- mkagenius - 6967 sekunder sedanLiterally made the same app, 2 weeks back - https://news.ycombinator.com/item?id=47171499
- sshagent - 12200 sekunder sedanI don't see my beloved 5060ti. looks great though
- vova_hn2 - 13834 sekunder sedanIt says "RAM - unknown", but doesn't give me an option to specify how much RAM I have. Why?
- tcbrah - 9681 sekunder sedantbh i stopped caring about "can i run X locally" a while ago. for anything where quality matters (scripting, code, complex reasoning) the local models are just not there yet compared to API. where local shines is specific narrow tasks - TTS, embeddings, whisper for STT, stuff like that. trying to run a 70b model at 3 tok/s on your gaming GPU when you could just hit an API for like $0.002/req feels like a weird flex IMO
- lagrange77 - 4589 sekunder sedanFinally! I've been waiting for something like this.
- ge96 - 12923 sekunder sedanRaspberry pi? Say 4B with 4GB of ram.
I also want to run vision like Yocto and basic LLM with TTS/STT
- mrdependable - 14224 sekunder sedanThis is great, I've been trying to figure this stuff out recently.
One thing I do wonder is what sort of solutions there are for running your own model, but using it from a different machine. I don't necessarily want to run the model on the machine I'm also working from.
- golem14 - 8386 sekunder sedanHas anyone actually built anything with this tool?
The website says that code export is not working yet.
That’s a very strange way to advertise yourself.
- adithyassekhar - 14448 sekunder sedanThis just reminded me of this https://www.systemrequirementslab.com/cyri.
Not sure if it still works.
- havaloc - 13630 sekunder sedanMissing the A18 Neo! :)
- debatem1 - 13788 sekunder sedanFor me the "can run" filter says "S/A/B" but lists S, A, B, and C and the "tight fit" filter says "C/D" but lists F.
Just FYI.
- tencentshill - 5876 sekunder sedanMissing laptop versions of all these chips.
- amelius - 10201 sekunder sedanWhat is this S/A/B/C/etc. ranking? Is anyone else using it?
- ipunchghosts - 2469 sekunder sedanWhat is S? Also, NVIDIA RTX 4500 Ada is missing.
- arjie - 13060 sekunder sedanCool website. The one that I'd really like to see there is the RTX 6000 Pro Blackwell 96 GB, though.
- ryandrake - 7358 sekunder sedanMissing RTX A4000 20GB from the GPU list.
- jrmg - 11560 sekunder sedanIs there a reliable guide somewhere to setting up local AI for coding (please don’t say ‘just Google it’ - that just results in a morass of AI slop/SEO pages with out of date, non-self-consistent, incorrect or impossible instructions).
I’d like to be able to use a local model (which one?) to power Copilot in vscode, and run coding agent(s) (not general purpose OpenClaw-like agents) on my M2 MacBook. I know it’ll be slow.
I suspect this is actually fairly easy to set up - if you know how.
- amelius - 10240 sekunder sedanWhy isn't there some kind of benchmark score in the list?
- bheadmaster - 6008 sekunder sedanMissing 5060 Ti 16GB
- brcmthrowaway - 9108 sekunder sedanIf anyone hasn't tried Qwen3.5 on Apple Silicon, I highly suggest you to! Claude level performance on local hardware. If the Qwen team didn't get fired, I would be bullish on Local LLM.
- S4phyre - 14768 sekunder sedanOh how cool. Always wanted to have a tool like this.
- tristor - 6151 sekunder sedanThis does not seem accurate based on my recently received M5 Max 128GB MBP. I think there's some estimates/guesswork involved, and it's also discounting that you can move the memory divider on Unified Memory devices like Apple Silicon and AMD AI Max 395+.
- g_br_l - 13843 sekunder sedancould you add raspi to the list to see which ridiculously small models it can run?
- metalliqaz - 13738 sekunder sedanHugging Face can already do this for you (with much more up-to-date list of available models). Also LM Studio. However they don't attempt to estimate tok/sec, so that's a cool feature. However I don't really trust those numbers that much because it is not incorporating information about the CPU, etc. True GPU offload isn't often possible on consumer PC hardware. Also there are different quants available that make a big difference.
- charcircuit - 13808 sekunder sedanOn mobile it does not show the name of the model in favor of the other stats.
- polyterative - 6352 sekunder sedanawesome, needed this
- kylehotchkiss - 10721 sekunder sedanMy Mac mini rocks qwen2.5 14b at a lightning fast 11/tokens a second. Which is actually good enough for the long term data processing I make it spend all day doing. It doesn’t lock up the machine or prevent its primary purpose as webserver from being fulfilled.
- varispeed - 11139 sekunder sedanDoes it make any sense? I tried few models at 128GB and it's all pretty much rubbish. Yes they do give coherent answers, sometimes they are even correct, but most of the time it is just plain wrong. I find it massive waste of time.
- tkfoss - 5060 sekunder sedanNice UI, but crap data, probably llm generated.
- nilslindemann - 10351 sekunder sedan1. More title attributes please ("S 16 A 7 B 7 C 0 D 4 F 34", huh?)
2. Add a 150% size bonus to your site.
Otherwise, cool site, bookmarked.
- Yanko_11 - 3029 sekunder sedan[dead]
- Felixbot - 12592 sekunder sedan[flagged]
- JulianPembroke - 5439 sekunder sedan[flagged]
- A7OM - 5433 sekunder sedan[flagged]
- A7OM - 5470 sekunder sedan[flagged]
- JulianPembroke - 5400 sekunder sedan[flagged]
- aplomb1026 - 10232 sekunder sedan[dead]
- prokajevo - 6932 sekunder sedan[dead]
- uncSoft - 12633 sekunder sedan[dead]
- unfirehose - 11451 sekunder sedanif you do, would you still want to collect data in a single pane of glass? see my open source repo for aggregating harness data from multiple machine learning model harnesses & models into a single place to discover what you are working on & spending time & money. there is plans for a scrobble feature like last.fm but for agent research & code development & execution.
https://github.com/russellballestrini/unfirehose-nextjs-logg...
thanks, I'll check for comments, feel free to fork but if you want to contribute you'll have to find me off of github, I develop privately on my own self hosted gitlab server. good luck & God bless.
Nördnytt! 🤓