A 10 year old Xeon is all you need

point.free - 524 poäng - 232 kommentarer - 38116 sekunder sedan

Kommentarer (54)

cafkafk - 37914 sekunder sedan
Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.
I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.
I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.
I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.
cmiles8 - 18389 sekunder sedan
We’re not there yet, but the obvious endgame of the present bubble insanity is open models running on local hardware and devices are “good enough” for most use cases. That will completely implode what’s going on at the moment in tech.
deng - 24071 sekunder sedan
Nice post and technically impressive work. I agree we need to understand the build pipeline and be able to do things locally. However, depending on your electricity cost, it might not make sense financially. These old servers are not energy efficient at all (I'm guessing that old Xeon server will easily pull 200W on load), and that model is currently at 0.1$/0.3$ per 1M tokens (with 76 tps and 262k context) in Openrouter (also, these servers are LOUD).
EDIT: I stand corrected, 200W is apparently way too high of an estimate. I used to run a bunch of old Xeon servers and they slurped watts like crazy, but I can't remember which ones exactly those were.
throwaway2027 - 25492 sekunder sedan
Glad to see other people realizing this. I've been running Gemma 26B-A4B Q4 on a 2012 Xeon with 16GB to 24GB of RAM in a container. It's getting around 8 to 12 tokens per second. Obviously it's not comparable to huge contexts and running it on a GPU and the image decoder in llama.cpp is super slow compared to a GPU but for some small automation tasks and general trivia questions it's decent. The speed is just enough to not have to wait for it to finish so you can read along.
Here's my setup. You may want to figure out what the best optimizations are for your specific CPU like AVX2 because mine didn't have most of them. I did try MTP briefly but I wasn't getting performance improvements. You could play around with the batch sizes for cache or context or go even lower for Q2 and don't overcommit on threads either, but I would suggest either defaults or trying out llama-bench. This isn't by any means the best I assume but it worked decently for me and I sometimes swap out Gemma for Qwen. You could also lower q8_0 to q4_0 for more context but it could hurt quality some say, altough I have noticed it too on some models.
# Building
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON
# Running
export OPENBLAS_NUM_THREADS=4
export OMP_NUM_THREADS=4
OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \
llama.cpp/build/bin/llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.00 --jinja --host 0.0.0.0 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 --threads 4 --threads-batch 4 --ctx-size 8192 -n 8192 --batch-size 2048 --ubatch-size 512 --no-mmap --mlock --chat-template-kwargs '{"enable_thinking":false}' --no-mmproj -np 1 -fa 1
bitwize - 206 sekunder sedan
Successfully ran Gemma4-26B-A4B on my 8yo first-gen Ryzen with a GeForce GTX 1070. It actually ran acceptably well; I was surprised. I even did some coding with it, but the wheels fell abruptly off when it tried several times to use a constant I told it doesn't exist. I only have 32 GiB of RAM in this old bucket, and these results are not worth the RAM consumption, so I put it aside. Maybe if I finish that build with more memory...
montroser - 17360 sekunder sedan
Result is ~12 tokens per second, as reported by OP down in these comments here.
An impressive effort, and better than I would have thought possible on this hardware -- but still pretty far short of what one needs for an satisfactory interactive session.
phaser - 27099 sekunder sedan
What intrigues me the most about AI progress, is not AGI or the model du jour by $AI_UNICORN, but rather what can be run locally. I remember having an amusing, but rather useless model in a beefy gaming PC that I had 6 years ago; and now, something that’s a hundred times better on my M5 laptop.
Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening.
Also I don’t know what this means for the valuation of the AI companies. I remember asking about this very idea to one of their employees at an event and instead of answering he bailed out to grab a cocktail.
jansommer - 24412 sekunder sedan
The E5-2620 v4 is great. Have been using it for 10 years now. Wanted to upgrade until I saw current prices. I have 64 GB ddr4. Paired it with rx 9060 xt 16 GB and games run as fast as ever. Perhaps the cpu is a slight bottleneck in DOOM The Dark Ages, but i'm at 60 fps, so no problem. Light llm on the gpu is a nobrainer, and it's cool to see that things can be tuned to run ok on the cpu. I bought 2667 v4 a month ago for 30$. I'd expect it to give a decent performance boost but I just haven't had the need for it yet, but pushing into llm like in the article I'd probably upgrade because 2667 can handle slightly faster ram.
RobotToaster - 15503 sekunder sedan
Apparently Itanium works quite well for LLMs https://medium.com/@tglozar/running-llama-inference-on-intel...
Which makes sense I suppose.
tomega2134 - 4867 sekunder sedan
I wish this were somehow tagged with AI, so I would know that it's not about say, general computing or cost-efficiency (e.g. using an old xeon machine from ebay instead of new, in these cost-conscious times.)
As it is, the title is click-bait for me, as 1) it says I need at least a Xeon somehow and 2) as it doesn't say what I actually need it for.
andai - 7168 sekunder sedan
I want to share something strange. I found a typo or two in the post and this absolutely delighted me, because it implies a human wrote the words. (Or was at least heavily involved in the editing.)
Guess I am a species-ist after all ;)
car - 27039 sekunder sedan
Similar recent posting with optimizations for older Xeon:
High-Performance AI on a Budget: Optimizing llama.cpp for Qwen3.5 Inference on a Dual-GPU HP Z440
https://news.ycombinator.com/item?id=47320244
Aurornis - 5235 sekunder sedan
llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
ryandrake - 8031 sekunder sedan
I've got an old HP Z-620 workstation with dual E5-2697 v2 CPUs (24 cores total, 48 threads @ 2.7GHz) and 128GB of DDR3 RAM. The docs say it supports up to 192GB, but I wasn't able to get it to POST with all the RAM slots full.
It's still a "homelab" beast and does great with development and GIS/Mapping applications. I was not able to figure out how to run AI workloads on it with decent performance, however, so I finally broke down and got a dedicated GPU for it. It's pretty great what can still be done with older hardware.
api - 655 sekunder sedan
Have to point out one boring thing though: this will use a lot more electricity than newer things. So it'll work, but it'll run up your electric bill.
FartyMcFarter - 21516 sekunder sedan
I may have missed this in the article, but:
What was the net effect of the optimisations? How much faster did it get?
vhaudiquet - 30311 sekunder sedan
The E5 2620-v4 only supports DDR4.
danbruc - 6305 sekunder sedan
Did some try to estimates what it would take to bake interference for a capable large language model into silicon so that one can pipeline inputs through it and produce outputs at one token per clock cycle?
cbdevidal - 16680 sekunder sedan
Old hardware is surprisingly effective. I've been considering a side hustle selling offline AI to local businesses who are privacy-sensitive. Medical, legal, places like that.
At the low end, I'd use old Xeons with gobs of DDR3, install some V100s, run a smaller agent for general chat inquiries, and a frontier model for the deeper stuff, with a router that passes between them depending on the complexity.
The frontier model would perform very slowly, but if it's a deep task the user can submit it in a batch in the evening e.g. "Correlate all of these cases and look for patterns" then receive the output with morning coffee.
Of course, AI helped me work out a plan for this. Haha
NSUserDefaults - 30052 sekunder sedan
How about the iMac Pro? Would that work? I was able to put 128gb in it (not as easy as the regular iMac but possible).
lreeves - 18331 sekunder sedan
Doesn't accepting 100% of the MTP draft tokens mean you should just be using the smaller model? Usually the acceptance rate in Qwen36 at least is around 60-70% and the "wrong" tokens are still filled in entirely by the base model, but when you just accept 100% of the draft tokens it seems kind of self defeating unless I'm wrong.
Also I feel like everyone leaves off prompt processing/prefill speeds in these articles. If you are using a very small prompt and asking for mostly generated tokens, sure but I'd love to know the time-to-response of asking for an analysis of an image or a few hundred lines of code.
kristjansson - 11483 sekunder sedan
Noting for reference that Gemma4 MTP work is in progress[0] on llama.cpp; similar work for Qwen3.6 landed recently and has been great thus far.
[0]: https://github.com/ggml-org/llama.cpp/pull/23398
cykros - 26987 sekunder sedan
Does this mean my 15 year old Phenom is too old? But it has 16 gb of DDR3 RAM!
Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.
anon-3988 - 25843 sekunder sedan
I tried to run gemma 4 on this CPU and it did not go well
https://www.techpowerup.com/cpu-specs/ryzen-7-4800u.c2281
It is way too slow
Liftyee - 13714 sekunder sedan
Very intriguing. This might be the use for my e5-2430 V2 X2 server that's been lying around. DDR3 is (relatively) cheap now too. Could fit 192GB of RAM in it and play around for much cheaper than a new GPU.
potus_kushner - 35611 sekunder sedan
@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?
mv4 - 11329 sekunder sedan
I have an old 192GB DDR4 Dell Precision with dual Intel Xeon Gold 6130 that I've considered spinning up. What's giving me pause is 250W at idle.
alimbada - 19548 sekunder sedan
What's the best way to apply this to slightly more modern hardware - i.e. 5800XT 32GB DDR4, 9060XT 16GB?
bombcar - 11703 sekunder sedan
Is this John Siracusa? It sounds like it could be something he’d say…
(He has a fully maxed out “last Intel” Mac Pro and laments the lack of replacement).
shovas - 15636 sekunder sedan
I have run llama.cpp on an i7-2600 with a 1050. It's too slow for everyday usage but it's not too slow to make it obvious AI is going to be everywhere and in everything. It's too easy to run.
qingcharles - 11020 sekunder sedan
Would there be any advantage of running this as dual Xeon? The CPUs are $5 and a dual mobo is $50...
SirMaster - 12170 sekunder sedan
Either they have a E5-2620 V2 from 13 years ago, or they have DDR4, not DDR3. The V3 and V4 only support DDR4.
haunter - 25970 sekunder sedan
And this is one of those CPUs which had dual slot motherboards so you can have double the fun (and power bill)
https://pcpartpicker.com/products/motherboard/#s=20028,20029...
asimovDev - 31327 sekunder sedan
I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?
robotswantdata - 21413 sekunder sedan
Granite or sapphire rapids are very under rated for MoE inference loads. But you need a GPU for the KV cache.
Plus many boards also support CXL for RAM expansion over PCI 5!
Source: building a hybrid inference business for regulated industry workloads.
Hasan121212 - 22465 sekunder sedan
I think one overlooked advantage of older Xeon systems is their availability. Many people can experiment with local AI deployments at a fraction of the cost of building a brand-new setup.
sperandeo - 11792 sekunder sedan
ive been doing the same thing. i refactored a old newtek stream machine . its my new favorite thing to do! adding old PCs to my "starcraft" fleet xD
coldcity_again - 19974 sekunder sedan
This is great work.
I'd love if anyone knows how I might fare with an old Dell R710 with 2 x Xeon 5600 (12 cores total) and 96Gb of DDR3.
Eonexus - 35752 sekunder sedan
I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?
gigatexal - 27696 sekunder sedan
What kind of tokens per second did the op get I saw nothing of this written.
egorfine - 25989 sekunder sedan
This and the previous one are insanely good articles. Thank you!
hparadiz - 28986 sekunder sedan
I'm now staring at a 10 year old 4U with 256 GB of DDR4 and thinking hmmmmm
christkv - 32967 sekunder sedan
Makes you wonder if its possible to squeeze more tps out of a strix halo system using the 16 zen5 cores as well as the gpu.
- 19968 sekunder sedan
rvba - 23674 sekunder sedan
As someone doing this for fun on a windows 11 machine (96gb ram, 5090 24gb) I wonder if I need any flags to keep the model in memory and avoid swapping to ssd?
I use LM studio and qwen3.5 35B - but never figured out if it is swapping or not.
Om am unrelated note, does anyone know a model that can help with this use case:
https://news.ycombinator.com/item?id=48301635
ForOldHack - 12815 sekunder sedan
Well, lets get started. I have 4 of those machines, and they are Two dual processor. They all had 32GB of ram, so now I have two with 64GB, and two with zero. They all hand stock K5000s, now how two have two cards. I stripped the uni processors ram and video cards, and put those into the dual procs. They have 256Gb SSDs, and two 1TB disk drives. One machine has 8Gb of VRam across two cards. Dual processors are 8Cx2 and 32 Threads. They can easily play 16 videos at once. For AI, I have not found a model that I can get above 3 tokens a second. Not a one.
nurettin - 29266 sekunder sedan
I also run a Qwen 3.6 moe A4B on old hardware. I set it up with
numactl --membind=1
so it is constrained to one of the memory sticks which speeds up token generation a little.
ezconnect - 23143 sekunder sedan
When you use page up and page down key when reading that blog the first line on the screen is obscured by the floating bar or what ever it is. It is not even needed for reading.
shevy-java - 25212 sekunder sedan
The webpage's layout is just horrible. Scrolling is also non-default - and thus rather annoying; I had to stop after two scroll events. Why do people think they need so much fancy effects or non-standard behaviour, if their alleged goal is to get information across to other people?
bflesch - 28603 sekunder sedan
Might consider going for even older CPUs which don't have the Intel ME ring -3 thing which is full of backdoors
SXX - 26957 sekunder sedan
Now we need someone try run Kimi K2.6 on old Xeon and DDR3. After all these platforms do support up to 768GB RAM.
maxothex - 4386 sekunder sedan
[flagged]
6_7 - 7646 sekunder sedan
[dead]
hypfer - 27744 sekunder sedan
> The argument for speculative decoding is stronger on CPU than on GPU.
Uh. Uuuh.
No?
___
Also
> While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip.
What purpose does the quoting of "caches" serve there? Is this AI writing written by that model running on that host?