GLM-5.2 is the new leading open weights model on Artificial Analysis
- SwellJoe - 70 sekunder sedanI added it to my benchmark based on Mythos-reported bugs, and it's better than GLM 5.1, but still behind several other models, maybe most directly comparable to Qwen 3.7 Max. But, several other open models, including small self-hostable ones (Gemma 4 and Qwen 3.6), found the same number of bugs, 3 of 9. Though it also gets partial credit for reporting one bug in the right spot, but kinda misunderstanding the bug. I also added Kimi K2.7-code in the same run, and it did poorly, consistent with 2.6 performance. Anyway, there are better, cheaper, models on this particular benchmark.
https://swelljoe.com/post/will-it-mythos/
(This small benchmark doesn't prove anything. It's a limited data set and each model only gets one shot at each file in the corpus. But, I find it useful for quickly sussing out if a model can reason about pretty complicated problems in code.)
- Tiberium - 29049 sekunder sedanIt seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file.
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.
- kristopolous - 23180 sekunder sedanI have a script that ranks these based on codingindex from Artificial Analysis.
All it does is pull a json from their main table page and parses it with the fields I care about (coding).
There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.
Current partial output
To see everything, run it like soscore age size name 47.1 58 large Kimi K2.6 47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort) 47.5 70 - Muse Spark 47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort) 47.8 205 - Claude Opus 4.5 (Reasoning) 48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort) 48.6 55 - GPT-5.5 (Non-reasoning) 48.7 188 - GPT-5.2 (xhigh) 50.1 29 - Qwen3.7 Max 50.7 1 large GLM-5.2 (max) 50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) 51.5 92 - GPT-5.4 mini (xhigh) 52.1 55 - GPT-5.5 (low) 52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort) 53.1 132 - GPT-5.3 Codex (xhigh) 53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort) 55.5 118 - Gemini 3.1 Pro Preview 56.2 55 - GPT-5.5 (medium) 56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort) 57.2 104 - GPT-5.4 (xhigh) 58.5 55 - GPT-5.5 (high) 59.1 55 - GPT-5.5 (xhigh) 62 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
The repo: https://github.com/day50-dev/aa-eval-email$ curl day50.dev/art-analysis.sh | bashsome key takeaways:
* open models are on about a 4-7 month lag right now depending on how you want to measure it
* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.
if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.
- mrngld - 24679 sekunder sedanArtificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to run, with GPT5.5 on medium significantly less expensive. Compared to GPT5.5 medium GLM5.1xhigh is twice the cost and half the intelligence. They don't have GLM5.2 on there yet, but that'd a big gap to bridge.
https://artificialanalysis.ai/agents/coding-agents?coding-ag...
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
- unrvl22 - 28653 sekunder sedanWhy aren't more people talking about this? It's literally Opus 4.7 quality stupid prices. I know providers who are offering this at unlimited tokens for $50 a month. Some are even offering API rates at 3x lower than the official ZAI api rates which are already like 10x cheaper than Opus. (Crof and Umans btw)
This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
- gertlabs - 6609 sekunder sedanGLM 5.2 is the first model we've tested that is unambiguously on par with, or better than Opus 4.6 (although as usual, we have GLM 5.2 and most other Chinese models a bit below most other benchmarks with more vulnerable test methodologies).
Data at https://gertlabs.com/rankings
- simonw - 22458 sekunder sedanI was surprised that GLM 5.1/5.2 are not vision models - they are text input only.
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
- CuriouslyC - 28576 sekunder sedanI've been playing with this model a fair amount over the last 24 hours, and I can confirm it's quite capable, while being a little bit verbose (I've seen it reconsider things 3-4 times in thinking traces before deciding on a path forward), and not being quite as good as GPT5.5 at working through complex abstract requirements.
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
- CubsFan1060 - 25631 sekunder sedanKnowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
- tensegrist - 27416 sekunder sedan> On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)
am i missing something?
- wongarsu - 24278 sekunder sedanIt's also third best overall on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable.
That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions
- kingstnap - 28124 sekunder sedanAccording to many benchmarks this model is straight up frontier level and Zai seriously cooked. Some of these numbers are incredible.
Excited to see if this turns out to be a Open Weight Opus 4.5 or better.
- XCSme - 25946 sekunder sedanIn my tests[0] GLM-5.2 is not much better than GLM-5, and overall DeepSeek V4 Flash seems to be the better/more cost-effective choice:
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
- xiaoyu2006 - 26954 sekunder sedanThis open source model is quite near SOTA with only 700B/40B MoE. Truly efficient.
- guybedo - 4951 sekunder sedanIt's probably a good model but they used GLM 5.1 to code their infra.
I signed up to their max plan yesterday, did some light coding work, and i'm at 180M tokens used and 40% weekly quota gone.
Even when tokenmaxxing on the Claude Max or GPT $200 plan, i couldn't get more than 20% quota gone per day.
- daniban - 4370 sekunder sedanI'm curious what harness everyone is using for these? I want to start to test some of these open models but don't know what tools people use to get these working "agenticaly"
- leemoore - 13632 sekunder sedanGLM 5.2 feels like Opus 4.6 level. I actually think 4.6 and GLM work better in practice than opus 4.7 or 4.8 as I find both of those more erratic and seem to randomly have a super dumb turn. That random bad turn I see doesn't seem to be hitting the benchmark scores but they make 4.7 and 4.8 very hard to use for me. GLM is more stable like opus 4.6
- Pragmata - 25802 sekunder sedanSo this basically means we will have a near opus level model able to be run locally in the next couple of months right?
QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?
- rahidz - 27385 sekunder sedanCorrect me if I'm wrong, but neither DeepSeek nor GLM have image input modality. This makes them less useful when looking at UIs, photos, screenshots, etc. doesn't it? Or do they have alternate ways of doing so?
- ponyous - 18276 sekunder sedanJust ran and scored 63 3d model generations (via code) across high and no reasoning. 3D Modeling benchmark quickly shows spatial, logic and code performance of the model so I think it's a very good indicator of the quality.
Here are the results compared to Gemini 3.5 Flash:
Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.Model + config CodeErr/gen Cost/gen Median time Quality gemini-3.5-flash, low 0.71 $0.18 68s baseline GLM 5.2, reasoning high 0.61 $0.18 289s -6.0% GLM 5.2, reasoning off 1.52 $0.10 126s -13.6%Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.
- JustSkyfall - 21765 sekunder sedanThe problem with these benchmarks is that the Chinese models tend to be incredible on paper, and absolutely terrible in practice :/
- davidwritesbugs - 27905 sekunder sedanI like their models, super cheap - I'm a Lite plan subscriber, and subjective performance seems to be same as lower Anthropic models, useful for lots of grunt work. The problem is that Ziphu really __really__ struggle with capacity - everyone is complaining of timeouts or very slow speeds. I can't get direct access to the model though I see it is in OpenRouter so I may play. But the capacity issues means DeepSeek is my main provider these days
- m-dot-reviews - 19440 sekunder sedanFor anyone who's interested, I've put together a simple site for sharing ratings/opinions on models at a task-specific granularity. https://model.reviews/
The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So I thought having a place to review and comment could be beneficial to people.
I'm not sure how best to get the corpus bootstrapped (i.e. people will likely only visit/post on the site if there's already activity), so posting it here for anyone who'd like to contribute.
- _pdp_ - 26533 sekunder sedanI am helpful.
DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.
- ramon156 - 26738 sekunder sedanI've made a comment before that 5.1 will sometimes get stuck looping over a simple decision or statement. It will basically contradict and then not realize that one option is the definite option. Sometimes it's two statements that aren't even exclusive. Nonetheless, a lot of tokens that get wasted from this.
I haven't extensively used 5.2 yet, but it seems a lot better.
- dizhn - 21564 sekunder sedanFYI.. This is coming with 3mil GLM 5.2 tokens right now. (Needs login. Google SSO fine) https://zcode.z.ai/en
- hereme888 - 12451 sekunder sedanHmmm... GLM insists it's Gemini.
- alansaber - 17639 sekunder sedanThese open source models need better multi-turn capabilities. They are always lacklustre in "agent mode". Whether it's just less RL, whatever, it's a worse "product". Whereas it feels like the frontier labs have been all-in on "agentic" multi-turn reasoning for a long time now.
- osti - 12509 sekunder sedanFun fact: Zhipu aka Z.ai, Knowledge Atlas etc., the company that made GLM, is listed on Hong Kong stock exchange, is up over 10x since the IPO at the beginning of this year.
- RDTvlokip - 15207 sekunder sedanI have a question, as it happens: Do you think the benchmarks and models were trained on benchmark datasets to skew the results, even though in real-world applications we realize they're not that great?
- robertwt7 - 16965 sekunder sedanwhat is that moodboard and chart of hypertension in the middle of the article that isn't explained?
This is a great step up in open models however the pricing to support z.ai is not far cheaper than Claude / OpenAI subscription
- KaoruAoiShiho - 18417 sekunder sedanThis is really held back by one bench (omniscience accuracy) where it's really very far behind otherwise i think it's got at least a couple of points higher.
- creamyhorror - 27018 sekunder sedanIt's a real step forward, getting closer to SOTA. It seems to be very epistemically cautious in its reasoning. I hope Deepseek and the other open-weights labs stay in the game and catch up too.
- piterrro - 17742 sekunder sedanDeepSeek v4 pro is still 10x cheaper than GLM-5.2 and the quality is still enough for 95% of coding tasks.
- adithyaharish - 7003 sekunder sedanwhy do not all open source LLM's have open weights like this model?
- zftnb666 - 20432 sekunder sedanOpen-weight models are winning. The gap with closed models is now measured in months, not years.
- - 15218 sekunder sedan
- Havoc - 28808 sekunder sedanIt’s pretty good. More talkative than 5.1. Reminds me of deepseek 4
Their servers are melting though - getting more timeouts etc
- nh43215rgb - 28653 sekunder sedan> GLM-5.2 sits off the most attractive quadrant on the Intelligence vs Output Tokens chart.
That is unfortunate...
- lousken - 26946 sekunder sedanCerebras really needs to have this on their API list (if they even still exist).
- hyqzz8 - 13686 sekunder sedanIt is a very useful model
- sourcecodeplz - 23225 sekunder sedan1m context btw.
- eckelhesten - 23954 sekunder sedanSure, but whatever you do, don't buy their (Z.ai) lite plan.
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
- - 28099 sekunder sedan
- jayess - 13803 sekunder sedanI asked z.ai what z.ai is, and it said "It seems you might be referring to xAI, as "z.ai" isn't a widely known or major AI company or platform at this time."
- Computer0 - 17975 sekunder sedanRegrettably I haven’t tried 5.2 yet but 5.1 I did not see as anything special. In practice I found it to be ~70% as good as Claude sonnet.
- dsrtslnd23 - 24470 sekunder sedanlooks like I need a GB300 workstation
- Imustaskforhelp - 23821 sekunder sedanI have been trying out GLM 5.2 and I am really impressed by it for the most part.
To all people on Hackernews, I am curious as to what agent harness are you using it with.
Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.
Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.
I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?
- hit8run - 23965 sekunder sedanOk, it is nice to see another great open source model. Not sure what to think of all these benchmarks but GLM was already quite strong before so an update is very welcome.
- kissgyorgy - 24636 sekunder sedanI tried it today through Openrouter and the API is atrocious. I got multiple rate limit and random errors every turn.
Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.
The model might be good, but if the API is so bad, it's effectively useless.
[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...
- - 9662 sekunder sedan
- maxothex - 9172 sekunder sedan[flagged]
- Asfand3099 - 17462 sekunder sedan[flagged]
- mohsen1 - 27655 sekunder sedanI don't if it is harness or the model is really not at the level those benchmarks are showing because based of my own "feelings" after using it I felt it's not Opus 4.5 level. It can't figure things out in my project (https://tsz.dev) or maybe tsz is at a stage that things are getting too difficult even for frontier models to be productive. I had the most productive time in the weekend Fable was available and since then it's been pretty slow to make progress
Nördnytt! 🤓