GLM-5.2 is the new leading open weights model on Artificial Analysis

artificialanalysis.ai - 614 poäng - 321 kommentarer - 33713 sekunder sedan

Kommentarer (54)

SwellJoe - 70 sekunder sedan
I added it to my benchmark based on Mythos-reported bugs, and it's better than GLM 5.1, but still behind several other models, maybe most directly comparable to Qwen 3.7 Max. But, several other open models, including small self-hostable ones (Gemma 4 and Qwen 3.6), found the same number of bugs, 3 of 9. Though it also gets partial credit for reporting one bug in the right spot, but kinda misunderstanding the bug. I also added Kimi K2.7-code in the same run, and it did poorly, consistent with 2.6 performance. Anyway, there are better, cheaper, models on this particular benchmark.
https://swelljoe.com/post/will-it-mythos/
(This small benchmark doesn't prove anything. It's a limited data set and each model only gets one shot at each file in the corpus. But, I find it useful for quickly sussing out if a model can reason about pretty complicated problems in code.)
Tiberium - 29049 sekunder sedan
It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file.
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.

kristopolous - 23180 sekunder sedan

I have a script that ranks these based on codingindex from Artificial Analysis.

All it does is pull a json from their main table page and parses it with the fields I care about (coding).

There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.

Current partial output

  score  age  size name
  47.1   58  large Kimi K2.6
  47.5   54  large DeepSeek V4 Pro (Reasoning, Max Effort)
  47.5   70    -   Muse Spark
  47.6   132   -   Claude Opus 4.6 (Non-reasoning, High Effort)
  47.8   205   -   Claude Opus 4.5 (Reasoning)
  48.1   132   -   Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  48.6   55    -   GPT-5.5 (Non-reasoning)
  48.7   188   -   GPT-5.2 (xhigh)
  50.1   29    -   Qwen3.7 Max
  50.7   1   large GLM-5.2 (max)
  50.9   120   -   Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  51.5   92    -   GPT-5.4 mini (xhigh)
  52.1   55    -   GPT-5.5 (low)
  52.5   62    -   Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  53.1   132   -   GPT-5.3 Codex (xhigh)
  53.1   62    -   Claude Opus 4.7 (Non-reasoning, High Effort)
  55.5   118   -   Gemini 3.1 Pro Preview
  56.2   55    -   GPT-5.5 (medium)
  56.7   20    -   Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  57.2   104   -   GPT-5.4 (xhigh)
  58.5   55    -   GPT-5.5 (high)
  59.1   55    -   GPT-5.5 (xhigh)
  62     8     -   Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)

To see everything, run it like so

  $ curl day50.dev/art-analysis.sh | bash

The repo: https://github.com/day50-dev/aa-eval-email

some key takeaways:

* open models are on about a 4-7 month lag right now depending on how you want to measure it

* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.

if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.

mrngld - 24679 sekunder sedan
Artificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to run, with GPT5.5 on medium significantly less expensive. Compared to GPT5.5 medium GLM5.1xhigh is twice the cost and half the intelligence. They don't have GLM5.2 on there yet, but that'd a big gap to bridge.
https://artificialanalysis.ai/agents/coding-agents?coding-ag...
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
unrvl22 - 28653 sekunder sedan
Why aren't more people talking about this? It's literally Opus 4.7 quality stupid prices. I know providers who are offering this at unlimited tokens for $50 a month. Some are even offering API rates at 3x lower than the official ZAI api rates which are already like 10x cheaper than Opus. (Crof and Umans btw)
This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
gertlabs - 6609 sekunder sedan
GLM 5.2 is the first model we've tested that is unambiguously on par with, or better than Opus 4.6 (although as usual, we have GLM 5.2 and most other Chinese models a bit below most other benchmarks with more vulnerable test methodologies).
Data at https://gertlabs.com/rankings
simonw - 22458 sekunder sedan
I was surprised that GLM 5.1/5.2 are not vision models - they are text input only.
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
CuriouslyC - 28576 sekunder sedan
I've been playing with this model a fair amount over the last 24 hours, and I can confirm it's quite capable, while being a little bit verbose (I've seen it reconsider things 3-4 times in thinking traces before deciding on a path forward), and not being quite as good as GPT5.5 at working through complex abstract requirements.
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
CubsFan1060 - 25631 sekunder sedan
Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
tensegrist - 27416 sekunder sedan
> On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)
am i missing something?
wongarsu - 24278 sekunder sedan
It's also third best overall on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable.
That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions
kingstnap - 28124 sekunder sedan
According to many benchmarks this model is straight up frontier level and Zai seriously cooked. Some of these numbers are incredible.
Excited to see if this turns out to be a Open Weight Opus 4.5 or better.
XCSme - 25946 sekunder sedan
In my tests[0] GLM-5.2 is not much better than GLM-5, and overall DeepSeek V4 Flash seems to be the better/more cost-effective choice:
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
xiaoyu2006 - 26954 sekunder sedan
This open source model is quite near SOTA with only 700B/40B MoE. Truly efficient.
guybedo - 4951 sekunder sedan
It's probably a good model but they used GLM 5.1 to code their infra.
I signed up to their max plan yesterday, did some light coding work, and i'm at 180M tokens used and 40% weekly quota gone.
Even when tokenmaxxing on the Claude Max or GPT $200 plan, i couldn't get more than 20% quota gone per day.
daniban - 4370 sekunder sedan
I'm curious what harness everyone is using for these? I want to start to test some of these open models but don't know what tools people use to get these working "agenticaly"
leemoore - 13632 sekunder sedan
GLM 5.2 feels like Opus 4.6 level. I actually think 4.6 and GLM work better in practice than opus 4.7 or 4.8 as I find both of those more erratic and seem to randomly have a super dumb turn. That random bad turn I see doesn't seem to be hitting the benchmark scores but they make 4.7 and 4.8 very hard to use for me. GLM is more stable like opus 4.6
Pragmata - 25802 sekunder sedan
So this basically means we will have a near opus level model able to be run locally in the next couple of months right?
QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?
rahidz - 27385 sekunder sedan
Correct me if I'm wrong, but neither DeepSeek nor GLM have image input modality. This makes them less useful when looking at UIs, photos, screenshots, etc. doesn't it? Or do they have alternate ways of doing so?
ponyous - 18276 sekunder sedan
Just ran and scored 63 3d model generations (via code) across high and no reasoning. 3D Modeling benchmark quickly shows spatial, logic and code performance of the model so I think it's a very good indicator of the quality.
Here are the results compared to Gemini 3.5 Flash:
```
    Model + config          CodeErr/gen   Cost/gen   Median time   Quality
    gemini-3.5-flash, low      0.71        $0.18        68s       baseline
    GLM 5.2, reasoning high    0.61        $0.18       289s         -6.0%
    GLM 5.2, reasoning off     1.52        $0.10       126s        -13.6%
```
Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.
Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.
JustSkyfall - 21765 sekunder sedan
The problem with these benchmarks is that the Chinese models tend to be incredible on paper, and absolutely terrible in practice :/
davidwritesbugs - 27905 sekunder sedan
I like their models, super cheap - I'm a Lite plan subscriber, and subjective performance seems to be same as lower Anthropic models, useful for lots of grunt work. The problem is that Ziphu really __really__ struggle with capacity - everyone is complaining of timeouts or very slow speeds. I can't get direct access to the model though I see it is in OpenRouter so I may play. But the capacity issues means DeepSeek is my main provider these days
m-dot-reviews - 19440 sekunder sedan
For anyone who's interested, I've put together a simple site for sharing ratings/opinions on models at a task-specific granularity. https://model.reviews/
The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So I thought having a place to review and comment could be beneficial to people.
I'm not sure how best to get the corpus bootstrapped (i.e. people will likely only visit/post on the site if there's already activity), so posting it here for anyone who'd like to contribute.
_pdp_ - 26533 sekunder sedan
I am helpful.
DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.
ramon156 - 26738 sekunder sedan
I've made a comment before that 5.1 will sometimes get stuck looping over a simple decision or statement. It will basically contradict and then not realize that one option is the definite option. Sometimes it's two statements that aren't even exclusive. Nonetheless, a lot of tokens that get wasted from this.
I haven't extensively used 5.2 yet, but it seems a lot better.
dizhn - 21564 sekunder sedan
FYI.. This is coming with 3mil GLM 5.2 tokens right now. (Needs login. Google SSO fine) https://zcode.z.ai/en
hereme888 - 12451 sekunder sedan
Hmmm... GLM insists it's Gemini.
https://github.com/zai-org/GLM-5/issues/79
alansaber - 17639 sekunder sedan
These open source models need better multi-turn capabilities. They are always lacklustre in "agent mode". Whether it's just less RL, whatever, it's a worse "product". Whereas it feels like the frontier labs have been all-in on "agentic" multi-turn reasoning for a long time now.
osti - 12509 sekunder sedan
Fun fact: Zhipu aka Z.ai, Knowledge Atlas etc., the company that made GLM, is listed on Hong Kong stock exchange, is up over 10x since the IPO at the beginning of this year.
RDTvlokip - 15207 sekunder sedan
I have a question, as it happens: Do you think the benchmarks and models were trained on benchmark datasets to skew the results, even though in real-world applications we realize they're not that great?
robertwt7 - 16965 sekunder sedan
what is that moodboard and chart of hypertension in the middle of the article that isn't explained?
This is a great step up in open models however the pricing to support z.ai is not far cheaper than Claude / OpenAI subscription
KaoruAoiShiho - 18417 sekunder sedan
This is really held back by one bench (omniscience accuracy) where it's really very far behind otherwise i think it's got at least a couple of points higher.
creamyhorror - 27018 sekunder sedan
It's a real step forward, getting closer to SOTA. It seems to be very epistemically cautious in its reasoning. I hope Deepseek and the other open-weights labs stay in the game and catch up too.
piterrro - 17742 sekunder sedan
DeepSeek v4 pro is still 10x cheaper than GLM-5.2 and the quality is still enough for 95% of coding tasks.
adithyaharish - 7003 sekunder sedan
why do not all open source LLM's have open weights like this model?
zftnb666 - 20432 sekunder sedan
Open-weight models are winning. The gap with closed models is now measured in months, not years.
- 15218 sekunder sedan
Havoc - 28808 sekunder sedan
It’s pretty good. More talkative than 5.1. Reminds me of deepseek 4
Their servers are melting though - getting more timeouts etc
nh43215rgb - 28653 sekunder sedan
> GLM-5.2 sits off the most attractive quadrant on the Intelligence vs Output Tokens chart.
That is unfortunate...
lousken - 26946 sekunder sedan
Cerebras really needs to have this on their API list (if they even still exist).
hyqzz8 - 13686 sekunder sedan
It is a very useful model
sourcecodeplz - 23225 sekunder sedan
1m context btw.
eckelhesten - 23954 sekunder sedan
Sure, but whatever you do, don't buy their (Z.ai) lite plan.
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
- 28099 sekunder sedan
jayess - 13803 sekunder sedan
I asked z.ai what z.ai is, and it said "It seems you might be referring to xAI, as "z.ai" isn't a widely known or major AI company or platform at this time."
Computer0 - 17975 sekunder sedan
Regrettably I haven’t tried 5.2 yet but 5.1 I did not see as anything special. In practice I found it to be ~70% as good as Claude sonnet.
dsrtslnd23 - 24470 sekunder sedan
looks like I need a GB300 workstation
Imustaskforhelp - 23821 sekunder sedan
I have been trying out GLM 5.2 and I am really impressed by it for the most part.
To all people on Hackernews, I am curious as to what agent harness are you using it with.
Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.
Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.
I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?
hit8run - 23965 sekunder sedan
Ok, it is nice to see another great open source model. Not sure what to think of all these benchmarks but GLM was already quite strong before so an update is very welcome.
kissgyorgy - 24636 sekunder sedan
I tried it today through Openrouter and the API is atrocious. I got multiple rate limit and random errors every turn.
Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.
The model might be good, but if the API is so bad, it's effectively useless.
[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...
- 9662 sekunder sedan
maxothex - 9172 sekunder sedan
[flagged]
Asfand3099 - 17462 sekunder sedan
[flagged]
mohsen1 - 27655 sekunder sedan
I don't if it is harness or the model is really not at the level those benchmarks are showing because based of my own "feelings" after using it I felt it's not Opus 4.5 level. It can't figure things out in my project (https://tsz.dev) or maybe tsz is at a stage that things are getting too difficult even for frontier models to be productive. I had the most productive time in the weekend Fable was available and since then it's been pretty slow to make progress