GLM 5.2 vs. Opus

techstackups.com - 371 poäng - 260 kommentarer - 40414 sekunder sedan

Kommentarer (70)

cultofmetatron - 37826 sekunder sedan
I seriously dont' know all this big hullabaloo about one shot prompting.
by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.
I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.
Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.
I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.
These are way more valuable metrics than "hey build X"
faxmeyourcode - 6124 sekunder sedan
I feel like another comparison worth looking at is purely cost.
Capability per dollar is something I care about:
```
    Opus API    $5/$25
    Sonnet API  $5/$15
    Haiku API   $1/$5

    GLM 5.2 API $1.4/$4.4
```
So you're really getting near opus level capability for the price of haiku.
meander_water - 39178 sekunder sedan
> So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch
Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.
Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).
lukaslalinsky - 16060 sekunder sedan
I was never able to get these models to collaborate with me the way Opus does. I'm probably an outliner, I don't one-shot projects, I don't vibe code. I basically use LLMs are if I was working with a coworker, fairly smart one, but with short memory and often missing the big picture. Sometimes I can delegate more, sometimes less, but I know I always have to stay on top of what's happening, because it WILL create mess when it hits something hard. With the Antropic models, this kind of cooperation is easy (with the exception of Opus 4.6, which was bad for some reason).
xlii - 38743 sekunder sedan
I've been checking out GLM 5.2 on some projects and few thoughts on it:
- it takes it sweet time to get code rolling, not the fastest model by any means
- it strays a lot during discovery/planning but then corrects
- it's not steering friendly, as it hallucinates things that it doesn't follow later on
- its output is quite good
A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.
GLM 5.2 spent 20 minutes building the benchmarks and getting data out, which made me frustrated so I blocked non-editing tool access and went AFK, after approx. 30 minutes I found that it used already-made benchmarks and some "conclusions" to optimize 3 choke points. Output pointed that it couldn't validate suspicions and asked for more data.
Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.
I would opt in in using it more BUT GPT usually completes same requests 5x faster.
GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).
InsideOutSanta - 6523 sekunder sedan
One nice thing about GLM is that it has never refused a task. I'm working on a website that renders countries right now, and Anthropic's models regularly give me the old "This request triggered safety guardrails."
I'm not sure what exactly triggers it, but it seems to happen when it has to look at lists of countries. I suspect there must be at least one country name that triggers the safety guardrail.
You'd expect GLM to balk at something like Taiwan, but so far, it hasn't.
toddmorey - 18791 sekunder sedan
I’m actually amazed at the output since GLM doesn’t have eyes. If GLM 5.2 costs 1/5 as much, seems like it could be set up to reach out to a multimodal model for vision tasks when required. Closer to parity but probably still significantly cheaper.
coreyburnsdev - 9714 sekunder sedan
People are looking for ways not to burn through their premium subs when in many cases all you have to do is move down to 5.4-mini codex and it will probably solve your issue while barely touching your 5 hour or weekly limits.
stevenhubertron - 17410 sekunder sedan
No one has really talked about hybrid and using Opus to plan and orchestrate GLMs work both through initial build and code reviews. That’s a true best of both worlds and there doesn’t need to be a winner.
ulrikrasmussen - 37128 sekunder sedan
> Through an API it costs a fraction of Opus, and you can run it yourself for free if you have the hardware.
I haven't been keeping up on hardware costs for state of the art LLM inference, but this remark made me ask myself how many readers of the article would actually be able to run this model on hardware they own. How much would it cost to acquire such a setup?
postatic - 36091 sekunder sedan
I've signed up with Ollama to experiment with these open source models. For the past 3 months, it's just been experimenting, trying it out. GLM is the first model that I am using on a daily basis to do my coding work (as well as using Claude). It's good - I've been maxing out my Ollama usage limits everyday :)
mellosouls - 8254 sekunder sedan
GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game
This implies Opus was potentially much (?) better value.
GLM cost a quarter but Opus was twice as fast. So we are already at GLM actually costing half when you compare on time, without even considering the extra effort and time it would take to get Opus-par results.
It's good to have cheaper options and very impressive to see the Chinese continue to set open standards in this field, but the article is maybe a little over-generous.
xg15 - 33629 sekunder sedan
So GLM emits fewer tokens and does fewer tool calls, but still takes over twice as long to complete.
Can someone explain to me where that time usage is coming from if not from the model operation itself?
Are the individual tool calls more complex and take more time to complete? Or is the rate of tok/s lower because the model does more compute per token?
js4ever - 27456 sekunder sedan
"GLM-5.2 hit a problem here, because it can't read images. It isn't multimodal. So instead of looking at a screenshot, it fell back on a hacky workaround: it wrote scripts to read the raw pixel data and check whether the colors came out roughly as expected."
A better way would be to use https://github.com/openbmb/MiniCPM-V
XCSme - 12590 sekunder sedan
Check out my comparison too, it has some not-really-benchmarks too (between any two models actually, SVG generation test and CSS animation test):
https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...
somesortofthing - 6010 sekunder sedan
this comparison seems kind of pointless if one model has vision and the other doesn't. obviously a model that can see is going to beat a blind model at making a video game.
bornfreddy - 7780 sekunder sedan
I know that running this locally is prohibitively expensive (for now), but what kind of cost would I be looking at if I wanted to rent the hardware and run the model by myself?
wiremine - 19058 sekunder sedan
I've been using GLM 5.2 extensively for the last few days. It is slower, and the lack of multimodality is a bummer.
But, it produces solid results for a fraction of the price. Worth checking out if you have the time.
One of my goto "tests" of a new frontier models is having it rebuild a programming language from scratch. For GLM 5.2 I had it rebuild the old Rebol language in Rust:
https://github.com/mhs/rebol-clone-glm-5.2
It did a fairly good job roughing in the language for a low token cost.
elliotbnvl - 12412 sekunder sedan
It is insane that we are comparing locally-hostable models to leading cloud providers, it is wild to me that this article even exists.
We have come a long way, and very clearly have a long way yet to go.
david_shi - 37516 sekunder sedan
> GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game.
Off topic, but does anyone else instantly pick up on LLMisms like this? It seems like all the models have converged on this style of writing, and improvements aren't really changing it.
Muaz_Ashraf - 6996 sekunder sedan
there is no comparison between glm 5.2 and opus. First for this glm 5.2 you need a big big resource and that big also came from money so instead you buy the opus subscription and enjoy.
pietz - 34021 sekunder sedan
GLM 5.2 has one big issue that will limit its meaningful success and that's the value of their coding subscription.
Yes, in terms of API pricing, GLM 5.2 outperforms the competition. But the only people that use API billing for their coding work are large corporations, where these highly subsidized subscriptions are being fazed out.
At the same time, none of these companies will use a Chinese API for their employees.
For individuals and smaller teams, Z.ai's coding subscription is outperformed by Anthropic and OpenAI. You probably get around the same usage with Claude, but Codex definitely offers more usage for the amount you pay.
We can have a debate how much Z.ai closed the gap to GPT5.5 and Opus 4.8, but if I can freely decide between them in a world where they all cost the same, I simply wouldn't choose GLM.
So the important question becomes: How good will the offering from Z.ai get with GLM 5.3 or 6 and how much will OpenAI and Anthropic cripple their current offering in the near future.
doe88 - 14803 sekunder sedan
To me one shot prompting is as relevant as Strava's KOM is for cycling, i'm more interested in a good cycling performance after a 3 hours ride than a straight up 30 min record effort.
maxdo - 19723 sekunder sedan
So the benchmark is : Two models with different harness produced very different results .
Glm game was completely broken Opus game was at first glance ok but also with bugs
Different models with different cost produced different non perfect results . How is it “close” ? :)
Also on costs : glm burns more tokens on average vs opus . Gpt5.5 burns less surprisingly
stavarotti - 17246 sekunder sedan
These style of comparisons are decent at showing capability but they don't really show me what I truly want - a sounding board and implementer with senior engineer-level execution. When I look back at all the teams that I've been part of, the best outcomes came from white-boarding (sometimes in the metaphorical sense) with one or two people, at times arguing, then finally compromising on a plan. Instead of synthetic benchmarks that try to be objective, I wonder if there's a way test this, or maybe I'm opining on a way of working that will soon be gone?
jkwang - 38665 sekunder sedan
GLM-5.2 is quietly becoming the most interesting open model release this year. The coding benchmarks are surprisingly close to frontier models at a fraction of the inference cost.
greyman - 38808 sekunder sedan
>On output tokens, GLM-5.2 is less than a fifth the price of Opus.
Opus is most expensive model in pay as you go model, but IMO fair comparison should include subscription price as well. For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.
zkmon - 37092 sekunder sedan
Cost difference matters most as cost optimization is the whole point of AI. Time difference (30 min vs 1 hr) is not a deal-breaker. The small precision gap on the first iteration does not matter for 99% of the work that happens in real world.
TurdF3rguson - 36911 sekunder sedan
Pretty clearly it's beating Opus at [web dev](https://www.gptbased.com/) - on price, on score.. I mean what else is there?
CuriouslyC - 28061 sekunder sedan
You should repeat this experiment but with progressively more detail in the initial prompt. Claude's secret sauce is taking weakly specified prompts and making passable things from them, but as the degrees of freedom in the prompt go down Claude starts to disobey while other models close in on the intent.
xrd - 17768 sekunder sedan
How are people running this locally? I just checked llama.cpp and it appears unsloth has a version but it hacks a bunch of things to make it work and isn't optimal.
https://github.com/ggml-org/llama.cpp/issues/24730
thedreammachine - 26414 sekunder sedan
I was surprised today by how much better GLM-5.2 was than GPT-5.5 at aesthetic/UI work. I'll keep my Claude/Codex setup via Conductor for now, but this model got me to set up OpenCode, download their desktop app and do most of my work there today.
samsin - 17788 sekunder sedan
My understanding was that n-shot prompting just referred to the number of examples included in a prompt, not the number of prompts to achieve the desired result.
"Build a 3D platformer game from scratch, in raw WebGL, with no game engine or 3D library" would be a zero-shot prompt.
Aozora7 - 37738 sekunder sedan
I used GLM 5.0/5.1/5.2 for some projects, and for me, the area in which they lag behind frontier models the most are user interfaces. They get really close to Opus when it comes to pure algorithms, but when I need something like web application or a mobile app that looks and works well, they are very noticeably worse than even Sonnet.
hmokiguess - 21774 sekunder sedan
I signed up for GLM 5.2 yesterday to try it out because Anthropic kept throwing 529 Overloaded
I like it, but the lite plan ate 22% usage of my 5h reset window in a single session after 2 prompts on xhigh of GLM 5.2 [1m]
Result was satisfactory, I think stuff is decent, I'm happy to use either, wish there was a combined subscription plan where I could get both
- 25069 sekunder sedan
leumon - 37430 sekunder sedan
I've seen glm 5.2 struggle writing simple compilable c code. It might be good at web, but it's world knowledge is limited due to the small model size, making it's use quite limited in my opinion.
orloffm - 19255 sekunder sedan
> 256 GiB unified RAM.
So, 8000$, plus it's unavailable. 3 years of Codex/Opus subscription.
> API prices
Which are irrelevant for 200$ Codex/Opus plans that are times cheaper.
lordforever - 13865 sekunder sedan
i think inference is the thing, that also fast inference, so enterprises can just host their own and run, ig vercel do it, many more would. but zs it thinks toooo much idk how fast we can make it.
poulpy123 - 29619 sekunder sedan
What would the best way to use these open source models for a price similar to what I could pay for the cheapest plan with claude and openai ?
I would like to give them a try but I certainly not have the money to get a system able to run them, and I don't really want to pay more than the state of the art
jofzar - 36141 sekunder sedan
Great article,
My only, I guess feedback, is that it's not really clear about the price.
Would the 21.92 be the API pricing I guess?
Cost $5.39 (real billed) ~$21.92 (estimate, list pricing)
maccard - 20316 sekunder sedan
Are these games supposed to be a good example of quality output? If this is the product, I don't really want to play _either_ of them.
- 31247 sekunder sedan
wejick - 35277 sekunder sedan
Totally agree witg the general assessment. The biggest problem with Z.ai model for a long time is not quality, but the inference speed and general capacity availability. Hopefully with this recent hype, there will be more provider on openrouter for 5.2.
IronWolve - 38305 sekunder sedan
Having issues with coding a render for good looking realistic smoke coming off burning incense, opus 4.8 & gpt-5.5 both have code issues, glm-5.2 did it. Amazing.
The real time 3d fluid dynamics appear to be the tricky part, I wish I still had opus access, would love to see if it can do it.
aykutseker - 23917 sekunder sedan
The text only part is the catch for me.
If it builds a UI and can't look at it, it's askin ls whether the app looks right.
close2 - 36639 sekunder sedan
I wonder how much tokens and time where used for the verifying part. Maybe GLM 5.2 instantly found the "solution" to read the screen pixel by pixel, but it could also have been a major token and time consumer.
jdright - 17931 sekunder sedan
not apples to apples. comparing official vs. pi.dev+openrouter and having slow times is more a openrouter issue. try comparing using official z.ai.
Havoc - 33122 sekunder sedan
Still on a z.ai legacy plan and their 50% discount for switching to standard plans tips the balance for me. So I guess I’ll reevaluate round about beginning 2028…
efficax - 18917 sekunder sedan
glm-5.2 is very good if you have a good harness and workflow to use it with. in fact, i'd call it good enough if you are a software engineer who knows what you want. it writes the code. i'm wondering if i need anthropic's models at all at this point, or openai. and surely in a year we won't need them at all. Opus 4.5+ was the turning point for me, and now these open models are just as good. i don't get how you IPO these companies when their only winning product is coding agents and the competition is just as good for 1/4 the price.
wolttam - 20886 sekunder sedan
Would have run it with GLM on max/xhigh effort. Just for fun.
_pdp_ - 35747 sekunder sedan
In the name of science we crafted an autonomous AI agent that builds games on a loop. It is based on GLM 5.2.
I am not sure where this is going to lead us but it is fun to watch.
cwoolfe - 17767 sekunder sedan
The model is 756B parameters, open weights.
linzhangrun - 37821 sekunder sedan
Just that their Coding Plan is too hard to get. I've been trying to grab it for a week and still can't get it
speedgoose - 37205 sekunder sedan
While this is interesting, one single sample with different coding harness is not very scientific.
taosu_la - 29903 sekunder sedan
I'm really feeling a bit tired of these models. I feel that since opus 4.1, I haven't been able to clearly feel the intelligence improvement from the model upgrades (except for gpt 5.5 and opus4.6 being able to speak like a human)
yanhangyhy - 33623 sekunder sedan
i think GLM 5.2 is not cheap and not easy to get the coding plan... so even it's on the Opus level... still not attractive.
elzbardico - 16442 sekunder sedan
If you are a real engineer and uses the LLM as a pair programmer instead of delegating everything to it, even GLM 4.7 was already good enough to help you with a lot of work.
I used it with Cerebras inference at a time when it had a good coding plan at a low price, and delivered tons of stuff using it.
NicoJuicy - 18140 sekunder sedan
For those praising GLM 5.2, can anyone confirm?
Tried with 2 harnesses and it seems bad + slow
_s_a_m_ - 29553 sekunder sedan
GLM is the most overrated LLM. I tried it and it not good.
ukprogrammer - 32272 sekunder sedan
GLM cannot use vision like Opus can. This is not a useful comparison.
sourcecodeplz - 35624 sekunder sedan
What is this fashion of testing models by giving them one shot projects? Especially games. this is so stupid
msejas - 36684 sekunder sedan
Seeing the results I don't see how the results are even comparable Opus is clearly far superior in most aspects. Smoothness, design, functionality etc.
At the end of the day, the time earned is more important then the cost for big players.
The ability to spawn 10 claude agents and rush a project to outcompete someone is more important for big businesses in my imo. Also the small details that GLM missed would take significant more time to iron out, considering it already took double the time.
I do hope other (open weight) models catch up, but to act like they are anywhere close for me is a bit disingenuous.
camillomiller - 29876 sekunder sedan
I swear, if I read forms with “genuinely” one more time I am gonna scream. FUCK LLM WRITING
jingpostmedia - 19625 sekunder sedan
[flagged]
- 25146 sekunder sedan
- 25228 sekunder sedan
gauravvij137 - 18705 sekunder sedan
[dead]
tsouth2 - 35183 sekunder sedan
[dead]
joshrw - 38412 sekunder sedan
Chinese models optimize for benchmarks and do poorly in real-world tasks