Anonymous request-token comparisons from Opus 4.6 and Opus 4.7

tokens.billchambers.me - 423 poäng - 427 kommentarer - 34503 sekunder sedan

Kommentarer (69)

andai - 25287 sekunder sedan
For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6, and seems to cost significantly less on the reasoning side as well.
Here is a comparison for 4.5, 4.6 and 4.7 (Output Tokens section):
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
4.7 comes out slightly cheaper than 4.6. But 4.5 is about half the cost:
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
Notably the cost of reasoning has been cut almost in half from 4.6 to 4.7.
I'm not sure what that looks like for most people's workloads, i.e. what the cost breakdown looks like for Claude Code. I expect it's heavy on both input and reasoning, so I don't know how that balances out, now that input is more expensive and reasoning is cheaper.
On reasoning-heavy tasks, it might be cheaper. On tasks which don't require much reasoning, it's probably more expensive. (But for those, I would use Codex anyway ;)
hgoel - 29378 sekunder sedan
The bump from 4.6 to 4.7 is not very noticeable to me in improved capabilities so far, but the faster consumption of limits is very noticeable.
I hit my 5 hour limit within 2 hours yesterday, initially I was trying the batched mode for a refactor but cancelled after seeing it take 30% of the limit within 5 minutes. Had to cancel and try a serial approach, consumed less (took ~50 minutes, xhigh effort, ~60% of the remaining allocation IIRC), but still very clearly consumed much faster than with 4.6.
It feels like every exchange takes ~5% of the 5 hour limit now, when it used to be maybe ~1-2%. For reference I'm on the Max 5x plan.
For now I can tolerate it since I still have plenty of headroom in my limits (used ~5% of my weekly, I don't use claude heavily every day so this is OK), but I hope they either offer more clarity on this or improve the situation. The effort setting is still a bit too opaque to really help.
glerk - 27765 sekunder sedan
I'd be ok with paying more if results were good, but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.
And yes, Claude models are generally more fun to use than GPT/Codex. They have a personality. They have an intuition for design/aesthetics. Vibe-coding with them feels like playing a video game. But the result is almost always some version of cutting corners: tests removed to make the suite pass, duplicate code everywhere, wrong abstraction, type safety disabled, hard requirements ignored, etc.
These issues are not resolved in 4.7, no matter what the benchmarks say, and I don't think there is any interest in resolving them.
kalkin - 30986 sekunder sedan
AFAICT this uses a token-counting API so that it counts how many tokens are in the prompt, in two ways, so it's measuring the tokenizer change in isolation. Smarter models also sometimes produce shorter outputs and therefore fewer output tokens. That doesn't mean Opus 4.7 necessarily nets out cheaper, it might still be more expensive, but this comparison isn't really very useful.
rectang - 29616 sekunder sedan
For now, I'm planning to stick with Opus 4.5 as a driver in VSCode Copilot.
My workflow is to give the agent pretty fine-grained instructions, and I'm always fighting agents that insist on doing too much. Opus 4.5 is the best out of all agents I've tried at following the guidance to do only-what-is-needed-and-no-more.
Opus 4.6 takes longer, overthinks things and changes too much; the high-powered GPTs are similarly flawed. Other models such as Sonnet aren't nearly as good at discerning my intentions from less-than-perfectly-crafted prompts as Opus.
Eventually, I quit experimenting and just started using Opus 4.5 exclusively knowing this would all be different in a few months anyway. Opus cost more, but the value was there.
But now I see that 4.7 is going to replace both 4.5 and 4.6 in VSCode Copilot, and with a 7.5x modifier. Based on the description, this is going to be a price hike for slower performance — and if the 4.5 to 4.6 change is any guide, more overthinking targeted at long-running tasks, rather than fine-grained. For me, that seems like a step backwards.
gsleblanc - 27956 sekunder sedan
It's increasingly looking naive to assume scaling LLMs is all you need to get to full white-collar worker replacement. The attention mechanism / hopfield network is fundamentally modeling only a small subset of the full human brain, and all the increasing sustained hype around bolted-on solutions for "agentic memory" is, in my opinion, glaring evidence that these SOTA transformers alone aren't sufficient even when you just limit the space to text. Maybe I'm just parroting Yann LeCun.
someuser54541 - 31784 sekunder sedan
Should the title here be 4.6 to 4.7 instead of the other way around?
tiffanyh - 30124 sekunder sedan
I was using Opus 4.7 just yesterday to help implement best practices on a single page website.
After just ~4 prompts I blew past my daily limit. Another ~7 more prompts & I blew past my weekly limit.
The entire HTMl/CSS/JS was less than 300 lines of code.
I was shocked how fast it exhausted my usage limits.
vicchenai - 989 sekunder sedan
ran into this yesterday building a data pipeline that pulls SEC filings. same prompt, same context window, 4.7 chewed through noticeably more of my api budget than 4.6 did. the output wasnt obviously better either, just... more expensive.
what bugs me is the tokenizer change feels like a stealth price hike. if you're charging the same $/token but the same text now costs 35% more tokens, thats just a 35% price increase with extra steps. at least be upfront about it.
hereme888 - 24702 sekunder sedan
> Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even after accounting for Opus 4.7's new tokenizer. This metric does not account for cached input token discounts, which we will be incorporating into our cost calculations in the near future.
bertil - 23507 sekunder sedan
My impression is that the quality of the conversation is unexpectedly better: more self-critical, the suggestions are always critical, the default choices constantly best. I might not have as many harnesses as most people here, so I suspect it’s less obvious but I would expect this to make it far more valuable for people who haven’t invested as much.
After a few basic operations (retrospective look at the flow of recent reviews, product discussions) I would expect this to act like a senior member of the team, while 4.6 was good, but far more likely to be a foot-gun.
dakiol - 30213 sekunder sedan
We dropped Claude. It's pretty clear this is a race to the bottom, and we don't want a hard dependency on another multi-billion dollar company just to write software
We'll be keeping an eye on open models (of which we already make good use of). I think that's the way forward. Actually it would be great if everybody would put more focus on open models, perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company. Wouldn't it be nice if we don't need to pay for tokens? Paying for infra (servers, electricity) is already expensive enough
npollock - 20235 sekunder sedan
You can configure the status line to get a feel for token usage:
[Opus 4.6] 3% context | last: 5.2k in / 1.1k out
add this to .claude/settings.json
"statusLine": { "type": "command", "command": "jq -r '\"[\\(.model.display_name)] \\(.context_window.used_percentage // 0)% context | last: \\(((.context_window.current_usage.input_tokens // 0) / 1000 * 10 | floor / 10))k in / \\(((.context_window.current_usage.output_tokens // 0) / 1000 * 10 | floor / 10))k out\"'" }
couchdb_ouchdb - 27046 sekunder sedan
Comments here overall do not reflect my experience -- i'm puzzled how the vast majority are using this technology day to day. 4.7 is absolute fire and an upgrade on 4.6.
autoconfig - 28832 sekunder sedan
My initial experience with Opus 4.7 has been pretty bad and I'm sticking to Codex. But these results are meaningless without comparing outcome. Wether the extra token burn is bad or not depends on whether it improves some quality / task completion metric. Am I missing something?
Frannky - 2492 sekunder sedan
My subscription was up for renewal today. I gave it a shot with OpenCode Go + Xiaomi model. So far, so good—I can get stuff done the same way it seems.
nickvec - 2447 sekunder sedan
For all intents and purposes, aren't the "token change" and "cost change" metrics effectively the same thing?
templar_snow - 29748 sekunder sedan
Brutal. I've been noticing that 4.7 eats my Max Subscription like crazy even when I do my best to juggle tasks (or tell 4.7 to use subagents with) Sonnet 4.6 Medium and Haiku. Would love to know if anybody's found ideal token-saving approaches.
anabranch - 34503 sekunder sedan
I wanted to better understand the potential impact for the tokenizer change from 4.6 and 4.7.
I'm surprised that it's 45%. Might go down (?) with longer context answers but still surprising. It can be more than 2x for small prompts.
tailscaler2026 - 30875 sekunder sedan
Subsidies don't last forever.
KellyCriterion - 28975 sekunder sedan
Yesterday, I killed my weekly limit with just three prompts and went into extra usage for ~18USD on top
throwatdem12311 - 24633 sekunder sedan
Price is now getting to be more in line with the actual cost. Th models are dumber, slower and more expensive than what we’ve been paying up until now. OpenAI will do it too, maybe a bit less to avoid pissing people off after seeing backlash to Anthropic’s move here. Or maybe they won’t make it dumber but they’ll increase the price while making a dumber mode the baseline so you’re encouraged to pay more. Free ride is over. Hope you have 30k burning a hole in your pocket to buy a beefy machine to run your own model. I hear Mac Studios are good for local inference.
fathermarz - 25341 sekunder sedan
I have been seeing this messaging everywhere and I have not noticed this. I have had the inverse with 4.7 over 4.6.
I think people aren’t reading the system cards when they come out. They explicitly explain your workflow needs to change. They added more levels of effort and I see no mention of that in this post.
Did y’all forget Opus 4? That was not that long ago that Claude was essentially unusable then. We are peak wizardry right now and no one is talking positively. It’s all doom and gloom around here these days.
atleastoptimal - 19545 sekunder sedan
The whole version naming for models is very misleading. 4 and 4.1 seem to come from a different "line" than 4.5 and 4.6, and likewise 4.7 seems like a new shape of model altogether. They aren't linear stepwise improvements, but I think overall 4.7 is generally "smarter" just based on conversational ability.
jimkleiber - 28011 sekunder sedan
I wonder if this is like when a restaurant introduces a new menu to increase prices.
Is Opus 4.7 that significantly different in quality that it should use that much more in tokens?
I like Claude and Anthropic a lot, and hope it's just some weird quirk in their tokenizer or whatnot, just seems like something changed in the last few weeks and may be going in a less-value-for-money direction, with not much being said about it. But again, could just be some technical glitch.
napolux - 28586 sekunder sedan
Token consumption is huge compared to 4.6 even for smaller tasks. Just by "reasoning" after my first prompt this morning I went over 50% over the 5 hours quota.
bobjordan - 28399 sekunder sedan
I've spent the past 4+ months building an internal multi-agent orchestrator for coding teams. Agents communicate through a coordination protocol we built, and all inter-agent messages plus runtime metrics are logged to a database.
Our default topology is a two-agent pair: one implementer and one reviewer. In practice, that usually means Opus writing code and Codex reviewing it.
I just finished a 10-hour run with 5 of these teams in parallel, plus a Codex run manager. Total swarm: 5 Opus 4.7 agents and 6 Codex/GPT-5.4 agents.
Opus was launched with:
`export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=35 claude --dangerously-skip-permissions --model 'claude-opus-4-7[1M]' --effort high --thinking-display summarized`
Codex was launched with:
`codex --dangerously-bypass-approvals-and-sandbox --profile gpt-5-4-high`
What surprised me was usage: after 10 hours, both my Claude Code account and my Codex account had consumed 28% of their weekly capacity from that single run.
I expected Claude Code usage to be much higher. Instead, on these settings and for this workload, both platforms burned the same share of weekly budget.
So from this datapoint alone, I do not see an obvious usage-efficiency advantage in switching from Opus 4.7 to Codex/GPT-5.4.
ausbah - 31069 sekunder sedan
is it really unthinkable that another oss/local model will be released by deepseek, alibaba, or even meta that once again give these companies a run for their money
gck1 - 19380 sekunder sedan
Anthropic is playing a strange game. It's almost like they want you to cancel the subscription if you're an active user and only subscribe if you only use it once per month to ask what the weather in Berlin is.
First they introduce a policy to ban third party clients, but the way it's written, it affects claude -p too, and 3 months later, it's still confusing with no clarification.
Then they hide model's thinking, introduce a new flag which will still show summaries of thinking, which they break again in the next release, with a new flag.
Then they silently cut the usage limits to the point where the exact same usage that you're used to consumes 40% of your weekly quota in 5 hours, but not only they stay silent for entire 2 weeks - they actively gaslight users saying they didn't change anything, only to announce later that they did, indeed change the limits.
Then they serve a lobotomized model for an entire week before they drop 4.7, again, gaslighting users that they didn't do that.
And then this.
Anthropic has lost all credibility at this point and I will not be renewing my subscription. If they can't provide services under a price point, just increase the price or don't provide them.
EDIT: forgot "adaptive thinking", so add that too. Which essentially means "we decide when we can allocate resources for thinking tokens based on our capacity, or in other words - never".
razodactyl - 29259 sekunder sedan
If anyone's had 4.7 update any documents so far - notice how concise it is at getting straight to the point. It rewrote some of my existing documentation (using Windsurf as the harness), not sure I liked the decrease in verbosity (removed columns and combined / compressed concepts) but it makes sense in respect to the model outputting less to save cost.
To me this seems more that it's trained to be concise by default which I guess can be countered with preference instructions if required.
What's interesting to me is that they're using a new tokeniser. Does it mean they trained a new model from scratch? Used an existing model and further trained it with a swapped out tokeniser?
The looped model research / speculation is also quite interesting - if done right there's significant speed up / resource savings.
ianberdin - 21505 sekunder sedan
Opus 4.6 is the main model on https://playcode.io.
Not a secret, the model is the best on the world. Yet it is crazy expensive and this 35% is huge for us. $10,000 becomes $13,500. Don’t forget, anthropic tokenizer also shows way more than other providers.
We have experimented a lot with GLM 5.1. It is kinda close, but with downsides: no images, max 100K adequate context size and poor text writing. However, a great designer. So there is no replacement. We pray.
coldtea - 31444 sekunder sedan
This, the push towards per-token API charging, and the rest are just a sign of things to come when they finally establish a moat and full monoply/duopoly, which is also what all the specialized tools like Designer and integrations are about.
It's going to be a very expensive game, and the masses will be left with subpar local versions. It would be like if we reversed the democratization of compilers and coding tooling, done in the 90s and 00s, and the polished more capable tools are again all proprietary.
monkpit - 27722 sekunder sedan
Does this have anything to do with the default xhigh effort?
BrianneLee011 - 13673 sekunder sedan
We should clarify 'Scaling up' here. Does higher token consumption actually correlate with better accuracy, or are we just increasing overhead?
QuadrupleA - 26228 sekunder sedan
One thing I don't see often mentioned - OpenAI API's auto token caching approach results in MASSIVE cost savings on agent stuff. Anthropic's deliberate caching is a pain in comparison. Wish they'd just keep the KV cache hot for 60 seconds or so, so we don't have to pay the input costs over and over again, for every growing conversation turn.
aray07 - 28131 sekunder sedan
Came to a similar conclusion after running a bunch of tests on the new tokenizer
It was on the higher end of Anthropics range - closer to 30-40% more tokens
https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...
- 28295 sekunder sedan
alphabettsy - 27214 sekunder sedan
I’m trying to understand how this is useful information on its own?
Maybe I missed it, but it doesn’t tell you if it’s more successful for less overall cost?
I can easily make Sonnet 4.6 cost way more than any Opus model because while it’s cheaper per prompt it might take 10x more rounds (or never) solve a problem.
ivanfioravanti - 27323 sekunder sedan
Probably due to the new tokenizer: https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...
nmeofthestate - 24891 sekunder sedan
Is this a weird way of saying Opus got "cheaper" somehow from 4.6 to 4.7?
ben8bit - 30200 sekunder sedan
Makes me think the model could actually not even be smarter necessarily, just more token dependent.
l5870uoo9y - 30180 sekunder sedan
My impression the reverse is true when upgrading to GPT-5.4 from GPT-5; it uses fewer tokens(?).
silverwind - 29374 sekunder sedan
Still worth it imho for important code, but it shows that they are hitting a ceiling while trying to improve the model which they try to solve by making it more token-inefficient.
- 31411 sekunder sedan
blahblaher - 29822 sekunder sedan
Conspiracy time: they released a new version just so hey could increase the price so that people wouldn't complain so much along the lines of "see this is a new version model, so we NEED to increase the price") similar to how SaaS companies tack on some shit to the product so that they can increase prices
eezing - 24383 sekunder sedan
Not sure if this equates to more spend. Smarter models make fewer mistakes and thus fewer round trips.
cooldk - 21854 sekunder sedan
Anthropic may have its biases, but its product is undeniably excellent.
axeldunkel - 28942 sekunder sedan
the better the tokenizer maps text to its internal representation, the better the understanding of the model what you are saying - or coding! But 4.7 is much more verbose in my experience, and this probably drives cost/limits a lot.
erelong - 15141 sekunder sedan
was shocked to see phone verification roll out like last month as well... yikes
Shailendra_S - 30372 sekunder sedan
45% is brutal if you're building on top of these models as a bootstrapped founder. The unit economics just don't work anymore at that price point for most indie products.
What I've been doing is running a dual-model setup — use the cheaper/faster model for the heavy lifting where quality variance doesn't matter much, and only route to the expensive one when the output is customer-facing and quality is non-negotiable. Cuts costs significantly without the user noticing any difference.
The real risk is that pricing like this pushes smaller builders toward open models or Chinese labs like Qwen, which I suspect isn't what Anthropic wants long term.
dackdel - 29650 sekunder sedan
releases 4.8 and deletes everything else. and now 4.8 costs 500% more than 4.7. i wonder what it would take for people to start using kimi or qwen or other such.
justindotdev - 31459 sekunder sedan
i think it is quite clear that staying with opus 4.6 is the way to go, on top of the inflation, 4.7 is quite... dumb. i think they have lobotomized this model while they were prioritizing cybersecurity and blocking people from performing potentially harmful security related tasks.
ai_slop_hater - 31443 sekunder sedan
Does anyone know what changed in the tokenizer? Does it output multiple tokens for things that were previously one token?
gverrilla - 23767 sekunder sedan
Yeah I'm seriously considering dropping my Max subscription, unless they do something in the next few days - something like dropping Sonnet 4.7 cheap and powerful.
varispeed - 26357 sekunder sedan
I spent one day with Opus 4.7 to fix a bug. It just ran in circles despite having the problem "in front of its eyes" with all supporting data, thorough description of the system, test harness that reproduces the bug etc. While I still believe 4.7 is much "smarter" than GPT-5.4 I decided to give it ago. It was giving me dumb answers and going off the rails. After accusing it many times of being a fraud and doing it on purpose so that I spend more money, it fixed the bug in one shot.
Having a taste of unnerfed Opus 4.6 I think that they have a conflict of interest - if they let models give the right answer first time, person will spend less time with it, spend less money, but if they make model artificially dumber (progressive reasoning if you will), people get frustrated but will spend more money.
It is likely happening because economics doesn't work. Running comparable model at comparable speed for an individual is prohibitively expensive. Now scale that to millions of users - something gotta give.
DeathArrow - 26565 sekunder sedan
We (my wallet and I) are pretty happy with GLM 5.1 and MiniMax 2.7.
micromacrofoot - 30130 sekunder sedan
The latest qwen actually performs a little better for some tasks, in my experience
latest claude still fails the car wash test
QuadrupleA - 26098 sekunder sedan
Definitely seems like AI money got tight the last month or two - that the free beer is running out and enshittification has begun.
fny - 30973 sekunder sedan
I'm going to suggest what's going on here is Hanlon's Razor for models: "Never attribute to malice that which is adequately explained by a model's stupidity."
In my opinion, we've reached some ceiling where more tokens lead only to incremental improvements. A conspiracy seems unlikely given all providers are still competing for customers and a 50% token drives infra costs up dramatically too.
mvkel - 29944 sekunder sedan
The cope is real with this model. Needing an instruction manual to learn how to prompt it "properly" is a glaring regression.
The whole magic of (pre-nerfed) 4.6 was how it magically seemed to understand what I wanted, regardless of how perfectly I articulated it.
Now, Anth says that needing to explicitly define instructions are as a "feature"?!
bparsons - 28161 sekunder sedan
Had a pretty heavy workload yesterday, and never hid the limit on claude code. Perhaps they allowed for more tokens for the launch?
Claude design on the other hand seemed to eat through (its own separate usage limit) very fast. Hit the limit this morning in about 45 mins on a max plan. I assume they are going to end up spinning that product off as a separate service.
therobots927 - 31666 sekunder sedan
Wow this is pretty spectacular. And with the losses anthro and OAI are running, don’t expect this trend to change. You will get incremental output improvements for a dramatically more expensive subscription plan.
alekseyrozh - 25642 sekunder sedan
Is it just me? I don't feel difference between 4.6 and 4.7
kziad - 9205 sekunder sedan
[dead]
chandureddyvari - 28663 sekunder sedan
[dead]
jeremie_strand - 25918 sekunder sedan
[dead]
kuzivaai - 23909 sekunder sedan
[dead]
matt3210 - 30624 sekunder sedan
[flagged]
monkeydust - 29694 sekunder sedan
[flagged]