Anonymous request-token comparisons from Opus 4.6 and Opus 4.7

Posted by anabranch 9 hours ago

Anonymous request-token comparisons from Opus 4.6 and Opus 4.7(tokens.billchambers.me)

420 points | 427 comments

andai 6 hours ago|

For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6, and seems to cost significantly less on the reasoning side as well.

Here is a comparison for 4.5, 4.6 and 4.7 (Output Tokens section):

https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...

4.7 comes out slightly cheaper than 4.6. But 4.5 is about half the cost:

https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...

Notably the cost of reasoning has been cut almost in half from 4.6 to 4.7.

I'm not sure what that looks like for most people's workloads, i.e. what the cost breakdown looks like for Claude Code. I expect it's heavy on both input and reasoning, so I don't know how that balances out, now that input is more expensive and reasoning is cheaper.

On reasoning-heavy tasks, it might be cheaper. On tasks which don't require much reasoning, it's probably more expensive. (But for those, I would use Codex anyway ;)

matheusmoreira 4 hours ago||

It thinks less and produces less output tokens because it has forced adaptive thinking that even API users can't disable. Same adaptive thinking that was causing quality issues in Opus 4.6 not even two weeks ago. The one bcherny recommended that people disable because it'd sometimes allocate zero thinking tokens to the model.

https://news.ycombinator.com/item?id=47668520

People are already complaining about low quality results with Opus 4.7. I'm also spotting it making really basic mistakes.

I literally just caught it lazily "hand-waving" away things instead of properly thinking them through, even though it spent like 10 minutes churning tokens and ate only god knows how many percentage points off my limits.

> What's the difference between this and option 1.(a) presented before?

> Honestly? Barely any. Option M is option 1.(a) with the lifecycle actually worked out instead of hand-waved.

> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.

> Fair call. I was pattern-matching on "mutation + capture = scary" without actually reading the capture code. Let me do the work properly.

> You were right to push back. I was wrong. Let me actually trace it properly this time.

> My concern from the first pass was right. The second pass was me talking myself out of it with a bad trace.

It's just a constant stream of self-corrections and doubts. Opus simply cannot be trusted when adaptive thinking is enabled.

Can provide session feedback IDs if needed.

codethief 2 hours ago|||

> > Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.

In my experience, prompts like this one, which 1) ask for a reason behind an answer (when the model won't actually be able to provide one), 2) are somewhat standoff-ish, don't work well at all. You'll just have the model go the other way.

What works much better is to tell the model to take a step back and re-evaluate. Sometimes it also helps to explicitly ask it to look at things from a different angle XYZ, in other words, to add some entropy to get it away from the local optimum it's currently at.

mrandish 23 minutes ago|||

> when the model won't actually be able to provide one

This is key. In my experience, asking an LLM why it did something is usually pointless. In a subsequent round, it generally can't meaningfully introspect on its prior internal state, so it's just referring to the session transcript and extrapolating a plausible sounding answer based on its training data of how LLMs typically work.

That doesn't necessarily mean the reply is wrong because, as usual, a statistically plausible sounding answer sometimes also happens to be correct, but it has no fundamental truth value. I've gotten equally plausible answers just pasting the same session transcript into another LLM and asking why it did that.

matheusmoreira 1 hour ago||||

That's good advice. I managed to get the session back on track by doing that a few turns later. I started making it very explicit that I wanted it to really think things through. It kept asking me for permission to do things, I had to explicitly prompt it to trace through and resolve every single edge case it ran into, but it seems to be doing better now. It's running a lot of adversarial tests right now and the results at least seem to be more thorough and acceptable. It's gonna take a while to fully review the output though.

It's just that Opus 4.6 DISABLE_ADAPTIVE_THINKING=1 doesn't seem to require me to do this at all, or at least not as often. It'd fully explore the code and take into account all the edge cases and caveats without any explicit prompting from me. It's a really frustrating experience to watch Anthropic's flagship subscription-only model burn my tokens only to end up lazily hand-waving away hard questions unless I explicitly tell it not to do that.

I have to give it to Opus 4.7 though: it recovered much better than 4.6.

j-bos 27 minutes ago||||

Yeah for anyone seriously using these models I highly reccomend reading the Mythos system card, esp the sections on analyzing it's internal non verbalized states. Save a lot of head wall banging.

nelox 46 minutes ago|||

Precisely. I find Grok’s multi-agent approach very useful here. I have custom agent configured as a validator.

rectang 2 hours ago||||

Are the benchmarks being used to measure these models biased towards completing huge and highly complex tasks, rather than ensuring correctness for less complex tasks?

It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.

matheusmoreira 2 hours ago||

I don't think there's a bias here. I'd say my task is of somewhat high complexity. I'm using Claude to assist me in implementing exceptions in my programming language. It's a SICP chapter 5.4 level task. There are quite a few moving parts in this thing. Opus 4.6 once went around in circles for half an hour trying to trace my interpreter's evaluator. As a human, it's not an easy task for me to do either.

I think the problem just comes down to adaptive thinking allowing the model to choose how much effort it spends on things, a power which it promptly abuses to be as lazy as possible. CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 significantly improved Opus 4.6's behavior and the quality of its results. But then what do they do when they release 4.7?

https://code.claude.com/docs/en/model-config

> Opus 4.7 always uses adaptive reasoning.

> The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.

what 23 minutes ago|||

> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.

Do you think it knows what max effort or patched system prompts are? It feels really weird to talk to an LLM like it’s a person that understands.

QuantumGood 3 hours ago||

Some have defined "fair" as tests of the same model at different times, as the behavior and token usage of a model changes despite the version number remaining the same. So testing model numbers at different times matters, unfortunately, and that means recent tests might not be what you would want to compare to future tests.

hgoel 7 hours ago||

The bump from 4.6 to 4.7 is not very noticeable to me in improved capabilities so far, but the faster consumption of limits is very noticeable.

I hit my 5 hour limit within 2 hours yesterday, initially I was trying the batched mode for a refactor but cancelled after seeing it take 30% of the limit within 5 minutes. Had to cancel and try a serial approach, consumed less (took ~50 minutes, xhigh effort, ~60% of the remaining allocation IIRC), but still very clearly consumed much faster than with 4.6.

It feels like every exchange takes ~5% of the 5 hour limit now, when it used to be maybe ~1-2%. For reference I'm on the Max 5x plan.

For now I can tolerate it since I still have plenty of headroom in my limits (used ~5% of my weekly, I don't use claude heavily every day so this is OK), but I hope they either offer more clarity on this or improve the situation. The effort setting is still a bit too opaque to really help.

matheusmoreira 2 hours ago||

The most frustrating part is the quality loss caused by the forced adaptive thinking. It eats 5-10% of my Max 5x usage and churns for ten minutes, only to come back with totally untrustworthy results. It lazily hand-waves issues away in order to avoid reading my actual code and doing real reasoning work on it. Opus simply cannot be trusted if adaptive thinking is enabled.

_blk 7 hours ago|||

From what I understand you shouldn't wait more than 5min between prompts without compacting or clearing or you'll pay for reinitializing the cache. With compaction you still pay but it's less input tokens. (Is compaction itself free?)

krackers 3 hours ago|||

>pay for reinitializing the cache

Why can't they save the kv cache to disk then later reload it to memory?

stavros 1 hour ago||

Probably because the costly operation is loading it onto the GPU, doesn't matter if it's from disk or from your request.

zozbot234 1 hour ago||

The point of prompt caching is to save on prefill which for large contexts (common for agentic workloads) is quite expensive per token. So there is a context length where storing that KV-cache is worth it, because loading it back in is more efficient than recomputing it. For larger SOTA models, the KV cache unit size is also much smaller compared to the compute cost of prefill, so caching becomes worthwhile even for smaller context.

gck1 5 hours ago||||

Cache ttl on max subscriptions is 1h, FYI.

bashtoni 3 hours ago|||

Only if you set `ENABLE_PROMPT_CACHING_1H`, which was mentioned in the release notes for a recent Claude Code release but doesn't seem to be in the official docs.

g4cg54g54 2 hours ago|||

subusers supposedly get it automatic again after the fix (and now also with `DISABLE_TELEMETRY=1`)

but if you are api user you must set `ENABLE_PROMPT_CACHING_1H` as i understood

and when using your own api (via `ANTHROPIC_BASE_URL`) ensure `CLAUDE_CODE_ATTRIBUTION_HEADER=0` is set as well... https://github.com/anthropics/claude-code/issues/50085

and check out the other neckbreakers ive found pukes lots of malicious compliance by feels... :/

[BUG] new sessions will *never* hit a (full)cache #47098 https://github.com/anthropics/claude-code/issues/47098

[BUG] /clear bleeds into the next session (what also breaks cache) #47756 https://github.com/anthropics/claude-code/issues/47756

[BUG] uncachable system prompt caused by includeGitInstructions / CLAUDE_CODE_DISABLE_GIT_INSTRUCTIONS -> git status https://github.com/anthropics/claude-code/issues/47107

andersa 3 hours ago|||

Bruh. It's getting hard to track down all these MAKE_IT_ACTUALLY_WORK settings that default to off for no reason.

_blk 4 hours ago|||

That'd be awesome but it doesn't reflect what I see. Do you have a source for that? What I see is if take a quick break the session loses ~5% right at the start of the next prompt processing. (I'm currently on max 5x)

gck1 4 hours ago|||

Not at my workstation right now, but simply ask claude to analyze jsonl transcript of any session, there are two cache keys there, one is 5m, another 1h. Only 1h gets set. There are also some entries there that will tell you if request was a cache hit or miss, or if cache rewrite happened. I've had claude test another claude and on max 5x subscription, cache miss only happened if message was sent after 1h, or if session was resumed using /resume or --resume (this is a bug that exists since January - all session resumes will cause a full cache rewrite).

However, cache being hit doesn't necessarily mean Anthropic won't just subtract usage from you as if it wasn't hit. It's Anthropic we're talking about. They can do whatever they want with your usage and then blame you for it.

Fabricio20 4 hours ago||||

I have heard that if you have telemetry disabled the cache is 5 minutes, otherwise 1h. No clue how true that is however my experience (with telemetry enabled) has been the 1h cache.

HarHarVeryFunny 4 hours ago||

They've acknowledged that as a bug and have fixed it.

ethanj8011 4 hours ago|||

It's true as far as I can tell, just by my own checking using `/status`. You can also tell by when the "clear" reminder hint shows up. Also if you look at the leaked claude code you can see that almost everything in the main thread is cached with 1H TTL (I believe subagents use 5 minute TTL)

conception 6 hours ago||||

Yeah the caching change is probably 90% of “i run out of usage so fast now!” Issues.

hgoel 7 hours ago||||

Ah I can see how my phrasing might be misleading, but these prompts were made within 5 minutes of each other, the timing I mentioned were what Claude spent working.

trueno 5 hours ago|||

is it 5 mins between constant prompting/work or 5 mins as in if i step away from the comp for 5 mins and comp back and prompt again im not subject to reinit?

if it's the latter that's crazy. i dont even know what to do there, compactions already feel like a memory wipe

viktorianer 3 hours ago||

[dead]

vicchenai 6 minutes ago||

ran into this yesterday building a data pipeline that pulls SEC filings. same prompt, same context window, 4.7 chewed through noticeably more of my api budget than 4.6 did. the output wasnt obviously better either, just... more expensive.

what bugs me is the tokenizer change feels like a stealth price hike. if you're charging the same $/token but the same text now costs 35% more tokens, thats just a 35% price increase with extra steps. at least be upfront about it.

glerk 7 hours ago||

I'd be ok with paying more if results were good, but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.

And yes, Claude models are generally more fun to use than GPT/Codex. They have a personality. They have an intuition for design/aesthetics. Vibe-coding with them feels like playing a video game. But the result is almost always some version of cutting corners: tests removed to make the suite pass, duplicate code everywhere, wrong abstraction, type safety disabled, hard requirements ignored, etc.

These issues are not resolved in 4.7, no matter what the benchmarks say, and I don't think there is any interest in resolving them.

Bridged7756 6 hours ago||

Mirrors my sentiment. Those tools seem mostly useful for a Google alternative, scaffolding tedious things, code reviewing, and acting as a fancy search.

It seems that they got a grip on the "coding LLM" market and now they're starting to seek actual profit. I predict we'll keep seeing 40%+ more expensive models for a marginal performance gain from now on.

danny_codes 6 hours ago|||

I just don’t see how they’ll be able to make a profit. Open models have the same performance on coding tasks now. The incentives are all wrong. Why pay more for a model that’s no better and also isn’t open? It’s nonsense

Bridged7756 3 hours ago|||

I wouldn't say the same but it's pretty close. At this point I'm convinced that they'll continue running the marketing machine and people due to FOMO will keep hopping onto whatever model anthropic releases.

braebo 3 hours ago||||

Which open model has the same performance as Opus 4.7?

3dfd 2 hours ago||

[dead]

3dfd 2 hours ago|||

[dead]

djeastm 1 hour ago||

I think that's precisely why they're paying thousands of people in those other jobs to perform their tasks while collecting new data. Software was easiest because its already mostly written down, but other jobs can be quantized with enough data points. Just give it time

holoduke 4 hours ago|||

You have to guide an ai. Not let roam freely. If you got skills to guide you can make it output high quality

xpe 6 hours ago|||

> ... but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.

This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.

My prior: it is 10X to 20X more likely Anthropic has done something other than shift to a short-term squeeze their customers strategy (which I think is only around ~5%)

What do I mean by "something other"? (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded. (2) Another possibility is that they are not as tuned to to what customers want relative to what their engineers want. (3) It is also possible they have slowed down their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos). Also, the above three possibilities are not mutually exclusive.

I don't expect us (readers here) to agree on the probabilities down to the ±5% level, but I would think a large chunk of informed and reasonable people can probably converge to something close to ±20%. At the very least, can we agree all of these factors are strong contenders: each covers maybe at least 10% to 30% of the probability space?

How short-sighted, dumb, or back-against-the-wall would Anthropic have to be to shift to a "let's make our new models intentionally _worse_ than our previous ones?" strategy? Think on this. I'm not necessarily "pro" Anthropic. They could lose standing with me over time, for sure. I'm willing to think it through. What would the world have to look like for this to be the case.

There are other factors that push back against claims of a "short-term greedy strategy" argument. Most importantly, they aren't stupid; they know customers care about quality. They are playing a longer game than that.

Yes, I understand that Opus 4.7 is not impressing people or worse. I feel similarly based on my "feels", but I also know I haven't run benchmarks nor have I used it very long.

I think most people viewed Opus 4.6 as a big step forward. People are somewhat conditioned to expect a newer model to be better, and Opus 4.7 doesn't match that expectation. I also know that I've been asking Claude to help me with Bayesian probabilistic modeling techniques that are well outside what I was doing a few weeks ago (detailed research and systems / software development), so it is just as likely that I'm pushing it outside its expertise.

glerk 6 hours ago||

> To claim to know a company's strategy as an outsider is messy stuff.

I said "it seems like". Obviously, I have no idea whether this is an intentional strategy or not and it could as well be a side effect of those things that you mentioned.

Models being "worse" is the perceived effect for the end user (subjectively, it seems like the price to achieve the same results on similar tasks with Opus has been steadily increasing). I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).

3dfd 2 hours ago||

[dead]

kalkin 8 hours ago||

AFAICT this uses a token-counting API so that it counts how many tokens are in the prompt, in two ways, so it's measuring the tokenizer change in isolation. Smarter models also sometimes produce shorter outputs and therefore fewer output tokens. That doesn't mean Opus 4.7 necessarily nets out cheaper, it might still be more expensive, but this comparison isn't really very useful.

h14h 8 hours ago||

For some real data, Artificial Analysis reported that 4.6 (max) and 4.7 (max) used 160M tokens and 100M tokens to complete their benchmark suite, respectively:

https://artificialanalysis.ai/?intelligence-efficiency=intel...

Looking at their cost breakdown, while input cost rose by $800, output cost dropped by $1400. Granted whether output offsets input will be very use-case dependent, and I imagine the delta is a lot closer at lower effort levels.

theptip 6 hours ago||

This is the right way of thinking end-to-end.

Tokenizer changes are one piece to understand for sure, but as you say, you need to evaluate $/task not $/token or #tokens/task alone.

manmal 8 hours ago|||

Why is it not useful? Input token pricing is the same for 4.7. The same prompt costs roughly 30% more now, for input.

dktp 8 hours ago|||

The idea is that smarter models might use fewer turns to accomplish the same task - reducing the overall token usage

Though, from my limited testing, the new model is far more token hungry overall

manmal 7 hours ago||

Well you‘ll need the same prompt for input tokens?

httgbgg 7 hours ago||

Only the first one. Ideally now there is no second prompt.

manmal 7 hours ago||

Are you aware that every tool call produces output which also counts as input to the LLM?

kalkin 8 hours ago|||

That's valid, but it's also worth knowing it's only one part of the puzzle. The submission title doesn't say "input".

SkyPuncher 7 hours ago|||

Yes. I actually noticed my token usage go down on 4.6 when I started switching every session to max effort. I got work done faster with fewer steps because thinking corrected itself before it cycled.

I’ve noticed 4.7 cycling a lot more on basic tasks. Though, it also seems a bit better at holding long running context.

the_gipsy 7 hours ago||

With AIs, it seems like there never is a comparison that is useful.

theptip 5 hours ago|||

You can build evals. Look at Harbor or Inspect. It’s just more work than most are interested in doing right now.

jascha_eng 7 hours ago|||

yup its all vibes. And anthropic is winning on those in my book still

rectang 8 hours ago||

For now, I'm planning to stick with Opus 4.5 as a driver in VSCode Copilot.

My workflow is to give the agent pretty fine-grained instructions, and I'm always fighting agents that insist on doing too much. Opus 4.5 is the best out of all agents I've tried at following the guidance to do only-what-is-needed-and-no-more.

Opus 4.6 takes longer, overthinks things and changes too much; the high-powered GPTs are similarly flawed. Other models such as Sonnet aren't nearly as good at discerning my intentions from less-than-perfectly-crafted prompts as Opus.

Eventually, I quit experimenting and just started using Opus 4.5 exclusively knowing this would all be different in a few months anyway. Opus cost more, but the value was there.

But now I see that 4.7 is going to replace both 4.5 and 4.6 in VSCode Copilot, and with a 7.5x modifier. Based on the description, this is going to be a price hike for slower performance — and if the 4.5 to 4.6 change is any guide, more overthinking targeted at long-running tasks, rather than fine-grained. For me, that seems like a step backwards.

axpy906 3 hours ago||

Why not just use Sonnet?

rectang 3 hours ago||

I've used Sonnet a lot. It is not as good as Opus at understanding what I'm asking for. I have to coach Sonnet more closely, taking more care to be precise in my prompts, and often building up Plan steps when I could just YOLO an Agent instruction at Opus and it would get it right.

I find that Opus is really good at discerning what I mean, even when I don't state it very clearly. Sonnet often doesn't quite get where I'm going and it sometimes builds things that don't make sense. Sonnet also occasionally makes outright mistakes, like not catching every location that needs to be changed; Opus makes nearly every code change flawlessly, as if it's thinking through "what could go wrong" like a good engineer would.

Sonnet is still better than older and/or less-capable models like GPT 4.1, Raptor mini (Preview), or GPT-5 mini, which all fail in the same way as Sonnet but more dramatically... but Opus is much better than Sonnet.

Recent full-powered GPTs (including the Codex variants) are competitive with Opus 4.6, but Opus 4.5 in particular is best in class for my workflow. I speculate that Opus 4.5 dedicates the most cycles out of all models to checking its work and ensuring correctness — as opposed to reaching for the skies to chase ambitious, highly complex coding tasks.

trueno 5 hours ago|||

> 4.7 is going to replace both 4.5 and 4.6

as in 4.5 is no longer going to be avail? F.

ive also been sticking with 4.5 that sucks

rectang 3 hours ago||

https://github.blog/changelog/2026-04-16-claude-opus-4-7-is-...

> Over the coming weeks, Opus 4.7 will replace Opus 4.5 and Opus 4.6 in the model picker for Copilot Pro+[...]

> This model is launching with a 7.5× premium request multiplier as part of promotional pricing until April 30th.

xstas1 54 minutes ago||

Promotional pricing? Are they saying that after the promotion, it will cost more than 7.5x??

benjiro3000 1 hour ago||

[dead]

gsleblanc 7 hours ago||

It's increasingly looking naive to assume scaling LLMs is all you need to get to full white-collar worker replacement. The attention mechanism / hopfield network is fundamentally modeling only a small subset of the full human brain, and all the increasing sustained hype around bolted-on solutions for "agentic memory" is, in my opinion, glaring evidence that these SOTA transformers alone aren't sufficient even when you just limit the space to text. Maybe I'm just parroting Yann LeCun.

ACCount37 6 hours ago||

You probably are.

The "small subset" argument is profoundly unconvincing, and inconsistent with both neurobiology of the human brain and the actual performance of LLMs.

The transformer architecture is incredibly universal and highly expressive. Transformers power LLMs, video generator models, audio generator models, SLAM models, entire VLAs and more. It not a 1:1 copy of human brain, but that doesn't mean that it's incapable of reaching functional equivalence. Human brain isn't the only way to implement general intelligence - just the one that was the easiest for evolution to put together out of what it had.

LeCun's arguments about "LLMs can't do X" keep being proven wrong empirically. Even on ARC-AGI-3, which is a benchmark specifically designed to be adversarial to LLMs and target the weakest capabilities of off the shelf LLMs, there is no AI class that beats LLMs.

bigyabai 6 hours ago||

> Human brain isn't the only way to implement general intelligence - just the one that was the easiest for evolution to put together out of what it had.

The human brain is not a pretrained system. It's objectively more flexible than than transformers and capable of self-modulation in ways that no ML architecture can replicate (that I'm aware of).

ACCount37 5 hours ago||

Human brain's "pre-training" is evolution cramming way too much structure into it. It "learns from scratch" the way it does because it doesn't actually learn from scratch.

I've seen plenty of wacky test-time training things used in ML nowadays, which is probably the closest to how the human brain learns. None are stable enough to go into the frontier LLMs, where in-context learning still reigns supreme. In-context learning is a "good enough" continuous learning approximatation, it seems.

bigyabai 5 hours ago||

> In-context learning is a "good enough" continuous learning approximatation, it seems.

"it seems" is doing a herculean effort holding your argument up, in this statement. Say, how many "R"s are in Strawberry?

ACCount37 5 hours ago||

If you think that "strawberry" is some kind of own, I don't know what to tell you. It takes deep and profound ignorance of both the technical basics of modern AIs and the current SOTA to do this kind of thing.

LLMs get better release to release. Unfortunately, the quality of humans in LLM capability discussions is consistently abysmal. I wouldn't be seeing the same "LLMs are FUNDAMENTALLY FLAWED because I SAY SO" repeated ad nauseam otherwise.

bigyabai 5 hours ago||

I can ask a nine-year-old human brain to solve that problem with a box of Crayola and a sheet of A4 printer paper.

In-context learning is professedly not "good enough" to approximate continuous learning of even a child.

ACCount37 5 hours ago|||

You're absolutely wrong!

You can also ask an LLM to solve that problem by spelling the word out first. And then it'll count the letters successfully. At a similar success rate to actual nine-year-olds.

There's a technical explanation for why that works, but to you, it might as well be black magic.

And if you could get a modern agentic LLM that somehow still fails that test? Chances are, it would solve it with no instructions - just one "you're wrong".

1. The LLM makes a mistake

2. User says "you're wrong"

3. The LLM re-checks by spelling the word out and gives a correct answer

4. The LLM then keeps re-checking itself using the same method for any similar inquiry within that context

In-context learning isn't replaced by anything better because it's so powerful that finding "anything better" is incredibly hard. It's the bread and butter of how modern LLM workflows function.

bigyabai 3 hours ago||

> it's so powerful that finding "anything better" is incredibly hard.

We're back around to the start again. "Incredibly hard" is doing all of the heavy lifting in this statement, it's not all-powerful and there are enormous failure cases. Neither the human brain nor LLMs are a panacea for thought, but nobody in academia or otherwise is seriously comparing GPT to the human brain. They're distinct.

> There's a technical explanation for why that works, but to you, it might as well be black magic.

Expound however much you need. If there's one thing I've learned over the past 12 months it's that everyone is now an expert on the transformer architecture and everyone else is wrong. I'm all ears if you've got a technical argument to make, the qualitative comparison isn't convincing me.

8note 1 hour ago|||

why is the breakdown from words to letters your highest priority thing to add to the training data?

what problem does this allow you to solve that you couldnt otherwise?

aerhardt 7 hours ago|||

> you just limit the space to text

And even then... why can't they write a novel? Or lowering the bar, let's say a novella like Death in Venice, Candide, The Metamorphosis, Breakfast at Tiffany's...?

Every book's in the training corpus...

Is it just a matter of someone not having spent a hundred grand in tokens to do it?

voxl 6 hours ago|||

I know someone spending basically every day writing personal fan fiction stories using every model you can find. She doesn't want to share it, and does complain about it a lot, seems like maintaining consistency for something say 100 pages long is difficult

conception 6 hours ago||||

I don’t understand - there are hundreds/thousands of AI written books available now.

aerhardt 6 hours ago||

I've glossed over a few and one can immediately tell they don't meet the average writing level you'd see in a local workshop for writers, and much less that of Mann or Capote.

zozbot234 5 hours ago||||

Never mind novels, it can't even write a good Reddit-style or HN-style comment. agentalcove.ai has an archive of AI models chatting to one another in "forum" style and even though it's a good show of the models' overall knowledge the AIisms are quite glaring.

mh- 4 hours ago||

They definitely can, and do.

It's just that the ones that manage to suppress all the AI writing "tells" go unnoticed as AI. This is a type of survivorship bias, though I feel there must be a better term for it that eludes me.

colechristensen 6 hours ago|||

Who says they can't? What's your bar that needs to be passed in order for "written a novella" to be achieved?

There's a lot of bad writing out there, I can't imagine nobody has used an LLM to write a bad novella.

aerhardt 6 hours ago||

> What's your bar that needs to be passed

I provide four examples in my comment...

colechristensen 6 hours ago||

Your qualification for if an LLM can write a novella is it has to be as good as The Metamorphosis?

Yes, those are examples of novellas, surely you believe an LLM could write a bad novella? I'm not sure what your point is. Either you think it can't string the words together in that length or your standard is it can't write a foundational piece of literature that stays relevant for generations... I'm not sure which.

aerhardt 6 hours ago||

I don't think it can write something that's of a fraction of the quality of Kafka.

But GP's argument ("limit the space to text") could be taken to imply - and it seems to be a common implication these days - that LLMs have mastered the text medium, or that they will very soon.

> it can't write a foundational piece of literature

Why not, if this a pure textual medium, the corpus includes all the great stories ever written, and possibly many writing workshops and great literature courses?

colechristensen 6 hours ago||

I don't know what to tell you. It's more than a little absurd to make the qualification of being able to do something to be that the output has to be considered a great work of art for generations.

aerhardt 6 hours ago||

I agree that the argument starts from a reduction to the absurd.

So at least we can agree that AI hasn't mastered the text medium, without further qualification?

And what about my argument, further qualified, which is that I don't think it could even write as well as a good professional writer - not necessarily a generational one?

colechristensen 3 hours ago||

>AI hasn't mastered the text medium

I don't know what this means and I don't know what would qualify it as having "mastered" at all. Seems like a no-true-Scotsman thing where regardless there would always be someone that it couldn't actually do a thing because this and that.

>why can't they write a novel?

This is what I'm disagreeing with. I think an LLM can write a novel well enough that it's recognizably a pretty mediocre novel, no worse than the median written human novel which to be fair is pretty bad. You seem to have an unqualified bar something needs to pass before "writing a novel" is accomplished but it's not clear what that is. At the same time you're switching between the ability to do a thing and the ability to do a thing in a way that's honored as the best of the best for a century. So I don't know it kind of seems like you just don't like AI and have a different standard for it that adjusts so that it fails. This doesn't match what you'd consider some random Bob's ability to do a thing.

mohamedkoubaa 2 hours ago||

I think they're as good as they're going to get from scaling. They can still get more efficient, and tooling/harnesses around them will improve.

3dfd 2 hours ago||

[dead]

someuser54541 8 hours ago||

Should the title here be 4.6 to 4.7 instead of the other way around?

freak42 8 hours ago||

absolutely!

UltraSane 8 hours ago||

Writing Opus 4.6 to 4.7 does make more sense for people who read left to right.

pixelatedindex 8 hours ago|||

I’m impressed with anyone who can read English right to left.

jlongman 7 hours ago|||

You might like https://en.wikipedia.org/wiki/Boustrophedon

amulyabaral 7 hours ago||

Whoa! TIL! I struggled a bit to read this style at first, but felt it get easier after a few tries.

einpoklum 7 hours ago|||

Right to Left English - read can, who? Anyone with [which] impressed am I.

y1n0 7 hours ago||

Yoda, you that is?

embedding-shape 8 hours ago||||

But the page is not in a language that should be read right to left, doesn't that make that kind of confusing?

usrnm 8 hours ago||

Did you mean "right to left"?

embedding-shape 7 hours ago||

I very much did, it got too confusing even for me. Thanks!

UltraSane 6 hours ago||

I kept mentally verifying that English is written left to right.

bee_rider 8 hours ago|||

Err, how so?

tiffanyh 8 hours ago||

I was using Opus 4.7 just yesterday to help implement best practices on a single page website.

After just ~4 prompts I blew past my daily limit. Another ~7 more prompts & I blew past my weekly limit.

The entire HTMl/CSS/JS was less than 300 lines of code.

I was shocked how fast it exhausted my usage limits.

hirako2000 8 hours ago||

I haven't used Claude. Because I suspect this sort of things to come.

With enterprise subscription, the bill gets bigger but it's not like VP can easily send a memo to all its staff that a migration is coming.

Individuals may end their subscription, that would appease the DC usage, and turn profits up.

fooster 4 hours ago||

Sorry you are missing out. I use claude all day every day with max and what people are reporting here has not been my experience. My current usage is 16% and it resets Thursday.

zaptrem 6 hours ago|||

What's your reasoning effort set to? Max now uses way more tokens and isn't suggested for most usecases. Even the new default (xhigh) uses more than the old default (medium).

nixpulvis 3 hours ago||

That's what I'm wondering. Is it people are defaulting to xhigh now and that's why it feels like it's consuming a lot more tokens? If people manually set it to medium, would it be comparable?

nixpulvis 46 minutes ago||

Switching back to medium seems to have fixed the issue for me.

sync 7 hours ago|||

Which plan are you on? I could see that happening with Pro (which I think defaults to Sonnet?), would be surprised with Max…

templar_snow 7 hours ago|||

It eats even the Max plan like crazy.

tiffanyh 7 hours ago|||

Pro. It even gave me $20 free credits, and exhausted free credits nearly instantly.

cageface 9 minutes ago||

The pro plan is useless. You need at least the 5x max plan to get any real work done.

That said I find the GPT plans much better value.

tomtomistaken 7 hours ago||

Are you using Claude subscription? Because that's not how it works there.

hereme888 6 hours ago|

> Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even after accounting for Opus 4.7's new tokenizer. This metric does not account for cached input token discounts, which we will be incorporating into our cost calculations in the near future.

More comments...