Speed up responses with fast mode

Posted by surprisetalk 7 hours ago

Speed up responses with fast mode(code.claude.com)

114 points | 117 comments

kristianp 2 hours ago|

This is gold for Anthropic's profitability. The Claude Code addicts can double their spend to plow through tokens because they need to finish something by a deadline. OpenAI will have a similar product within a week but will only charge 3x the normal rate.

This angle might also be NVidias reason for buying Groq. People will pay a premium for faster tokens.

jweir 1 hour ago|

I switched back to 4.5 Sonnet or Opus yesterday since 4.6 was so slow and often “over thinking” or “over analyzing” the problem space. Tasks which accurately took under an minute in Sonnet 4.5 were still running after 5 minutes in 4.6 (yeah I had them race for a few tasks)

Someone of this could be system overload I suppose.

stefan_ 51 minutes ago||

Yeah, nothing is sped up, their initial deployment of 4.6 is so unbearably slow they are just now offering you the opportunity to pay more for the same experience of 4.5. What's the word for that?

coldtea 23 minutes ago||

Enshittification

OtherShrezzing 1 hour ago||

A useful feature would be slow-mode which gets low cost compute on spot pricing.

I’ll often kick off a process at the end of my day, or over lunch. I don’t need it to run immediately. I’d be fine if it just ran on their next otherwise-idle gpu at much lower cost that the standard offering.

stavros 1 hour ago||

OpenAI offers that, or at least used to. You can batch all your inference and get much lower prices.

guerrilla 1 hour ago||

> I’ll often kick off a process at the end of my day, or over lunch. I don’t need it to run immediately. I’d be fine if it just ran on their next otherwise-idle gpu at much lower cost that the standard offering.

If it's not time sensitive, why not just run it at on CPU/RAM rather than GPU.

weird-eye-issue 47 minutes ago|||

Yeah just run a LLM with over 100 billion parameters on a CPU.

kristjansson 39 minutes ago||

200 GB is an unfathomable amount of main memory for a CPU

(with apologies for snark,) give gpt-oss-120b a try. It’s not fast at all, but it can generate on CPU.

bethekidyouwant 1 hour ago|||

Run what exactly?

all2 50 minutes ago||

I'm assuming GP means 'run inference locally on GPU or RAM'. You can run really big LLMs on local infra, they just do a fraction of a token per second, so it might take all night to get a paragraph or two of text. Mix in things like thinking and tool calls, and it will take a long, long time to get anything useful out of it.

zhyder 2 hours ago||

So 2.5x the speed at 6x the price [1].

Quite a premium for speed. Especially when Gemini 3 Pro is 1.8x the tokens/sec speed (of regular-speed Opus 4.6) at 0.45x the price [2]. Though it's worse at coding, and Gemini CLI doesn't have the agentic strength of Claude Code, yet.

[1] - https://x.com/claudeai/status/2020207322124132504 [2] - https://artificialanalysis.ai/leaderboards/models

legojoey17 13 minutes ago||

What's crazy is the pricing difference given that OpenAI recently reduced latency on some models with no price change - https://x.com/OpenAIDevs/status/2018838297221726482

Nition 6 hours ago||

Note that you can't use this mode to get the most out of a subscription - they say it's always charged as extra usage:

> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.

Although if you visit the Usage screen right now, there's a deal you can claim for $50 free extra usage this month.

paxys 5 hours ago||

Looking at the "Decide when to use fast mode", it seems the future they want is:

- Long running autonomous agents and background tasks use regular processing.

- "Human in the loop" scenarios use fast mode.

Which makes perfect sense, but the question is - does the billing also make sense?

AstroBen 47 minutes ago||

This seems like an incredibly bad deal, but maybe they're probing to see if people will pay more

You know if people pay for this en masse it'll be the new default pricing, with fast being another step above

jawon 3 hours ago||

I was thinking about inhouse model inference speeds at frontier labs like Anthropic and OpenAI after reading the "Claude built a C compiler" article.

Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services.

Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.

In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone.

Aurornis 3 hours ago||

> Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.

They said the 2.5X offering is what they've been using internally. Now they're offering via the API: https://x.com/claudeai/status/2020207322124132504

LLM APIs are tuned to handle a lot of parallel requests. In short, the overall token throughput is higher, but the individual requests are processed more slowly.

The scaling curves aren't that extreme, though. I doubt they could tune the knobs to get individual requests coming through at 10X the normal rate.

This likely comes from having some servers tuned for higher individual request throughput, at the expense of overall token throughput. It's possible that it's on some newer generation serving hardware, too.

crowbahr 3 hours ago|||

Where on earth are you getting these numbers? Why would a SaaS company that is fighting for market dominance withhold 10x performance if they had it? Where are you getting 2.5x?

This is such bizarre magical thinking, borderline conspiratorial.

There is no reason to believe any of the big AI players are serving anything less than the best trade off of stability and speed that they can possibly muster, especially when their cost ratios are so bad.

jawon 3 hours ago|||

Not magical thinking, not conspiratorial, just hypothetical.

Just because you can't afford to 10x all your customers' inference doesn't mean you can't afford to 10x your inhouse inference.

And 2.5x is from Anthropic's latest offering. But it costs you 6x normal API pricing.

jawon 2 hours ago||

Also, from a comment in another thread, from roon, who works at OpenAI:

> codex-5.2 is really amazing but using it from my personal and not work account over the weekend taught me some user empathy lol it’s a bit slow

[0] https://nitter.net/tszzl/status/2016338961040548123

stavros 3 hours ago|||

This makes no sense. It's not like they have a "slow it down" knob, they're probably parallelizing your request so you get a 2.5x speedup at 10x the price.

brookst 1 hour ago||

All of these systems use massive pools of GPUs, and allocate many requests to each node. The “slow it down” knob is to steer a request to nodes with more concurrent requests; “speed it up” is to route to less-loaded nodes.

stavros 1 hour ago||

Right, but that's still not Anthropic adding an intentional delay for the sole purpose of having you pay more to remove it.

falloutx 3 hours ago||

Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.

Aurornis 3 hours ago|||

That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.

falloutx 3 hours ago||

They can now easily decrease the speed for the normal mode, and then users will have to pay more for fast mode.

Aurornis 3 hours ago|||

Do you have any evidence that this is happening? Or is it just a hypothetical threat you're proposing?

These companies aren't operating in a vacuum. Most of their users could change providers quickly if they started degrading their service.

falloutx 2 hours ago||

They have contracts with companies, and those companies wont be able to change quickly. By the time those contracts will come back for renewals it will already be too late, their code becoming completely unreadable by humans. Individual devs can move quickly but companies don't.

kolinko 2 hours ago|||

Are you at all familiar with the architecture of systems like theirs?

The reason people don't jump to your conclusion here (and why you get downvoted) is that for anyone familiar with how this is orchestrated on the backend it's obvious that they don't need to do artificial slowdowns.

falloutx 2 hours ago||

I am familiar with the business model. This is clear indication of what their future plan is.

Also, I just pointed out at the business issue, just raising a point which was not raised here. Just want people to be more cautious

throw310822 3 hours ago|||

Slowing down respect to what?

falloutx 3 hours ago||

Slowing down with respect to original speed of response. Basically what we used to get few months back and what is the best possible experience.

throw310822 3 hours ago||

There is no "original speed of response". The more resources you pour in, the faster it goes.

falloutx 3 hours ago||

Watch them decrease resources for the normal mode so people are penny pinched into using fast mode.

throw310822 2 hours ago||

Seriously, thinking at the price structure of this (6x the price for 2.5x the speed, if that's correct) it seems to target something like real time applications with very small context. Maybe vocal assistants? I guess that if you're doing development it makes more sense to parallelize over more agents rather than paying that much for a modest increase in speed.

rustyhancock 4 hours ago||

At this point why don't we just CNAME HN to the Claude marketing blog?

Kwpolska 1 hour ago||

Because we would miss simonw’s self-promotion blog posts.

iLoveOncall 56 minutes ago||

It gives the same space, if not a lot more, to OpenAI.

It should definitely be renamed to AINews instead of HackerNews, but Claude posts are a lot less frequent than OpenAI's.

IMTDb 6 hours ago|

I’m curious what’s behind the speed improvements. It seems unlikely it’s just prioritization, so what else is changing? Is it new hardware (à la Groq or Cerebras)? That seems plausible, especially since it isn’t available on some cloud providers.

Also wondering whether we’ll soon see separate “speed” vs “cleverness” pricing on other LLM providers too.

kingstnap 4 hours ago||

It comes from batching and multiple streams on a GPU. More people sharing 1 GPU makes everyone run slower but increases overall token throughput.

Mathematically it comes from the fact that this transformer block is this parallel algorithm. If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.

Its true for basically all hardware and most models. You can draw this Pareto curve of how much throughput per GPU vs how many tokens per second per stream. More tokens/s less total throughput.

See this graph for actual numbers:

Token Throughput per GPU vs. Interactivity gpt-oss 120B • FP4 • 1K / 8K • Source: SemiAnalysis InferenceMAX™

https://inferencemax.semianalysis.com/

vlovich123 2 hours ago||

> If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.

I think you skipped the word “total throughout” there right? Cause tok/s is a measure of throughput, so it’s clearer to say you increase throughput/user at the expense of throughput/gpu.

I’m not sure about the comment about speculative decode though. I haven’t served a frontier model but generally speculative decode I believe doesn’t help beyond a few tokens, so I’m not sure you can “speculatively decode harder” with fewer users.

sothatsit 5 hours ago|||

There are a lot of knobs they could tweak. Newer hardware and traffic prioritisation would both make a lot of sense. But they could also lower batching windows to decrease queueing time at the cost of lower throughput, or keep the KV cache in GPU memory at the expense of reducing the number of users they can serve from each GPU node.

martinald 45 minutes ago||

I think it's just routing to faster hardware:

H100 SXM: 3.35 TB/s HBM3

GB200: 8 TB/s HBM3e

2.4x faster memory - which is exactly what they are saying the speedup is. I suspect they are just routing to GB200 (or TPU etc equivalents).

FWIW I did notice _sometimes_ recently Opus was very fast. I put it down to a bug in Claude Code's token counting, but perhaps it was actually just occasionally getting routed to GB200s.

jstummbillig 5 hours ago|||

> It seems unlikely it’s just prioritization

Why does this seem unlikely? I have no doubt they are optimizing all the time, including inference speed, but why could this particular lever not entirely be driven by skipping the queue? It's an easy way to generate more money.

AnotherGoodName 2 hours ago|||

Yes it's 100% prioritization. Through that it's also likely running on more GPUs at once but that's an artifact of prioritization at the datacenter level. Any task coming into an AI datacenter atm is split into fairly fined grained chunks of work and added to queues to be processed.

When you add a job with high priority all those chunks will be processed off the queue first by each and every GPU that frees up. It probably leads to more parallelism but... it's the prioritization that led to this happening. It's better to think of this as prioritization of your job leading to the perf improvement.

Here's a good blog for anyone interested which talks about prioritization and job scheduling. It's not quite at the datacenter level but the concepts are the same. Basically everything is thought of as a pipeline. All training jobs are low pri (they take months to complete in any case), customer requests are mid pri and then there's options for high pri. Everything in an AI datacenter is thought of in terms of 'flow'. Are there any bottlenecks? Are the pipelines always full and the expensive hardware always 100% utilized? Are the queues backlogs big enough to ensure full utilization at every stage?

https://www.aleksagordic.com/blog/vllm

kgeist 1 hour ago||

>Yes it's 100% prioritization

Amazon Bedrock has a similar feature called "priority tier": you get faster responses at 1.75x the price. And they explicitly say in the docs "priority requests receive preferential treatment in the processing queue, moving ahead of standard requests for faster responses".

singpolyma3 5 hours ago|||

Until everyone buys it. Like fast pass at an amusement park where the fast line is still two hours long

sothatsit 5 hours ago|||

At 6x the cost, and it requiring you to pay full API pricing, I don’t think this is going to be a concern.

servercobra 5 hours ago|||

It's a good way to squeeze extra out of a bunch of people without actually raising prices.

Nition 5 hours ago|||

I wonder if they might have mostly implemented this for themselves to use internally, and it is just prioritization but they don't expect too many others to pay the high cost.

sothatsit 5 hours ago||

Roon said as much here [0]:

> codex-5.2 is really amazing but using it from my personal and not work account over the weekend taught me some user empathy lol it’s a bit slow

[0] https://nitter.net/tszzl/status/2016338961040548123

Nition 2 hours ago||

I see Anthropic says so here as well: https://x.com/claudeai/status/2020207322124132504

re-thc 3 hours ago|||

Nvidia GB300 i.e. Blackwell.

pshirshov 6 hours ago||

> so what else is changing?

Let me guess. Quantization?

More comments...