Speed up responses with fast mode

Posted by surprisetalk 8 hours ago

Speed up responses with fast mode(code.claude.com)

114 points | 117 commentspage 2

throwaway132448 3 hours ago|

Given how little most of us can know about the true cost of inference for these providers (and thus the financial sustainability of their services), this is an interesting signal. Not sure how to interpret it, but it doesn’t feel like it bodes well.

not_math 3 hours ago|

Given that providers of open source models can offer Kimi K2.5 at input $0.60 and output $2.50 per million tokens, I think the cost of inference must be around that. We would still need to compare the tokens per second.

digiown 3 hours ago||

I wouldn't be surprised if the implementation is

- Turn down the thinking token budget to one half

- Multiply the thinking tokens by 2 on the usage stats returned

- Phew! Twice the speed

IMO charging for the thinking tokens that you can't see is scam.

jhack 7 hours ago||

The pricing on this is absolutely nuts.

nick49488171 7 hours ago|

For us mere mortals, how fast does a normal developer for through a MTok. How about a good power user?

snowfield 5 hours ago|||

A developer can blast millions of tokens in minutes. When you have a context size of 250k that’s just 4 queries. But with tool usage and subsequent calls etc it can easily just do many millions in one request

But if you just ask a question or something it’ll take a while to spend a million tokens…

nick49488171 5 hours ago||

Seems like an opportunity to condense the context into 'documentation' level and only load the full text/code for files that expect to be edited?

snowfield 4 hours ago|||

Yeah that’s what they try to do with the latest coding agents sub agents which only have the context they need etc. but atm it’s too much work to manage contexts at that level

SatvikBeri 2 hours ago|||

I use one Claude instance at a time, roughly fulltime (writes 90% of my code). Generally making small changes, nothing weird. According to ccusage, I spend about $20 of tokens a day, a bit less than 1 MTOK output tokens a way. So the exact same workflow would be about $120 for higher speed.

clbrmbr 7 hours ago||

I’d love to hear from engineers who find that faster speed is a big unlock for them.

The deadline piece is really interesting. I suppose there’s a lot of people now who are basically limited by how fast their agents can run and on very aggressive timelines with funders breathing down their necks?

CuriouslyC 1 hour ago||

The only time I find faster speed to be a big unlock is when iterating on UI stuff. If you're talking to your agent, with hot reload and such the model can often be the bottleneck in a style tuning workflow by a lot.

Aurornis 4 hours ago|||

> I’d love to hear from engineers who find that faster speed is a big unlock for them.

How would it not be a big unlock? If the answers were instant I could stay focused and iterate even faster instead of having a back-and-forth.

Right now even medium requests can take 1-2 minutes and significant work can take even longer. I can usually make some progress on a code review, read more docs, or do a tiny chunk of productive work but the constant context switching back and forth every 60s is draining.

electroly 3 hours ago|||

I won't be paying extra to use this, but Claude Code's feature-dev plugin is so slow that even when running two concurrent Claudes on two different tasks, I'm twiddling my thumbs some of the time. I'm not fast and I don't have tight deadlines, but nonetheless feature-dev is really slow. It would be better if it were fast enough that I wouldn't have time to switch off to a second task and could stick with the one until completion. The mental cost of juggling two tasks is high; humans aren't designed for multitasking.

fragmede 2 hours ago||

Two? I'd estimate twelve (three projects x four tasks) going at peak.

sothatsit 7 hours ago|||

If it could help avoid you needing to context switch between multiple agents, that could be a big mental load win.

throw310822 4 hours ago|||

The idea of development teams bottlenecked by agent speed rather than people, ideas, strategy, etc. gives me some strange vibes.

bananapub 3 hours ago||

it's simpler than that - making it faster means it becomes less of an asynchronous task.

current speeds are "ask it to do a thing and then you the human need find something else to do for minutes (or more!) while it works". at a certain point at it being faster you just sit there and tell it to do a thing and it does and you just constantly work on the one thing.

cerebras is just about fast enough for that already, with the downside of being more expensive and worse at coding than claude code.

it feels like absolute magic to use though.

so, depends how you price your own context switches, really.

simonw 7 hours ago||

The one question I have that isn't answered by the page is how much faster?

Obviously they can't make promises but I'd still like a rough indication of how much this might improve the speed of responses.

scosman 7 hours ago||

Yeah is this cerebras/groq speed, or I just skip the queue?

l1n 7 hours ago|||

2.5x faster or so (https://x.com/claudeai/status/2020207322124132504).

zurfer 7 hours ago||

6x more expensive

simonwsucks 7 hours ago||

[dead]

l5870uoo9y 6 hours ago||

It doesn’t say how much faster it is but from my experience with OpenAI’s “service_tier=priority” option on SQLAI.ai is that it’s twice as fast.

dmix 4 hours ago||

I really like Anthropic's web design. This doc site looks like it's using gitbook (or a clone of gitbook) but they make it look so nice.

falloutx 4 hours ago||

Its just https://www.mintlify.com/ with barely customized theme

dmix 29 minutes ago|||

Ah fair enough. Their webdesign on the homepage and other stuff is still great. And the font/colour choice on the Mintlify theme is nice.

deepdarkforest 3 hours ago|||

Mintlify is the best example of a product that is just nice. They don't claim to have a moat, or weird agi vibes, or whatever. It just works and it's pretty. 10m arr right there

treycluff 4 hours ago||

Looks like mintlify to me. Especially the copy page button.

pronik 7 hours ago||

While it's an excellent way to make more money in the moment, I think this might become a standard no-extra-cost feature in several months (see Opus becoming way cheaper and a default model within months). Mental load management while using agents will become even more important it seems.

falloutx 4 hours ago||

Why would they cut a money making feature? In fact I am already imagining them asking for speed ransom every time you are in a pinch, some extra context space will also become buyable. Anthropic is in a penny pincher phase right now and they will try to milk everything. Watch them add micro transactions too.

giancarlostoro 7 hours ago||

Yeah especially once they make an even faster fast mode.

1123581321 7 hours ago||

Could be a use for the $50 extra usage credit. It requires extra usage to be enabled.

> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.

minimaxir 7 hours ago||

After exceeding the increasingly shrinking session limit with Opus 4.6, I continued with the extra usage only for a few minutes and it consumed about $10 of the credit.

I can't imagine how quickly this Fast Mode goes through credit.

arcanemachiner 7 hours ago||

It has to be. The timing is just too close.

niobe 5 hours ago|

So fast mode uses more tokens, in direct opposition to Gemini where fast 'mode' means less. One more piece of useless knowledge to remember.

Sol- 5 hours ago||

I don't think this is the case, according to the docs, right? The effort level will use fewer tokens, but the independent fast mode just somehow seems to use some higher priority infrastructure to serve your requests.

Aurornis 4 hours ago||

You're comparing two different things. It's not useless knowledge, it's something you need to understand.

Opus fast mode is routed to different servers with different tuning that prioritizes individual response throughput. Same model served differently. Same response, just delivered faster.

The Gemini fast mode is a different model (most likely) with different levels of thinking applied. Very different response.

More comments...