Speed up responses with fast mode

Posted by surprisetalk 10 hours ago

Speed up responses with fast mode(code.claude.com)

128 points | 139 commentspage 3

HardCodedBias 2 hours ago|

If this pricing ratio holds it is going to mint money for Cerebras.

Many suspected a 2x premium for 10x faster. It looks like they may have been incorrect.

1123581321 9 hours ago||

Could be a use for the $50 extra usage credit. It requires extra usage to be enabled.

> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.

minimaxir 9 hours ago||

After exceeding the increasingly shrinking session limit with Opus 4.6, I continued with the extra usage only for a few minutes and it consumed about $10 of the credit.

I can't imagine how quickly this Fast Mode goes through credit.

arcanemachiner 8 hours ago||

It has to be. The timing is just too close.

dmix 6 hours ago||

I really like Anthropic's web design. This doc site looks like it's using gitbook (or a clone of gitbook) but they make it look so nice.

falloutx 6 hours ago||

Its just https://www.mintlify.com/ with barely customized theme

dmix 2 hours ago|||

Ah fair enough. Their webdesign on the homepage and other stuff is still great. And the font/colour choice on the Mintlify theme is nice.

deepdarkforest 4 hours ago|||

Mintlify is the best example of a product that is just nice. They don't claim to have a moat, or weird agi vibes, or whatever. It just works and it's pretty. 10m arr right there

treycluff 6 hours ago||

Looks like mintlify to me. Especially the copy page button.

niobe 7 hours ago||

So fast mode uses more tokens, in direct opposition to Gemini where fast 'mode' means less. One more piece of useless knowledge to remember.

Sol- 7 hours ago||

I don't think this is the case, according to the docs, right? The effort level will use fewer tokens, but the independent fast mode just somehow seems to use some higher priority infrastructure to serve your requests.

Aurornis 6 hours ago||

You're comparing two different things. It's not useless knowledge, it's something you need to understand.

Opus fast mode is routed to different servers with different tuning that prioritizes individual response throughput. Same model served differently. Same response, just delivered faster.

The Gemini fast mode is a different model (most likely) with different levels of thinking applied. Very different response.

solidasparagus 9 hours ago||

I pay $200 a month and don't get any included access to this? Ridiculous

pedropaulovc 9 hours ago||

Well, you can burn your $50 bonus on it

bakugo 8 hours ago|||

The API price is 6x that of normal Opus, so look forward to a new $1200/mo subscription that gives you the same amount of usage if you need the extra speed.

MuffinFlavored 8 hours ago||

I always wondered this, is this true/does the math come out to be really that bad? 6x?

Is the writing on the wall for $100-$200/mo users that, it's basically known-subsidized for now and $400/mo+ is coming sooner than we think?

Are they getting us all hooked and then going to raise it in the future, or will inference prices go down to offset?

bakugo 3 hours ago||

The writing has been on the wall since day 1. They wouldn't be marketing a subscription being sold at a loss as hard as they are if the intention wasn't to lock you in and then increase the price later.

What I expect to happen is that they'll slowly decrease the usage limits on the existing subscriptions over time, and introduce new, more expensive subscription tiers with more usage. There's a reason why AI subscriptions generally don't tell you exactly what the limits are, they're intended to be "flexible" to allow for this.

kingforaday 8 hours ago||

..But it says "Available to all Claude Code users on subscription plans (Pro/Max/Team/Enterprise) and Claude Console."

Is this wrong?

behindsight 8 hours ago|||

It's explicitly called out as excluded in the blue info bubble they have there.

https://code.claude.com/docs/en/fast-mode#requirements

sothatsit 8 hours ago|||

I think this is just worded in a misleading way. It’s available to all users, but it’s not included as part of the plan.

pqdbr 2 hours ago||

I redeemed my 50 USD credit to give it a go. In literally less than 10 minutes I spent 10 USD. Insane. I love Claude Code, but this pricing is madness.

otterley 47 minutes ago|

What would have been the human labor cost equivalent?

maz1b 8 hours ago||

AFAIK, they don't have any deals or partnerships with Groq or Cerebras or any of those kinds of companies.. so how did they do this?

tcdent 8 hours ago||

Inference is run on shared hardware already, so they're not giving you the full bandwidth of the system by default. This most likely just allocates more resources to your request.

hendersoon 8 hours ago||

Could well be running on Google TPUs.

pedropaulovc 9 hours ago||

Where is this perf gain coming from? Running on TPUs?

AnotherGoodName 6 hours ago|

AI data centers are a whole lot of pipelines pumping data around utilizing queues. They want those expensive power hungry cards near 100% utilized at all times. So they have a queue of jobs on each system ready to run, feeding into the GPU memory as fast as completed jobs are read out of memory (and passed into the next stage) and they aim to have enough backlog in these queues to keep the pipeline full. You see responses in seconds but at the data center you're request was broken into jobs, passed around into queues, processed in an orderly manner and pieced back together.

With fast mode you're literally skipping the queue. An outcome of all of this is that for the rest of us the responses will become slower the more people use this 'fast' option.

I do suspect they'll also soon have a slow option for those that have Claude doing things overnight with no real care for latency of the responses. The ultimate goal is pipelines of data hitting 100% hardware utilization at all times.

martinald 4 hours ago||

Hmm not sure I agree with you there entirely. You're right there's queues to ensure that you max out the hardware with concurrent batches to _start_ inference, but I doubt you'd want to split up the same job into multiple bits and move them around servers if you could at all avoid it.

It requires a lot of bandwidth to do that and even at 400gbit/sec it would take a good second to move even a smaller KV cache between racks even in the same DC.

krm01 9 hours ago||

Will this mean that when cost is more important than latency that replies will now take longer?

I’m not in favor of the ad model chatgpt proposes. But business models like these suffer from similar traps.

If it works for them, then the logical next step is to convert more to use fast mode. Which naturally means to slow things down for those that didn’t pick/pay for fast mode.

We’ve seen it with iPhones being slowed down to make the newer model seem faster.

Not saying it’ll happen. I love Claude. But these business models almost always invite dark patterns in order to move the bottom line.

blackqueeriroh 2 hours ago|

No we’ve actually never seen that in iPhones. There is zero proof of this.

esafak 8 hours ago|

It's a good way to address the price insensitive segment. As long as they don't slow down the rest, good move.

digiown 4 hours ago|

This sounds like one of these theme park "skip the queue" tickets. It will absolutely slow down the rest.

More comments...