Posted by surprisetalk 8 hours ago
- Turn down the thinking token budget to one half
- Multiply the thinking tokens by 2 on the usage stats returned
- Phew! Twice the speed
IMO charging for the thinking tokens that you can't see is scam.
But if you just ask a question or something it’ll take a while to spend a million tokens…
The deadline piece is really interesting. I suppose there’s a lot of people now who are basically limited by how fast their agents can run and on very aggressive timelines with funders breathing down their necks?
How would it not be a big unlock? If the answers were instant I could stay focused and iterate even faster instead of having a back-and-forth.
Right now even medium requests can take 1-2 minutes and significant work can take even longer. I can usually make some progress on a code review, read more docs, or do a tiny chunk of productive work but the constant context switching back and forth every 60s is draining.
current speeds are "ask it to do a thing and then you the human need find something else to do for minutes (or more!) while it works". at a certain point at it being faster you just sit there and tell it to do a thing and it does and you just constantly work on the one thing.
cerebras is just about fast enough for that already, with the downside of being more expensive and worse at coding than claude code.
it feels like absolute magic to use though.
so, depends how you price your own context switches, really.
Obviously they can't make promises but I'd still like a rough indication of how much this might improve the speed of responses.
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
I can't imagine how quickly this Fast Mode goes through credit.
Opus fast mode is routed to different servers with different tuning that prioritizes individual response throughput. Same model served differently. Same response, just delivered faster.
The Gemini fast mode is a different model (most likely) with different levels of thinking applied. Very different response.