Posted by fittingopposite 6 hours ago
ie intelligence per token, and then tokens per second
My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd be primarily using Sonnet 4.6. But that wasn't true for me with prior model generations, in those generations the Sonnet class models didn't feel good enough compared to the Opus class models. And it might shift again when I'm doing things that feel more intelligence bottlenecked.
But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.
Also, I put together a little research paper recently--I think there's probably an underexplored option of "Use frontier AR model for a little bit of planning then switch to diffusion for generating the rest." You can get really good improvements with diffusion models! https://estsauver.com/think-first-diffuse-fast.pdf
Cerebras requires a $3K/year membership to use APIs.
Groq's been dead for about 6 months, even pre-acquisition.
I hope Inception is going well, it's the only real democratic target at this. Gemini 2.5 Flash Lite was promising but it never really went anywhere, even by the standards of a Google preview
If you're a poor schmoke like me, you'd be thinking of them as API vendors of ~1000 token/s LLMs.
Especially because Inception v1's been out for a while and we haven't seen a follow-the-leader effect.
Coincidentally, that's one of my biggest questions: why not?
Something about that Nvidia sale smelled funny to me because the # was yuge, yet, the software side shut down decently before the acquisition.
But that's 100% speculation, wouldn't be shocked if it was:
"We were never looking to become profitable just on API users, but we had to have it to stay visible. So, yeah, once it was clear an Nvidia sale was going through, we stopped working 16 hours a day, and now we're waiting to see what Nvidia wants to do with the API"
But to be clear, 1000 tokens/second is WAY better. Anthropic's Haiku serves at ~50 tokens per second.
(I can also see a world where it just doesn't make sense to share most of the layers/infra and you diverge, but curious how you all see the approach.)
At the moment I’m loving opus 4.6 but I have no idea if its extra intelligence makes it worth using over sonnet. Some data would be great!
Imagine the quality of life upgrade of getting compaction down to a few second blip, or the "Explore" going 20 times faster! As these models get better, it will be super exciting!
The open question for me is whether the quality ceiling is high enough for cases where the bottleneck is actually reasoning, not iteration speed. volodia's framing of it as a "fast agent" model (comparable tier to Haiku 4.5) is honest -- for the tasks that fit that tier, the 5x speed advantage is genuinely interesting.
There are also more advanced approaches, for example FlexMDM, which essentially predicts length of the "canvas" as it "paints tokens" on it.
https://gist.github.com/nlothian/cf9725e6ebc99219f480e0b72b3...
What causes this?
Is it's agentic accuracy good enough to operate, say, coding agents without needing a larger model to do more difficult tasks?
We’re not positioning it as competing with the largest models (Opus 4.5, etc.) on hardest-case reasoning. It’s more of a “fast agent” model (like Composer in Cursor, or Haiku 4.5 in some IDEs): strong on common coding and tool-use tasks, and providing very quick iteration loops.
And a pop-up error of: "The string did not match the expected pattern."
That happened three times, then the interface stopped working.
I was hoping to see how this stacked up against Taalas demo, which worked well and was so fast every time I've hit it this past week.
Other labs like Google have them but they have simply trailed the Pareto frontier for the vast majority of use cases
Here's more detail on how price/performance stacks up
On speed/quality, diffusion has actually moved the frontier. At comparable quality levels, Mercury is >5× faster than similar AR models (including the ones referenced on the AA page). So for a fixed quality target, you can get meaningfully higher throughput.
That said, I agree diffusion models today don’t yet match the very largest AR systems (Opus, Gemini Pro, etc.) on absolute intelligence. That’s not surprising: we’re starting from smaller models and gradually scaling up. The roadmap is to scale intelligence while preserving the large inference-time advantage.
It looks like they are offering this in the form of "Mercury Edit"and I'm keen to try it
Assuming that's what is causing this. They might show some kind of feedback when it actually makes it out of the queue.