Mercury 2: The fastest reasoning LLM, powered by diffusion

Posted by fittingopposite 6 hours ago

Mercury 2: The fastest reasoning LLM, powered by diffusion(www.inceptionlabs.ai)

137 points | 74 comments

cjbarber 4 hours ago|

It could be interesting to do the metric of intelligence per second.

ie intelligence per token, and then tokens per second

My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd be primarily using Sonnet 4.6. But that wasn't true for me with prior model generations, in those generations the Sonnet class models didn't feel good enough compared to the Opus class models. And it might shift again when I'm doing things that feel more intelligence bottlenecked.

But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.

estsauver 2 hours ago||

I think there's clearly a "Speed is a quality of it's own" axis. When you use Cereberas (or Groq) to develop an API, the turn around speed of iterating on jobs is so much faster (and cheaper!) then using frontier high intelligence labs, it's almost a different product.

Also, I put together a little research paper recently--I think there's probably an underexplored option of "Use frontier AR model for a little bit of planning then switch to diffusion for generating the rest." You can get really good improvements with diffusion models! https://estsauver.com/think-first-diffuse-fast.pdf

refulgentis 2 hours ago||

I'm very worried for both.

Cerebras requires a $3K/year membership to use APIs.

Groq's been dead for about 6 months, even pre-acquisition.

I hope Inception is going well, it's the only real democratic target at this. Gemini 2.5 Flash Lite was promising but it never really went anywhere, even by the standards of a Google preview

behnamoh 10 minutes ago|||

Once again, it's a tech that Google created but never turned into a product. AFAIK in their demo last year, Google showed a special version of Gemini that used diffusion. They were so excited about it (on the stage) and I thought that's what they'd use in Google search and Gmail.

ainch 2 hours ago||||

I don't think it's a good comparison given Inception work on software and Cerebras/Groq work on hardware. If Inception demonstrate that diffusion LLMs work well at scale (at a reasonable price) then we can probably expect all the other frontier labs to copy them quickly, similarly to OpenAI's reasoning models.

refulgentis 2 hours ago||

Definitely depends on what you're buying, maybe some of the audience here was buying Groq and Cerebras chips? I don't think they sold them but can't say for sure.

If you're a poor schmoke like me, you'd be thinking of them as API vendors of ~1000 token/s LLMs.

Especially because Inception v1's been out for a while and we haven't seen a follow-the-leader effect.

Coincidentally, that's one of my biggest questions: why not?

nl 2 hours ago||||

Taalas is interesting. 16,000 TPS for Llama on a chip.

https://taalas.com/

freeqaz 2 hours ago||||

You can call Cerebras APIs via OpenRouter if you specify them as the provider in your request fyi. It's a bit pricier but it exists!

andai 1 hour ago||

I used their API normally (pay per token) a few weeks ago. Their Coding Plan appears to be permanently sold out though.

estsauver 1 hour ago||||

I am currently using their APIs on a paygo plan, I think it might just be a capacity issue for new sign ups.

7thpower 2 hours ago|||

What do you mean by Grow is dead since about 6 months ago? Not refuting your point, but I’m curious.

refulgentis 2 hours ago||

No new model since GPT-OSS 120B, er maybe Kimi K2 not-thinking? Basically there were a couple models it normally obviously support, and it didn't.

Something about that Nvidia sale smelled funny to me because the # was yuge, yet, the software side shut down decently before the acquisition.

But that's 100% speculation, wouldn't be shocked if it was:

"We were never looking to become profitable just on API users, but we had to have it to stay visible. So, yeah, once it was clear an Nvidia sale was going through, we stopped working 16 hours a day, and now we're waiting to see what Nvidia wants to do with the API"

bigbuppo 3 hours ago|||

Maybe make that intelligence per token per relative unit of hardware per watt. If you're burning 30 tons of coal to be 0.0000000001% better than the 5 tons of coal option because you're throwing more hardware at it, well, it's not much of a real improvement.

estsauver 1 hour ago||

I think the fast inference options have historically been only marginally more expensive then their slow cousins. There's a whole set of research about optimal efficiency, speed, and intelligence pareto curves. If you can deliver even an outdated low intelligence/old model at high efficiency, everyone will be interested. If you can deliver a model very fast, everyone will be interested. (If you can deliver a very smart model, everyone is obviously the most interested, but that's the free space.)

But to be clear, 1000 tokens/second is WAY better. Anthropic's Haiku serves at ~50 tokens per second.

dmichulke 15 minutes ago|||

Useful for evaluating people as well

volodia 2 hours ago|||

We agree! In fact, there is an emerging class of models aimed at fast agentic iteration (think of Composer, the Flash versions of proprietary and open models). We position Mercury 2 as a strong model in this category.

estsauver 1 hour ago||

Do you guys all think you'll be able to convert open source models to diffusion models relatively cheaply ala the d1 // LLaDA series of papers? If so, that seems like an extremely powerful story where you get to retool the much, much larger capex of open models into high performance diffusion models.

(I can also see a world where it just doesn't make sense to share most of the layers/infra and you diverge, but curious how you all see the approach.)

josephg 3 hours ago|||

Yeah I agree with this. We might be able to benchmark it soon (if we can’t already) but asking different agentic code models to produce some relatively simple pieces of software. Fast models can iterate faster. Big models will write better code on the first attempt, and need less loop debugging. Who will win?

At the moment I’m loving opus 4.6 but I have no idea if its extra intelligence makes it worth using over sonnet. Some data would be great!

estsauver 1 hour ago||

For what it's worth, most people already are doing this! Some of the subagents in Claude Code (Explore, I think even compaction) default to Haiku and then you have to manually overwrite it with an env variable if you want to change it.

Imagine the quality of life upgrade of getting compaction down to a few second blip, or the "Explore" going 20 times faster! As these models get better, it will be super exciting!

nubg 3 hours ago||

Interesting perspective. Perhaps also the user would adopt his queries knowing he can only to small (but very fast) steps. I wonder who would win!

vicchenai 28 minutes ago||

The iteration speed advantage is real but context-specific. For agentic workloads where you're running loops over structured data -- say, validating outputs or exploring a dataset across many small calls -- the latency difference between a 50 tok/s model and a 1000+ tok/s one compounds fast. What would take 10 minutes wall-clock becomes under a minute, which changes how you prototype.

The open question for me is whether the quality ceiling is high enough for cases where the bottleneck is actually reasoning, not iteration speed. volodia's framing of it as a "fast agent" model (comparable tier to Haiku 4.5) is honest -- for the tasks that fit that tier, the 5x speed advantage is genuinely interesting.

volodia 3 hours ago||

Co-founder / Chief Scientist at Inception here. If helpful, I’m happy to answer technical questions about Mercury 2 or diffusion LMs more broadly.

nowittyusername 2 hours ago||

How does the whole kv cache situation work for diffusion models? Like are there latency and computation/monetary savings for caching? is the curve similar to auto regressive caching options? or maybe such things dont apply at all and you can just mess with system prompt and dynamically change it every turn because there's no savings to be had? or maybe you can make dynamic changes to the head but also get cache savings because of diffusion based architecture?... so many ideas...

volodia 2 hours ago||

There are many ways to do it, but the simplest approach is block diffusion: https://m-arriola.com/bd3lms/

There are also more advanced approaches, for example FlexMDM, which essentially predicts length of the "canvas" as it "paints tokens" on it.

nl 2 hours ago|||

I had a very odd interaction somewhat similar to how weak transformer models get into a loop:

https://gist.github.com/nlothian/cf9725e6ebc99219f480e0b72b3...

What causes this?

volodia 2 hours ago||

This looks like an inference glitch that we are working on fixing, thank you for flagging.

kristianp 3 hours ago|||

How big is Mercury 2? How many tokens is it trained on?

Is it's agentic accuracy good enough to operate, say, coding agents without needing a larger model to do more difficult tasks?

volodia 3 hours ago||

You can think of Mercury 2 as roughly in the same intelligence tier as other speed-optimized models (e.g., Haiku 4.5, Grok Fast, GPT-Mini–class systems). The main differentiator is latency — it’s ~5× faster at comparable quality.

We’re not positioning it as competing with the largest models (Opus 4.5, etc.) on hardest-case reasoning. It’s more of a “fast agent” model (like Composer in Cursor, or Haiku 4.5 in some IDEs): strong on common coding and tool-use tasks, and providing very quick iteration loops.

xanth 2 hours ago|||

Are you dogfooding it on simple tasks? If so what do you use it for regularly and what do you avoid?

nayroclade 2 hours ago|||

Is the approach fundamentally limited to smaller models? Or could you theoretically train a model as powerful as the largest models, but much faster?

techbro92 2 hours ago|||

Do you think you will be moving towards drifting models in the future for even more speed?

volodia 2 hours ago||

Not imminently, but hard to predict where the field will go

CamperBob2 3 hours ago||

Seems to work pretty well, and it's especially interesting to see answers pop up so quickly! It is easily fooled by the usual trick questions about car washes and such, but seems on par with the better open models when I ask it math/engineering questions, and is obviously much faster.

volodia 3 hours ago||

Thanks for trying it and for the thoughtful feedback, really appreciate it. And we’re actively working on improving quality further as we scale the models.

rancar2 40 minutes ago||

My attempt with trying one of their OOTB prompts in the demo https://chat.inceptionlabs.ai resulted in: "The server is currently overloaded. Please try again in a moment."

And a pop-up error of: "The string did not match the expected pattern."

That happened three times, then the interface stopped working.

I was hoping to see how this stacked up against Taalas demo, which worked well and was so fast every time I've hit it this past week.

dvt 5 hours ago||

What excites me most about these new 4figure/second token models is that you can essentially do multi-shot prompting (+ nudging) and the user doesn't even feel it, potentially fixing some of the weird hallucinatory/non-deterministic behavior we sometimes end up with.

volodia 2 hours ago||

That is also our view! We see Mercury 2 as enabling very fast iteration for agentic tasks. A single shot at a problem might be less accurate, but because the model has a shorter execution time, it enables users to iterate much more quickly.

lostmsu 1 hour ago||

Regular models are very fast if you do batch inference. GPT-OSS 20B gets close to 2k tok/s on a single 3090 at bs=64 (might be misremembering details here).

dmix 30 minutes ago||

I tried Mercury 1 in Zed for inline completions and it was significantly slower than Cursors autocomplete. Big reason why I switched backed to Cursor(free)+Claude Code

nylonstrung 3 hours ago||

I'm not sold on diffusion models.

Other labs like Google have them but they have simply trailed the Pareto frontier for the vast majority of use cases

Here's more detail on how price/performance stacks up

https://artificialanalysis.ai/models/mercury-2

volodia 3 hours ago||

I’d push back a bit on the Pareto point.

On speed/quality, diffusion has actually moved the frontier. At comparable quality levels, Mercury is >5× faster than similar AR models (including the ones referenced on the AA page). So for a fixed quality target, you can get meaningfully higher throughput.

That said, I agree diffusion models today don’t yet match the very largest AR systems (Opus, Gemini Pro, etc.) on absolute intelligence. That’s not surprising: we’re starting from smaller models and gradually scaling up. The roadmap is to scale intelligence while preserving the large inference-time advantage.

nylonstrung 1 hour ago|||

I changed my mind: this would be perfect for a fast edit model ala Morph Fast Apply https://www.morphllm.com/products/fastapply

It looks like they are offering this in the form of "Mercury Edit"and I'm keen to try it

ainch 2 hours ago||

This understates the possible headroom as technical challenges are addressed - text diffusion is significantly less developed than autoregression with transformers, and Inception are breaking new ground.

nylonstrung 1 hour ago||

Very good point- if as much energy/money that's gone into ChatGPT style transformer LLMs were put into diffusion there's a good chance it would outperform in every dimension

ilaksh 4 hours ago||

It seems like the chat demo is really suffering from the effect of everything going into a queue. You can't actually tell that it is fast at all. The latency is not good.

Assuming that's what is causing this. They might show some kind of feedback when it actually makes it out of the queue.

volodia 3 hours ago|

Thank you for your patience. We are working to handle the surge in demand.

serjester 2 hours ago||

There's a potentially amazing use case here around parsing PDFs to markdown. It seems like a task with insane volume requirements, low budget, and the kind of thing that doesn't benefit much from autoregression. Would be very curious if your team has explored this.

nowittyusername 2 hours ago|

Nice, I'm excited to try this for my voice agent, at worst it could be used to power the human facing agent for latency reduction.

volodia 2 hours ago|

Would love to hear about your experience. Send us an email.

More comments...