Defeating Nondeterminism in LLM Inference

Posted by jxmorris12 3 days ago

Defeating Nondeterminism in LLM Inference(thinkingmachines.ai)

326 points | 131 commentspage 2

syntaxing 3 days ago|

Super interesting. For those unaware, this is the company Mira Murati (OpenAI previous CTO) started

mg 3 days ago||

I really hope we will get deterministic LLMs in the future. Even if it causes slightly slower response times.

Nondeterminism is what currently keeps me from working with other developers.

As I wrote in "Prompt Coding" [1], these days I am not looking for good code. I am looking for prompts that create good code. But how do you share prompts among developers when they produce different code every time? You cannot simply state "Here, I found a prompt that makes gpt-5-2025-08-07 output a solution with all the desired attributes".

Similar with images. At the moment, for most image models, you cannot outsource the task of writing prompts that create the desired images. Because most image models will not create the same image when given the same prompt and parameters.

[1]: https://www.gibney.org/prompt_coding

p1necone 3 days ago||

Surely if you end up relying on a given prompt to produce the exact same code every time you should instead just check that code into source control the first time you generate it?

A deterministic LLM isn't going to behave appreciably differently from a non deterministic one if your input or context varies by even a tiny bit (pun intended) each time.

skybrian 2 days ago||

If nothing has changed, caching the result would certainly be cheaper. But if you're doing that as part of a test, it's not really running the test and it might defeat the purpose of the test.

khimaros 3 days ago|||

i tried to create a makefile driven workflow based on this idea and ended up with https://github.com/khimaros/enc -- it suffers from the issues you raised

i'm hoping that it becomes more useful as models improve and become more reliable at producing working code (though determinism would be great for improving prompts).

xnx 2 days ago||

> most image models will not create the same image when given the same prompt and parameters.

Really? If you include the seed as one of the parameters most produce pixel identical output.

E.g. "Generate deterministic images" https://cloud.google.com/vertex-ai/generative-ai/docs/image/...

kybernetikos 3 days ago||

For fun over the last few days, I've built a compressor / decompressor that uses the logits from an LLM, for each token in the input, then takes the ranks and exponential goolomb encodes them. Then you work in reverse to regenerate the original

It took me ages to get the prediction for the second token after "hello" to match the same as the prediction for the second token when running the model on the string "hello world", despite the fact that I was using a causal model. I tried all kinds of things before discovering that `quantized: false` was the important setting.

giveita 3 days ago|

What's the Weissman score? Or more seriously :) did it perform well. Sounds like it should. If more and more text is AI slop it should do well.

I dont fully understand what you said but I guess higher probability logits are encoded with fewer bits. If your text is the LLM output then you may need a bit or two per token?

kybernetikos 3 days ago||

I used exponential golomb coding, so the rank 0 logit is encoded with a single bit, ranks 1 and 2 are encoded with three bits, ranks 3-6 are encoded with 5 bits, etc.

In terms of performance, I've not done any serious testing, but e.g. the wikipedia article on volcanos compresses to about 20% using GPT2. I've seen other strings compress even further.

The big issue is that while encoding is not unreasonable, decoding any significant amount of data is incredibly slow, since I'm doing a model run for every token in the output. It's bad enough that the scheme is probably unworkable as it is. I'm thinking about changing my code so that it streams out the tokens as it decodes them, so you're not just left there waiting for ages.

frotaur 2 days ago||

I don't know about golomb coding, but with Arithmetic coding you can do stream decoding(AC), if I remember correctly.

I supervised a student's project whose goal was exactly that : implement compression with LLMs using AC.

Since AC is optimal, if your LLM has an average cross entropy x on some dataset, you can expect that the compression will compress data using x nats per token on average!

kybernetikos 2 days ago||

Arithmetic coding looks like an extremely interesting approach, given that you can use the model at each step to give you the probabilities of each token.

eldenring 3 days ago||

Very impressive! I guess this still wouldn't affect their original example

> For example, you might observe that asking ChatGPT the same question multiple times provides different results.

even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.

> Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.

The router also leaks batch-level information across sequences.

boroboro4 3 days ago|

> even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.

I don’t think this is correct - MoE routing happens at per token basis. It can be non deterministic and batch related if you try to balance out your experts load in a batch but that’s performance optimization (just like all of the blogpost) and not the way models are trained to work.

eldenring 3 days ago||

Ah interesting, good point. So I guess expert-choice routing leaks across the batch. Now I'm not sure.

gajjanag 2 days ago||

As others have pointed out, these phenomena are well known to many folks across companies in the AI infra space. It doesn't really break new ground. This article is a good exposition of the basic strategies though.

What I would have loved is a discussion around collectives/multi-node setups. And showing how to get determinism at low performance penalty for multi-node reduction collectives.

quantum_state 3 days ago||

As the bottom of LLM inference, it is sampling for the next token based on the probability distribution conditioned on the tokens currently in the context window. If the distribution exhibits degeneracy in probability for more than token, outcome of the sampling will naturally, as it should, be nondeterministic. It should be left alone.

orbital-decay 2 days ago||

By setting the temperature to 0 you get greedy decoding, which does a lot more than just making it predictable, and can degrade outputs. Random sampling exists for a reason! Gemini 2.5 Pro in particular doesn't like temp 0, for example.

Focus on correctness, not determinism.

empiko 2 days ago|

Determinism does not require temperature=0. You can have a deterministic behavior even with >0 temperature as long as you fix your random seeds.

cubefox 3 days ago||

His solution still relies on greedy (temperature 0) sampling, which is probably not optimal for model performance on various tasks. For example, Gemini 2.5 uses temperature 1 by default. But deterministic inference with temperature >0 can still be achieved by using pseudorandom sampling with a fixed seed.

red2awn 3 days ago||

Conceptually setting temperature to be >0 doesn't actually introduce any non-determinism. If your sampler is seeded then it will always choose the same next token. Higher temperature only flattens the logit distribution.

mynameismon 3 days ago||

The point of the blog is that even at "supposed" deterministic generative sampling, non-determinism creeps in. This in turn has disastrous effects in very real experiments.

cubefox 3 days ago||

My point is that greedy sampling is not just not sufficient but also not necessary for deterministic inference.

measurablefunc 3 days ago|

I think this means that the results might also be non-deterministic across hardware revisions b/c I don't think they verified that the kernels will work the same on different GPU & TPU versions b/c how do they know that the compiler will not re-order the operations behind their back?

saagarjha 3 days ago||

Yes, there’s usually no guarantee on how different hardware does operations (for example, even if the hardware is correctly rounding intermediate results, different hardware may use different tile sizes). The reproducibility here is for runs on the same machine.

Compilers can also reorder operations but in practice this is rarely an issue because kernels typically synchronize frequently and this limits the ability for compilers to reorder things. This isn’t to say it doesn’t happen, but even if it does happen it’s likely because the compiler changed because the code they generate is generally run-to-run identical.

AlotOfReading 3 days ago|||

You can prevent reordering with sufficient amounts of compiler abuse.

With revisions, you're trying to ensure a consistent floating point environment where the operations used are deterministic, and used in the same order with the same inputs. The best way to do that is to use operations that adhere to a mostly deterministic standard like IEEE-754.

TimorousBestie 3 days ago|||

Ensuring the same floating-point algorithm workload behaves exactly the same on two distinct workstations is a heck of a lot of work that almost no one is willing to pay for.

measurablefunc 3 days ago||

Not only that but heterogeneous clusters (inevitable at a large enough scale) will also have non-deterministic outputs. So it's great that they wrote kernels to make the forward pass deterministic but getting rid of it entirely at data center scale would mean that they'd also have to do this type of work across cluster nodes as well to maintain "cluster" invariance & not just batch invariance.

reliabilityguy 3 days ago||

> will not re-order the operations behind their back?

Valid point. Floating point summation is not always commutative.

More comments...