Defeating Nondeterminism in LLM Inference

Posted by jxmorris12 3 days ago

Defeating Nondeterminism in LLM Inference(thinkingmachines.ai)

326 points | 131 comments

lsy 3 days ago|

Fixing "theoretical" nondeterminism for a totally closed individual input-output pair doesn't solve the two "practical" nondeterminism problems, where the exact same input gives different results given different preceding context, and where a slightly transformed input doesn't give a correctly transformed result.

Until those are addressed, closed-system nondeterminism doesn't really help except in cases where a lookup table would do just as well. You can't use "correct" unit tests or evaluation sets to prove anything about inputs you haven't tested.

kazinator 3 days ago||

There is no such thing as "exactly the same input, but with different preceding context". The preceding context is input!

If you were to obtain exactly the same output for a given input prompt, regardless of context, then that would mean that the context is being ignored, which is indistinguishable from the session not maintaining any context such that each prompt is in a brand new empty context.

Now what some people want is requirements like:

- The different wording of a prompt with exactly the same meaning should not change anything in the output; e.g. whether you say "What is the capital of France" or "What is France's capital" the answer should be verbatim identical.

- Prior context should not change responses in ways that don't have any interaction with the context. For instance, a prompt is given "what is 2 + 2", then the answer should always be the same, except if the context instructs the LLM that 2 + 2 is to be five.

These kinds of requirements betray a misunderstanding of what these LLMs are.

Zacharias030 2 days ago|||

While I get that this is how LLMs work, I think you should think backwards from the user / from what AI as a field is aiming for and recognize that the „naive“ way of the parent to ask for reliable responses no matter what the „context“ is, is exactly what a good AI system should offer.

„The context is the input“ betrays a misunderstanding of what (artificial) intelligence systems are aiming for.

Helmut10001 2 days ago||

Then we need something else. This is not how LLMs work. They are simple statistical predictors, now universal anwsering machines.

Zacharias030 2 days ago|||

I agree mostly. They are all that you say, but if you think about the conditional distribution that you are learning, there is nothing preventing us in principle from mapping different contexts to the same responses. It is rather a practical limitation that we don’t have sufficient tools of shaping these distributions very soundly. All we can do is throw data at them and hope that they generalize to similar contexts.

We have observed situations where agentic LLM traces on verifiable problems with deterministic (greedy) decoding lead to either completely correct or completely wrong solutions depending on the minutes on the clock which are printed as coincidental output of some tool that the LLM used.

I think there may be some mild fixes to current models available , for example it is worrying that the attention mechanism can never fully disregard any token in the input, because the softmax will always assign a > 0 weight everywhere (and the NN has no way of setting a logit to -infinity). This directly causes that it is extremely difficult for the LLM to fully ignore any part of the context reliably.

However Yann LeCun actually offers some persuasive arguments that autoregressive decoding has some limitations and we may need something better.

virgilp 2 days ago|||

> They are simple statistical predictors, now universal anwsering machines.

I see this a lot. I kinda' doubt the "simple" part, but even beyond that, is there any evidence that statistical predictor can't be a universal answering machine? I think there's plenty of evidence that our thinking is at least partially a statistical predictor (e.g. when you see a black sheep you don't think "at least one side of this sheep is black", you fully expect it to be black on both sides)

I'm not saying that LLMs _are_ universal answering machines. I'm wondering why people question that they are/they can become one, based on the argument that "fundamentally they are statistical predictors". So they are. So what?

enormousness 18 hours ago||

Does your definition of "universal answering machine" include the answers being correct?

If it does, statistical predictors can't help you because they're not always correct or even meaningful (correlation does not imply causation).

If it doesn't then, by all means, enjoy your infinite monkeys

Dylan16807 2 days ago||||

> These kinds of requirements betray a misunderstanding of what these LLMs are.

They do not. Refusing to bend your requirements to a system that can't satisfy them is not evidence of misunderstanding the system.

And if you tack on "with X 9s of reliability" then it is something LLMs can do. And in the real world every system has a reliability factor like that.

stubbornleaf 1 day ago||||

Sure. But the context always starts with the first input, right? And how can you guarantee—or why should you guarantee—that the reply to the first input will always be the same? And if that’s not the case, how can we ensure the preceding context remains consistent?

kjkjadksj 2 days ago||||

If an input along with the context generated some random seed or hash this would certainly be possible. Just paste your seed over to your coworker, they supply it to the model and it contains all contextual information.

skybrian 2 days ago|||

I wonder if there's a way to use an LLM to rewrite the prompt, standardizing the wording when two prompts mean the same thing?

kazinator 2 days ago|||

It's going to backfire. In real scenarios (not regression testing) users don't want to see the exact same thing twice out of the LLM in the same session in spite of trying to refine the result with more context.

There are going to be false positives: text that is subtly different from a previous response is misidentified as a duplicate such that the previous response is substituted for it, frustrating the user.

skybrian 2 days ago||

Google search rewrites misspelled search queries and also lets you override it if that's not what you want. Maybe something similar would work?

d4mi3n 2 days ago||||

Not an expert, but I've been told RAG in combination with a database of facts is one way to get more consistency here. Using one of the previous examples, you might have a knowledge store (usually a vector database of some kind) that contains a mapping of countries to capitols and the LLM would query it whenever it had to come up with an answer rather than relying on whatever was baked into the base model.

taneq 2 days ago||||

Deterministically, you mean? ;)

tonyhart7 2 days ago|||

oh so you want it to be thinking???? now we talking

raincole 2 days ago|||

> where the exact same input gives different results given different preceding context

Why and how is this a problem?

If 'preceding context' doesn't cause different results, it means you can simply discard the context. Why do I want that? It's not how I expect a tool to work (I expect vim responds differently to my input after I switch to the insert mode). It's absolutely not how I expect intelligence to work either. It sounds like the most extreme form of confirmation bias.

qcnguy 2 days ago|||

When the context is auto-generated and may include irrelevant data.

This is a common AI benchmark and has been for years before GPT-2 even existed. LLMs need to not get distracted by irrelevant facts and there are tests that measure this. It's the motivation for attention mechanisms, which are the breakthrough that enabled LLMs to scale up.

edflsafoiewq 2 days ago|||

An example is translation. I MTLed some text recently where the name of a (fictional) city was translated about a dozen different ways. Sometimes you'd get a calque, sometimes you'd get a transliteration (including several wrong ones). Ironically "dumb" MTLs are often much more consistent about this than LLMs.

saagarjha 3 days ago|||

This is really useful in reproducing bugs.

brookst 2 days ago||

I was with you until you said it “doesn’t really help”. Did you mean “doesn’t completely solve the problem “?

daralthus 2 days ago||

I thought this was pretty well known (at least in the JAX/XLA world). I've hit this many times and got batch variance explained to me before: https://github.com/google-deepmind/penzai/issues/82 and https://github.com/jax-ml/jax/issues/20047#issuecomment-1975...

Zacharias030 2 days ago|

should be the top comment.

dns_snek 2 days ago||

Why do you care about determinism in a probabilistic system? What difference does it make to the end user if the input "How do I X?" always produces the same deterministic output when semantically equivalent inputs "how do i x?", "how do I x", and "how do I X??" are bound to produce different answers that often won't even be semantically equivalent.

What LLMs need is the ability to guarantee semantically-equivalent outputs for all semantically-equivalent inputs, but that's very different from "determinism" as we understand it from other algorithms.

helloplanets 2 days ago||

Not all LLM based applications are a user facing free form chat.

If you take an LLM that makes 10 tool calls in a row for an evaluation, any reduction in unpredictable drift is welcome. Same applies to running your prompt through DSPy Optimizer. [0] Countless other examples. Basically any situation where you are in control of the prompt, the token level input to the LLM, so there's no fuzziness.

In this case, if you would've eliminated token level fuzziness and can yourself guarantee that you're not introducing it from your own end, you can basically map out a much more reliable tree or graph structure of your system's behavior.

[0]: https://dspy.ai/#2-optimizers-tune-the-prompts-and-weights-o...

skeezyboy 2 days ago||

> If you take an LLM that makes 10 tool calls in a row for an evaluation, any reduction in unpredictable drift is welcome

why use an ambiguous natural language for a specific technical task? i get that its a cool trick but surely they can come up with another input method by now?

Taek 2 days ago|||

You aren't wrong, but that doesn't mean this level of determinism isn't useful. If you don't even have the level of determinism that the exact same input tokens produce the exact same output tokens, then it's very hard to share reproducible results with peers, which can be useful if you are say, red teaming an LLM to produce a very rare / unreliable output.

stillsut 2 days ago|||

I'm actually working on something similar to this where you can encode information into the outputs of LLM's via steganography: https://github.com/sutt/innocuous

Since I'm really looking to sample the only the top ~10 tokens, and I mostly test on CPU-based inference of 8B models, there's probably not a lot of worries getting a different order of the top tokens based on hardware implementation, but I'm still going to take a look at it eventually, and build in guard conditions against any choice that would be changed by an epsilon of precision loss.

brisky 2 days ago|||

It would be very useful for AI platform customers. You could run prompts with 0 temperature and check if the results are the same making sure that AI provider is not switching the PRO model in the background for a cheap one and ripping you off.

ZeljkoS 2 days ago|||

For "bug" reproduction purposes. It is easier to debug a model if the same string always produces the same incorrect or strange LLM output, not every 100th time you run it.

ontouchstart 2 days ago||

If there is a bug (a behavior defined by whatever criteria), it is just a single path in a very complex systems with high connectivity.

This nonlinear and chaotic behavior regardless of implementation details of the black box makes LLM seem to be nondeterministic. But LLM is just a pseudo random number generator with a probability distribution.

(As I am writing this on my iPhone with text completion, I can see this nondeterministic behavior)

kodablah 2 days ago|||

Deterministic output is needed when LLMs are used for validations. This can be anything from input validation at runtime to a CI check leveraging LLMs. It can be argued this is not an acceptable use of AI, but it will become increasingly common and it will need to be tweaked/tested. You cannot tweak/test a response you don't know you're going to get.

dahcryn 2 days ago||

yeah indeed, regression testing for chatbots that use RAGs would involve making sure the correct response comes from the RAG.

Today we have a extremely hacky workaround by ensuring that at least the desired chunk from the RAG is selected, but it's far from ideal and our code is not well written (a temporary POC written by AI that has been there for quite some months now ...)

gtsop 2 days ago|||

When you do MCP-style applications, an LLM is more like RegEx on steroids, and since you expect your regex to return the same matches on the same input, it is a very desirable attribute for LLMs as well. I would say it is more than desirable, it is necessary.

If i want to covert "how do I x" to `api.howTo("x")` it is very important that i get the exact same result every time.

Ratelman 2 days ago|||

Was my thinking exactly - but also semantically equivalent is also only relevant when it needs to be factual, not necessarily for ALL outputs (if we're aiming for LLM's to present as "human" - or for interactions with LLMs to be natural conversational...). This excludes the world where LLMs act as agents - where you would of course always like the LLM to be factual and thus deterministic.

mingtianzhang 2 days ago|||

I agree that we need stochasticity in a probabilistic system, but I also think it would be good to control it. For example, we need the stochasticity introduced at high temperatures since it is inherent to the model, but we don’t need stochasticity in matrix computations, as it is not required for modeling.

caminanteblanco 2 days ago|||

I don't think the claim is that this is particularly helpful for consumer-facing applications. But from a research perspective, this is invaluable for allowing reproducibility.

redlock 2 days ago||

Easier to debug deterministic inference

jll29 3 days ago||

Sometimes, the reason for non-determinism is implementation-specific. For instance, in GPT-2's source code (I haven't checked other model versions), setting the temperature in the GUI does not lead to a value of 0 but "epsilon" (a very small value larger than 0), to avoid a division by zero error in the code, which makes sense.

For many applications, non-determinism implies "useless". This has been a long standing issue with LDA topic models. In particular in the legal, financial and regulatory domains, if a method is not deterministic, it may be illegal to use it or it may lead to follow-on requirements that one does not want (e.g. all screens shown to humans must be preserved to be able to go back and reconstruct what exactly happened to a particular user in a particular second).

nakamoto_damacy 2 days ago||

"in collaboration with others at Thinking Machines"

If you're old enough, you might remember Danny Hillis' Thinking Machines from the late 80s. I wish they had chosen a different name (I say this for nostalgic reasons, having been in front of one of those cubes glowing with red LEDs back in the late 80s at MIT's AI Lab" (renamed to CSAIL at some point). Feynman did some amazing work on that, too: https://longnow.org/ideas/richard-feynman-and-the-connection...

In the U.S., the “THINKING MACHINES” trademarks were owned by Thinking Machines Corporation (the company Hillis co-founded), not Hillis personally, and those registrations were cancelled in 1998–1999. USPTO Report +1

The company itself went bankrupt in 1994 and its assets were dispersed (e.g., to Sun Microsystems, later Oracle).

There’s a new, pending USPTO application for “THINKING MACHINES” filed in 2025 by Thinking Machines Lab Inc., the company founded by Amira Murati.

Imnimo 2 days ago||

I make this mistake every time I see their name.

skeezyboy 2 days ago||

didnt feynmann shag his colleagues wives? the more i read about him, the more he seems like a cunt

nakamoto_damacy 2 days ago||

I had no idea. But I believe the same deal with Einstein being a dick to his wife and never acknowledging his friend who taught him the math he used in his work (I read about that recently from a respectable source.) I guess that makes sense; no one is void of some deep flaw, it's just selectively hidden.

jasonjmcghee 3 days ago||

I love high quality blog post style research discussion - Anthropic has been leading the charge with this recently and it's great to see it spreading. OpenAI was also doing this during all the RL research days.

riazrizvi 2 days ago||

Natural language is ambiguous. It needs to be. I think the approach here of trying to figure out how to make circles into squares, and argue why circles should be squares, is misguided.

Discussions of this type are going to eventually morph into better understanding of how to accept ambiguity and randomness in language, and further shape it with other larger sub-patterns beyond the little proto-grammars that the QKV projection matrices extract.

atoav 2 days ago|

Yes, but determinism != ambiguity, because determinism means: for this exact input the same exact output needs to follow.

If I ask the same model the same question I should be able to deterministically get the same answer.

Now if we phrase the same question slightly differently we would expect to get a slightly different answer.

Jensson 2 days ago|||

> Now if we phrase the same question slightly differently we would expect to get a slightly different answer.

You wouldn't get this from an LLM though, a tiny change in starting point gets a massive change in output, its a chaotic system.

ares623 2 days ago||

Maybe predictability is what is meant?

riazrizvi 2 days ago|||

Me: What’s an example of a dice roll?

LLM: 1

“Language ambiguity with determinism”? Sure I can juxtapose the terms but if it’s semantically inconsistent, then what we mean by that is not a deterministic, definitive thing. You’re chasing your tail on this ‘goal’.

Nevermark 2 days ago|||

Ambiguity: The request/prompt leaves a lot of room for interpretation. Many qualitatively different answers may be correct, relative to the prompt. Different or non-deterministic models will return highly variance results.

Determinism: If a model is given the exact same request/prompt twice, its two responses will also be identical. Whether or not the consistent response qualifies as correct.

The two concepts are very different.

(Ambiguous vs. precise prompt) x (Deterministic vs. Non-deterministic model) = 4 different scenarios.

A model itself can be non-deterministic without being ambiguous. If you know exactly how it functions, why it is non-deterministic (batch sensitive for instance), that is not an ambiguous model. Its operation is completely characterized. But it is non-deterministic.

An ambiguous model would simply be model whose operation was not characterized. A black box model for instance. A black box model can be deterministic and yet ambiguous.

atoav 2 days ago||

Maybe I got this wrong but I thought ambiguity refered to the input. So in a deterministic system I would assume that a input of "Give an example of a dice roll" Will always output the exact same example (unless the model also gets the context of the message history).

Ambiguity is what happens when you change the prompt slightly, e.g. by adding a word: "Give an example of a single dice roll". Now as a human our expectation would be that this is the same question and should thus (in a deterministic system) receive the same answer. But to an LLM it may not be.

Nevermark 1 day ago||

> Ambiguity: [...] Different or non-deterministic models will return highly variance results.

Yes, and thanks. That was my intended point - but you point out a better example. Slightly different prompts may also produce highly varied responses.

(My subsequent comments on ambiguous models was in case I was misinterpreting the comment I was replying to. I also generally think of ambiguity as a property of input. Either way, ambiguity is not the same as non-deterministic.)

skybrian 2 days ago||||

If you really want that to work while being reproducible, maybe give it a random number tool and set the seed?

raincole 2 days ago|||

> LLM: 1

A perfectly acceptable answer.

If it answers 1 every time it's still a perfectly acceptable answer.

riazrizvi 2 days ago||

So is ‘2’ or ‘3’ or ‘19’ or ‘99’ or ‘a jam sponge cake with gaming dice for frosting’… The point is in natural language there are many perfectly acceptable answers. Usually any particular answer is arbitrary, and it would probably be undesirable to have the same answer everytime. For a majority of use cases.

gond 2 days ago||

I am still irritated by the name of the company.

What is the reasoning behind these schemes? The hope that bits of the properties of legendary companies will rub off onto the new venture?

As if naming the next best venture PARC will inevitably create a breakthrough in networking just by the arrangement of four letters.

ricardobeat 2 days ago||

Are you talking about the “Thinking Machines” company that shut down in 1994? Took me some digging to figure it out, doesn’t seem well-known enough to be the reason - it’s just a nice (and relatively obvious) name.

gond 2 days ago|||

Yes. Danny Hillis’ Thinking Machines Corporation, an AI company which created its own massive parallel processing supercomputer hardware.

“We are building a machine that will be proud of us” was their corporate motto. And that was in 1983.

One of those Machines is on view at the Computer History Museum in Mountain View. Back then, they could be ordered in “Darth Vader Black”, no kidding here. You can also see a couple of them (the CM-5) as the stereotypical supercomputer in the original Jurassic Park.

More here: https://en.m.wikipedia.org/wiki/Thinking_Machines_Corporatio...

kkylin 2 days ago||

And in the original Jurassic Park! https://www.google.com/search?q=jurassic+park+cm-5

kkylin 2 days ago||

[addendum: posted this too quickly & didn't see it in the comment above. duh.]

ewoodrich 2 days ago|||

It may not be a household name like Apple or Microsoft but its flagship product the Connection Machine is somewhat iconic in (super)computing history. The physical design of the machine is cool and unforgettable looking, plus recurring HN favorite Richard Feynman contributed to the original architecture.

random3 2 days ago||

The thinking is free marketing and the same reason trademarks were invented

andy99 23 hours ago||

Deterministic reproducibility is very different from replicability, and imo the latter is more important; even if the details of the reproducibility are interesting I think they're irrelevant.

There's a similar situation in other scientific disciplines. People want source code and data so they can reproduce results - that basically tells you someone didn't cheat and they documented everything. But it does not tell you whether a real phenomenon was observed.

It's much more interesting to know if roughly the same cause and effect relationships exist so we can predict behavior.

Concretely, there are studies that show e.g. randomly capitalizing letters can lead to completely different responses from and LLM. That speaks to a fragility that doesn't have anything to do with deterministic reproduction.

menaerus 2 days ago|

> But why aren’t LLM inference engines deterministic? One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism. For example, a recent arXiv preprint writes

I'm honored to see that Mira and co. appreciated my feedback on the very topic I made 7 months ago here :D

> You don't need RNG since the whole transformer is an extremely large floating-point arithmetic unit. A wild guess - how about the source of non-determinism is coming from the fact that, on the HW level, tensor execution order is not guaranteed and therefore (T0 * T1) * T2 can produce slightly different results than T0 * (T1 * T2) due to rounding errors?

https://news.ycombinator.com/item?id=42952605#42960047

More comments...