Top
Best
New

Posted by jxmorris12 3 days ago

Defeating Nondeterminism in LLM Inference(thinkingmachines.ai)
328 points | 130 commentspage 3
bee_rider 3 days ago|
From their code:

    A = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
    B = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
    ref = torch.mm(A, B)
    for _ in range(1000):
         assert (torch.mm(A, B) - ref).abs().max().item() == 0
I’m sort of surprised that Torch doesn’t have some kind of lazy evaluation thing to avoid computing anything here. I thought that was one of the nice things about all these fancy frameworks (if I wanted the computer to actually do silly things when I asked it to, I would use BLAS directly, right?).
nomel 3 days ago|
Maybe I'm missing something, but in this case, wouldn't being lazy would be pure overhead? I don't see anything can be lazy here. The reference computed once, nanoseconds before it's needed, and test cases computed at the time of comparison, then tossed away.

What would hope to be achieved by making this case lazy? If you wanted these to run in parallel, with a multi-gpu system, you would use the appropriate parallel interface.

bee_rider 3 days ago||
I mean if you wait long enough, it is asking for

  .abs().max().item()
of something that can be identified as definitionally zero.
nomel 3 days ago||
I don't understand. Since it's not using the parallel interface, only one operation can happen at a time. This would be, literally, sequential execution with extra overhead, in this case. Again, in this case, what would hope to be achieved from doing things lazily, since the lazy operations would immediately be followed by their evaluation?

The parallel interface, which is async, is probably what you're lookin for.

Dylan16807 2 days ago|||
Let's look at the subtraction in this case.

If evaluation is lazy, then the subtraction operator gets fed two unevaluated matrix multiplies.

If it's a dumb subtraction operator, this gives us no benefit. Eventually it evaluates both and then subtracts. And it has some extra overhead like you said.

But if it's a smart subtraction operator, it can realize that both parameters are the same equation, and then it can return all 0s without evaluating anything.

And even better than just skipping the matrix math, "all 0s" can be a stub object that takes O(1) time to set up. And then .abs().max() will be instant too.

nomel 2 days ago|||
I see now, thank you. I was stuck on the "lazy evaluation" part, rather than the optimization part they were actually suggesting.
bee_rider 3 days ago|||
The Python commands are encountered sequentially. One could image a library where the Python commands build the computation under the hood. Then, the library would be able to take advantage of situations like this one (or, more practically, reorder multiplications and/or avoid unnecessary temporaries).
themeiguoren 2 days ago||
A bit off topic from the technical discussion but does anyone recognize what blog layout or engine this is? I really like the layout with sidenotes and navigation.
ako 2 days ago|
Seems like a Thufte inspired style, something like this: https://clayh53.github.io/tufte-jekyll/articles/20/tufte-sty...
simne 2 days ago||
This is eternal struggle. - Hardware developers will constantly scale horizontally and make less (time) deterministic hardware, because wall of memory, and scientists could constantly develop new ways to make calculations deterministic.

So, even if will be achieved progress just now, I think in predictable future this will be constant dead-end.

paulbjensen 3 days ago||
It reminded me of this wonderful talk by the late Joe Armstrong (Erlang's creator): https://www.youtube.com/watch?v=lKXe3HUG2l4

Great post.

PeterStuer 2 days ago||
THANK YOU! Great work and writeup. Hope it finally silences the "concurrency + floating point" crowd and the "LLMs can never be deterministic" zealots.
Noumenon72 2 days ago||
Are the results of the matmuls really that far apart in size that you have to lose significant bits when adding them up at FP32?
zacksiri 2 days ago||
This work is extremely consequential. When building agentic systems determinism will significantly improve the reliability.

I hope all the model providers adopt this.

bendoy 3 days ago||
Where this gets really complicated is when you are chaining many LLM calls together (basically any agent). A slight deviation in the call stack can throw off everything else.
lrvick 3 days ago||
Job one is have every bit of software involved also be deterministic, which stagex takes care of.

I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago.

Run two of these with the same prompts and same seed and you get the same results.

Obviously in GPU clusters with different hardware things get more complicated.

https://git.distrust.co/public/llmshell

spindump8930 3 days ago||
That's not what this is about.

"I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago" looks like you're using llama-cpp in that repo. This is about vllm serving many requests at once, at long sequence lengths.

> As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.

Your situation isn't really comparable.

saagarjha 3 days ago||
What’s stagex?
lrvick 3 days ago||
supply chain security focused linux distro that does not trust its own maintainers by design.
threeducks 3 days ago|
It should also be noted that PyTorch has a page about reproducibility: https://docs.pytorch.org/docs/stable/notes/randomness.html

TL;DR

Seed your PRNGs and call torch.use_deterministic_algorithms(True) to get the deterministic kernels. They may be slightly slower, but in practice, you probably will not notice.

Note that results will still differ between different drivers and GPUs. It would be great if NVIDIA tried harder in that regard.

red2awn 3 days ago|
The blog post is about LLM non-determinism in the context of serving at scale (variable batch size). The page you link is only about run-to-run determinism implicitly assuming a fixed batch size.
More comments...