Posted by alpaylan 11 hours ago
It all feels to me like the guys who make videos of using using electric drills to hammer in a nail - Sure, you can do that, but it is the wrong tool for the job. Everyone knows the phrase: "When all you have is a hammer, everything looks like a nail." But we need to also keep in mind the other side of that coin: "When all you have is nails, all you need is a hammer." LLMs are not a replacement for everything that happens to be digital.
Often it seems like tech maximalists are the most against tech reliability.
Imagine that - you got your project done ahead of schedule (which looks great on your OKRs) AND finally achieved your dream of no longer being dependent on those stupid overpaid, antisocial software engineers, and all it cost you was the company's reputation. Boeing management would be proud.
Lots of business leaders will do the math and decide this is the way to operate from now on.
I suggest when their pointer dereferences, it can go a bit forward or backwards in memory as long as it is mostly correct.
Then my job became I am assigned a larger implementation and depending on how large the implementation was, I had to design specifications for others to do some or all of the work and validate the final product for correctness. I definitely didn’t pore over every line of code - especially not for front end work that I stopped doing around the same time.
The same is true for LLMs. I treat them like junior developers and slowly starting to treat them like halfway competent mid level ticket takers.
No. LLMs are undefined behavior.
But most LLM services on purpose introduce randomness, so you don’t get the same result for the same input you control as a user.
"Deterministic" is not the the right constraint to introduce here. Plenty of software is non-deterministic (such as LLMs! But also, consensus protocols, request routing architecture, GPU kernels, etc) so why not compilers?
What a compiler needs is not determinism, but semantic closure. A system is semantically closed if the meanings of its outputs are fully defined within the system, correctness can be evaluated internally and errors are decidable. LLMs are semantically open. A semantically closed compiler will never output nonsense, even if its output is nondeterministic. But two runs of a (semantically closed) nondeterministic compiler may produce two correct programs, one being faster on one CPU and the other faster on another. Or such a compiler can be useful for enhancing security, e.g. programs behave identically, resist fingerprinting.
Nondeterminism simply means the compiler selects any element of an equivalence class. Semantic closure ensures the equivalence class is well‑defined.
That a compiler might pick among different specific implementations in the same equivalency class is exactly what you want a multi-architecture optimizing compiler to do. You don't want it choosing randomly between different optimization choices within an optimization level, that would be non-deterministic at compile time and largely useless assuming that there is at most one most optimized equivalent. I always want the compiler to choose to xor a register with itself to clear it if that's faster than explicitly setting it to zero if that makes the most sense to do given the inputs/constraints.
There are legitimate compiler use cases e.g. search‑based optimization, superoptimization, diversification etc where reproducibility is not the main constraint. It's worth leaving conceptual space for those use cases rather than treating deterministic output as a defining property of all compilers
You are attempting to hedge and leave room for a non-deterministic compiler, presumably to argue that something like vibe-compilation is valuable. However, you've offered no real use cases for a non-deterministic compiler, and I assert that such a tool would largely be useless in the real world. There is already a huge gap between requirements gathering, the expression of those requirements, and their conversion into software. Adding even more randomness at the layer of translating high level programming languages into low level machine code would be a gross regression.
https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
I am not. To me that describes a debugging fiasco. I don't want "semantic closure," I want correctness and exact repeatability.
Meanwhile, you press the "shuffle" button, and code-gen creates different code. But this isn't necessarily the part that's supposed to be reproducible, and isn't how you actually go about comparing the output. Instead, maybe two different rounds of code-generation are "equal" if the test-suite passes for both. Not precisely the equivalence-class stuff parent is talking about, but it's simple way of thinking about it that might be helpful
On a practical level, existing implementations are nondeterministic because they don't take care to always perform mathematically commutative operations in the same order every time. Floating-point arithmetic is not commutative, so those variations change the output. It's absolutely possible to fix this and perform the operations in the same order every time, implementors just don't bother. It's not very useful, especially when almost everything runs with a non-zero temperature.
I think the whole nondeterminism thing is overblown anyway. Mathematical nondeterminism and practical nondeterminism aren't the same thing. With a compiler, it's not just that identical input produces identical output. It's also that semantically identical input produces semantically identical output. If I add an extra space somewhere whitespace isn't significant in the language I'm using, this should not change the output (aside from debug info that includes column numbers, anyway). My deterministic JSON decoder should not only decode the same values for two runs on identical JSON, a change in one value in the input should produce the same values in the output except for the one that changed.
LLMs inherently fail at this regardless of temperature or determinism.
No, a compiler needs determinism. The article is quite correct on this point: if you can't trust that the output of a tool will be consistent, you can't use it as a building block. A stochastic compiler is simply not fit for purpose.
There’s even efforts to guarantee this for many packages on Linux - it’s a core property of security because it lets you validate that the compilation process or environment wasn’t tampered with illicitly by being able to verify by building from scratch.
Now actually managing to fix all inputs and getting deterministic output can be challenging, but that’s less to do with the compiler and more to do with the challenge of completely taking the entire environment (the profile you are using for PGO, isolating paths on the build machine being injected into the binary, programs that have things in their source or build system that’s non deterministic (e.g. incorporating the build time into the binary)
> PGO seems like it ought to have a random element.
PGO should be deterministic based on the runs used to generate the profile. The runs are tracking information that should be deterministic--how many times does the the branch get taken versus not taken, etc. HWPGO, which relies on hardware counters to generate profiling information, may be less deterministic because the hardware counters end up having some statistical slip to them.
Hence why it is hard to do benchmarks with various kinds of GC and dynamic compilers.
You can't even expect deterministic code generation for the same source code across various compilers.
or does your binary always come out differently each time you compile the same file??
You can try it. try to compile the same file 10 times and diff the resultant binaries.
Now try to prompt a bunch of LLMs 10 times and diff the returned rubbish.
There's this really good blog post about how autovectorization is not a programming model https://pharr.org/matt/blog/2018/04/18/ispc-origins
The point is that you want to reliably express semantics in the top level language, tool, API etc. because that's the only way you can build a stable mental model on top of that. Needing to worry about if something actually did something under the hood is awful.
Now of course, that depends on the level of granularity YOU want. When writing plain code, even if it's expressively rich in the logic and semantics (e.g. c++ template metaprogramming), sometimes I don't necessarily care about the specific linker and assembly details (but sometimes I do!)
The issue I think is that building a reliable mental model of an LLM is hard. Note that "reliable" is the key word - consistent. Be it consistently good or bad. The frustrating thing is that it can sometimes deliver great value and sometimes brick horribly and we don't have a good idea for the mental model yet.
To constrain said possibility space, we tether to absolute memes (LLMs are fully stupid or LLMs are a superset of humans).
Idk where I'm going with this
Humans, in all their non deterministic brain glory, long ago realized they don't want their software to behave like their coworkers after a couple of margaritas.
They are designed to be where temperature=0. Some hardware configurations are known defy that assumption, but when running on perfect hardware they most definitely are.
What you call compilers are also nondeterministic on 'faulty' hardware, so...
To say the least, this is garbage compared to compilers
When isn't that true?
int main() {
printf("Continue?\n");
}
and int main() {
printf("Continue?\n");
printf("Continue?\n");
}
do not see the compiler produce equivalent outputs and I am not sure how they ever could. They are not equivalent programs. Adding additional instructions to a program is expected to see a change in what the compiler does with the program.With LLMs the output depends on the phases of the moon.
As with LLMs, unless you ask for the output to be nondeterministic. But any compiler can be made nondeterministic if you ask for it. That's not something unique to LLMs.
> With LLMs the output depends on the phases of the moon.
If you are relying on a third-party service to run the LLM, quite possibly. Without control over the hardware, configuration, etc. then there is all kinds of fuckery that they can introduce. A third-party can make any compiler nondeterministic.
But that's not a limitation of LLMs. By design, they are deterministic.
Not unique as in: no one makes their compilers deterministic, and you have to work to make a non-deterministic one. LLMs are non-deterministic by default, and you have to contort them to the point of uselessness to make them deterministic
> If you are relying on a third-party service to run the LLM, quite possibly. Without control over the hardware, configuration, etc.
Again. Even if you control everything, the only time they produce deterministic output is when they are completely neutered:
- workaround for GPUs with num_thread 1
- temperature set to 0
- top_k to 0
- top_p to 0
- context window to 0 (or always do a single run from a new session)
Go (gc) was designed for reproducible builds by default, so clearly that's not true, but you are right that it isn't the norm.
Even the most widely recognized and used compilers, like gcc, clang, even rustc, are non-deterministic by default. Only if you work hard and control all the variables (e.g. -frandom-seed) can you make these compilers deterministic.
It's fascinating that anyone on HN thinks that compilers converge on always being deterministic or always being non-deterministic. I thought we were supposed to know things about computers around here?
I think it’s more productive to chart all of these systems, LLMs included, on a line of abstraction leakiness. Even disregarding their stochastic nature, I think they’re a much too leaky abstraction to find any use in compilers. There’s a giant mismatch that I think is too big to reconcile.
We have mechanisms for ensuring output from humans, and those are nothing like ensuring the output from a compiler. We have checks on people, we have whole industries of people whose whole careers are managing people, to manage other people, to manage other people.
with regards to predictability LLMs essentially behave like people in this manner. The same kind of checks that we use for people are needed for them, not the same kind of checks we use for software.
The whole benefit of computers is that they don't make stupid mistakes like humans do. If you give a computer the ability to make random mistakes all you have done is made the computer shitty. We don't need checks, we need to not deliberately make our computers worse.
If they are junior developers working in Java they may just as well build an AbstractFactoryConcurrentSingletonBean because that’s what they learned in school as an LLM would be from training on code it found on the Internet.
Those checks works for people because humans and most living beings respond well to rewards/punishment mechanisms. It’s the whole basis of society.
> not the same kind of checks we use for software.
We do have systems that are non deterministic (computer vision, various forecasting models…). We judge those by their accuracy and the likely of having false positive or false negatives (when it’s a classifier). Why not use those metrics?
LLM code completion compares unfavourably to the (heuristic, nigh-instant) picklist implementations we used to use, both at the low-level (how often does it autocomplete the right thing?) and at the high-level (despite many believing they're more effective, the average programmer is less effective when using AI tools). We need reasons to believe that LLMs are great and do all things, therefore we look for measurements that paint it in a good light (e.g. lines of code written, time to first working prototype, inclination to output Doom source code verbatim).
The reason we're all using (or pretending to use) LLMs now is not because they're good. It's almost entirely unrelated.
If you don't like the results or the process, you have to switch targets or add new intermediates. For example instead of doing description -> implementation, do description -> spec -> plan -> implementation
This is technically true. But unimportant. When I write code in a higher level language and it gets compiled to machine code, ultimately I am testing statically generated code for correctness. I don’t care what type of weird tricks the compiler did for optimizations.
How is that any different than when someone is testing LLM generated C code? I’m still testing C code that isn’t going to magically be changed by the LLM without my intervention anymore than my C code is going to be changed without my recompiling it.
On this latest project I was on, the Python generated code by Codex was “correct” with the happy path. But there were subtle bugs in the distributed locking mechanics and some other concurrency controls I specified. Ironically, those were both caught by throwing the code in ChatGPT in thinking mode.
No one is using an LLM to compute is a number even or odd at runtime.
you might not, but plenty of others do. on the jvm for example, anyone building a performance sensitive application has to care about what the compiler emits + how the jit behaves. simple things like accidental boxing, megamorphic call preventing inlining, etc. have massive effects.
i've spent many hours benchmarking, inspecting in jitwatch, etc.
Yes I know every millisecond a company like Google can shave off, is multiplied by billions of transactions a day and can save real money on infrastructure. But even at a second tier company like Salesforce, it probably doesn’t matter
Over the past decade, part of my job has been to design systems, talk to “stakeholders” and delegate some work and do some myself. I’m neither a web developer nor a mobile developer.
I don’t look at a line of code for those types of implementations. I do make sure they work. From my perspective, those that I delegated to might as well be “human LLMs”.
But even with C, it’s still not completely deterministic with out of order and predictive branching, cache hits vs misses etc. Didn’t exactly this cause some of the worse processor level security issues we had seen in years?
The same thing happens in JavaScript. I debug it using a Javascript debugger, not with gdb. Even when using bash script, you don’t debug it by going into the programs source code, you just consult the man pages.
When using LLM, I would expect not to go and verify the code to see if it actually correct semantically.
Like I said above, I do know to watch out for implementations that “Work on my Machine” but don’t work at scale or involve concurrency. But I have had to check for the same issues when I delegate work to more junior developers.
This is not meant to be an insult toward you. But my not doing front end development for well over a decade, a front end developer might as well be a “human LLM” to me. I’m going to give you the business requirements and constraints and you are going to come back with a website. I am just going to check it meets the business requirements and not tell you the how. I’m definitely not going to look at the code.
I just had a web project I had to modify for a new project, I used Codex and didn’t look at a line of code. Yeah I know JavaScript. But I have no idea whether the initial developer who worked on on another project I led or whether the Codex changes were idiomatic. I know the developer and Codex met my functional requirements.
This is why I think the better goal is an abstraction layer that differentiates human decisions from default (LLM) decisions. A sweeping "compiler" locks humans out of the decision making process.
This could be a good way to learn how robust your tests are, and also what accidental complexity could be removed by doing a rewrite. But I doubt that the results would be so good that you could ask a coding agent to regenerate the source code all the time, like we do for compilers and object code.
For context, my initial implementation went through the official AWS open source process (no longer there) five years ago and I’m still getting occasional emails and LinkedIn Messages because it’s one of the best ways to solve the problem that is publicly available - the last couple of times, I basically gave the person the instructions I gave ChatGPT (since I couldn’t give them the code) and told them to have it regenerate the code in Python and it would do much better than what I wrote when I didn’t know the service as well as I do now, and the service has more features that you have to be concerned about
There are people playing around with straight machine code generation, or integrating ML into the optimisation backend, finally compiling via a translation to an existing language is already a given in vibe coding with agents.
Speaking of which, using agentic runtimes is hardly any different from writing programs, there are some instructions which then get executed just like any other applications, and if it gets compiled before execution or plainly interpreted, becomes a runtime implementation detail.
Are we there yet without hallucinations?
Not yet, however the box is already open, and there are enough people trying to make it happen.
One current idea of mine, is to iteratively make things more and more specific, this is the approach I take with psuedocode-expander ([0]) and has proven generally useful. I think there's a lot of value in the LLM instead of one shot generating something linearly, building from the top down with human feedback, for instance. I give a lot more examples on the repo for this project, and encourage any feedback or thoughts on LLM driven code generation in a more sustainable then vibe-coding way.
[0]: https://github.com/explosion-Scratch/psuedocode-expander/
Well, you can always set temperature to 0, but that doesn't remove hallucinations.