Posted by samrolken 3 days ago
Ultimately useless layers of state that the goal you set out to test for inevitably complicates the process.
In chip design land we're focused on streamlining the stack to drawing geometry. Drawing it will be faster when the machine doesn't have decades of programmer opinions to also lose cycles to the state management.
When there are no decisions but extend or delete a bit of geometry we will eliminate more (still not all) hallucinations and false positives than we get trying to organize syntax which has subtly different importance to everyone (misunderstanding fosters hallucinations).
Most software out there is developer tools, frameworks, they need to do a job.
Most users just want something like automated Blender that handles 80% of an ask (look like a word processor or a video game) they can then customize and has a "play" mode that switches out of edit mode. That’s the future machine and model we intend to ship. Fonts are just geometric coordinates. Memory matrix and pixels are just geometric coordinates. The system state is just geometric coordinates[1].
Text driven software engineering modeled on 1960-1970s job routines, layering indirection on math states in the machine, is not high tech in 2025 and beyond. If programmers were car people they would all insist on a Model T being the only real car.
Copy-paste quote about never getting one to understand something when their paycheck depends on them not understanding it.
Intelligence gave rise to language, language does not give rise to intelligence. Memorization and a vain sense of accomplishment that follows is all there is to language.
[1]https://iopscience.iop.org/article/10.1088/1742-6596/2987/1/...
I'm not entirely sure why I had an urge to write this.
When the god rectangle fails, there is literally nobody on earth who can even diagnose the problem, let alone fix it. Reasoning about the system is effectively impossible. And the vulnerability of the system is almost limitless, since it’s possible to coax LLMs into approximations of anything you like: from an admin dashboard to a sentient potato.
“zero UI consistency” is probably the least of your worries, but object permanence is kind of fundamental to how humans perceive the world. Being able to maintain that illusion is table stakes.
Despite all that, it’s a fun experiment.
For me it is predictability. I am a big proponent of AI tools. But even the biggest proponents admit that LLMs are non-deterministic. When you ask a question, you are not entirely sure what kind of answers you will get.
This behavior is acceptable as a developer assistance tool, when a human is in the loop to review and the end goal is to write deterministic code.
Whereas that sort of evaluation is trivial with code (even if at times program execution is non-deterministic), because its mechanics are explainable. Things like only testing boundary conditions hinge on this property, but completely fall apart if it’s all probabilistic.
Maybe explainable AI can help here, but to be honest I have no idea what the state of the art is for that.
Kind of like saving a game before taking on a boss. If things go haywire, just reload. Or maybe like cooking? If something went catastrophically wrong, just throw it out and start from the beginning (with the same tools!)
And I think the only way to even halfway mitigate the vulnerability concern is to identify that this hypothetical system can only serve a single user. Exactly 1 intent. Totally partitioned/sharded/isolated.
If you were using your own model you could maybe try to retrain/finetune the issues away given a new dataset and different techniques? But at that point you’re just transmuting a difficult problem into a damn near impossible one?
LLMs can be miraculous and inappropriate at the same time. They are not the terminal technology for all computation.
I think part of the issue is that most frameworks really suck. Web programming isn't that complicated at its core, the overengineering is mind boggling at times.
Thinking in the limit, if you have to define some type of logic unambiguously, would you want to do it in English?
Anyway, I'm just thinking out loud, it's pretty cool that this works at all, interesting project!
What these LLMs continue to prove those is they are no substitute for real domain knowledge. To date, I've yet to have a model implement RAFT consensus correctly in testing to see if they can build a database.
The way I interact with these models is almost adversarial in nature. I prompt them with the bare minimum that a developer might get in a feature request. I may even have a planning session to populate the context before I set it off on a task.
The bias in these LLMs really shines through an proves their autocomplete properties when they have a strong bias towards changing the one snippet of code I wrote because it doesn't fit in how it's training data would suggest the shape of it's code should be. Most models will course correct with instructions that they are wrong and I am right though.
One thing I've noted is that if you let it generate choices for you from the start of a project, it will make poor choices in nearly every language. You can be using uv to manage a python project and it will continue to try using pip or python commands. You can start an electron app and it will continuously botch if it's using commonjs or some other standard. It persistently wants to download go modules before coding instead of just writing the code and doing `go mod tidy` after (it literally doesn't need the module in advance, it doesn't even have tools to probe the module before writing the code anyway).
RAFT consensus is my go-to test because there is no 1 size fits all way for you to implement it. It might get an in-memory key store system right, but what if you want it to organize etcd/raft/v3 in a way that you can do multi-group RAFT? What if you need RAFT to coordinate some other form of data replication? None of these LLMs can really do it without a lot of prep work.
This is across all the models available from OpenAI, Claude, and Google.
Each person gets their own cache. The format of the cache is a git repo tied to their sessionid. Each time a request is made it writes the code, html, CSS, and database to git and commits it. Over time you build more and more artifacts and fewer things need to get generated JIT. Should also help with stability.
Let's say, in the future, when AI learns how to build houses, every time I want to sleep, I'll just ask the AI to build a new house for me, so I can sleep. I guess it will have to repurpose the old one, but that isn't my concern, it's just some implementation detail.
Wouldn't that be nice?
Every night, new house?
Once the dust settles prices will go up. Even if running models will be cheaper they will need to earn back all the burned cash.
I’d much rather vibe code app get the code to run on some server.
I can get gpt 3 level of quality with qwen 8B, even qwen 4B in some cases
This project could use something like that. Perhaps ask the LLM to implement a way to store/cache the snapshots of its previous answers. That way, the more you use it, the faster it becomes.