Show HN: Why write code if the LLM can just do the thing? (web app experiment)

Posted by samrolken 3 days ago

Show HN: Why write code if the LLM can just do the thing? (web app experiment)(github.com)

I spent a few hours last weekend testing whether AI can replace code by executing directly. Built a contact manager where every HTTP request goes to an LLM with three tools: database (SQLite), webResponse (HTML/JSON/JS), and updateMemory (feedback). No routes, no controllers, no business logic. The AI designs schemas on first request, generates UIs from paths alone, and evolves based on natural language feedback. It works—forms submit, data persists, APIs return JSON—but it's catastrophically slow (30-60s per request), absurdly expensive ($0.05/request), and has zero UI consistency between requests. The capability exists; performance is the problem. When inference gets 10x faster, maybe the question shifts from "how do we generate better code?" to "why generate code at all?"

432 points | 317 commentspage 3

tekbruh9000 3 days ago|

You're still operating with layers of lexical abstraction and indirection. Models full of dated syntactic and semantic concepts about software that waste cycles.

Ultimately useless layers of state that the goal you set out to test for inevitably complicates the process.

In chip design land we're focused on streamlining the stack to drawing geometry. Drawing it will be faster when the machine doesn't have decades of programmer opinions to also lose cycles to the state management.

When there are no decisions but extend or delete a bit of geometry we will eliminate more (still not all) hallucinations and false positives than we get trying to organize syntax which has subtly different importance to everyone (misunderstanding fosters hallucinations).

Most software out there is developer tools, frameworks, they need to do a job.

Most users just want something like automated Blender that handles 80% of an ask (look like a word processor or a video game) they can then customize and has a "play" mode that switches out of edit mode. That’s the future machine and model we intend to ship. Fonts are just geometric coordinates. Memory matrix and pixels are just geometric coordinates. The system state is just geometric coordinates[1].

Text driven software engineering modeled on 1960-1970s job routines, layering indirection on math states in the machine, is not high tech in 2025 and beyond. If programmers were car people they would all insist on a Model T being the only real car.

Copy-paste quote about never getting one to understand something when their paycheck depends on them not understanding it.

Intelligence gave rise to language, language does not give rise to intelligence. Memorization and a vain sense of accomplishment that follows is all there is to language.

[1]https://iopscience.iop.org/article/10.1088/1742-6596/2987/1/...

finnborge 3 days ago|

I'm not sure I follow this entirely, but if the assertion is that "everything is math" then yeah, I totally agree. Where I think language operates here is as the medium best situated to assign objects to locations in vector space. We get to borrow hundreds of millions of encodings/relationships. How can you plot MAN against FATHER against GRAPEFRUIT using math without circumnavigating the human experience?

yanis_t 3 days ago||

Robert Martin teaches us that codebase is behaviour and structure. While behaviour is something we want the software to do. The structure can be even more important because it defines how easy if possible to evolve the behaviour.

I'm not entirely sure why I had an urge to write this.

hyko 3 days ago||

The fatal problem with LLM-as-runtime-club isn’t performance. It’s ops (especially security).

When the god rectangle fails, there is literally nobody on earth who can even diagnose the problem, let alone fix it. Reasoning about the system is effectively impossible. And the vulnerability of the system is almost limitless, since it’s possible to coax LLMs into approximations of anything you like: from an admin dashboard to a sentient potato.

“zero UI consistency” is probably the least of your worries, but object permanence is kind of fundamental to how humans perceive the world. Being able to maintain that illusion is table stakes.

Despite all that, it’s a fun experiment.

cheema33 3 days ago||

> The fatal problem with LLM-as-runtime-club isn’t performance. It’s ops (especially security).

For me it is predictability. I am a big proponent of AI tools. But even the biggest proponents admit that LLMs are non-deterministic. When you ask a question, you are not entirely sure what kind of answers you will get.

This behavior is acceptable as a developer assistance tool, when a human is in the loop to review and the end goal is to write deterministic code.

hyko 3 days ago||

Non-deterministic behaviour doesn’t help when trying to reason about the system. But you could in theory eliminate the non-determinism for a given input, and yet still be stuck with something unpredictable, in the sense that you can’t predict what new input will cause.

Whereas that sort of evaluation is trivial with code (even if at times program execution is non-deterministic), because its mechanics are explainable. Things like only testing boundary conditions hinge on this property, but completely fall apart if it’s all probabilistic.

Maybe explainable AI can help here, but to be honest I have no idea what the state of the art is for that.

finnborge 3 days ago|||

At this extreme, I think we'd end up relying on backup snapshots. Faulty outcomes are not debugged. They, and the ecosystem that produced them, are just erased. The ecosystem is then returned to its previous state.

Kind of like saving a game before taking on a boss. If things go haywire, just reload. Or maybe like cooking? If something went catastrophically wrong, just throw it out and start from the beginning (with the same tools!)

And I think the only way to even halfway mitigate the vulnerability concern is to identify that this hypothetical system can only serve a single user. Exactly 1 intent. Totally partitioned/sharded/isolated.

hyko 3 days ago||

Backup snapshots of what though? The defects aren’t being introduced through code changes, they are inherent in the model and its tooling. If you’re using general models, there’s very little you can do beyond prompt engineering (which won’t be able to fix all the bugs).

If you were using your own model you could maybe try to retrain/finetune the issues away given a new dataset and different techniques? But at that point you’re just transmuting a difficult problem into a damn near impossible one?

LLMs can be miraculous and inappropriate at the same time. They are not the terminal technology for all computation.

indigodaddy 3 days ago||

What if they are extremely narrow and targeted LLMs running locally on the endpoint system itself (llamafile or whatever)? Would that make this concern at least a little better?

indigodaddy 3 days ago||

Downvoted! What a dumb comment right?

qsort 3 days ago||

If you're working like that then the prompt is the code and the LLM is the interpreter, and it's not obvious to me that it would be "better" than just running it normally, especially since an LLM with that level of capability could definitely help you with coding, no?

I think part of the issue is that most frameworks really suck. Web programming isn't that complicated at its core, the overengineering is mind boggling at times.

Thinking in the limit, if you have to define some type of logic unambiguously, would you want to do it in English?

Anyway, I'm just thinking out loud, it's pretty cool that this works at all, interesting project!

SamInTheShell 3 days ago||

Currently today, I would say these models can be used by someone with minimal knowledge to churn out SPAs with React. They can probably get pretty far into making logins, message systems, and so on because there is lots of training data for those things. They can struggle through building desktop apps as well with relative ease compared to how I had to learn in years long past.

What these LLMs continue to prove those is they are no substitute for real domain knowledge. To date, I've yet to have a model implement RAFT consensus correctly in testing to see if they can build a database.

The way I interact with these models is almost adversarial in nature. I prompt them with the bare minimum that a developer might get in a feature request. I may even have a planning session to populate the context before I set it off on a task.

The bias in these LLMs really shines through an proves their autocomplete properties when they have a strong bias towards changing the one snippet of code I wrote because it doesn't fit in how it's training data would suggest the shape of it's code should be. Most models will course correct with instructions that they are wrong and I am right though.

One thing I've noted is that if you let it generate choices for you from the start of a project, it will make poor choices in nearly every language. You can be using uv to manage a python project and it will continue to try using pip or python commands. You can start an electron app and it will continuously botch if it's using commonjs or some other standard. It persistently wants to download go modules before coding instead of just writing the code and doing `go mod tidy` after (it literally doesn't need the module in advance, it doesn't even have tools to probe the module before writing the code anyway).

RAFT consensus is my go-to test because there is no 1 size fits all way for you to implement it. It might get an in-memory key store system right, but what if you want it to organize etcd/raft/v3 in a way that you can do multi-group RAFT? What if you need RAFT to coordinate some other form of data replication? None of these LLMs can really do it without a lot of prep work.

This is across all the models available from OpenAI, Claude, and Google.

koliber 2 days ago||

Both the speed and cost problems can be solved by caching.

Each person gets their own cache. The format of the cache is a git repo tied to their sessionid. Each time a request is made it writes the code, html, CSS, and database to git and commits it. Over time you build more and more artifacts and fewer things need to get generated JIT. Should also help with stability.

indigoabstract 2 days ago||

Interesting idea, it never crossed my mind, but maybe we can take it further?

Let's say, in the future, when AI learns how to build houses, every time I want to sleep, I'll just ask the AI to build a new house for me, so I can sleep. I guess it will have to repurpose the old one, but that isn't my concern, it's just some implementation detail.

Wouldn't that be nice?

Every night, new house?

ozim 2 days ago||

Calculation is $0.05/request is valid only as far as AI companies continue to burn money as they are in grab the market phase.

Once the dust settles prices will go up. Even if running models will be cheaper they will need to earn back all the burned cash.

I’d much rather vibe code app get the code to run on some server.

jokethrowaway 2 days ago|

not necessarily, hardware and software gains will make tokens cheaper, so we'll see where we are once the vc money runs out (or the entire US economy, there's a chance AI will pop the tech bubble of the last 20 years: I think tech company evaluation are insanely inflated compared to the value they provide)

I can get gpt 3 level of quality with qwen 8B, even qwen 4B in some cases

kesor 2 days ago|

This is just like vibe coding. In vibe coding, you snapshot the results of the LLM's implementation into files that you reuse later.

This project could use something like that. Perhaps ask the LLM to implement a way to store/cache the snapshots of its previous answers. That way, the more you use it, the faster it becomes.

More comments...