Top
Best
New

Posted by dbalatero 9/3/2025

Where's the shovelware? Why AI coding claims don't add up(mikelovesrobots.substack.com)
762 points | 482 commentspage 2
InCom-0 9/4/2025|
On one hand I don't understand what all the fuss is about. LLMs are great at all kinds of things around and about: searching for (good) information, summarizing existing text, conceptual discussions where it points you in the right directions very quickly, etc. ..... they are just not great (some might say harmful) at straight up non-trivial code generation or design of complex systems with the added peculiarity that on the surface the models seem almost capable to do it but never quite ... which is sort their central feature: producing text so that it is seems correct from statistical perspective, but without actual reasoning.

On the other hand, I do understand that the things the LLMs are really great at is not actually all that spectacular to monetize ... and so as a result we have all these snake oil salesmen on every corner boasting about nonsensical vibecoding achievements, because that's where the real money would be ... if it were really true ... but it is not.

larve 9/3/2025||
In case the author is reading this, I have the receipts on how there's a real step function in how much software I build, especially lately. I am not going to put any number on it because that makes no sense, but I certainly push a lot of code that reasonably seems to work.

The reason it doesn't show up online is that I mostly write software for myself and for work, with the primary goal of making things better, not faster. More tooling, better infra, better logging, more prototyping, more experimentation, more exploration.

Here's my opensource work: https://github.com/orgs/go-go-golems/repositories . These are not just one-offs (although there's plenty of those in the vibes/ and go-go-labs/ repositories), but long-lived codebases / frameworks that are building upon each other and have gone through many many iterations.

nerevarthelame 9/3/2025||
How are you sure it's increasing your productivity if it "makes no sense" to even quantify that? What are the receipts you have?
larve 9/3/2025||
I have linked my github above. I don't know how that fares in the bigger scope of things, but I went from 0 opensource to hundreds of tools and frameworks and libraries. Putting a number on "productivity" makes no sense to me, I would have no idea what that means.

I generate between 10-100k lines of code per day these days. But is that a measure of productivity? Not really...

sarchertech 9/4/2025|||
>I generate between 10-100k lines of code per day these days.

That’s absolute nonsense.

irthomasthomas 9/4/2025|||
He said "generate". This is trivial to do. And probably this is what Amodei meant when he said 90% of code would be AI by now. It doesn't meant that generated code is actually useful and gets checked in.
larve 9/4/2025||
Trivial is a pretty big word in this context. Expanding an idea into some sort of code is indeed a matter of waiting. The idea, the prompt, the design of the overall workflow to leverage the capabilities of llms/agents in a professional/long-lived codebase context is far from trivial, imo.
larve 9/4/2025|||
You can look at my GitHub, and I stream full unedited sessions on https://youtube.com/@program-with-ai
saulpw 9/4/2025||
I tuned in to a random spot at a random episode, didn't see any coding but did get to hear you say:

"I'm a person who hates art now...I never want to see art again. All I want to see is like, AI stuff. That's how bad it's gotten. Handmade? nuh-uh. Handmade code? ... anything by humans, just over. I'm just gonna watch pixels."

https://www.youtube.com/live/APkR4qRg1vM?si=XLGmH9uEjG08q-6x...

I watched a little more but was, uh, not impressed.

larve 9/4/2025||
I'm always a very serious person while I wait for people to join the stream. I'm sorry you weren't impressed, but tbf that's not really my goal, I just like building things and yapping about it.
saulpw 9/4/2025||
I'm not sure why you bother yapping about it yourself. It's too human. Just give an LLM a list of lowercase bullet points and have an AI voiceover read them. It'll be 10x more efficient.
coffeebeqn 9/4/2025|||
Who’s reviewing 10-100k lines of code per day? This sounds like a slop nightmare
larve 9/4/2025||
I only review what needs to be reviewed, I don’t need to fully review every prototype, shell script, dev tool etc… only what is in the critical path.

But if llms show us one thing, it’s how bad our code review tools are. I have a set of tree sitter helpers that allow me to examine different parts of a PR more easily (one that allows me to diff semantic parts of the code, instead of “files” and “lines”, one that gives me stats on what subsystems are touched and crosscorrelation of different subsystems, one for attaching metadata and which documents are related to a commit, one for managing our design documents, llm-coding intermediary documents, long lasting documents, etc… the proper version of these are for work but here’s the initial yolo from Manus: https://github.com/go-go-golems/vibes/tree/main/2025-08-22/p... https://github.com/go-go-golems/vibes/tree/main/2025-08-22/c... https://github.com/go-go-golems/vibes/tree/main/2025-08-15/d... https://github.com/go-go-golems/vibes/tree/main/2025-07-29/p...).

I very often put some random idea into the llm slot machine that is manus, and use the result as a starting point to remold it into a proper tool, and extracting the relevant pieces as reusable packages. I’ve got a pretty wide treesitter/lsp/git based set of packages to manage llm output and assist with better code reviews.

Also, every llm PR comes with _extensive_ documentation / design documents / changelogs, by the nature of how these things work, which helps both humans and llm-asssisted code review tools.

larve 9/4/2025||
Since I get downvoted because I guess people don’t believe me, I’m sitting at breakfast reading a book. I suddenly think about yaml streaming parsing, start a gpt research, dig a bit deeper into streaming parser approaches, and launch a deep research on streaming parsing which I will print out and read tomorrow at breakfast and go through by hand. I then take some of the gpt discussion and paste it into Manus, saying:

“ Write a streaming go yaml parsers based on the tokenizer (probably use goccy yaml if there is no tokenizer in the standard yaml parser), and provide an event callback to the parser which can then be used to stream and print to the output.

Make a series of test files and verify they are streamed properly.”

This is the slot machine. It might work, it might be 50% jank, it might be entire jank. It’ll be a few thousand lines of code that I will skim and run. In the best case, it’s a great foundation to more properly work on. In the worst case it was an interesting experiment and I will learn something about either prompting Manus, or streaming parsing, or both.

I certainly won’t dedicate my full code review attention to what was generated. Think of it more as a hyper specific google search returning stackoverflow posts that go into excruciating detail.

https://chatgpt.com/share/68b98724-a8cc-8012-9bee-b9c4a77fe9...

https://manus.im/share/kmsyzuoRHfn1FNjg5NWz17?replay=1

trenchpilgrim 9/3/2025|||
Same. On many days 90% of my code output by lines is Claude generated and things that took me a day now take well under an hour.

Also, a good chunk of my personal OSS projects are AI assisted. You probably can't tell from looking at them, because I have strict style guides that suppress the "AI style", and I don't really talk about how I use AI in the READMEs. Do you also expect I mention that I used Intellisense and syntax highlighting too?

droidjj 9/3/2025|||
The author’s main point is that there hasn’t been an uptick in total code shipped, as you would expect if people are 10x-ing their productivity. Whether folks admit to using AI in their workflow is irrelevant.
trenchpilgrim 9/3/2025|||
The bottleneck on how much I ship has never been how fast I can write and deploy code :)
larve 9/3/2025||||
Their main point is "AI coding claims don't add up", as shown by the amount of code shipped. I personally do think some of the more incredible claims about AI coding add up, and am happy to talk about it based on my "evidence", ie the software I am building. 99.99% of my code is ai generated at this point, with the occasional one line I fill in because it'd be stupid to wait for an LLM to do it.

For example, I've built 5-6 iphone apps, but they're kind of one-offs and I don't know why I would put them up on the app store, since they only scratch my own itches.

Gormo 9/4/2025|||
I'd suspect that a very large proportion of code has always been "private code" written for personal or intra-organizational purposes, and which never get released publicly.

But if we expect the ratio of this sort of private code to publicly-released code to remain relatively stable, which I think is a reasonable expectation, then we'd expect there to be a proportional increase in both private and public code as a result of any situation that increased coding productivity generally.

So the absence of a notable increase in the volume of public code either validates the premise that LLMs are not actually creating a general productivity boost for software development, or instead points to its productivity gains being concentrated entirely in projects that never do get released, which would raise the question of why that might be.

trenchpilgrim 9/3/2025|||
Oh yeah, I love building one off tools with it. I am working on a game mod with a friend, we are hand writing the code that runs when you play it, but we vibe code all sorts of dev tools to help us test and iterate on it faster.

Do internal, narrow purpose dev tools count as shipped code?

daxfohl 9/3/2025||
This seems to be a common thread. For personal projects where most details aren't important, they are good at meeting the couple things that are important to you and filling in the rest with reasonable, mostly-good-enough guesses. But the more detailed the requirements are, the less filler code there is, and the more each line of code matters. In those situations it's probably faster to type the line of code than to type the English equivalent and hand-hold the assistant through the editing process.
larve 9/3/2025||
I don't think so, although I think at that point experience heavily comes into play. With GPT-5 especially, I can basically point cursor/codex at a repo and say "refactor this to this pattern" and come back 25 minutes later to a pretty much impeccable result. In fact that's become my favourite past time lately.

I linked some examples higher up, but I've been maintaining a lot of packages that I started slightly before chatgpt and then refactored and worked on as I progressively moved to the "entirely AI generated" workflow I have today.

I don't think it's an easy skill (not saying that to make myself look good, I spent an ungodly amount of time exploring programming with LLMs and still do), akin to thinking at a strategic level vs at a "code" level.

Certain design patterns also make it much easier to deal with LLM code: state reducers (redux/zustand for example), event-driven architectures, component-based design systems, building many CLI tools that the agent can invoke to iterate and correct things, as do certain "tools" like sqlite/tmux (by that I mean just telling the LLM "btw you can use tmux/sqlite", you allow it to pass hurdles that would otherwise just make it spiral into slop-ratatouille).

I also think that a language like go was a really good coincidence, because it is so amenable to LLM-ification.

Aeolun 9/3/2025||||
I don’t think this is necessarily true. People that didn’t ship before still don’t ship. My ‘unshipped projects’ backlog is still nearly as large. It’s just got three new entries in the past two months instead of one.
warkdarrior 9/3/2025|||
Maybe people are working less and enjoying life more, while shipping the same amount of code as before.

If someone builds a faster car tomorrow, I am not going to go to the office more often.

leoc 9/3/2025|||
"In this economy?", as the saying goes.
jplusequalt 9/4/2025|||
Jevon's paradox.
jplusequalt 9/4/2025|||
>Do you also expect I mention that I used Intellisense and syntax highlighting too?

No, but I expect my software to have been verified for correctness, and soundness by a human being with a working mental model of how the code works. But, I guess that's not a priority anymore if you're willing to sacrifice $2400 a year to Anthropic.

trenchpilgrim 9/5/2025||
$2400? Mate, I have a free GitHub Copilot subscription (Microsoft hands them out to active OSS developers), and work pays for my Claude Code via our cloud provider backend (and it costs less per working day than my morning Monster can). LLM inference is _cheap_ and _getting cheaper every month_.

> No, but I expect my software to have been verified for correctness, and soundness by a human being with a working mental model of how the code works.

This is not exclusive with AI tools:

- Use AI to write dev tools to help you write and verify your handwritten code. Throw the one-off dev tools in the bin when you're done.

- Handwrite your code, generate test data, review the test data like you would a junior engineer's work.

- Handwrite tests, AI generate an implementation, have the agent run tests in a loop to refine itself. Works great for code that follows a strict spec. Again, review the code like you would a junior engineer's work.

jplusequalt 7 days ago||
Writing the tests by hand, but letting the AI write the code sounds horribly dull.
trenchpilgrim 7 days ago||
I'm an infrastructure/platform engineer. If the code is boring, that probably means I'm doing my job well. This isn't hobby coding.
noidesto 9/4/2025|||
Agree. In the hands of a seasoned dev not only does productivity improve but the quality of outputs.

If I’m working against a deadline I feel more comfortable spending time on research and design knowing I can spend less time on implementation. In the end, it took the same amount of time, though hopefully with an increase of reliability, observability, and extendibility. None of these things show up in the author’s faulty dataset and experiment.

ryanobjc 9/4/2025|||
The author is pointing out that aggregate productivity hasn't really gone up. The graphs are fairly compelling.

There are many reasons for your experience, and I am glad you are having them! That's great!

But the fact remains, overall we aren't seeing an exponential or even step function in how much software is being delivered!

xenobeb 9/5/2025|||
What is even the point in having this argument?

At this point, one is gaining with each model release or they are not.

Lets see in 2035 who was right and who was wrong. My bet is the people who are not gaining right now are not going to like the situation in 2035.

philipwhiuk 9/4/2025||
I mean it's definitely shovelware, I'll give you that.

https://github.com/go-go-golems/ai-in-action-app/blob/main/c...

larve 9/4/2025||
Not sure what you mean? This was a demo in a live session that took about 30 minutes, including ui ideation (see pngs). It’s a reasonably well featured app and the code is fairly minimal. I wouldn’t be able to write something like that in 30 minutes by hand.
benjiro 9/3/2025||
I need to agree with the author, with a caveat. He is a well developed developer. For somebody like him, churning out good quality code is probably easy.

Where i expect to see a lot of those metrics of feeling fast come from, is from people who may have less coding experience, and with AI are coding way above their level.

My brother in law asks for a nice product website, i just feed his business plan into a LLM, do some fine tuning on the results, and have a good looking website in a hour time. If i did it myself manually, just take me behind a barn as those jobs are so boring and take for ages. But i know that website design is a weakness of mine.

That is the power of LLMs. Turn out quick code, maybe offer some suggestion you did not think about, but ... it also eats time! Making your prompts so that the LLM understands, waiting for the result, ... waiting ... ok, now check the result, can you use it? O no, it did X, Y, Z wrong. Prompt again ... and again. And this is where your productivity goes to die.

So when you compare a pool of developer feedback, your going to get a broad "it helps a lot", "some", "is worse then my code", ... mix in with the prompting, result delays etc...

It gets even worse with Agent / Vibe coding, as you just tend to be waiting, 5, 10min for changed to be done. You need to review them, test them, ... o no, the LLM screwed something up again. O no, it removed 50% of my code. Hey, where did my comments go. And we are back to a loss of time.

LLMs are a tool... But after a lot of working with them, my opinion is to use them when needed but do not depend on them for everything. I sometimes look with cow eyes when people say they are coding so much with LLMs and spending 200, or more bucks per month.

They can be powerful tools, but i feel that some folks become so over dependent on them. And worst is my feeling that our juniors are going to be in a world of hurt, if their skills are more LLM monkey coding (or vibe coding), then actually understanding how to code (and the knowledge behind the actual programming languages and systems).

solatic 9/4/2025||
I'm not sure what to make of these takes because so many people are using such an enormous variety of LLM tooling in such a variety of ways, people are going to get a variety of results.

Let's take the following scenario for the sake of argument: a codebase with well-defined AGENTS.md, referencing good architecture, roadmap, and product documentation, and with good test coverage, much of which was written by an LLM and lightly reviewed and edited by a human. Let's say for the sake of argument that the human is not enjoying 10x productivity despite all this scaffolding.

Is it still worthwhile to use LLM tooling? You know what, I think a lot of companies would say yes. There are way too many companies whose codebases lack testing and documentation, that are too difficult to on-board new engineers and have too high risk if the original engineers are lost. The simple fact that LLMs, to be effective, force the adaptation of proper testing and documentation is a huge win for corporate software.

noodletheworld 9/4/2025|
> people are going to get a variety of results.

Yes, but the point of this article is surely that on average if it's working, there would be obvious signs of it working by now.

Even if there are statistical outliers (ie. 10x productivity using the tools), if on average, it does nothing to the productivity of developers, something isn't working as promised.

ketozhang 9/4/2025||
We need long running averages and 2023-2025 is still too early to determine it's not effective. The barriers of entry for 2023 and 2024, I'd argue is too high for inexperienced developers to start churning software. For seasoned developers, the skepticism and company adoption wasn't there yet (and still isn't).
raylad 9/4/2025||
I used to be a full-time developer back in the day. Then I was a manager. Then I was a CTO. I stopped doing the day-to-day development and even stopped micro-managing the detailed design.

When I tried to code again, I found I didn't really have the patience for it -- having to learn new frameworks, APIs, languages, tricky little details, I used to find it engrossing: it had become annoying.

But with tools like Claude Code and my knowledge about how software should be designed and how things should work, I am able to develop big systems again.

I'm not 20% more productive than I was. I'm not 10x more productive than I was either. I'm infinity times more productive because I wouldn't be doing it at all otherwise, realistically: I'd either hire someone to do it, or not do it, if it wasn't important enough to go through the trouble to hire someone.

Sure, if you are a great developer and spend all day coding and love it, these tools may just be a hindrance. But if you otherwise wouldn't do it at all they are the opposite of that.

ferrous69 9/4/2025||
my grand theory on AI coding tools is that they don't really save on time, but they massively save on annoyance. I can save my frustration budget for useful things instead of fiddling with syntax or compiler messages or repetitive tasks, and oftentimes this means I'll take on a task I would find too frustrating in an already frustrating world, or stay at my desk longer before needing to take a walk or ditch the office for the bar.
jdlshore 9/4/2025|||
If you’re a CTO who can no longer program, the solution isn’t to use AI to program again; the solution is to hire people who can program. The question at hand is whether AI helps your developers, not whether it helps you. You’re the CTO. It’s not your job to program.
raylad 9/4/2025||
Some of the projects I've been doing are for myself in other businesses, automating processes that were time consuming or... annoying.

Others are for start-ups that are pre-money, pre-revenue where I can build things myself without having to deal with hiring people.

In a larger organization, certainly I'd delegate to other people, but if it's just for me or new unfunded start-ups, this is working out very well.

And it's not that I "can no longer program". I could program, it's just that I don't find the nuts and bolts of it as interesting as I used to and am more focused on functionality, algorithm, and UI.

kobe_bryant 9/4/2025||
wow, not just one but multiple big systems? well, share the details with us
weweersdfsd 9/4/2025||
The problem with current GenAI is the same as in outsourcing to lowest bidder in India or whatever. For any non-trivial project you'll get something that may appear to work out of it, but for anything production-ready you'll most likely you'll spend lots of time testing, verifying, cleaning up the code and making changes to things AI didn't catch. Then there's requirement gathering, discussing with stakeholders, gathering more feedback and so on, debugging when things fail in production...

I believe it's a productivity boost, but only to a small part of my job. The boost would be larger if only had to build proof-of-concepts or hobby projects that don't need to be reliable in prod, and don't require feedback and requirements from many other people.

iainctduncan 9/4/2025||
This reminds me of something... I'm a jazz musician when not being a coder, and have studied and taught from/to a lot of players. One thing advanced improvisors notice is that the student is very frequently not a good judge – in the moment – of what is making them better. Doing long term analytics tests (as the author did) works, but knowing how well something is working while you're doing it? not so much. Very, very frequently that which feels productive isn't, and that which feels painful and slow is.

Just spit balling here, but it sure feels similar.

bjackman 9/3/2025||
There is actually a lot of AI shovelware on Steam. Sort by newest releases and you'll see stuff like a developer releasing 10 puzzle games in one day.

I have the same experience as OP, I use AI every day including coding agents, I like it, it's useful. But it's not transformative to my core work.

I think this comes down to the type of work you're doing. I think the issue is that most software engineering isn't in fields amenable to shovelware.

Most of us either work in areas where the coding is intensely brownfield. AI is great but not doubling anyone's productivity. Or, in areas where the productivity bottlenecks are nowhere near the code.

sarchertech 9/4/2025|
If you look at the actual steam metrics though we’re barely seeing more games releases than we were last year.

If AI were really making people 10x more productive, given the number of people who want to make games, you’d expect to see more than a few percent increase year over year.

kenjackson 9/3/2025||
Shovelware may not be a good way to track additional productivity.

That said, I’m skeptical that AI is as helpful for commercial software. It’s been great for in automating my workflow because I suck at shell scripting and AI is great at it. But most of the code I write I honestly halfway don’t know what I’m going to write until I write it. The prompt itself is where my thinking goes - so the time savings would be fairly small, but I also think I’m fairly skilled (except at scripting).

NathanKP 9/3/2025|
I think the explanation is simple: there is a direct correlation between being too lazy and demotivated to write your own code, and being too lazy and demotivated to actually finish a project and publish your work online.

The same people who are willing to go through all the steps to release an application online are also willing to go through the extra effort of writing their own code. The code is actually the easy part compared to the rest of it... always has been.

More comments...