AI can code, but it can't build software

Posted by nreece 10/27/2025

AI can code, but it can't build software(bytesauna.com)

262 points | 177 comments

simonw 10/27/2025|

This is a good headline. LLMs are remarkably good at writing code. Writing code isn't the same thing as delivering working software.

A human expert needs to identify the need for software, decide what the software should do, figure out what's feasible to deliver, build the first version (AI can help a bunch here), evaluate what they've built, show it to users, talk to them about whether it's fit for purpose, iterate based on their feedback, deploy and communicate the value of the software, and manage its existence and continued evolution in the future.

Some of that stuff can be handled by non-developer humans working with LLMs, but a human expert needs who understands code will be able to do this stuff a whole lot more effectively.

I guess the big question is if experienced product management types can pick up enough coding technical literacy to work like this without programmers, or if programmers can pick up enough enough PM skills to work without PMs.

My money is on both roles continuing to exist and benefit from each other, in a partnership that produces results a lot faster because the previously slow "writing the code" part is a lot faster than it used to be.

prmph 10/28/2025||

> LLMs are remarkably good at writing code.

Just this past weekend, I've designed and written code (in Typescript) that I don't think LLMs can even come close to writing in years. I have a subscription to a frontier LLM, but lately I find myself using like 25% of the time.

At a certain level the software architecture problems I'm solving, drawing upon decades of understanding about maintainable, performant, and verifiable design of data structures and types and algorithms, are things LLMs cannot even begin to grasp.

At that point, I find that attempting to use an LLM to even draft an initial solution is a waste of time. At best I can use it for initial brainstorming.

The people saying LLMs can code are hard for me to understand. They are good for simple bash scripts and complex refactoring and drafting basic code idioms and that's about it.

And even for these tasks the amount of hand-holding I need to do is substantial. At least Gemini Pro/CLI seems good at one-shot performance, before its context gets poisoned

Aperocky 10/28/2025|||

I found that mastering LLM is no less complex than getting to learning a new language, probably between python and C++ in terms of mastery.

The learning curve is very different - with other languages, the learning curve is often upfront, with LLM, it seems linear/even rear loaded, maybe because I've not gotten to the other side.

I've been able to make LLM do more and more, some of it is undoubtly due to the improvement in model, but most of it is probably paradigm and changes in my approach. At the beginning, I run into all of the same complaints that I have eventually found workarounds to many.

jcelerier 10/28/2025||||

> The people saying LLMs can code are hard for me to understand. They are good for simple bash scripts and complex refactoring and drafting basic code idioms and that's about it

that's like, 90% of the code people are writing

FromTheFirstIn 10/28/2025||

But not 90% of the work people do. It’s solved a task, not a problem.

lan321 10/28/2025||

It's what takes time though. When you need to make a wrapper for some API for example LLMs are incredible. You give it a template, the payload format and the possible methods and it just spits out a 500-1000 line class in 15 seconds. Do it for 20 classes, that's work for a week 'done' in 30 mins. Realistically 2 days since you still have to fix and test a lot but still..

skydhash 10/28/2025||

Or write a lisp macro in one hour and be done. Or install an opengenerator and be done in 10 minutes, 9 of which is configuring the generator.

lan321 10/28/2025|||

If you can get the specific documentation for it. Sadly many companies don't want you using the API so they just give you a generic payload and the methods and leave you to it. LLMs are good in the sense that they can tell what type StartDate, EndDate is (str MSDate), maybe it also somehow catches on that ActualDuration is an int.. It also manages to guess correctly a lot of the fields in that payload that are not necessary for the particular call/get overridden anyway.

theshrike79 10/28/2025|||

Can a Lisp macro automatically search for, and find, the API documentation and apply it to the output?

I've implemented connections to (public) APIs of different services multiple times using LLMs without even looking up the APIs myself.

I just say "Enrich the data about this game from Steam's API" and that's about it.

airstrike 10/28/2025||||

I find LLMs most helpful when I already have half of the answer written and need them to fill in the blanks.

"Take X and Y I've written before, some documentation for Z, an example W from that repo, now smash them together and build the thing I need"

martin1027 10/28/2025||

This is so true. I've had the same experience.

latentsea 10/28/2025||||

I think C# is really going to shine in the LLM coding era. You can write Roslyn Analyzers to fail the build on arbitrary conditions after inspecting the AST. LLMs are great at helping you write these too. If you get a solid architecture well defined you can then use these as guardrails to constrain development to only happen in the manner you intend. You can then get LLMs to implement features and guarantee the code comes out in the shape you expect it to.

This works well for humans too, but custom analysers are abstract and not many devs know how to write them, so they are mostly provided by library authors. However, being able to generate them via LLMs makes them so much more accessible, and IMHO is a game changer for enforcing an architecture.

I've been exploring this direction a lot lately, and it feels very promising.

raddan 10/28/2025|||

Can you expand a little? What you’re suggesting sounds a bit like program verification, or at least program analysis. But what properties are you checking?

I have written many program analyses (though never any for C#; I’ll have to check it out), and my experience is that they are quite challenging to write. Many are research-level CS, so well outside the skill set of your average vibe coder. I’m wondering if you have some insight about LLM generated code that has not occurred to me…

aitchnyu 10/28/2025||||

I'm looking at AST based tools in Python to lint, enforce modularity, ban certain patterns. LLMs allow me to write scripts to find recursive function calls, calling super().method() in overrides etc.

https://pypi.org/project/import-linter/ https://github.com/hchasestevens/astpath

torginus 10/28/2025||||

I do quite a bit of coding in C#, and have a lot of experience, and personally I haven't found LLMs to be that great a help at writing C#.

First, LLMs are great at learning new tech stacks, but good ol' ASP.NET has been pretty much stable since forever. Second, I think Rider/Resharper is the greatest piece of autocomplete tech ever made, seriously nothing ever comes, close, which means I'd rather do a refactor using them than do something similar by prompting the AI and hoping for the best. Also probably my experience makes me far less accepting of LLMisms, but that might just be on me.

Lastly, AI seems to be focused around its own set of tooling, like Cursor, which is fine for TS but is far worse than Rider for things like C#. I know I could kludge things together, but still.

As for Roslyn...

I have some experience writing codegen/analyzers at my company and it feels like typical a Microsoft tech product, like WPF or Powershell.

Brilliant idea (that's a market first as well) combined with really solid technical fundamentals, but plain confusing and overcomplicated UX, that makes it a chore to use. Seriously the amount of scaffolding you need to make even for a simple analyzer is just nuts

latentsea 10/30/2025|||

My point is I can my an analyser in like 20 minutes now and it's not a chore at all. I've made like 15 for my current codebase, and when the LLM goes off the rails and generates code in a pattern I don't like, I don't tell it how to write the code much anymore, I tell it to write an analyser that prevents it from writing the code that way and then it goes about fixing it up because now the build fails.

I do a lot of coding in C# with Rider, and refactoring has been a career speciality of mine. I personally find LLMs to have a tonne of value in this space.

theshrike79 10/28/2025||||

> Lastly, AI seems to be focused around its own set of tooling, like Cursor

Nah, the best coding LLMs are console applications like Claude Code, Codex CLI and the like.

Editor integration mostly brings more tools, like tapping into different validators on VSCode and examining the "problems" view.

Also Rider's autocomplete is at least partially AI powered unless you specifically disable it IIRC.

pjmlp 10/28/2025|||

Why I never bothered writing one is the scaffolding, and the dumb idea to write code with WriteLines instead of a nice experience like T4 templates.

bgrainger 10/28/2025||||

Completely agree, and I've started writing more Roslyn analyzers to provide quick feedback to the LLM (assuming you're using it in something like VS Code that exposes the `problems` tool to the model).

I also want C# semantics even more closely integrated with the LLM. I'm imagining a stronger version of Structured Model Outputs that knows all the valid tokens that could be generated following a "." (including instance methods, extension properties, etc.) and prevents invalid code from even being generated in the first place, rather than needing a roundtrip through a Roslyn analyzer or the compiler to feed more text back to the model. (Perhaps there's some leeway to allow calls to not-yet-written methods to be generated.) Or maybe this idea is just a crutch I'm inventing for current frontier models and future models will be smart enough that they don't need it?

seanmcdirmid 10/28/2025||||

Library authors don’t really provide custom analyzers. heck, the best we can hope for are some regex based linting rules, anything that involves local data flow analysis is very rare, and anything inter procedural is non-existent. Program analysis is a dark hole, you are better off just making stronger type systems, but then type inference starts to bite you if you want to support it (and you will given how annoying type annotations are to write, unless you go with something simple like a purely structural type system so you can use Hindley Milner).

latentsea 10/30/2025||

Analyzers are a first class citizen in C#. You can get access to the AST during compile time and use it to output diagnostics with error or warning level, so it's more robust than just regex.

I've not seen teams personally write them because they're abstract and most devs shy away from it, but working in the C# ecosystem the one place I do see them pop up occasionally is from library authors.

For example xunit has some.

https://github.com/xunit/xunit.analyzers

So far I have found the value in them to be they help you constrain the possible valid moves that can be made in a codebase. This is valuable with teams of human engineers, but even more so with LLMs. It just so happens LLMs are really good at helping you write them too given you know what you want to enforce.

I doubt the tooling is as good in other languages as it is in C# in this respect, but at least for devs working in the C# ecosystem LLMs have unlocked access to writing custom analysers on a whole new level, and with that it's now significantly easier to define and enforce rules regarding what constitutes a valid program, such that the set of valid programs matches your intended architecture.

pjmlp 10/28/2025||||

As someone whose C# is one of the main work ecosystems, I highly doubt it.

What I am seeing it that LLMs will push current programming languages down the stack, like now you're enjoying C# => MSIL => Machine code.

On my line of work I already can imagine the other side of the tunnel, more low-code/no-code tooling, orchestration agents, and much (much) less manually writing C#, Java and TypeScript.

latentsea 10/30/2025||

I don't disagree with your take. My take on your take is that via what I'm suggesting I can envision that the low-code/no-code tooling can be expanded to produce a wider variety of more flexible programs with robust, consistent C# code underpinning them.

KurSix 10/28/2025|||

Linters are great at catching specific pattern violations, but they’re useless against bad decomposition or a poorly chosen abstraction. An LLM can generate code that passes all 100 linters and still ends up being a logical mess - with business logic in the wrong layer and completely unmaintainable.

latentsea 10/30/2025|||

With Roslyn Analyzers and things like ArchUnit I've found its possible to actually write linters that enforce a predetermined architecture with established patterns such as enforcing usage of particular base or framework level classes in specific layers/locations.

I agree with the assessment LLMs aren't great at novel architectural work. I'm merely reporting my experience that using LLMs to write analysers that enforce established patterns takes the output from random to well ordered and provides a nice productivity boost in a constrained, but valuable, set of circumstances. It's not a complete solve, but it's a big improvement over just prompting them and hoping for the best.

KurSix 11/5/2025||

This approach is perfect for mature projects with a well-established architecture. But it could be counterproductive in the early R&D stages when the architecture is still fluid and the team is constantly experimenting. The rigid constraints of the analyzers might stifle creativity. So it's a powerful tool, but for the right stage of a project's lifecycle

prmph 10/28/2025|||

Exactly this. LLMs may do a passable job of architecture if there are many examples of high quality architecture similar to what you want to do in their training set, but to introduce some novel stuff and they are clueless

CjHuber 10/28/2025||||

Can you maybe give an example you’ve encountered of an algorithm or a data structure that LLMs cannot handle well?

In my experience implementing algorithms from a good comprehensive description and keeping track of data models is where they shine the most.

prmph 10/28/2025|||

Example (expanding on [1]): I want to design a strongly typed fluent API interface to some role/permissions based authorization engine functionality. Even knowing how to shape the fluent interface so that is powerful but intuitive, as strongly typed as possible but also and maintainable, is a deep art.

One reason I know LLM can't come close to my design is this: I've written something that works (that a typical senior engineer might write), but this not enough. I have evaluated it critically (drawing on my experience with long lived software), rewritten it again to better meet the targets above, and repeated this process several times. I don't know what would make an LLM go: now that kind of works, but is this the most intuitive, well typed, and maintainable design that there could be?

1. https://news.ycombinator.com/item?id=45728183

simonw 10/28/2025|||

Funny you should use role/permissions as an example here, I spent the weekend using Claude Code to rewrite my own permissions engine to a new design that uses SQL queries to solve the problem "list all of the resources that this actor can perform this action on".

My previous design required looping through all known resources asking "can actor X action Y on this?". The new design gets to generate a very complex by thoroughly tested SQL query instead.

Applying that new design and updating the hundred of related tests would have taken me weeks. I got it done in two days.

Here's a diff that captures most of the work: https://github.com/simonw/datasette/compare/e951f7e81f038e43...

realusername 10/28/2025||||

Working on permissions for a large saas app, I can also confirm that the best LLM of the market have maybe a 10% success rate writing code in this area.

YZF 10/28/2025|||

What % of the total amount of software (lessay lines of code or time invested) in the world is like that?

manwe150 10/28/2025||||

Converting an algorithm implementation from recursive to iterative: it got the concept broadly right, but was quite bad at making the logic actually match up, often refusing to fix mistakes or reverting fixes two edits later. Still a positive experience though, since it was fixable issues and reduced the amount of tedious copies I had to type

cadamsdotcom 10/28/2025||

Did you have it write tests and give it the ability to iterate & validate its implementation without you in the loop?

Anything less is setting it up for failure...

manwe150 10/28/2025||

Yes, but it got 99% of those then got stuck on why the others made no sense to it

cadamsdotcom 10/28/2025||

It’s important to understand the tests it’s written yourself.

If you’d like some help I’d be glad to, just drop me an email.

My email’s in my profile.

herbst 10/28/2025||||

Claude added a self re-calling timeout to my Typescript game loop to track time. Manually by adding 1000ms every time it's called.

I removed it and it later just added it again.

It's this small weird things where it can mess up a lot of code.

NicoJuicy 10/28/2025|||

There are severe edge cases. Here are some of the last days.

Eg. Just updating bootstrap to angular bootstrap. It didn't transfer how I placed the dropdowns ( basically using dropdown-end). So everything was out of view in desktop and mobile.

It forgot the transloco I used everywhere and just used default English ( happens a lot).

Suggested code that fixed 1 bug ( expression property recursion), but now linq to SQL was broken.

Upgrade to angular 17 in a asp.net core app. I knew it used vite now. But it also required a browser folder to deploy. 20 changes down the road, I noticed something on my ui wasn't updated in dev ( fast commits for my side project, I don't build locally), it didn't deploy anything related to angular no more...

I had 2 files named ApplicationDbContext and it took the one from wrong monolith module.

It adds files in the wrong directory sometimes. Eg. Some modules were made with feature folders.

It sometimes forgets to update my ocelot gateway or updates the compressed version. ...

Note: I documented my architecture in eg. cline. But I use multiple agents to experiment with.

Tldr: it's an expert beginner programmer.

simonw 10/28/2025||

Do you have any automated tests for that project?

I'm bringing to suspect a lot of my great experiences with coding agents come from the fact that they can run tests to confirm they haven't broken anything.

solumunus 10/28/2025|||

The test loop is integral.

It’s kind of annoying hearing all this skepticism from people putting in the least effort into optimally using the tool. There is a learning curve. Every month I’ve gotten better results than the last because I’m constantly context building and refining, understanding how, what and when to prompt.

It’s like hearing someone say database suck but they haven’t bothered to learn about or use indexes or foreign keys.

NicoJuicy 10/28/2025||

Most of the mentioned issues wouldn't be catched by a test loop unless you have 100% automated tests (unit tests, ...)

Which isn't always plausible ( time ). The AI makes makes different mistakes than humans that are sometimes harder to catch.

simonw 10/28/2025||

It's a lot more plausible now you can get LLMs to help write those tests in the first place.

NicoJuicy 10/29/2025||

Most of these examples build ( shallow test in my LLM on the end of the task ) and produced new edge cases

NicoJuicy 10/28/2025|||

Too little.

Things moved as fast as possible to migrate from .net framework to .net core 8, angular 8 to 18 and bootstrap 4.5 to 5.x

saint-evan 10/28/2025||||

Maybe if you mentioned a more complex, lower level or niche language than typescript like maybe C, MIPS or some niche exotic systems language pushing around registers. I'd believe yu, with caveat, but with abstract high level abstract languages like Python, typescript and the likes? It's highly unlikely that you would've put together syntax in any uniquely surprising combination. Maybe yu mean yu designed a clever fix to a problem within a larger codebase so thar would mean a context/attention issue for the LLM but there's no way in hell yu wrote up a contained piece of code solving a specific problem, not tied to a larger software env, that couldn't also have been written by frontier LLMs provided yu could articulate the problem, a course-of-action and expected output/behavior. LLMs are very good at writing code in isolation, humans still have deeper intuition and we're still extremely good at doing the plug-in, wiring and big picture planning. Yu over-estimate what you've done with typescript or misunderstand what 'LLMs are good at writing code' [in isolation] means

prmph 10/28/2025||

This is a weird take. Software engineering solving and design is not about of syntax at all. Syntax can help or hinder some ways of expressing things, but the result of the design process is not clever syntax.

For example, the new shortest path algorithm that eclipses Dijkstra's is conceptual advance; it can be written in any Turing-complete language, and it's discovery had nothing to do with inventing new syntax in any specific language.

You comment betrays the literal/concrete understanding of coding that is a hallmark of novices. It's like saying as long as LLMs can write any kind of musical notation, there is no way a human can be a better composer.

I have not said an LLM cannot the same syntax or code patterns I write; I'm saying it, for instance, is poor at figuring out stuff like: How do I write types to enforce which entities and which fields and which roles are allowed for this action at compile-time? Should I use a generator, iterator, or recursive function for such and such functionality? Should this function be generic or not? How do I design my query fluent interface for the best performance? What should be the folder organization for this module that makes it intuitive to navigate and maintain? What is the best name for that function that will make it most intuitive to use? etc.

Anyone saying such concerns have anything to do with whether I'm using Typescript vs C or Haskell does not understand software engineering.

crazygringo 10/28/2025||||

> The people saying LLM can code are hard for me to understand.

Just today, I spent an hour documenting a function that performs a set of complex scientific simulations. Defined the function input structure, the outputs, and put a bunch of references in the body to function calls it would use.

I then spent 15 minutes explaining to the free version of ChatGPT what the function needs to do both in scientific terms and in computer architecture terms (e.g. what needed to be separated out for unit tests). Then it asked me to answer ~15 questions it had (most were yes/no, it took about 5 min), then it output around 700 lines of code.

It took me about 5 minutes to get it working, since it had a few typos. It ran.

Then I spent another 15 minutes laying out all the categories of unit tests and sanity tests I wanted it to write. It produced ~1500 lines of tests. It took me half an hour to read through them all, adjusting some edge cases that didn't make sense to me and adjusting the code accordingly. And a couple cases where it was testing the right part of the code, but had made valiant but wrong guesses as to what the scientifically correct answer would be. All the tests then passed.

All in all, a little over two hours. And it ran perfectly. In contrast, writing the code and tests myself entirely by hand would have taken at least a couple of entire days.

So when you say they're good for those simple things you list and "that's about it", I couldn't disagree more. In fact, I find myself relying on them more and more for the hardest scientific and algorithmic programming, when I provide the design and the code is relatively self-contained and tests can ensure correctness. I do the thinking, it does the coding.

DougWebb 10/28/2025|||

> Just today, I spent an hour documenting a function that performs a set of complex scientific simulations. Defined the function input structure, the outputs, and put a bunch of references in the body to function calls it would use.

So that's... math. A very well defined problem, defined very well. Any decent programmer should be able to produce working software from that, and it's great that ChatGPT was able to help you get it done much faster than you could have done it yourself. That's also the kind of project that's very well suited for unit testing, because again: math. Functions with well defined inputs, outputs, and no side-effects.

Only a tiny subset of software development projects are like that though.

simonw 10/28/2025||

> Only a tiny subset of software development projects are like that though.

Right: the majority of software development is things like "build a REST API for these three database tables" or "build a contact form with these four fields" or "write unit tests for this new function" or "update my YAML CI configuration to run this extra command".

skydhash 10/28/2025||

You do know that system programming is a thing? Or that desktop applications are software too?

simonw 10/28/2025||

I said "the majority of software development". Those are both relatively niche disciplines in 2025.

somebehemoth 10/28/2025||

Can you please explain? Are you saying all software development outside of the web is "niche"?

simonw 10/28/2025|||

Not necessarily niche, but less common. Take a look at the JetBrains developer survey if you want some numbers: https://www.jetbrains.com/lp/devecosystem-2024/

skydhash 10/28/2025||

I have a much more close relation with other niches than with web programming, even if web programming is part of my core skill set. I mostly interact with a few sites daily, even though I spend some time there. But I spend a lot of time with software like xterm, emacs, calibre, cmus,... and more with tooling like make, bash. While I'm not working on those, I had to become quite familiar with their working to troubleshoot some bug. Emacs is more important to me than AWS and GitHub.

theshrike79 10/28/2025|||

Niche as in for every one systems programmer there are dozens of people writing API Glue.

By hours of work spent and lines of code produced the latter is in a whole different scale than systems programmers (which is a very badly designed term anyway).

somebehemoth 11/2/2025||

Non web programming is not niche by any definition of the word niche.

prmph 10/28/2025|||

> documenting a function that performs a set of complex scientific simulations.

The example you gave sounds like the problem is deterministic, even if composed of many moving parts. That's one way of looking at complexity.

When I talk about complex problems I'm not just talking about intricate problems. I'm talking about problems where the "problem" is design, not just implementing a design, and that is where LLMs struggle a lot.

Example, I want to design a strongly typed fluent API interface to some functionality. Even knowing how to shape the fluent interface so that is powerful, intuitive, well/strongly typed, and maintainable is a deep art.

The intuitive design constraints that I'm designing under would be hard to even explain to an LLM.

simonw 10/28/2025||

For the problems like that I consider my role to be the expert designer. I figure out a the design, then get the LLM to write the code and the tests for me.

It is a lot faster at typing than I am.

solumunus 10/28/2025||||

That amazing code you’ve written is a tiny proportion of code that’s needed to provide business value. Most of the code delivering business value to customers day in, day out is quite simple and can easily be LLM driven.

ratatougi 10/28/2025||||

Agreed-I often use it when I need to brainstorm which appoarch I should take for my task, or when I need a refactor or generate a large set of mock data.

veegee 10/28/2025|||

[flagged]

avgDev 10/28/2025|||

This is an unhinged comment. You should take a deep breath and get off the internet. You sound extremely immature calling someone on HN "script kiddie".

veegee 10/28/2025||

[flagged]

wutwutwat 10/28/2025|||

What do you plan to do after your software career is over?

roxolotl 10/28/2025|||

One of the interesting corollaries of the title is that this can also be true of humans. Being able to code is not the same as being a software engineer. It never has been.

echelon 10/28/2025|||

We're also finding this true with media generation.

AI video is an incredible tool, but it can't make movies.

It's almost as if all of these models are an exoskeleton for people that already know what they're doing. But you still need an expert in the loop.

falcor84 10/28/2025|||

> but it can't make movies.

To me this appears to be a very time-dependent assertion. 5 years ago, AI couldn't generate a good movie frame. 2 years ago, AI couldn't generate a good shot, but now in 2025, AI can generate a not-too-shabby scene. If capabilities continue improving at this rate (e.g. as they have with AI being able to generate full musical albums), I wouldn't bet against AI being able to generate a decent feature film in the next decade. It might take longer until it's the sort of thing that we'd present in festivals, but I just don't a clear barrier any more.

Looking at it from another perspective, if an AI driven task currently requires "an expert in the loop" to navigate things by offering the appropriate prompts, evaluating and iterating on the AI generated content, then there's nothing clear to stop us from training the next generation of AI to include that expert's competency.

Taking it into full extrapolation mode, the thing that current generation AIs really don't have is the human experience that leads to a creative drive, but once we have robotic agents among us, these would arguably be able start gathering "experiences" that they could then mine to write and produce "their own" stories.

kujjerl7 10/28/2025|||

>it can't make movies

Humans are sharply declining in this ability at the same time. Most of what Hollywood churns out now is superhero slop, forced-diversity spin-offs, awful remakes of classics, and awkward comebacks for yesteryear's leading men.

I know it's not a movie but I could've happily watched "Nothing, Forever" for the rest of my life. That was creative, chaotic, hilarious, and wildly entertaining.

Meanwhile I watched the human-created War Of The Worlds (2025) last weekend... The less said, the better.

bloppe 10/28/2025|||

At least you can teach a human to become a software engineer.

jfim 10/28/2025|||

> I guess the big question is if experienced product management types can pick up enough coding technical literacy to work like this without programmers

I'd argue that they can't, at least on a short timeframe. Not because LLMs can't generate a program or product that works, but that there needs to be enough understanding of how the implementation works to fix any complex issues that come up.

One experience I had is that I had tried to generate a MITM HTTPS proxy that uses Netty using Claude, and while it generated a pile of code that looked good on the surface, it didn't actually work. Not knowing enough about Netty, I wasn't able to debug why it didn't work and trying to fix it with the LLM didn't help either.

Maybe PMs can pick up enough knowledge over time to be able to implement products that can scale, but by that time they'd effectively be a software engineer, minus the writing code part.

ambicapter 10/28/2025|||

LLMs are great for learning though, you can easily ask them questions, and you can evaluate your understanding every step of the way, and gradually build the accuracy of your world model that way. It’s not uncommon for me to ask a general question, drill deeper into a concept, and then either test things manually with some toy code or end up reading the official documentation, this time with at least some exposure to the words that I’m looking for to answer my question.

sodaclean 10/28/2025|||

This is how I use them- but I also use them to write initial UI's (usually very primitive). Because I've got an issue where the UI has to be perfect, and if I can blame somebody/something other than me I can ignore it until the UI becomes important enough.

o11c 10/28/2025|||

If I wanted a confident and simple answer with no regard for veracity, I would just ask a politician.

kaashif 10/28/2025|||

If an LLM can get you 90% of the way there, you need fewer engineers. But the engineer you need probably needs to be a senior engineer who went through the pain of learning all of the details and can function without AI.

If all juniors are using AI, or even worse, no juniors are ever hired, I'm not sure how we can produce those seniors at the scale we currently do. Which isn't even that large a scale.

Bukhmanizer 10/28/2025|||

> the big question is if experienced product management types can pick up enough coding technical literacy to work like this without programmers

I have a strong opinion that AI will boost the importance of people with “special knowledge” more than anyone else regardless of role. So engineers with deep knowledge of a system or PMs with deep knowledge of a domain.

simonw 10/28/2025||

That sounds right to me.

samsolomon 10/28/2025|||

I think you're right, the roles will exist for some time. But I think we'll start to see more and more overlap between engineering, product management and design.

In a lot of ways I think that will lead to stronger delivery teams. As a designer—the best performing teams I've been on have individuals with a core competency, but a lot of overlap in other areas. Product managers with strong engineering instincts, engineers with strong design instincts, etc. When there is less ambiguity in communication, teams deliver better software.

Longer-term I'm unsure. Maybe there is some sort of fusion into all-purpose product people able to do everything?

kakacik 10/28/2025|||

Not happening anytime soon. Those product management types are more expensive than devs in most places, you would be literally a) increasing cost per hour worked; and b) stiffling the use of (pricey) management skills of such manager to do lower pay job.

I have no doubt some broken places end up in similar mode but en masse it doesnt make any financial sense.

Also when SHTF and you can't avoid going into deep debug with strong management pressure and oversight, it will become glaringly obvious which approach can maintain things running. And SHTF always happens, its only a function of time.

shalmanese 10/28/2025|||

It’s worthwhile reading the original Fred Brooks “No Silver Bullets” paper where they explicitly cover LLMs under their “Hopes for the Silver” AI/Expert Systems/Automatic programming section and explain why it is still not a silver bullet.

https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.p...

adncors 11/1/2025|||

what if the real paradigm shift isn't about replacing engineers in building durable systems, but about making software so cheap and disposable that the concept of "technical debt" becomes irrelevant for a new class of "single-use" or ultra-short-lifespan applications?

colordrops 10/28/2025|||

Once all the context that a typical human engineer has to "build software" is available to the LLM, I'm not so sure that this statement will hold true.

bloppe 10/28/2025||

But it's becoming increasingly clear that LLMs based on the transformer model will never be able to scale their context much further than the current frontier, due mainly to context rot. Taking advantage of greater context will require architectural breakthroughs.

colordrops 10/29/2025||

Will it though? The human mind can hold less context at any one time than even a mediocre LLM. The problem isn't architecture. It's capturing context. Most of it is in a bunch of people's heads and encoded in the physical world. Once it's digitized and accessible through search, RAG, or whatever, the LLM will be able to use it effectively.

prmph 10/29/2025||

Human hold a lot of implicit context, I think far beyond any LLM. Context is not just what you consciously are thinking about in your head

colordrops 10/29/2025||

Sure, but so do LLMs models. They have a huge subconscious (the model itself).

Recording every conversation a single person ever had, every book or text or site ever read, everything ever seen, is not a huge amount of data. Microsoft attempted this with a digital camera lanyard but they were too early.

prmph 10/29/2025||

Yeah, but the models are all based on explicit data. I'm saying humans have prior wiring that allows them to extract and keep context that LLMs do not have access to.

colordrops 10/29/2025||

So the suggestion here is that RAG, tools, LLM memory, fine tuning, context management etc are not enough to take advantage of all this context? Is there any evidence that these things aren't on a trajectory to be optimized enough to do the job?

vrc 10/28/2025|||

I’m a PM and I’ve been able to do a lot of very interesting near production ready bits of coding recently with an LLM. I say near production ready because I specifically only build functional data processing stuff that I intentionally build with clean I/O requirements to hand to the real engineers on the team to slot in. They still have to fix some things to meet our standards, but I’m basically a “researcher” level coder. Which makes sense — I do have an undergrad and MS in CS, and did a lot of mathy algo stuff. For the last 15+ years I could never use anything in my brain to help the team solve things I was best suited to solve. I am now, and that’s nice.

The one key point is that I am keenly aware of what I can and cannot do. With these new superpowers, I often catch myself doing too much, and I end up doing a lot more rewrites than a real engineer would. But I can see Dunning Kruger playing out everywhere when people say they can vibe code an entire product.

belZaah 10/28/2025|||

Yeah, no. Had Claude 4.5 generate a mock implementation of an OpenAPI spec. Trivial interaction, just a post of a json object. And Claude invented new fields to check for and failed to check for required ones.

It is helpful in reducing the number of keys I have to press and the amount of documentation-diving I need to do. But saying that’s writing code is like saying StackOverflow is writing code along with autocomplete.

simonw 10/28/2025||

What did Claude do when you replied and said "don't add new fields, and make sure you check the required ones"?

Balinares 10/28/2025||

"You're absolutely right!"

IanCal 10/28/2025||

I disagree. Unless you’re focussed on right now, in which case case… maybe? Depends on scale.

I have a few scattered thoughts here but I think you’re caught up on how things are done now.

A human expert in a field is the customer.

Do you think, say, gpt5 pro can’t talk to them about a problem and what’s reasonable to try and build in software?

It can build a thing, with tests, run stuff and return to a user.

It can take feedback (talking to people is the key major things LLMs have solved).

They can iterate (see: codex) deploy and they can absolutely write copy.

What do you really think in this list they can’t do?

For simplicity reduce it to a relatively basic crud app. We know that they can make these over several steps. We know they can manage the ui pretty well, do incremental work etc. What’s missing?

I think something huge here is that some of the software engineering roles and management become exceptionally fast and cheap. That means you don’t need to have as many users to be worthwhile writing code to solve a problem. Entirely personal software becomes economically viable. I don’t need to communicate value for the problem my app has solved because it’s solved it for me.

Frankly most of the “AI can’t ever do my thing” comments come across as the same as “nobody can estimate my tasks they’re so unique” we see every time something comes up about planning. Most business relevant SE isn’t complex logically, interestingly unique or frankly hard. It’s just a different language to speak.

Disclaimer: a client of mine is working on making software simpler to build and I’m looking at the AI side, but I have these views regardless.

simonw 10/28/2025||

I expect that customers who have those needs would much rather hire somebody to be the intermediary with the LLM writing the code than take on that role themselves.

You'll get the occasional high agency non-technical customer who decides to learn how to get these things done with LLMs but they'll be a pretty rare breed.

IanCal 10/28/2025||

This may be a timeframe issue but I sincerely doubt anyone wants to hire someone to be an intermediary. They just want the thing done.

I know that right now few want to sit in front of claude code, but it's just not that big of a leap to move this up a layer. Workflows do this even without the models getting better.

simonw 10/28/2025||

YouTube can show anyone how to unblock a sink. Most people still choose to call a plumber.

IanCal 10/28/2025||

Most people would probably not do that if they could just say “unblock the sink” into their phone.

jumploops 10/28/2025||

I've been forcing myself to "pure vibe-code" on a few projects, where I don't read a single line of code (even the diffs in codex/claude code).

Candidly, it's awful. There are countless situations where it would be faster for me to edit the file directly (CSS, I'm looking at you!).

With that said, I've been surprised at how far the coding agents are able to go[0], and a lot less surprised about where I need to step in.

Things that seem to help: 1. Always create a plan/debug markdown file 2. Prompt the agent to ask questions/present multiple solutions 3. Use git more than normal (squash ugly commits on merge)

Planning is key to avoid half-brained solutions, but having "specs" for debug is almost more important. The LLM will happily dive down a path of editing as few files as possible to fix the bug/error/etc. This, unchecked, can often lead to very messy code.

Prompting the agent to ask questions/present multiple solutions allows me to stay "in control" over the how something is built.

I now basically commit every time a plan or debug step is complete. I've tried having the LLM control git, but I feel that it eats into the context a bit too much. Ideally a 3rd party "agent" would handle this.

The last thing I'll mention is that Claude Code (Sonnet 4.5) is still very token-happy, in that it eagerly goes above and beyond when not always necessary. Codex (gpt-5-codex) on the other hand, does exactly what you ask, almost to a fault. For both cases, this is where planning up-front is super useful.

[0]Caveat: the projects are either Typescript web apps or Rust utilities, can't speak to performance on other languages/domains.

theshrike79 10/28/2025||

Sonnet 4.5 is rebranded Opus 4. That's where it got its token-happiness.

Try asking Opus to generate a simple application and it'll do it. It'll also add thousands of lines of setup scripts and migration systems and Dockerfiles and reports about how it built everything and... Ooof.

Sonnet 4.5 is the same, but at a slightly smaller scale. It still LOVES to generate markdown reports of features it did. No clue why, but by default it's on, you need to specifically tell it to stop doing that.

svachalek 10/28/2025|||

Also, put heavy lint rules in place, and commit hooks to make sure everything compiles, lints, passes tests, etc. You've got to be super, super defensive. But Claude Code will see all those barriers and respond to them automatically which saves you the trouble of being vigilant over so many little things. You just need to watch the big picture, like make sure tests are there to replicate bugs, new features are tested, etc, etc.

theshrike79 10/28/2025||

Same as when coding with humans, better tests and linters will give you a shorter and simpler iteration loop.

LLMs love that.

enraged_camel 10/28/2025|||

>> Codex (gpt-5-codex) on the other hand, does exactly what you ask, almost to a fault.

I've seriously tried gpt-5-codex at least two dozen times since it came out, and every single time it was either insufficient or made huge mistakes. Even with the "have another agent write the specs and then give it to codex to implement" approach, it's just not very good. It also stops after trying one thing and then says "I've tried X, tests still failing, next I will try Y" and it's just super annoying. Claude is really good at iterating until it solves the issue.

jumploops 10/28/2025||

What type of codebase are you working within?

I've spent quite a bit of time with the normal GPT-5 in Codex (med and high reasoning), so my perspective might be skewed!

Oh, one other tip: Codex by default seems to read partial files (~200 lines at a time), so I make sure to add "Always read files in full" to my AGENTS.md file.

asabla 10/28/2025|||

> The last thing I'll mention is that Claude Code (Sonnet 4.5) is still very token-happy, in that it eagerly goes above and beyond when not always necessary. Codex (gpt-5-codex) on the other hand, does exactly what you ask, almost to a fault.

I very much share your experience. As for the time being I like the experience with codex over claude, just because I find my self in a position where I know much sooner when to step in and just doing it manually.

With claude I find my self in a typing exercise much more often, I could probably get better of knowing when to stop ofc.

throwaway314155 10/28/2025|||

> Candidly, it's awful.

Noting your caveat but I’m doing this with Python and your experience is very different from mine.

jumploops 10/28/2025||

Oh, don't get me wrong, the models are marvelous!

The "it's awful" admission is due to the "don't look at code" aspect of this exercise.

For real work, my split is more like 80% LLM/20% non-LLM, and I read all the code. It's much faster!

tharkun__ 10/28/2025||

    Always create a plan/debug markdown file

Very much necessary. Especially with Claude I find. It auto-compacts so often (Sonnet 4.5) and it instantly goes a-wall stupid after that. I then make it re-read the markdown file, so we can actually continue without it forgetting about 90% of what we just did/talked about.

    Prompt the agent to ask questions/present multiple solutions

I find that only helps marginally. They all output so much text it's not even funny. And that's with one "solution".

I don't get how people can stand reading all that nonsense they spew, especially Claude. Everything is insta-ready to deploy, problem solved, root cause found, go hit the big red button that might destroy the earth in a mushroom cloud. I learned real fast to only skim what it says and ignore all that crap (as in I never tried to "change its personality" for real - I did try to tell it to always use the scientific method and prove its assumptions but just like a junior dev it never does and just tells me stupid things it believes to be true and I have to question it. Again, just like a junior dev, but it's my junior dev that's always on and available when I have time and it does things while I do other stuff. And instead of me having to ask the junior after and hour or two what rabbit hole it went down and get them out of there, Claude and Codex usually visually ping the terminal before I even have time to notice. That's for when I don't have full time focus on what I'm trying to do with the agents, which is why I do like using them.

The times when I am fully attentive, they're just soooo slow. And many many times I could do what they're doing faster or just as fast but without spending extra money and "environment". I've been trying to "only use AI agents for coding" for like a month or two now to see its positives and limitations and form my own opinion(s).

    Prompting the agent to ask questions/present multiple solutions allows me to stay "in control" over the how something is built.

I find Claude's "Plan mode" is actually ideal. I just enable it and I don't have to tell it anything. While Codex "breaks out" from time to time and just starts coding even when I just ask it a question. If these machines ever take over, there's probably some record of me swearing at them and I will get a hitman on me. Unlike junior devs, I have no qualms about telling a model that it again ignored everything I told it.

    Ideally a 3rd party "agent" would handle this.

With sub-agents you can. Simple git interactions are perfect for subagents because not much can get lost in translation in the interface between the main agent and the sub agent. Then again, I'm not sure how you loose that much context. I rather use a sub agent for things like running the tests and linter on the whole project in the final steps, which spew a lot of unnecessary output.

Personally, I had a rather bad set of experiences with it controlling git without oversight, so I do that myself, since doing it myself is less taxing than approving everything it wants to do (I automatically allow Claude certain commands that are read only for investigations and reviewing things).

pron 10/28/2025||

> I don’t really know why AI can't build software (for now)

Could be because programming involves:

1. Long chains of logical reasoning, and

2. Applying abstract principles in practice (in this case, "best practices" of software engineering).

I think LLMs are currently bad at both of these things. They may well be among the things LLMs are worst at atm.

Also, there should be a big asterisk next to "can write code". LLMs do often produce correct code of some size and of certain kinds, but they can also fail at that too frequently.

orliesaurus 10/28/2025||

Software engineering has always been about managing complexity, not writing code. Code is just the artifact. No-code, low-code is all code but doesn't make for a good software engineered application

hamasho 10/28/2025||

The problem with vibe coding is it demoralizes experienced software engineers. I'm developing a MVP with vibes and Claude Code and Codex output work in many cases for this relatively new project. But the quality of code is bad. There is already duplicated or unused logic, a lot of code is unnecessarily complex (especially React and JSX). And there's little PR reviews so that "we can keep velocity". I'm paying much less attention for quality now. After all, why bother when AI produce working code? I can't justify and don't have energy for deep-diving system design or dozens of nitpicking change requests. And it makes me more and more replaceable by LLM.

bloppe 10/28/2025||

> I'm paying much less attention for quality now. After all, why bother when AI produce working code?

I hear this so much. It's almost like people think code quality is unrelated to how well the product works. As though you can have 1 without the other.

If your code quality is bad, your product will be bad. It may be good enough for a demo right now, but that doesn't mean it really "works".

krackers 10/28/2025|||

Because there's a notion that if any bugs are discovered later on, they can just "be fixed". And generally unless you're the one fixing the bugs, it's hard to understand the asymmetry in effort here. No one also ever got any credit for bug-fixes compared to adding features.

hamasho 10/28/2025||||

I know how important code quality is. But I can't (or don't have energy to) convince junior engineers and sometimes project managers to submit good quality code instead of vibe-coded garbage anymore.

bloppe 10/28/2025||

I just hope I never have to work at a company like that again

carlosjobim 10/28/2025||||

> If your code quality is bad, your product will be bad.

Why? Modern hardware power allow for extremely inefficient code, so even if some code runs a thousand times slower because it's badly programmed it will still be so fast that it seems instant.

For the rest of the stuff, it has no relevance for the user of the software what the code is doing inside of the chip, as long as the inputs and outputs function as they should. User wants to give input and receive output, nothing else has any significance at all for her.

bloppe 10/28/2025||

Sure. Everyone remembers from Algorithms 101 that a constant multiple ("a thousand times slower") is irrelevant. What matters is the scalability. Something that's O(n) will always scale better than something that O(n^2), even if the thing that's O(n) has 1000x overhead per unit.

But that's just a small piece of the puzzle. I agree that the user only cares about what the product does and not how the product works, but the what is always related to how, even if that relationship is imperceptible to the user. A product with terrible code quality will have more frequent and longer outages (because debugging is harder), and it will take longer for new features to be added (because adding things is harder). The user will care about these things.

theshrike79 10/28/2025|||

There is space for a generic tool that defines code quality as code. Something like ast-grep[0] or Roslyn analysers. Linters for some languages like Go do a lot of lifting in this field, but there could be more checks.

With that you could specify exactly what "good code" looks like and prevent the LLM from even committing stuff that doesn't match the rules.

[0] https://ast-grep.github.io

phyzome 10/28/2025||

I find it fascinating that your reaction to that situation is to double down while my reaction would be to kill it with fire.

Calamityjanitor 10/28/2025||

I feel you can apply this to all roles. When models passed highschool exam benchmarks, some people talked as if that made the model equivalent to a person passing highschool. I may be wrong, but I bet even an state of the art LLM couldn't complete high school. You have to do things like attending classes at the right time/place, take initiative, keep track of different classes. All of the bigger picture thinking and soft skills that aren't in a pure exam.

Improving this is what everyone's looking into now. Even larger models, context windows, adding reasoning, or something else might improve this one day.

takoid 10/28/2025|

How would LLMs ever be able to attend classes at the right time/place, assuming the classes are in-person and not remote? Seems like an odd and irrelevant criticism.

Kim_Bruning 10/28/2025||

"On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."

   --Charles Babbage

We have now come to the point where you CAN put in the wrong figures and sometimes the right answer comes out (possibly over half the time!). This was and is incredible to me and I feel lucky to be alive to see it.

However, people have taken that to mean that you can ask any old question any old way and have the right answer come out now. I might at one point have almost thought so myself. But LLMs currently are definitely not there yet.

Consider (eg) Claude Code to be your English SHell (Compare: zsh, bash).

Learn what it can and can't do for you. It's messier to learn than straight and/or/not; and I'm not sure there's manuals for it; and any manual will be outdated next quarter anyway; but that's the state of play at this time.

loco5niner 10/28/2025|

Well, the right answers have been put in the knowledgebase. It's just that the prompt may be wrong.

subtlesoftware 10/27/2025||

True for now because models are mainly used to implement features / build small MVPs, which they’re quite good at.

The next step would be to have a model running continuously on a project with inputs from monitoring services, test coverage, product analytics, etc. Such an agent, powered by a sufficient model, could be considered an effective software engineer.

We’re not there today, but it doesn’t seem that far off.

bloppe 10/28/2025||

> We’re not there today, but it doesn’t seem that far off.

What time frame counts as "not that far off" to you?

If you tried to bet me that the market for talented software engineers would collapse within the next 10 years, I'd take it no question. 25 years, I think my odds are still better than yours. 50 years, I might not take the bet.

subtlesoftware 10/28/2025||

Great question. It depends on the product. For niche SaaS products, I’d say in the next few years. For like Amazon.com, on the order of decades.

bloppe 10/28/2025||

If the niche SaaS product never required a talented engineer in the first place, I'd be inclined to agree with you. But even a niche SaaS product requires a decent amount of engineering skill to maintain well.

thomasfromcdnjs 10/27/2025|||

Agreed.

I've played around with agent only code bases (where I don't code at all), and had an agent hooked up to server logs, which would create an issue when it encounters errors, and then an agent would fix the tickets, push to prod and check deployment statuses etc. Worked good enough to see that this could easily become the future. (I also had it claude/codex code that whole setup)

Just for semantic nitpicking, I've zero shot heaps of small "software" projects that I use then throw away. Doesn't count as a SAAS product but I would still call it software.

bloppe 10/28/2025||

The article "AI can code, but it can't build software"

An inevitable comment: "But I've seen AI code! So it must be able to build software"

bcrosby95 10/27/2025|||

> The next step would be to have a model running continuously on a project with inputs from monitoring services, test coverage, product analytics, etc. Such an agent, powered by a sufficient model, could be considered an effective software engineer.

Building an automated system that determines if a system is correct (whatever that means) is harder to build than the coding agents themselves.

pil0u 10/27/2025|||

I agree that tooling is maturing towards that end.

I wonder if that same non-technical person that built the MVP with GenAI and requires a (human) technical assistance today, will need it tomorrow as well. Will the tooling be mature enough and lower the barrier enough for anyone to have a complete understanding about software engineering (monitoring services, test coverage, product analytics)?

cratermoon 10/28/2025||

> I agree that tooling is maturing towards that end.

That's what every no-programming-needed hyped tool has said. Yet here we are, still hiring programmers.

jahbrewski 10/27/2025||

I’ve heard “we’re not there today, but it doesn’t seem that far off” since the beginning of the AI infatuation. What if, it is far off?

bloppe 10/28/2025||

It's telling to me that nobody who actually works in AI research thinks that it's "not that far off".

KurSix 10/28/2025||

This whole situation painfully reminds me of the low-code/no-code boom from like 5–10 years ago.

Back then everyone was saying developers would become obsolete and business analysts would just “click together” enterprise solutions. In the end, we got a mess of clunky non-scalable systems that still had to be fixed and integrated by the same engineers.

LLMs are basically low-code on steroids - they make it easier to build a prototype, but exponentially harder to turn it into something actually reliable.

dreamcompiler 10/28/2025|

I've worked in a few teams where some member of the [human] team could be described as "Joe can code, but he can't build software."

The difference is what we used to call the "ilities": Reliability, inhabitability, understandability, maintainability, securability, scalability, etc.

None of these things are about the primary function of the code, i.e. "it seems to work." In coding, "it seems to work" is good enough. In software engineering, it isn't.

More comments...