Posted by nreece 1 day ago
A human expert needs to identify the need for software, decide what the software should do, figure out what's feasible to deliver, build the first version (AI can help a bunch here), evaluate what they've built, show it to users, talk to them about whether it's fit for purpose, iterate based on their feedback, deploy and communicate the value of the software, and manage its existence and continued evolution in the future.
Some of that stuff can be handled by non-developer humans working with LLMs, but a human expert needs who understands code will be able to do this stuff a whole lot more effectively.
I guess the big question is if experienced product management types can pick up enough coding technical literacy to work like this without programmers, or if programmers can pick up enough enough PM skills to work without PMs.
My money is on both roles continuing to exist and benefit from each other, in a partnership that produces results a lot faster because the previously slow "writing the code" part is a lot faster than it used to be.
Just this past weekend, I've designed and written code (in Typescript) that I don't think LLMs can even come close to writing in years. I have a subscription to a frontier LLM, but lately I find myself using like 25% of the time.
At a certain level the software architecture problems I'm solving, drawing upon decades of understanding about maintainable, performant, and verifiable design of data structures and types and algorithms, are things LLMs cannot even begin to grasp.
At that point, I find that attempting to use an LLM to even draft an initial solution is a waste of time. At best I can use it for initial brainstorming.
The people saying LLMs can code are hard for me to understand. They are good for simple bash scripts and complex refactoring and drafting basic code idioms and that's about it.
And even for these tasks the amount of hand-holding I need to do is substantial. At least Gemini Pro/CLI seems good at one-shot performance, before its context gets poisoned
The learning curve is very different - with other languages, the learning curve is often upfront, with LLM, it seems linear/even rear loaded, maybe because I've not gotten to the other side.
I've been able to make LLM do more and more, some of it is undoubtly due to the improvement in model, but most of it is probably paradigm and changes in my approach. At the beginning, I run into all of the same complaints that I have eventually found workarounds to many.
that's like, 90% of the code people are writing
I've implemented connections to (public) APIs of different services multiple times using LLMs without even looking up the APIs myself.
I just say "Enrich the data about this game from Steam's API" and that's about it.
"Take X and Y I've written before, some documentation for Z, an example W from that repo, now smash them together and build the thing I need"
This works well for humans too, but custom analysers are abstract and not many devs know how to write them, so they are mostly provided by library authors. However, being able to generate them via LLMs makes them so much more accessible, and IMHO is a game changer for enforcing an architecture.
I've been exploring this direction a lot lately, and it feels very promising.
I have written many program analyses (though never any for C#; I’ll have to check it out), and my experience is that they are quite challenging to write. Many are research-level CS, so well outside the skill set of your average vibe coder. I’m wondering if you have some insight about LLM generated code that has not occurred to me…
First, LLMs are great at learning new tech stacks, but good ol' ASP.NET has been pretty much stable since forever. Second, I think Rider/Resharper is the greatest piece of autocomplete tech ever made, seriously nothing ever comes, close, which means I'd rather do a refactor using them than do something similar by prompting the AI and hoping for the best. Also probably my experience makes me far less accepting of LLMisms, but that might just be on me.
Lastly, AI seems to be focused around its own set of tooling, like Cursor, which is fine for TS but is far worse than Rider for things like C#. I know I could kludge things together, but still.
As for Roslyn...
I have some experience writing codegen/analyzers at my company and it feels like typical a Microsoft tech product, like WPF or Powershell.
Brilliant idea (that's a market first as well) combined with really solid technical fundamentals, but plain confusing and overcomplicated UX, that makes it a chore to use. Seriously the amount of scaffolding you need to make even for a simple analyzer is just nuts
Nah, the best coding LLMs are console applications like Claude Code, Codex CLI and the like.
Editor integration mostly brings more tools, like tapping into different validators on VSCode and examining the "problems" view.
Also Rider's autocomplete is at least partially AI powered unless you specifically disable it IIRC.
https://pypi.org/project/import-linter/ https://github.com/hchasestevens/astpath
I also want C# semantics even more closely integrated with the LLM. I'm imagining a stronger version of Structured Model Outputs that knows all the valid tokens that could be generated following a "." (including instance methods, extension properties, etc.) and prevents invalid code from even being generated in the first place, rather than needing a roundtrip through a Roslyn analyzer or the compiler to feed more text back to the model. (Perhaps there's some leeway to allow calls to not-yet-written methods to be generated.) Or maybe this idea is just a crutch I'm inventing for current frontier models and future models will be smart enough that they don't need it?
What I am seeing it that LLMs will push current programming languages down the stack, like now you're enjoying C# => MSIL => Machine code.
On my line of work I already can imagine the other side of the tunnel, more low-code/no-code tooling, orchestration agents, and much (much) less manually writing C#, Java and TypeScript.
For example, the new shortest path algorithm that eclipses Dijkstra's is conceptual advance; it can be written in any Turing-complete language, and it's discovery had nothing to do with inventing new syntax in any specific language.
You comment betrays the literal/concrete understanding of coding that is a hallmark of novices. It's like saying as long as LLMs can write any kind of musical notation, there is no way a human can be a better composer.
I have not said an LLM cannot the same syntax or code patterns I write; I'm saying it, for instance, is poor at figuring out stuff like: How do I write types to enforce which entities and which fields and which roles are allowed for this action at compile-time? Should I use a generator, iterator, or recursive function for such and such functionality? Should this function be generic or not? How do I design my query fluent interface for the best performance? What should be the folder organization for this module that makes it intuitive to navigate and maintain? What is the best name for that function that will make it most intuitive to use? etc.
Anyone saying such concerns have anything to do with whether I'm using Typescript vs C or Haskell does not understand software engineering.
In my experience implementing algorithms from a good comprehensive description and keeping track of data models is where they shine the most.
One reason I know LLM can't come close to my design is this: I've written something that works (that a typical senior engineer might write), but this not enough. I have evaluated it critically (drawing on my experience with long lived software), rewritten it again to better meet the targets above, and repeated this process several times. I don't know what would make an LLM go: now that kind of works, but is this the most intuitive, well typed, and maintainable design that there could be?
My previous design required looping through all known resources asking "can actor X action Y on this?". The new design gets to generate a very complex by thoroughly tested SQL query instead.
Applying that new design and updating the hundred of related tests would have taken me weeks. I got it done in two days.
Here's a diff that captures most of the work: https://github.com/simonw/datasette/compare/e951f7e81f038e43...
Anything less is setting it up for failure...
If you’d like some help I’d be glad to, just drop me an email.
My email’s in my profile.
I removed it and it later just added it again.
It's this small weird things where it can mess up a lot of code.
Eg. Just updating bootstrap to angular bootstrap. It didn't transfer how I placed the dropdowns ( basically using dropdown-end). So everything was out of view in desktop and mobile.
It forgot the transloco I used everywhere and just used default English ( happens a lot).
Suggested code that fixed 1 bug ( expression property recursion), but now linq to SQL was broken.
Upgrade to angular 17 in a asp.net core app. I knew it used vite now. But it also required a browser folder to deploy. 20 changes down the road, I noticed something on my ui wasn't updated in dev ( fast commits for my side project, I don't build locally), it didn't deploy anything related to angular no more...
I had 2 files named ApplicationDbContext and it took the one from wrong monolith module.
It adds files in the wrong directory sometimes. Eg. Some modules were made with feature folders.
It sometimes forgets to update my ocelot gateway or updates the compressed version. ...
Note: I documented my architecture in eg. cline. But I use multiple agents to experiment with.
Tldr: it's an expert beginner programmer.
I'm bringing to suspect a lot of my great experiences with coding agents come from the fact that they can run tests to confirm they haven't broken anything.
It’s kind of annoying hearing all this skepticism from people putting in the least effort into optimally using the tool. There is a learning curve. Every month I’ve gotten better results than the last because I’m constantly context building and refining, understanding how, what and when to prompt.
It’s like hearing someone say database suck but they haven’t bothered to learn about or use indexes or foreign keys.
Which isn't always plausible ( time ). The AI makes makes different mistakes than humans that are sometimes harder to catch.
Things moved as fast as possible to migrate from .net framework to .net core 8, angular 8 to 18 and bootstrap 4.5 to 5.x
Just today, I spent an hour documenting a function that performs a set of complex scientific simulations. Defined the function input structure, the outputs, and put a bunch of references in the body to function calls it would use.
I then spent 15 minutes explaining to the free version of ChatGPT what the function needs to do both in scientific terms and in computer architecture terms (e.g. what needed to be separated out for unit tests). Then it asked me to answer ~15 questions it had (most were yes/no, it took about 5 min), then it output around 700 lines of code.
It took me about 5 minutes to get it working, since it had a few typos. It ran.
Then I spent another 15 minutes laying out all the categories of unit tests and sanity tests I wanted it to write. It produced ~1500 lines of tests. It took me half an hour to read through them all, adjusting some edge cases that didn't make sense to me and adjusting the code accordingly. And a couple cases where it was testing the right part of the code, but had made valiant but wrong guesses as to what the scientifically correct answer would be. All the tests then passed.
All in all, a little over two hours. And it ran perfectly. In contrast, writing the code and tests myself entirely by hand would have taken at least a couple of entire days.
So when you say they're good for those simple things you list and "that's about it", I couldn't disagree more. In fact, I find myself relying on them more and more for the hardest scientific and algorithmic programming, when I provide the design and the code is relatively self-contained and tests can ensure correctness. I do the thinking, it does the coding.
So that's... math. A very well defined problem, defined very well. Any decent programmer should be able to produce working software from that, and it's great that ChatGPT was able to help you get it done much faster than you could have done it yourself. That's also the kind of project that's very well suited for unit testing, because again: math. Functions with well defined inputs, outputs, and no side-effects.
Only a tiny subset of software development projects are like that though.
Right: the majority of software development is things like "build a REST API for these three database tables" or "build a contact form with these four fields" or "write unit tests for this new function" or "update my YAML CI configuration to run this extra command".
By hours of work spent and lines of code produced the latter is in a whole different scale than systems programmers (which is a very badly designed term anyway).
The example you gave sounds like the problem is deterministic, even if composed of many moving parts. That's one way of looking at complexity.
When I talk about complex problems I'm not just talking about intricate problems. I'm talking about problems where the "problem" is design, not just implementing a design, and that is where LLMs struggle a lot.
Example, I want to design a strongly typed fluent API interface to some functionality. Even knowing how to shape the fluent interface so that is powerful, intuitive, well/strongly typed, and maintainable is a deep art.
The intuitive design constraints that I'm designing under would be hard to even explain to an LLM.
It is a lot faster at typing than I am.
AI video is an incredible tool, but it can't make movies.
It's almost as if all of these models are an exoskeleton for people that already know what they're doing. But you still need an expert in the loop.
To me this appears to be a very time-dependent assertion. 5 years ago, AI couldn't generate a good movie frame. 2 years ago, AI couldn't generate a good shot, but now in 2025, AI can generate a not-too-shabby scene. If capabilities continue improving at this rate (e.g. as they have with AI being able to generate full musical albums), I wouldn't bet against AI being able to generate a decent feature film in the next decade. It might take longer until it's the sort of thing that we'd present in festivals, but I just don't a clear barrier any more.
Looking at it from another perspective, if an AI driven task currently requires "an expert in the loop" to navigate things by offering the appropriate prompts, evaluating and iterating on the AI generated content, then there's nothing clear to stop us from training the next generation of AI to include that expert's competency.
Taking it into full extrapolation mode, the thing that current generation AIs really don't have is the human experience that leads to a creative drive, but once we have robotic agents among us, these would arguably be able start gathering "experiences" that they could then mine to write and produce "their own" stories.
Humans are sharply declining in this ability at the same time. Most of what Hollywood churns out now is superhero slop, forced-diversity spin-offs, awful remakes of classics, and awkward comebacks for yesteryear's leading men.
I know it's not a movie but I could've happily watched "Nothing, Forever" for the rest of my life. That was creative, chaotic, hilarious, and wildly entertaining.
Meanwhile I watched the human-created War Of The Worlds (2025) last weekend... The less said, the better.
I'd argue that they can't, at least on a short timeframe. Not because LLMs can't generate a program or product that works, but that there needs to be enough understanding of how the implementation works to fix any complex issues that come up.
One experience I had is that I had tried to generate a MITM HTTPS proxy that uses Netty using Claude, and while it generated a pile of code that looked good on the surface, it didn't actually work. Not knowing enough about Netty, I wasn't able to debug why it didn't work and trying to fix it with the LLM didn't help either.
Maybe PMs can pick up enough knowledge over time to be able to implement products that can scale, but by that time they'd effectively be a software engineer, minus the writing code part.
If all juniors are using AI, or even worse, no juniors are ever hired, I'm not sure how we can produce those seniors at the scale we currently do. Which isn't even that large a scale.
I have a strong opinion that AI will boost the importance of people with “special knowledge” more than anyone else regardless of role. So engineers with deep knowledge of a system or PMs with deep knowledge of a domain.
In a lot of ways I think that will lead to stronger delivery teams. As a designer—the best performing teams I've been on have individuals with a core competency, but a lot of overlap in other areas. Product managers with strong engineering instincts, engineers with strong design instincts, etc. When there is less ambiguity in communication, teams deliver better software.
Longer-term I'm unsure. Maybe there is some sort of fusion into all-purpose product people able to do everything?
https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.p...
The one key point is that I am keenly aware of what I can and cannot do. With these new superpowers, I often catch myself doing too much, and I end up doing a lot more rewrites than a real engineer would. But I can see Dunning Kruger playing out everywhere when people say they can vibe code an entire product.
It is helpful in reducing the number of keys I have to press and the amount of documentation-diving I need to do. But saying that’s writing code is like saying StackOverflow is writing code along with autocomplete.
I have no doubt some broken places end up in similar mode but en masse it doesnt make any financial sense.
Also when SHTF and you can't avoid going into deep debug with strong management pressure and oversight, it will become glaringly obvious which approach can maintain things running. And SHTF always happens, its only a function of time.
I have a few scattered thoughts here but I think you’re caught up on how things are done now.
A human expert in a field is the customer.
Do you think, say, gpt5 pro can’t talk to them about a problem and what’s reasonable to try and build in software?
It can build a thing, with tests, run stuff and return to a user.
It can take feedback (talking to people is the key major things LLMs have solved).
They can iterate (see: codex) deploy and they can absolutely write copy.
What do you really think in this list they can’t do?
For simplicity reduce it to a relatively basic crud app. We know that they can make these over several steps. We know they can manage the ui pretty well, do incremental work etc. What’s missing?
I think something huge here is that some of the software engineering roles and management become exceptionally fast and cheap. That means you don’t need to have as many users to be worthwhile writing code to solve a problem. Entirely personal software becomes economically viable. I don’t need to communicate value for the problem my app has solved because it’s solved it for me.
Frankly most of the “AI can’t ever do my thing” comments come across as the same as “nobody can estimate my tasks they’re so unique” we see every time something comes up about planning. Most business relevant SE isn’t complex logically, interestingly unique or frankly hard. It’s just a different language to speak.
Disclaimer: a client of mine is working on making software simpler to build and I’m looking at the AI side, but I have these views regardless.
You'll get the occasional high agency non-technical customer who decides to learn how to get these things done with LLMs but they'll be a pretty rare breed.
I know that right now few want to sit in front of claude code, but it's just not that big of a leap to move this up a layer. Workflows do this even without the models getting better.
Candidly, it's awful. There are countless situations where it would be faster for me to edit the file directly (CSS, I'm looking at you!).
With that said, I've been surprised at how far the coding agents are able to go[0], and a lot less surprised about where I need to step in.
Things that seem to help: 1. Always create a plan/debug markdown file 2. Prompt the agent to ask questions/present multiple solutions 3. Use git more than normal (squash ugly commits on merge)
Planning is key to avoid half-brained solutions, but having "specs" for debug is almost more important. The LLM will happily dive down a path of editing as few files as possible to fix the bug/error/etc. This, unchecked, can often lead to very messy code.
Prompting the agent to ask questions/present multiple solutions allows me to stay "in control" over the how something is built.
I now basically commit every time a plan or debug step is complete. I've tried having the LLM control git, but I feel that it eats into the context a bit too much. Ideally a 3rd party "agent" would handle this.
The last thing I'll mention is that Claude Code (Sonnet 4.5) is still very token-happy, in that it eagerly goes above and beyond when not always necessary. Codex (gpt-5-codex) on the other hand, does exactly what you ask, almost to a fault. For both cases, this is where planning up-front is super useful.
[0]Caveat: the projects are either Typescript web apps or Rust utilities, can't speak to performance on other languages/domains.
Try asking Opus to generate a simple application and it'll do it. It'll also add thousands of lines of setup scripts and migration systems and Dockerfiles and reports about how it built everything and... Ooof.
Sonnet 4.5 is the same, but at a slightly smaller scale. It still LOVES to generate markdown reports of features it did. No clue why, but by default it's on, you need to specifically tell it to stop doing that.
LLMs love that.
I very much share your experience. As for the time being I like the experience with codex over claude, just because I find my self in a position where I know much sooner when to step in and just doing it manually.
With claude I find my self in a typing exercise much more often, I could probably get better of knowing when to stop ofc.
I've seriously tried gpt-5-codex at least two dozen times since it came out, and every single time it was either insufficient or made huge mistakes. Even with the "have another agent write the specs and then give it to codex to implement" approach, it's just not very good. It also stops after trying one thing and then says "I've tried X, tests still failing, next I will try Y" and it's just super annoying. Claude is really good at iterating until it solves the issue.
I've spent quite a bit of time with the normal GPT-5 in Codex (med and high reasoning), so my perspective might be skewed!
Oh, one other tip: Codex by default seems to read partial files (~200 lines at a time), so I make sure to add "Always read files in full" to my AGENTS.md file.
Noting your caveat but I’m doing this with Python and your experience is very different from mine.
The "it's awful" admission is due to the "don't look at code" aspect of this exercise.
For real work, my split is more like 80% LLM/20% non-LLM, and I read all the code. It's much faster!
Always create a plan/debug markdown file
Very much necessary. Especially with Claude I find. It auto-compacts so often (Sonnet 4.5) and it instantly goes a-wall stupid after that. I then make it re-read the markdown file, so we can actually continue without it forgetting about 90% of what we just did/talked about. Prompt the agent to ask questions/present multiple solutions
I find that only helps marginally. They all output so much text it's not even funny. And that's with one "solution".I don't get how people can stand reading all that nonsense they spew, especially Claude. Everything is insta-ready to deploy, problem solved, root cause found, go hit the big red button that might destroy the earth in a mushroom cloud. I learned real fast to only skim what it says and ignore all that crap (as in I never tried to "change its personality" for real - I did try to tell it to always use the scientific method and prove its assumptions but just like a junior dev it never does and just tells me stupid things it believes to be true and I have to question it. Again, just like a junior dev, but it's my junior dev that's always on and available when I have time and it does things while I do other stuff. And instead of me having to ask the junior after and hour or two what rabbit hole it went down and get them out of there, Claude and Codex usually visually ping the terminal before I even have time to notice. That's for when I don't have full time focus on what I'm trying to do with the agents, which is why I do like using them.
The times when I am fully attentive, they're just soooo slow. And many many times I could do what they're doing faster or just as fast but without spending extra money and "environment". I've been trying to "only use AI agents for coding" for like a month or two now to see its positives and limitations and form my own opinion(s).
Prompting the agent to ask questions/present multiple solutions allows me to stay "in control" over the how something is built.
I find Claude's "Plan mode" is actually ideal. I just enable it and I don't have to tell it anything. While Codex "breaks out" from time to time and just starts coding even when I just ask it a question. If these machines ever take over, there's probably some record of me swearing at them and I will get a hitman on me. Unlike junior devs, I have no qualms about telling a model that it again ignored everything I told it. Ideally a 3rd party "agent" would handle this.
With sub-agents you can. Simple git interactions are perfect for subagents because not much can get lost in translation in the interface between the main agent and the sub agent. Then again, I'm not sure how you loose that much context. I rather use a sub agent for things like running the tests and linter on the whole project in the final steps, which spew a lot of unnecessary output.Personally, I had a rather bad set of experiences with it controlling git without oversight, so I do that myself, since doing it myself is less taxing than approving everything it wants to do (I automatically allow Claude certain commands that are read only for investigations and reviewing things).
Could be because programming involves:
1. Long chains of logical reasoning, and
2. Applying abstract principles in practice (in this case, "best practices" of software engineering).
I think LLMs are currently bad at both of these things. They may well be among the things LLMs are worst at atm.
Also, there should be a big asterisk next to "can write code". LLMs do often produce correct code of some size and of certain kinds, but they can also fail at that too frequently.
Improving this is what everyone's looking into now. Even larger models, context windows, adding reasoning, or something else might improve this one day.
The next step would be to have a model running continuously on a project with inputs from monitoring services, test coverage, product analytics, etc. Such an agent, powered by a sufficient model, could be considered an effective software engineer.
We’re not there today, but it doesn’t seem that far off.
What time frame counts as "not that far off" to you?
If you tried to bet me that the market for talented software engineers would collapse within the next 10 years, I'd take it no question. 25 years, I think my odds are still better than yours. 50 years, I might not take the bet.
I've played around with agent only code bases (where I don't code at all), and had an agent hooked up to server logs, which would create an issue when it encounters errors, and then an agent would fix the tickets, push to prod and check deployment statuses etc. Worked good enough to see that this could easily become the future. (I also had it claude/codex code that whole setup)
Just for semantic nitpicking, I've zero shot heaps of small "software" projects that I use then throw away. Doesn't count as a SAAS product but I would still call it software.
An inevitable comment: "But I've seen AI code! So it must be able to build software"
Building an automated system that determines if a system is correct (whatever that means) is harder to build than the coding agents themselves.
I wonder if that same non-technical person that built the MVP with GenAI and requires a (human) technical assistance today, will need it tomorrow as well. Will the tooling be mature enough and lower the barrier enough for anyone to have a complete understanding about software engineering (monitoring services, test coverage, product analytics)?
That's what every no-programming-needed hyped tool has said. Yet here we are, still hiring programmers.
--Charles Babbage
We have now come to the point where you CAN put in the wrong figures and sometimes the right answer comes out (possibly over half the time!). This was and is incredible to me and I feel lucky to be alive to see it.However, people have taken that to mean that you can ask any old question any old way and have the right answer come out now. I might at one point have almost thought so myself. But LLMs currently are definitely not there yet.
Consider (eg) Claude Code to be your English SHell (Compare: zsh, bash).
Learn what it can and can't do for you. It's messier to learn than straight and/or/not; and I'm not sure there's manuals for it; and any manual will be outdated next quarter anyway; but that's the state of play at this time.
I've generally found the quality of .NET to be quite good. It trips up sometimes when linters ping it for rules not normally enforced, but it does the job reasonably well.
The front-end javascript though? It's both an absolute genuis and a complete menace at the same time. It'll write reams of code to gets things just right but with no regards to human maintainability.
I lost an entire session to the fact that it cheerfully did:
npm install fabric
npm install -D @types/fabric
Now that might look fine, but a human would have realised that the typings library is a completely different out-dated API, the package last updated 6 years ago.Claude however didn't realise this, and wrote a ton of code that would pass unit tests but fail the type check. It'd check the type checker, re-write it all to pass the type checker, only for it now to fail the unit tests.
Eventually it semi-gave up typing and did loads of (fabric as any) all over the place, so now it just gave runtime exceptions instead.
I intervened when I realised what it was doing, and found the root cause of it's problems.
It was a complete blindspot because it just trusted both the library and the typechecker.
So yeah, if you want to snipe a vibe coder, suggest installing fabricjs with typings!
Instead of just committing more often, make the agent write commits following the conventional commits spec (feat:, fix:, refactor:) and reference a specific item from your plan.md in the commit body. That way you’ll get a self-documenting history - not just of the code, but of the agent’s thought process, which is priceless for debugging and refactoring later on
Back then everyone was saying developers would become obsolete and business analysts would just “click together” enterprise solutions. In the end, we got a mess of clunky non-scalable systems that still had to be fixed and integrated by the same engineers.
LLMs are basically low-code on steroids - they make it easier to build a prototype, but exponentially harder to turn it into something actually reliable.
The human brain learns through mistakes, repetition, breaking down complex problems into simpler parts, and reimagining ideas. The hippocampus naturally discards memories that aren’t strongly reinforced.. so if you rely solely on AI, you’re simply not going to remember much.