Posted by yuedongze 6 days ago
More and more often, while doing code review, I find I will not understand something and I will ask, and the "author" will clearly have no idea what it is doing either.
I find it quite troubling how little actual human thought is going into things. The AIs context window is not nearly large enough to fully understand the entire scope of any decently sized applications ecosystem. It just takes small peaks at bits and makes decisions based on a tiny slice of the world.
It's a powerful tool and as such needs to be guided with care.
I have seen so many projects were people who understood all of it, are just gone. They moved, did something else etc.
As soon as this happens, you no longer have anyone 'getting it'. You have to handle so many people adding/changing very thin lines across all components and you can only hope that the original people had enough foresight adding enough unit tests for core decisions.
So i really don't mind AI here anymore.
“Whatever code you commit - you own it - no matter who (or what) wrote it.”
Make this your top-down directive, and fire people who insist on throwing trash over the fence into your yard.
Does your company not have many retirements, firings, or employees who quit to work elsewhere?
I want to stress that the main point of my article is not really about AI coding, it's about letting AI perform any arbitrary tasks reliably. Coding is an interesting one because it seems like it's a place where we can exploit structure and abstraction and approaches (like TDD) to make verification simpler - it's like spot-checking in places with a very low soundness error.
I'm encouraging people to look for tasks other than coding to see if we can find similar patterns. The more we can find these cost asymmetry (easier to verify than doing), the more we can harness AI's real potential.
One that works particularly well in my case is test-driven development followed by pair programming:
• “given this spec/context/goal/… make test XYZ pass”
• “now that we have a draft solution, is it in the right component? is it efficient? well documented? any corner cases?…”
All the type systems (and model-checkers) for Rust, Ada, OCaml, Haskell, TypeScript, Python, C#, Java, ... are based on such research, and these are all rather weak in comparison to what research has created in the last ~30 years (see Rocq, Idris, Lean).
This goes beyond that, as some of these mechanisms have been applied to mathematics, but also to some aspects of finance and law (I know of at least mechanisms to prove formally implementations of banking contracts and tax management).
So there is lots to do in the domain. Sadly, as every branch of CS other than AI (and in fact pretty much every branch of science other than AI), this branch of computer science is underfunded. But that can change!
You're answering with finding bugs, which is about fixing one issue at a time.
Both are useful, but we're not speaking of the same scale.
I work on a large product with two decades of accumulated legacy, maybe that's the problem. I can see though how generating and editing a simple greenfield web frontend project could work much better, as long as actual complexity is low.
public static double ScoreItem(Span<byte> candidate, Span<byte> target)
{
//TODO: Return the normalized Levenshtein distance between the 2 byte sequences.
//... any additional edge cases here ...
}
I think generating more than one method at a time is playing with fire. Individual methods can be generated by the LLM and tested in isolation. You can incrementally build up and trust your understanding of the problem space by going a little bit slower. If the LLM is operating over a whole set of methods at once, it is like starting over each time you have to iterate.Using an agentic system that can at least read the other bits of code is more efficient than copypasting snippets to a web page.
This is the point. I don't want it thinking about my entire project. I want it looking at a very specific problem each time.
Most code is about patterns, specific code styles and reusing existing libraries. Without context none of that can be applied to the solution.
If you put a programmer in a room and give them a piece of paper with a function and say OPTIMISE THAT! - is it going to be their best work?
Genuine productivity boost but I don't feel like it's AI slop, sometimes it feels like its actually reading my mind and just preventing me from having to type...
I've had net-time-savings with bigger agentic tasks, but I still have to check it line-by-line when it is done, because it takes lazy shortcuts and sometimes just outright gets things wrong.
Big productivity boost, it takes out the worst of my job, but I still can't trust it at much above the micro scale.
I wish I could give a system prompt for the tab complete; there's a couple of things it does over and over that I'm sure I could prompt away but there's no way to feed that in that I know of.
I like to read descriptive variable names, I just don't like to write them all the time.
When I give AI a smaller or more focused project, it's magical. I've been using Claude Code to write code for ESP32 projects and it's really impressive. OTOH, it failed to tell me about a standard device driver I could be using instead of a community device driver I found. I think any human who works on ESP-IDF projects would have pointed that out.
AI's failings are always a little weird.
I find hand-holding Claude a permanent source of frustration, except in the rare case that it helps me discover an error in the code.
Eg it‘s great for refactoring now, it’s often updating the README along with renames without me asking. It’s also really good at rebasing quickly, but only by cherry-picking inside a worktree. Churning out small components I don’t want to add a new dependency for, those are usually good on first try.
For implementing whole features, the space of possible solutions is way too big to always hit something that I‘ll be satisfied with. Once I have an idea on how to implement something in broad strokes, I can give a very error ridden first draft to it as a stream of thoughts, let it read all required files, and make an implementation plan. Usually that’s not too far off, and doesn’t take that long. Once that’s done, Opus 4.5 is pretty good at implementing that plan. Still I read every line, if this will go to production.
Ironically, this would be the best workflow with humans too.
Now I use agentic coding a lot with maybe 80-90% success rate.
I’m on greenfield projects (my startup) and maintaining strict Md files with architecture decisions and examples helps a lot.
I barely write code anymore, and mostly code review and maintain the documentation.
In existing codebases pre-ai I think it’s near impossible because I’ve never worked anywhere that maintained documentation. It was always a chore.
I've tried vibe coding and usually end up with something subtly or horribly broken, with excessive levels of complexity. Once it digs itself a hole, it's very difficult to extricate it even with explicit instruction.
Another good use case is to use it for knowledge searching within a codebase. I find that to be incredibly useful without much context "engineering"
Let's say you want to add a new functionality, for example plug to the shared user service, that already exist in another service in the same monorepo, the AI will be really good at identifying an example and applying it to your service.
* My 5 years old project: monorepo with backend, 2 front-ends and 2 libraries
* 10+ years old company project: about 20 various packages in monorepo
In both cases I successfully give Claude Code or OpenCode instructions either at package level or monorepo level. Usually I prefer package level.
E.g. just now I gave instructions in my personal project: "Invoice styles in /app/settings/invoice should be localized". It figured out that unlocalized strings comes from library package, added strings to the code and messages files (added missing translations), however has not cleaned up hardcoded strings from library. As I know code I have written extra prompt "Maybe INVOICE_STYLE_CONFIGS can be cleaned-up in such case" and it cleaned-up what I have expected, ran tests and linting.
Also - claude (~the best coding agent currently imo) will make mistakes, sometimes many of them - tell it to test the code it writes and make sure it's working - I've generally found its pretty good at debugging/testing and fixing it's own mistakes.
Instead of dealing with intricacies of directly writing the code, I explain the AI what are we trying to achieve next and what approach I prefer. This way I am still on top of it, I am able to understand the quality of the code it generated and I’m the one who integrates everything.
So far I found the tools that are supposed to be able to edit the whole codebase at once be useless. I instantly loose perspective when the AI IDE fiddles with multiple code blocks and does some magic. The chatbot interface is superior for me as the control stays with me and I still follow the code writing step by step.
I'm in a similar situation, and for the first time ever I'm actually considering if a rewrite to microservices would make sense, with a microservice being something small enough an AI could actually deal with - and maybe even build largely on its own.
You can start there. Does it ever stay that way?
> I work on a large product with two decades of accumulated legacy
Survey says: No.
Definitely. I've found Claude at least isn't so good at working in large existing projects, but great at greenfielding.
Most of my use these days is having it write specific functions and tests for them, which in fairness, saves me a ton of time.
That’s the typical “claude code writes all my code” setup. That’s my setup.
This does require you to fit your problem to the solution. But when you do, the results are tremendous.
This is not the case for most monoliths, unless they are structured into LLM-friendly components that resemble patterns the models have seen millions of times in their training data, such as React components.
In contrast, a poorly designed microservice can be replaced much more easily. You can identify the worst-performing and most problematic microservices and replace them selectively.
That's exactly my experience. While a well-structured monolith is a good idea in theory, and I'm sure such examples exist in practice, that has never been the case in any of my jobs. Friends working at other companies report similar experiences.
It’s hardcoded into the system prompt which is why your CLAUDE.md approach fails. Ended up intercepting it out via proxy
And I think it's less about non-deterministic code (the code is actually still deterministic) but more about this new-fangled tool out there that finally allows non-coders to generate something that looks like it works. And in many cases it does.
Like a movie set. Viewed from the right angle it looks just right. Peek behind the curtain and it's all wood, thinly painted, and it's usually easier to rebuild from scratch than to add a layer on top.
I suspect that we're going to witness a (further) fork within developers. Let's call them the PM-style developers on one side and the system-style developers on the other.
The PM-style developers will be using popular loosely/dynamically-typed languages because they're easy to generate and they'll give you prototypes quickly.
The system-style developers will be using stricter languages and type systems and/or lots of TDD because this will make it easier to catch the generated code's blind spots.
One can imagine that these will be two clearly distinct professions with distinct toolsets.
There is a non-trivial cost in taking apart the AI code to ensure it's correct, even with tests. And I think it's easy to become slower than writing it from scratch.
It doesn't get to generate much of the code I'm shipping, though.
The more important property is that, unlike compilers, type checkers, linters, verifiers and tests, the output is unreliable. It comes with no guarantees.
One could be pedantic and argue that bugs affect all of the above. Or that cosmic rays make everything unreliable. Or that people are non deterministic. All true, but the rate of failure, measured in orders of magnitude, is vastly different.
Technically you are right… but in principle no. Ask an LLM any reasonably complex task and you will get different results. This is because the mode changes periodically and we have no control over the host systems source of entropy. It’s effectively non deterministic.
If it works 85% of the time, how soon do you catch that it is moving in the wrong direction? Are you having a standup every few minutes for it to review (edit) it's work with you? Are you reviewing hundreds of thousands of lines of code every day?
It feels a bit like pouring cement or molten steel really fast: at best, it works, and you get things done way faster. Get it just a bit wrong, and your work is all messed up, as well as a lot of collateral damage. But I guess if you haven't shipped yet, it's ok to start over? How many different respins can you keep in your head before it all blends?
> A large percentage (at least 50%) of the market for software developers will shift to lower paid jobs focused on managing, inspecting and testing the work that outsourced developers do. If a median software developer job paid $125k before, it'll shift to $65k-$85k type outsourced developer babysitting work after.
This argument is common and facile: Software development has always been about "automating ourselves out of a job", whether in the broad sense of creating compilers and IDEs, or in the individual sense that you write some code and say: "Hey, I don't want to rewrite this again later, not even if I was being paid for my time, I'll make it into a reusable library."
> the same thing
The reverse: What pisses me off is how what's coming is not the same thing.
Customers are being sold a snake-oil product, and its adoption may well ruin things we've spent careers de-crappifying by making them consistent and repeatable and understandable. In the aftermath, some portion of my (continued) career will be diverted to cleaning up the lingering damage from it.
AI is also great at looking for its own quality problems.
Yesterday on an entirely LLM generated codebase
Prompt: > SEARCH FOR ANTIPATTERNS
Found 17 antipatterns across the codebase:
And then what followed was a detailed list, about a third of them I thought were pretty important, a third of them were arguably issues or not, and the rest were either not important or effectively "this project isn't fully functional"
As an engineer, I didn't have to find code errors or fix code errors, I had to pick which errors were important and then give instructions to have them fixed.
The limit of product manager as "extra technical context" approaches infinity is programmer. Because the best, most specific way to specify extra technical context is just plain old code.
(It’s been said that Swift concurrency is too hard for humans as well though)
A good software engineering system built around the top LLMs today is definitely competitive in quality to a mediocre software shop and 100x faster and 1000x cheaper.
But at least in its theoretical construction the LLM should be deterministic. It outputs a fixed probability distribution across tokens with no rng involvement.
We then sample from that fixed distribution non-deterministically for better performance or we use greedy decoding and get slightly worse performance in exchange for full determinism.
Happy to be corrected if I am wrong about something.
We're leaving my area of confidence, so take everything I write with a pinch of salt.
As far as I understand, indeed, each layer transforms a set of inputs into a probability distribution. However, if you wanted to compute entirely with probability distributions, you'd need the ability to compose these distributions across layers. Mathematically, it doesn't feel particularly complicated, but computationally, it feels like this adds several orders of magnitude of both space and time.
The writer could be very accomplished when it comes to developing - I don’t know - but they clearly don’t understand a single thing about visual arts or culture. I probably could center those text boxes after fiddling with them maybe ten seconds - I have studied art since I was a kid. My bf could do it instantly without thinking a second, he is a graphic designer. You might think that you are able to see what « looks good » since, hey you have eyes, but no you can’t. There’s million details you will miss, or maybe feel something is off, but cannot quite say why. This is why you have graphic designers, who are trained to do that to do it. They can also use generative tools to make something genuinely stunning, unlike most of us. Why? Skills.
This is the same difference why the guy in the story who can’t code can’t code even with LLM, whereas the guy who cans is able to code even faster with these new tools. If use LLM’s for basically auto-completion (what transformer models really are for) you can work with familiar codebase very quickly I’m sure. I’ve used it to gen SQL call statements, which I can’t be bothered to type myself and it was perfect. If I try to generate something I don’t really understand or know how to do, I’m lost staring at sole horrible gobbledygoo that is never going to work. Why? Skills.
There is no verification engineering. There is just people who know how to do things, who have studied their whole life to get those skills. And no, you will not replace a real hardcore professional with an LLM. LLM’s are just tools, nothing else. A tractor replaced a horse in turning the field, bit you still need a farmer to drive it.
I'm sure lots of people will reply to you stating the opposite, but for what it's worth, I agree. I am not a visual artist... well, not any more, I was really into it as a kid and had it beaten out of me by terrible art teachers, but I digress... I am creative (music), and have a semblance of understanding of the creative process.
I ran a SaaS company for 20 years and would be constantly amazed at how bad the choices of software engineers would be when it came to visual design. I could never quite understand whether they just didn't care or just couldn't see. I always believed (hoped) it was the latter. Even when I explained basic concepts like consistent borders, grid systems, consistent fonts and font-sizing, less visual clutter, etc. they would still make the same mistakes over and over.
To the trained eye they immediately see it and see what's right and what's wrong. And that's why we still need experts. It doesn't matter what is being generated, if you don't have expertise to know whether it's good or not, the chances are glaring errors will be missed (in code and in visual design)
Before mechanisation, like 50x more people worked in the agricultural sector, compared to today. So tractors certainly left without work a huge number of people. Our society adapted to this change and sucked these people into industrial sector.
If LLM would work like a tractor, it would force 49 out of 50 programmers (or, more generically, blue-collar workers) to left their industry. Is there a place for them to work instead? I don't know.
But none of this chamged how food grows and that you need somebody who bloody well knows what they are doing to produce it. Especially how machinised it is today.
However, I do not believe LLM to be a tractor. More like a slightly different hammer. You still need to hit the nail.
But i'm a software engineere by trade and I do not struggle with telling you that this thing has to move left for reason xy, i would struggle with random tools capable of doing that particular thing for me.
And it does not matter here how i did it if the result is the same result.
In Software Engineering this is just not always the case. Because often enough you would need to verify that what you get is the thing you expect (did the report actually take the right numbers) or Security. Security is the biggest risk to all ai coding out there. Security is already so hard because people don't see it, they ignore it because they don't know.
You have so many non functional requirements in software which just don't exist in art. If i need that image, thats it. Most complex thing here? Perhaps color calibration and color profiles. Resolution.
If we talk about 3D it gets again a little bit more complicated because now we talk the right 3d model, right way to rig, etc.
Also if someone says "i need a picture for x" and is happy about it, the risk is less customers. But if someone needs a new feature and tomorrow all your customer data are exposed or the companies product stops working because of a basic bug, the company might be gone a week later.
For example, Inkscape has this and it is easy to use.
I'm more of a fan of aligning to an edge anyways. But some designers love to get really deep into these kinds of things, often in ways they can't really articulate
Point is, even basic visual design is far from intuitive.
No, it neither thinks nor learns. It can give an illusion of thinking, and an AI model itself learns nothing. Instead it can produce a result based on its training data and context.
I think it important that we do not ascribe human characteristics where not warranted. I also believe that understanding this can help us better utilize AI.
Without such automation and guard rails, AI generated code eventually becomes a burden on your team because you simply can't manually verify every scenario.
And I have on occasion found it useful.
If you can make as a rule "no AI for tests", then you can simply make the rule "no AI" or just learn to cope with it.
Sort of a nitpick, because what's written is true in some contexts (I get it, web development is like the ideal context for AI for a variety of reasons), but this is currently totally false in lots of knowledge domains very much like programming. AI is currently terrible at the math niches I'm interested in. Since there's no economic incentive to improve things and no mountain of literature on those topics, unless AI really becomes self-learning / improves in some real way, I don't see the situation ever changing. AI has consistently gotten effectively a 0% score on my personal benchmarks for those topics.
It's just aggravating to see someone write "totally undeniable" when the thing is trivially denied.
You've described AI hype bros in a nutshell, I think.