Posted by nurimamedov 4 hours ago
LLMs do understand codebases and I've been able to get them to make reactors and clean up code without them breaking anything due to them understanding what they are doing.
Bugs are being solved faster than before. Crashes from production can directly be collected and fixed by a LLM with no engineering time needed other than a review.
When I investigated I found the docs and implementation are completely out of sync, but the implementation doesn’t work anyway. Then I went poking on GitHub and found a vibed fix diff that changed the behavior in a totally new direction (it did not update the documentation).
Seems like everyone over there is vibing and no one is rationalizing the whole.
I can’t understand how people would run agents 24/7. The agent is producing mediocre code and is bottlenecked on my review & fixes. I think I’m only marginally faster than I was without LLMs.
And specifically: Lots of checks for impossible error conditions - often then supplying an incorrect "default value" in the case of those error conditions which would result in completely wrong behavior that would be really hard to debug if a future change ever makes those branches actually reachable.
I don’t know where the LLMs are picking up this paranoid tendency to handle every single error case. It’s worth knowing about the error cases, but it requires a lot more knowledge and reasoning about the current state of the program to think about how they should be handled. Not something you can figure out just by looking at a snippet.
At the same time, the amount of anti-patterns the LLM generates is higher than I am able to manage. No Claude.md and Skills.md have not fixed the issue.
Building a production grade system using Claude has been a fools errand for me. Whatever time/energy i save by not writing code - I end up paying back when I read code that I did not write and fixing anti-patterns left and right.
I rationalized by a bit - deflecting by saying this is AI's code not mine. But no - this is my code and it's bad.
This is starting to drive me insane. I was working on a Rust cli that depends on docker and Opus decided to just… keep the cli going with a warning “Docker is not installed” before jumping into a pile of garbage code that looks like it was written by a lobotomized kangaroo because it tries to use an Option<Docker> everywhere instead of making sure its installed and quitting with an error if it isn’t.
What do I even write in a CLAUDE.md file? The behavior is so stupid I don’t even know how to prompt against it.
Think about it, they have to work in a very limited context window. Like, just the immediate file where the change is taking place, essentially. Having broader knowledge of how the application deals with particular errors (catch them here and wrap? Let them bubble up? Catch and log but don't bubble up?) is outside its purview.
I can hear it now, "well just codify those rules in CLAUDE.md." Yeah but there's always edge cases to the edge cases and you're using English, with all the drawbacks that entails.
In particular writing tests that do nothing, writing tests and then skipping them to resolve test failures, and everybody's favorite: writing a test that greps the source code for a string (which is just insane, how did it get this idea?)
The assumption is that your test is right. That's TDD. Then you write your code to conform to the tests. Otherwise what's the point of the tests if you're just trying to rewrite them until they pass?
Claude Code creator literally brags about running 10 agents in parallel 24/7. It doesn't just seems like it, they confirmed like it is the most positive thing ever.
Full disclosure - I am a heavy codex user and I review and understand every line of code. I manually fight spurious tests it tries to add by pointing a similar one already exists and we can get coverage with +1 LOC vs +50. It's exhausting, but personal productivity is still way up.
I think the future is bright because training / fine-tuning taste, dialing down agentic frameworks, introducing adversarial agents, and increasing model context windows all seem attainable and stackable.
I'm definitely faster, but there's a lot of LLM overhead to get things done right. I think if you're just using a single agent/session you're missing out on some of the speed gains.
I think a lot of the gains I get using an LLM is because I can have the multiple different agent sessions work on different projects at the same time.
That is not an uncommon occurrence in human-written code as well :-\
> Automation doesn't just allow you to create/fix things faster. It also allows you to break things faster.
https://news.ycombinator.com/item?id=13775966
Edit: found the original comment from NikolaeVarius
The degradation is palpable.
I have been using vscode github copilot chat with mostly the claude opus 4.5 model. The underlying code for vscode github copilot chat has turned to shit. It will continuously make mistakes no matter what for 20 minutes. This morning I was researching Claude Code and pricing thinking about switching however this post sounds like it has turned to shit also. I don't mind spending $300-$500 a month for a tool that was a month ago accomplishing in a day what would take me 3-4 days to code. However, the days since the last update have been shit.
Clearly the AI companies can't afford to run these models at profit. Do I buy puts?
Then again, the google home page was broken on FF on Android for how long?
I run multiple agents in separate sessions. It starts with one agent, building out features or working on a task/bug fix. Once it gets some progress, I spin up another session and have it just review the code. I explicitly tell it things to look out for. I tell it to let me know about things I'm not thinking of and to make me aware of any blind spots. Whatever it reviews I send back to the agent building out features (I used to also review what the review agent told me about, but now I probably only review it like 20% of the time). I'll also have an agent session started just for writing tests, I tell it to look at the code and see if it's testable, find duplicate code, stale/dead code. And so on and so forth.
Between all of that + deterministic testing it's hard for shit to end up in the code base.
Doesn't mean it's not a useful tool - if you read and think about the output you can keep it in check. But the "100% of my contributions to Claude Code were written by Claude Code" claim by the creator makes me doubt this is being done.
Shaping of a codebase is the name of the game - this has always been, and still, is difficult. Build something, add to it, refactor, abstraction doesn’t sit right, refactor, semantics change, refactor, etc, etc.
I’m surprised at how so few seem to get this. Working enterprise code, many codebases 10-20 years old could just as well have been produced by LLMs.
We’ve never been good at paying debt and you kind of need a bit of OCD to keep a code base in check. LLM exacerbates a lack of continuous moulding as iterations can be massive and quick.
Not that old big non-AI software doesn't have similar maintainability issues (I keep posting this example, but I don't actually want to callthat company out specifically, the problem is widespread: https://news.ycombinator.com/item?id=18442941).
That's why I'm reluctant to complain about the AI code issues too much. The problem of how software is written, on the higher level, the teams, the decisions, the rotating programmers, may be bigger than that of any particular technology or person actually writing the code.
I remember a company where I looked at a contractor job, they wanted me to fix a lot of code they had received from their Eastern European programmers. They complained about them a lot in our meeting. However, after hearing them out I was convinced the problem was not the people generating the code, but the ones above them who failed to provide them with accurate specs and clear guidance, and got surprised at the very end that it did not work as expected.
Similar with AI. It may be hard to disentangle what is project management, what is actually the fault of the AI. I found that you can live with pockets of suboptimal but mostly working code well enough, even adding features and fixing bugs easily, if the overall architecture is solid, and components are well isolated.
That is why I don't worry too much about the complaints here about bad error checks and other small stuff. Even if it is bad, you will have lots of such issues in typical large corporate projects, even with competent people. That's because programmers keep changing, management focuses on features over anything else (usually customers, internal or external, don't pay for code reorg, only for new features). The layers above the low level code are more important in deciding if the project is and remains viable.
From what the commenters say, it seems to me the problem starts much higher than the Claude code, so it is hard to say how much at fault AI generated code actually is IMHO. Whether you have inexperienced juniors or an AI producing code, you need solid project lead and architecture layers above the lines of code first of all.
I'd much rather make plans based on reality
Other AI agents, I guess. Call Claude in to clean up code written by Gemini, then ChatGPT to clean up the bugs introduced by Claude, then start the cycle over again.
If the code is cheap (and it certainly is), then tossing it out and replacing it can also be cheap.
Similarly, Human-in-the-loop utilization of AI/ML tooling in software development is expected and in fact encouraged.
Any IP that is monetizable and requires significant transformation will continue to see humans-in-the-loop.
Weak hiring in the tech industry is for other reasons (macro changes, crappy/overpriced "talent", foreign subsidies, demanding remote work).
AI+Competent Developer paid $300k TC > Competent Developer paid $400k TC >>> AI+Average Developer paid $30k TC >> Average Developer paid $40k TC >>>>> Average Developer paid $200k TC
Huh?
A Coding copilot subscription paired with a competent developer dramatically speeds up product and feature delivery, and also significantly upskills less competent developers.
That said, truly competent developers are few and far between, and the fact that developers in (eg.) Durham or remote are demanding a SF circa 2023 base makes the math to offshore more cost effective - even if the delivered quality is subpar (which isn't neccesarily true), it's good enough to release, and can be refactored at a later date.
What differentiates a "competent" developer from an "average" developer is the learning mindset. Plenty of people on HN kvetch about being forced to learn K8s, Golang, Cloud Primitives, Prompt Engineering, etc or not working in a hub, and then bemoan the job market.
If we are paying you IB Associate level salaries with a fraction of the pedigree and vetting needed to get those roles, upskilling is the least you can do.
We aren't paying mid 6 figure TC for a code monkey - at that point we may as well entirely use AI and an associate at Infosys - we are paying for critical and abstract thinking.
As such, AI in the hands of a truly competent engineer is legitimately transformative.
Tl;dr - Mo' money, Mo' expectations
Edit: And 3 minutes later it is back...
You can assert that something you want to happen is actually happening
How do you assert all the things it shouldn't be doing? They're endless. And AI WILL mess up
It's enough if you're actively reviewing the code in depth.. but if you're vibe coding? Good luck
It's not a world where everything produced is immediately verified.
If a human consistently only produced the quality of work Claude Opus 4.5 is capable of I would expect them to be fired from just about any job in short order. Yes, they'd get some stuff done, but they'd do too much damage to be worth it. Of course humans are much more expensive than LLMs to manage so this doesn't mean it can't be a useful tool... just it's not that useful a tool yet.
1. Competent humans architecting and leading the system who understand the specs, business needs, have critical thinking skills and are good at their job
2. Automated tests
3. Competent human reviewers
4. QA
5. Angry users
Cutting out 1 and 3 in favor of more tests isn't gunna work
This can be abused because the programmer is both judge and jury, but people tend to handle this paradox much better than LLMs.
The amount of times I have to "yell" at the llm for adding #[allow] statements to silence the linter instead of fixing the code is crazy and when I point it out they go "Oops, you caught me, let me fix it the proper way".
So the tests don't necessarily make them produce proper code.
I spent 20 minutes between guiding it because it was putting the translation in the wrong cells, asking it not to convert the cells to a fancy table, and finally, convincing it that it really had access to alter the document, because at some point it denied it. I wasn't being rude, but it seems I somehow made it evasive.
I had to ask it to translate in the chat, and manually copy-pasted the translations in the proper cells myself. Bonus points because it only translated like ten cells at a time, and truncated the reply with a "More cells translated" message.
I can't imagine how hard it would be to handhold an LLM while working in a complex code base. I guess they are a godsend for prototypes and proofs of concept, but they can't beat a competent engineer yet. It's like that joke where a student answers that 2+2=5, and when questioned, he replies that his strength is speed, not accuracy.
So I have a different experience with Claude Code, but I'm not trying to say you're holding it wrong, just adding a data point, and then, maybe I got lucky.
And this is not tied to the LLMs. It's that to EVERYTHING we do. There are limits everywhere.
And for humans the context window might be smaller, but at least we have developed methods of abstracting different context windows, by making libraries.
Now, as a trade-off of trying to go super-fast, changes need to be made in response to your current prompts, and there is no time validate behavior in cases you haven't considered.
And regardless of whether you have abstractions in libraries, or whether you have inlined code everywhere, you're gonna have issues.
With libraries changes in behavior are going to impact code in places you don't want, but also, you don't necessarily know, as you haven't tested all paths.
With inlined code everywhere you're probably going to miss instances, or code goes on to live its own life and you lose track of it.
They built a skyscraper while shifting out foundational pieces. And now a part of the skyscraper is on the foundation of your backyard shed.
folks have created software by "vibe coding". It is now time to "face the music" when doing so for production grade software at scale.
That's a big, slow, and expensive process though.
Will Anthropic actually do that or will they keep throwing AI at it and hope the AI figures this approach out? We shall see...
---
> Just my own observation that the same pattern has occurred at least 3 times now:
> release a model; overhype it; provide max compute; sell it as the new baseline
> this attracts a new wave of users to show exponential growth & bring in the next round of VC funding (they only care about MAU going up, couldn’t care less about existing paying users)
> slowly degrade the model and reduce inference
> when users start complaining, initially ignore them entirely then start gaslighting and make official statements denying any degradation
> then frame it as a tiny minority of users experiencing issues then, when pressure grows, blame it on an “accidentally” misconfigured servers that “unintentionally” reduced quality (which coincidentally happened to save the company tonnes of $).
I cancelled my subscription.
Just because 99% of the things you read are critical and negatively biased doesn't mean the subsequent determination or the consensus among participants in the public conversation have anything to do with reality.
Dario is delusional, for this and other reasons.