When I reject AI code even if it works

Posted by vnbrs 8 hours ago

When I reject AI code even if it works(vinibrasil.com)

180 points | 99 comments

Aurornis 7 hours ago|

Even using Fable (while it was briefly available), having it refine a plan, and directing it to make only small incremental changes, I still found reasons to reject its first pass at a lot of work. There was a lot of “You’re right to push back” responses. A lot of incidents where it would creat some giant complex set of abstractions to accomplish something that I could find ways to do much more elegantly and in a more maintainable manner.

It’s really eye opening to work with these tools on a codebase you know deeply because these problems are everywhere.

However if I opened an unfamiliar project in another language and I wanted to add a little feature with no intention of maintaining it, I’d happily accept the changes and loop until it worked well enough for my temporary needs.

The scary middle is when you’re dealing with coworkers who don’t care about anything other than closing tickets and collecting credit. With enough of a token budget you can now wrap loops around an LLM and have it try things until the program appears to work. Ask it to do a code review and then submit the PR without having understood what it was doing. There are a lot of workplaces where there isn’t a good mechanism to push back on this and the tech debt just keeps growing.

abhgh 6 hours ago||

These "You're right to push back" scenarios are scary for me. I mostly code ML implementations, and some of the errors Claude Code (CC - have only used Opus 4.7) makes are very sneaky, and if you don't have sufficient experience in the area (I see this with people entering ML and writing their implementations with CC), you wouldn't know when to question CC and will let errors or future pitfalls silently slip into your code. A recent example was when there was data leakage in a model calibration step, which it refused to see as an error, till I wrote a detailed reason, and then it agreed that there was a "subtle leakage".

nostrebored 5 hours ago||

The leakage problem is so pervasive. None of the frontier models seem to have any idea how to actually hold out rows. God help you if you decide to change the data mix.

I was working on creating a next-n-actions predictor for one of our use cases and not paying much attention for a PoC. I was fairly happy with the progress for a few days, before actually reading the eval code and seeing that we leaked the final state in every eval.

It's nice to let claude run loose on porting from framework to framework (port my code from TRL to NemoRL to Tinker to VeRL) but looking at what it does in the intermediate steps makes me want to claw my eyes out. And getting it to adhere to our domain model (e.g. we have an SFTConfig and a .to_trl(), or a Row and a .to_harmony()) is impossible.

resonious 6 hours ago|||

All Claude models are huge suck ups. The "you're absolutely right" meme is real even if that exact phrase doesn't show up as much anymore.

I don't want to start a fight or anything but IME Codex has a bit more of a spine. If you point out something weird, it sometimes gives a good reason for it. Whereas Claude will always say "whoopsie you're right as always sir" even when it's me who missed something.

herdymerzbow 6 hours ago|||

I only use free AI chats to help me with my learning, but often I direct its responses neutral and to refrain from providing any encouraging language, or value judgements. It tends to get rid of these 'you're absolutely right' comments when I point out a mistake.

But your comment just made me think whether this tendency for LLMs to resort to flattery when found out is a built in strategy to distract the user from the error prone fragility of much of the output? It's perhaps a stretch to think these canned responses were put in strategically, but the result is that the user's attention may be deflected to contemplating their own superior knowledge and insight, and bask in the glory of all that, but then forgot to appreciate that 'Hey, chatLLM is just making all this stuff up/doesn't know which way is up/or down!'

pyridines 5 hours ago||||

IME it's Claude that pushes back, and Codex that just does the thing. It's happened once or twice where I've told Claude bluntly and directly "do this" and it responded "no, here's why that's a bad idea..." Maybe it's just my CLAUDE.md.

Not sure if there are sycophancy benchmarks for coding agents

mcintyre1994 1 hour ago||

I find the same. Someone posted this benchmark here: https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

It measures whether models push back on bullshit prompts or just go along with it, and Claude models are all the top performers.

teaearlgraycold 6 hours ago|||

Right now the thing I get from Opus 4.8 is a ton of “That’s a good instinct”. Also >50% of its closing statements begin with “Clean.”

fy20 3 hours ago|||

A nice trick I've found is following up with "make it simpler". Often you can do 2-3 rounds of that and end up with something much easier to comprehend but still meeting the requirements.

I have a Rails background, so maybe KISS is more engrained in my philosophy than whatever training material was used on AI. At least it isn't heavily pushing design patterns...

dapperdrake 3 hours ago|||

Maybe that feedback loop finally got fast enough to die out.

embedding-shape 7 hours ago|||

> There are a lot of workplaces where there isn’t a good mechanism to push back on this and the tech debt just keeps growing.

If the "big ball of spaghetti" theory holds, where software companies who can't manage the debt stumble over themselves as they continue to add to the big ball of spaghetti code, I guess we'll see a row of companies declaring "software bankruptcy" or something in some/many months, depending on how well these workspaces learn to care slightly more and get better at pushing back against slop.

onion2k 2 hours ago|||

I guess we'll see a row of companies declaring "software bankruptcy" or something in some/many months

I don't think you will, because that would require the business to recognise the problem. That might happen in companies where the leadership team are engineers but it will never happen if they're not.

Instead you'll see:

- Churn in the dev team with senior developers leaving rather than try to deal with the mess

- Large scale projects to refactor or rewrite entire codebases, which will inevitably fail because you can't rewrite a big ball of spaghetti because you can't tell what it actually does (especially if it's in a language that allows side effects, or you've used a strategy like 'exceptions as flow of control').

- Companies just getting slower and slower to deliver anything. That's probably fine in many cases where they're big enough to still carry on without growing much, but anyone in the company will see their career die and pay rises dry up.

- Eventually, maybe, you'll see 'tech debt fixing' service companies start up to leverage AI in the effort to fix these problems. (AWS have a thing called 'Amazon Modernization Lab' that is exactly that, but only for companies running old tech on their services.)

aryehof 4 hours ago||||

What concerns me the most is that improvements in software design are at an end. The “big ball of mud”, which really is a problem of modularity and dependencies, will never improve through innovation because the way it is done now is all there will ever be.

codemog 6 hours ago|||

Coding agents have been better than the average "enterprise" programmer for a while now and nobody wants to admit it or talk about it. I have never seen an agent output an implementation called FooImpl that's tens of thousands of LOC in a single file, but I have seen plenty of human code like this.

People call coding agents bad because they don't know the asinine meaningless conventions at their particular company while they themselves write awful abstractions and brittle tightly coupled systems, but hey, at least they know how to write a for loop how their particular company likes.

kuschku 17 minutes ago|||

> I have never seen an agent output an implementation called FooImpl that's tens of thousands of LOC in a single file, but I have seen plenty of human code like this.

I've seen countless vibecoded implementations that look exactly like that. Especially painful is agents adding the same utility functions in each and every file instead of properly reusing or splitting things.

And then I have to fix them.

jeppester 3 hours ago||||

Yesterday Claude wanted to add a position column to what is a slightly extended many-many relation table. It did this to "make ordering stable".

An average enterprise developer would never add bloat like that up-front, unless if the ability to change the order was a requirement.

Obviously a stable order can be easily derived from the ID or a creation time (if available).

Setting a position however requires extra steps to ensure the integrity of the sequence.

I see things like that all the time, and it's always stuff that grows the code base and adds unnecessary complexity.

fzeroracer 6 hours ago||||

> I have never seen an agent output an implementation called FooImpl that's tens of thousands of LOC in a single file, but I have seen plenty of human code like this.

And how long does it take a coding agent to output a thousand lines of code versus a human? The worst human at any company was rate limited by themselves. Those 'average enterprise' programmers aren't going away, they're the ones now spending tens of thousands on coding agents and filling your codebase with even more garbage without bothering to review an iota of it.

mkozlows 5 hours ago|||

Which is why one of the big problems for the field right now is that a) most code bases still need someone more skilled than a mere robot driver, and b) many developers are not better than that.

In the past, a team of five mid devs and one good one would be fine, because that good one would ride herd on the mid ones. But now those mid ones are slamming out robot code that they're incapable of meaningfully reviewing (because it's better than they are already), and they're just overwhelming the good dev's capacity.

The solution, of course, is to fire them all -- they're worthless now -- but this is not going to happen quickly, and it's probably for the best that it doesn't.

ben_w 1 hour ago|||

> And how long does it take a coding agent to output a thousand lines of code versus a human?

Sometimes the human is faster.

I've seen someone duplicate a class file (already filled with duplicate methods) rather than subclassing, and when called out on this it was because properties were private.

This was a team with just me and him in it, it didn't even really benefit from things being private.

That said, the really important lesson I've learned over the years is that terrible code and practices are almost irrelevant: this app won awards and was highly regarded.

what 5 hours ago|||

> that's tens of thousands of LOC in a single file

Why is this worse than splitting it across 1k files?

codemog 2 hours ago||

Does taking this example and extending it to the limit answer your question? There is a reason we don’t have a single file called program with a million lines of code in it. Google studies on module size vs code defect rates for more empirical numbers.

ben_w 55 minutes ago||

The limit you're replying to is files which are each tens of lines long. At that point, the cognitive overhead of switching documents is larger than the benefit of a compact object to reason about.

(Personally my threshold is around 2-5 thousand lines per file depending on what it is; but that's me working solo, obviously I'll follow whatever standards any team I'm in gives me).

latexr 34 minutes ago|||

In your second and third paragraphs you’re essentially describing Gell-Mann amnesia.

https://en.wikipedia.org/wiki/Michael_Crichton#%22Gell-Mann_...

busterarm 7 hours ago|||

> With enough of a token budget you can now wrap loops around an LLM and have it try things until the program appears to work. Ask it to do a code review and then submit the PR without having understood what it was doing. There are a lot of workplaces where there isn’t a good mechanism to push back on this and the tech debt just keeps growing.

I'm not making an argument in favor of people using LLMs for this, but people were doing this before we had LLMs it was just usually a bit slower. I can't even say it usually doesn't work out long term because I worked with a lot of guys who did this and took a ton of Adderall while working practically around the clock. Every incentive structure in the organizations rewarded it along with social credibility from more junior engineers. (The last cowboy I worked with who pulled this shit ended up becoming the most senior engineer in the company, a multi-millionaire and worshipped like a god by 90% of the mostly fresh grads we were hiring).

The problem is when invariably these people burn out eventually and leave, they leave a massive vacuum in their stead. Not from load they were carrying but creating.

I think the larger the organization I've been at, the more they reward the people making huge commits on nights and weekends. Worse, they could get away with TBRing their shit and merging it without review.

LLMs are often all of the bad habits and organizational problems that we already carryied just being speedrun. There are some places doing it right, but they already were.

timacles 6 hours ago||

> There are some places doing it right, but they already were.

Could you be more specific what "right" is?

> I can't even say it usually doesn't work out long term because I worked with a lot of guys who did this and took a ton of Adderall while working practically around the clock. Every incentive structure in the organizations rewarded it along with social credibility from more junior engineers. (The last cowboy I worked with who pulled this shit ended up becoming the most senior engineer in the company, a multi-millionaire and worshipped like a god by 90% of the mostly fresh grads we were hiring).

I'm having a tough time believing this, it sounds like you're trying to backwards rationalize more productive engineers were "on drugs" and they delivered but "did it wrong"

darkerside 6 hours ago|||

In fairness, you could throw the most senior engineer into a brand new codebase, and they would probably make a dozen mistakes if you immediately had them pick up invasive and risky work.

kerkeslager 4 hours ago||

No, that's not "in fairness", that's misunderstanding the entire problem.

Having worked 20 years in this field and managed a few projects, no, I wouldn't make a dozen mistakes, because I would refuse to take on work I can't responsibly do.

Invasive and risky work IS the thing I want to be working on because it's the place where I can be most valuable, but part of my value comes from asking the right people the right questions. If I'm working on something invasive and risky, I'm going to work directly with the people who wrote it, and only when THEY think I understand it well enough am I venturing in alone.

Absent access to the people who wrote the code, I'm going to start by writing tests around the code and spend a lot of time checking my initial assumptions upon reading the code, because I know that I don't know what I don't know.

Yeah, if I did foolishly just started making changes, I'd make mistakes but that's missing the point: a good senior engineer knows not to do that.

That's the failure point of AI: it's arrogant. It will provide you statements without any idea if they're true and make changes without any idea if they're correct. It will never tell you "I don't know how to do that" or even "I am not sure if this is correct". It just does the work with infinite confidence even when that confidence is not justified and often it will be just as hard to figure out if the AI's work is correct as it would be to do the work yourself.

alex_suzuki 4 hours ago||

> That's the failure point of AI: it's arrogant.

I agree with your take, but AI is exactly as arrogant as the human driving it.

justinclift 3 hours ago||

> "You’re right to push back"

It sounds like you've not conditioned your Claude to stop being a sycophant yet?

ecshafer 7 hours ago||

If we rephrased this to "When I reject my coworkers code even if it works" and give the same reasons there would be zero dissent. There is this weird idea that seems to come up with AI that any solution must be good and adequate. Software Engineering is all about rejecting code that works for the right code that works.

mkozlows 5 hours ago||

Yeah, but I think there's a difference here: If your coworker puts up code that you don't understand quickly, in most environments people give it an approval, as withholding approval is meant to indicate that there's a problem with the code. It's very rare that you'd actually force them to wait to merge until they've explained the code to your satisfaction.

(There are workplaces where that's the norm, I know -- it tends to be a thing with smaller teams with codebases that everyone understands fully, and much less a thing with larger teams where different people have areas of the code they understand more than others.)

With AI code, though, it's _your code_ and you can't give it a lgtm, you actually need to dig at it until you do fully understand it, fully agree with it, and could justify it to a hostile reviewer. It's a different level of rigor.

Not all engineers apply that rigor, though, which becomes a problem.

api 6 hours ago||

Which means it doesn’t matter if the code is from AI or not.

If it’s not good it’s not good.

jdw64 5 hours ago||

Coding with AI eventually comes down to two paths, I've realized. One is using AI exclusively for everything. The other is not using it at all. There is almost no middle ground. The reason is that as the complexity and depth of the problem increase, the code AI generates increasingly follows enterprise level patterns. The deeper the meaning of what I input, the more AI tends to produce code that goes beyond my own area of expertise. For example, a human expert's code is very powerful and deep within their own domain, but when you look at the entire codebase, it's often shallow and uneven outside that domain. But the moment you write code with AI, once you go deep in one part, AI tries to standardize the rest accordingly. This means the entire codebase converges toward enterprise level standard code, which essentially reflects the average patterns of senior programmers who built large scale systems.

The problem is this. Human cognitive resources are finite, so we inevitably become shallow outside our own expertise. There is no programmer who can do everything well. And as systems grow in scale, they become more modularized and fragmented, making it impossible to understand the whole system. So what should we do about this? That's always the question.

In the end, do I choose not to use AI, finish the project with uneven code outside my domain, and deliver it? Or do I use AI and deliver a program that is uniform and consistent, but not in my own style? I still don't know. I haven't found the answer yet.

mkozlows 5 hours ago||

You can also just use AI and keep the scale of your changes small rather than refactoring the whole app with a change? This isn't super-weird.

archargelod 3 hours ago|||

"In the discrete world of computing, there is no meaningful metric in which "small" changes and "small" effects go hand in hand, and there never will be." - E.W.Dijkstra (EWD1036)

usef- 2 hours ago||

I believe grandparent meant "small enough changes that you can understand what the effects are likely to be"

archargelod 45 minutes ago||

Then it's probably small enough - where you don't need a help of AI, and should do it yourself.

My position is that AI could be useful to find the potential places for these changes, but it should be someone who's capable of thinking to implement them.

jdw64 5 hours ago|||

As you know, the boundary ultimately depends on code quality. The problem is that AI generates code that looks high quality even outside my area of expertise, at least from my perspective. So now the boundary has to be redrawn. Refactoring usually ends up redefining those boundaries. At that point, the question becomes: do I rewrite my own code, or do I reject the AI code? Those are the two choices left.

In the end, an exceptionally skilled programmer might be able to keep their core domain intact, but I think the vast majority would find that very difficult. So it might be possible once you cross a certain threshold, but considering the sheer amount of code required to deliver a single modern program, it's hard to know which parts to focus on. However, my perspective might be different because I'm coming from the point of view of delivering a working program, not from the perspective of open source development

lemagedurage 5 hours ago||

Own the design and let AI write the code. Spend the extra free time on becoming a better/broader architect.

fzeroracer 4 hours ago|||

How can you own the design if you don't know what your design actually does?

lemagedurage 3 hours ago||

You can't, so you do read the code.

gib444 3 hours ago||||

This idea is being pushed to increase sunk costs IMO. We are told to spend huge amount of time writing specs, behaviour tests, AGENTS.md and prompts.

Pinky promise that's enough to get good output.

Pinky promise we won't invent yet another body of work the whole industry must adopt to get good output.

Pinky promise the AI tool will properly read all your work

And then of course we are told you must never trust its output !? You must review all code it produces line by line and grok it fully !

And now we have: keep challenging it, keep rejecting it, keep interrogating it... That's just fancy words for spend more money (tokens)

lemagedurage 2 hours ago||

Wow hang on, I'm suggesting to use AI as a code writing aid, not to increase scope until owning the design becomes unreasonable.

Planktonne 5 minutes ago||

It's been years at this point though; everyone knows where "use AI as a code writing aid" ends up.

tedajax 4 hours ago|||

I'd rather blowtorch my nipples off than yell at a computer all day

summerlight 7 hours ago||

My personal rule of thumb: I am usually okay with agents driving e2e implementations if this won't make life noticeably worse when it does not work. Some analytical code? Perfectly fine. Hobby projects? Fine, though I prefer doing a fun part myself. Refactoring production code generating 10x more revenue than my salary? You'd better be at least understanding what it does.

resonious 6 hours ago|

Yes this is the thing with these new tools. You have to know when to use them and when not to.

Good ol' software architecture tricks can also help you slot "vibe coded" components into a larger system safely.

whilenot-dev 2 hours ago||

Titles like these make me always point out the obvious: A working state is the absolute minimum requirement for any code to be merged, isn't it? ...imagine to merge something even though you know that's not working.

Besides, this post has nothing specific to code produced by an LLM, and placing AI in the stated reasons feels completely arbitrary, or is rather a fallacy of our times:

- I reject [AI] code when I can’t explain the approach in my own words.

- I reject [AI] code when the diff is bigger than the problem.

- I reject [AI] code when it introduces abstractions before proving they’re needed.

- I reject [AI] code when it works locally but makes the system harder to reason about.

- I reject [AI] code when I’m trusting the output more than my understanding.

utopiah 59 minutes ago|

Fallacy or scapegoat. If management ask for revised KPIs where PRs must be 10x and AI is the "excuse" for this (unrealistic) new demand.

edanm 1 hour ago||

Not that I disagree with anything here, but...

I wish it were clearer in these kinds of posts how "I use AI code I don't understand" is so different from "I use libraries written by other people I don't understand", or "I work in a large codebase which was 99% written by other people, and I haven't seen all of it", or even "I use software written by other people I don't understand".

SunboX 1 hour ago||

I unterstand the reasons, but I don't think so. I have experience in software development over 20 years now and still developing software daily. Nowadays it's nearly 100℅ AI written. It looks good and works. Sure, you have to guide the AI. But this can be done with custom skills, angent files, code quality guards test cases and so on. Maybe the code looks at the end not as I would have written it, maybe something is too complex implemented. But that's true for large developer teams also. At the end it's way faster and it works. I think, everyone who does not adapt to this new workflow is left behind in professional development soon.

PacificSpecific 1 hour ago|

That's cool. Could you share some concrete examples of your successes?

osigurdson 3 hours ago||

Its hard to find a middle ground between fully understanding everything in a PR vs a vibe coding type approach. Can you understand "just a little bit" of a PR and merge it into a code base you really care about? Is it maybe fine to "mostly understand it" on the other hand? Its definitely a tough call and its impossible to argue that no trade off is being made.

LLMs are perfect for quick prototypes, speed runs, learning, etc., but if the code really matters its still not clear cut. I think the definition of what "really matters" is very project dependent of course As an extreme example you would want to understand every line of the code for the control system runs an MRI machine or a jet engine since bugs might mean life or death. Depositing money into the wrong account might not kill anyone but could lead to severe economic losses. But, then again, even problems in far less consequential software may be drastically sub-economic (i.e. saving $1000 on the implementation might cost $10000 if customers aren't happy and fails to re new). Pick your scenario I guess.

The problem is, this isn't going to change regardless of how well a new model scores on a benchmark. It seems actually AGI is needed.

krupan 6 hours ago||

And again this makes me wonder, is AI really helping if this much review and rework is needed for all the code it writes?

mkozlows 5 hours ago||

Most code they write is obviously fine. Much of the rest isn't obviously fine, but is in fact fine once you've gone through understanding it. But yes, there's some that still benefits from a human eye.

(For as long as that's true, "software developer" is still a job. It's not clear for how long it will be true.)

unknownfuture 5 hours ago|||

I mean, the reality is a ton of folks in the industry, myself included, are writing glorified CRUD apps in their day jobs. We're building into existing an codebase with established infrastructure and ways of working. What we're building isn't inherently complex or very interesting.

Meanwhile, those codebases often require a ton of boilerplate and drudgery to get anything done.

In these spaces it's very easy to read and comprehend AI generated output and review it fairly quickly. So the time savings from dealing with all that boilerplate and conforming with all that existing infrastructure are potentially substantial.

teaearlgraycold 6 hours ago||

Depends on what it’s writing. There are times an LLM saves me a lot of time researching library functionality. Especially with testing frameworks. So many strange and arcane features out there beyond the basics, but not hard to understand what they do once you see the code. On that topic I should say I am careful when reviewing the actual test cases.

However if you’re highly familiar with a domain then LLMs are much less useful.

wwind123 5 hours ago|

I use 3 AI's (Claude, GPT and Gemini) to review each other's design plans and implementation on the same code base. Each often catches problems the others miss.

I try to make sure the architecture docs of the code base are refreshed regularly based on recent changes, so it's easier for humans and AI agents to make sense of the code.

I also regularly stop all other developments and just focus on auditing the code base with these AI's to make sure they are secure, robust, clean, and well structured and well tested -- some refactoring would be needed most of the time, and it's well worth it.

With this approach, nowadays I often merge code from AI without completely understanding what it's doing, but seems the code has been working so far. :)

BobbyTables2 5 hours ago||

You’ve transitioned from “individual contributor” to “manager”! (;->

wwind123 5 hours ago||

Haha, true!

I do sometimes have to steer the discussions between the AI's to the right direction, if they deviate too far away from the real problem, either because they miss some context, or because my original description of the problem was misleading.

To do that formally, I have a mechanism built-in the review loop where if a comment on a github issue or PR is signed as "-- Human Reviewer", then all AI agents have to treat the comment as the highest priority item to address.

kajman 5 hours ago|||

I'm always curious when I see these stories. How long have you been doing this, for what sort of work, and was the codebase mature before you began working like this?

wwind123 4 hours ago||

Yeah, this one is easy: I have been doing this for half a year. I have a couple of projects worked out this way, all green-field projects, code base grew from 0 to tens of thousand of lines each.

jimbobimbo 5 hours ago||

This is the way. I use gh copilot and have opus interrogate me and write the plan, then gpt review the plan and provide feedback; repeat this multiple times until gpt is either satisfied or starts to nitpick on unimportant stuff. Then sanity check the plan myself and have gpt implement it.

Each implementation is also reviewed by me before merging to master. I complete PRs only when I'm satisfied with the implementation, my feedback is addressed, and I fully understand what is going on. Agents are the replacement for typing and productivity multipliers.

I have big picture view of the product, each plan implements only a part of it, scoped to avoid merging unreviwed slop. Probably slower, but result is much better.

wwind123 4 hours ago||

Cool. Yeah it's important to have a big picture of the product, to steer the AI's towards the right direction in their work.

More comments...