LLM code generation may lead to an erosion of trust

Posted by CoffeeOnWrite 6/26/2025

LLM code generation may lead to an erosion of trust(jaysthoughts.com)

248 points | 275 comments

gblargg 6/26/2025|

(Works on older browsers and doesn't require JavaScript except to get past CloudSnare).

dirkc 6/26/2025||

I have a friend that always says "innovation happens at the speed of trust". Ever since GPT3, that quote comes to mind over and over.

Verification has a high cost and trust is the main way to lower that cost. I don't see how one can build trust in LLMs. While they are extremely articulate in both code and natural language, they will also happily go down fractal rabbit holes and show behavior I would consider malicious in a person.

acedTrex 6/26/2025||

Author here: I quite like that quote. A very succinct way of saying what took me a few paragraphs.

This new world of having to verify every single thing at all points is quite exhausting and frankly pretty slow.

JackFr 6/26/2025|||

https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...

The classic on the subject.

tayo42 6/26/2025||||

We do this is in professional environments already with documentation for designs upfront and code reviews though

EGreg 6/26/2025||||

“Freedom of speech” in politics

Herring 6/26/2025|||

So get another LLM to do it. Judging is considerably easier [For LLMs] than writing something from scratch, so LLM judges will always have that edge in accuracy. Equivalently, I also like getting them to write tons of tests to build trust in correct behavior.

acedTrex 6/26/2025|||

> Judging is considerably easier than writing something from scratch

I don't agree with this at all. Writing new code is trivially easy, to do a full in depth review takes significantly more brain power. You have to fully ascertain and insert yourself into someone elses thought process. Thats way more work than utilizing your own thought process.

Herring 6/26/2025||

Sorry, I should have been more specific. I meant LLMs are more reliable and accurate at judging than at generating from scratch.

They basically achieve over 80% agreement with human evaluators [1]. This level of agreement is similar to the consensus rate between two human evaluators, making LLM-as-a-judge a scalable and reliable proxy for human judgment.

[1] https://arxiv.org/abs/2306.05685 (2023)

habinero 6/26/2025|||

80% is a pretty abysmal success rate and means it's very unreliable.

It sounds nice but it means at least 1 in 5 are bad. That's worse odds than rolling 1 on a d6. You'll be tripping over mistakes constantly.

PeterStuer 6/29/2025|||

From the study: " [80% is] the same level of agreement between humans."

So if 80% is abysmal, you are getting the same abysmal level from human code reviewers.

habinero 6/30/2025||

Not the same thing. With human code reviewers, you can talk to them and resolve conflicts. You can even ask them to review code differently in future.

The whole point of a tool is to reduce workload. If you have to frequently correct it, then it's not worthwhile to use.

andrekandre 6/27/2025|||

  > 80% is a pretty abysmal success rate and means it's very unreliable.

EXACTLY. imagine if your car would at best start 80% of the time...

bluedel 6/27/2025||

At least if the car doesn't start, you can immediately tell.

yencabulator 6/27/2025||

Imagine Google Maps driving directions took you in the correct direction 80% of the time.

malfist 6/26/2025|||

LLMs will not have the context behind the lines of code in the CR.

Sure there's no bug with how the logic is defined in the CR or even in the context of the project, ti maybe won't throw an exception.

But the LLM won't know that the query is iterating over an unindexed field in the DB with the table in prod having 10s of millions of rows. The LLM won't know that even though the code says the button should be red and the comments say the button should be red, the corporate style guide says red should be a very specific hex code that it isn't.

inetknght 6/26/2025||||

> So get another LLM to do it.

Oh goodness that's like trusting one kid to tell you whether or not his friend lied.

In matters where trust matters, it's a recipe for disaster.

malfist 6/26/2025|||

LLMs inspecting LLM code is like the police investigating themselves for wrong doing.

Herring 6/26/2025|||

*shrug this kid is growing up fast

Give it another year and HN comments will be very different.

Writing tests already works now. It's usually easier to read tests than to read convoluted logic.

inetknght 6/26/2025|||

> shrug this kid is growing up fast

Mmmhmm. And you think this "growing up" doesn't have biases to lie in circumstances where it matters? Consider politics. Politics matter. It's inconceivable that a magic algorithm would lie to us about various political concerns, right? Right...?

A magic algorithm lying to us about anything would be extremely valuable to liars. Do you think it's possible that liars are guiding the direction of these magic algorithms?

habinero 6/26/2025||||

Sure, and there was a lot of hype about the blockchain a decade ago and how it would take over everything. YC funded a ton of blockchain startups.

I notice a distinct lack of blockchain hegemony.

catlifeonmars 6/26/2025||||

It’s also easy to misread tests FWIW.

dingnuts 6/26/2025|||

they've been saying that for three years and the performance improvement has been asymptotic (logarithmic) for a decade, if you've been following the state of the art that long.

skim1420 6/26/2025|||

0.9 * 0.9 == 0.81

kbelder 6/26/2025||

0.1 * 0.1 == 0.01

lubujackson 6/26/2025|||

We never can have total trust in LLM output, but we can certainly sanitize it and limit it's destructive range. Just like we sanitize user input and defend with pentests and hide secrets in dot files, we will eventually resolve to "best practices" and some "SOC-AI compliance" standard down the road.

It's just too useful to ignore, and trust is always built, brick by brick. Let's not forget humans are far from reliable anyway. Just like with driving cars, I imagine producing less buggy code (along predefined roads) will soon outpace humans. Then it is just blocking and tackling to improve complexity.

bluefirebrand 6/26/2025|||

> We never can have total trust in LLM output, but we can certainly sanitize it and limit it's destructive range

Can we really do this reliably? LLMs are non-deterministic, right, so how do we validate the output in a deterministic way?

We can validate things like shape of data being returned, but how do we validate correctness without an independent human in the loop to verify?

olivermuty 6/27/2025|||

Put four juniors in separate rooms and give them the same task. Do you expect them to produce identical solutions?

If no? Then congrats, you are now in a position where your software development lifecycle needs to handle non-determinism.

This fanatical vibing movement is ridicolous, but this luddite stance that LLMs cannot contribute to software dev because they are «non deterministic» is almost as ludicrus.

bluefirebrand 6/27/2025||

> Then congrats, you are now in a position where your software development lifecycle needs to handle non-determinism.

Sure, except the Juniors producing wrong solutions is the Juniors problem, not mine

If I give four LLM agents tasks and they call come back with slightly wrong solutions, that's me adding four problems to my own workload

I'm not sure how I'm supposed to keep up with that. I'm definitely not sure it makes me overall more productive

lovich 6/26/2025||||

The same way we did it with humans in the loop?

I check AI output for hallucinations and issues as I don’t fully trust it to work, but we also do PRs with humans to have another set of eyes check because humans also make mistakes.

For the soft sciences and arts I’m not sure how to validate anything from AI but for software and hard sciences I don’t see why test suites wouldn’t continue serving their same purpose

aDyslecticCrow 6/26/2025||

Famously, "it's easier to write code than to read it". That goes for humans. So why did we automate the easy part and move the effort over to the hard part?

If we need a human in the loop to check every row of code for the deep logic errors... then we could just get the human to write it no?

buescher 6/27/2025||

We’ve been automating the easy parts since the first compiler, but llms make everything weird.

hdjdbdirbrbtv 6/27/2025||

Respectfully, I disagree. An llm in my mind is a new compiler. Just it takes natural language and produces code.

aDyslecticCrow 6/27/2025||

It feels like we're talking about different technologies sometimes.

I find its a slightly improved google for vague questions. Or a doxygen writer.

Its all use I've found for any ai model since i first started playing with github copilot beta.

Ive been trying the newer models as they arrived, and found they're getting more verbose, more prone to hallucinating functions that dont exist, and more prone to praise me as a god when trying to ask about basic assumptions. (you're cutting to the heart of the matter)

What kind of code do you write where its somehow replacing coding itself? I spent 30 minutes trying to get mistral to write a basic bash script yesterday.

hdjdbdirbrbtv 7/1/2025||

I am playing with open weights models at home and yeah they are like that ... I use Claude 3.7 @ work and yeah it is a lot better ... Sometimes it will flub things but it also can write large amounts of code ... Mostly how I want (the pareto principle comes into play for the parts I don't want though).

So for me, the future will tend towards this ... Currently the tech is early days, we have no way to steer thought.. We have no way to align it to our thought processes... But eventually we will get to I want x pls make and it will be able to do it well.

mk_stjames 6/27/2025|||

I want to point out that LLMs can be completely deterministic if the final sampler is run with 0 temperature (picking the highest probability token), no top-k, fixed seed, etc.

yencabulator 6/27/2025||

Highest probability token can still vary nondeterministically when the computation is essentially racing GPU cores or even separate hosts against each other. Float math evaluation order can change the end result.

ngold 6/27/2025|||

You perfectly said nothing. Well done.

whiplash451 6/26/2025||

> "innovation happens at the speed of trust"

You'll have to elaborate on that. How much trust was there in electricity, flight and radioactivity when we discovered them?

In science, you build trust as you go.

agent281 6/26/2025|||

Have you heard of the War of the Currents?

> As the use of AC spread rapidly with other companies deploying their own systems, the Edison Electric Light Company claimed in early 1888 that high voltages used in an alternating current system were hazardous, and that the design was inferior to, and infringed on the patents behind, their direct current system.

> In the spring of 1888, a media furor arose over electrical fatalities caused by pole-mounted high-voltage AC lines, attributed to the greed and callousness of the arc lighting companies that operated them.

https://en.wikipedia.org/wiki/War_of_the_currents

bori5 6/26/2025||

Tesla is barely mentioned in that article which is somewhat surprising

throw4847285 6/26/2025||

Not surprising at all. He was a minor player in the Current Wars compared to his primary benefactor, George Westinghouse. His image was rehabilitated first by Serbian-Americans and then by webcomic artists and Redditors, who turned him into a secular saint.

Most of what people think they know about Tesla is not actually true if you examine the historical record. But software engineering as a discipline demands business villains and craftsman heroes, and so Edison and Tesla were warped to fit those roles even though in real life there is only evidence of cordial interactions.

reaperducer 6/26/2025||||

How much trust was there in electricity, flight and radioactivity when we discovered them?

Not much.

Plenty of people were against electricity when it started becoming common. They were terrified of lamps, doorbells, telephones, or anything else with an electric wire. If they were compelled to use these things (like for their job) they would often wear heavy gloves to protect themselves. It is very occasionally mentioned in novels from the late 1800's.

(Edit: If you'd like to see this played out visually, watch the early episodes of Miss Fisher's Murder Mysteries on ABC [.oz])

There are still people afraid of electricity today. There is no shortage of information on the (ironically enough) internet about how to shield your home from the harmful effects of electrical wires, both in the house and utility lines.

Flight? I dunno about back then, but today there's plenty of people who are afraid to fly. If you live in Las Vegas for a while, you start to notice private train cars occasionally parked on the siding near the north outlet mall. These belong to celebrities who are afraid to fly, but have to go to Vegas for work.

Radioactivity? There was a plethora of radioactive hysteria in books, magazines, comics, television, movies, and radio. It's not hard to find.

whiplash451 6/26/2025||

That’s exactly my point

dirkc 6/26/2025|||

I use it to mean that the more people trust each other, the quicker things get done. Maybe the statement can be rephrased as "progress happens at the speed of trust" to avoid the specific scientific connotation.

perrygeo 6/26/2025|||

Importantly, there are many business processes today that are already limited by lack of trust. That's not necessarily a bad thing either - checks and balances exist for a reason. But it does strongly suggest that increasing "productivity" by dumping more inputs into the process is counter-productive to the throughput of the overall system.

jnxx 6/27/2025||||

Trust is also essential for any form of symbolic information exchange.

Humans communicate using symbols. That could be patterns of sound waves, gestures, or written characters.

If we can't trust that the communicated symbols match their agreed meaning, what is the use of them? Communication breaks down quickly, and being social individuals which inter-depend on others, we can't live without communication. Nobody likes a liar, and the reason is, the words they give us do not match what they mean. (Perhaps not only humans. I read that dolphins readily come and rescue individuals which are drowning - including humans - but they punish individuals which fake drowning.)

And that goes from every stratum of social interaction, from big treaties to selling a bagel. Would you sell a bagel for a fake dollar bill? The number on it is a symbol as well, it has a meaning.

So it is right down one of the very bases of human cooperation.

reaperducer 6/26/2025||||

I use it to mean that the more people trust each other, the quicker things get done.

True not only in innovation, but in business settings.

I don't think there's anyone who works in any business long enough who doesn't have problems getting their job done simply because someone else with a key part of the project doesn't trust that you know what you're doing.

whiplash451 6/26/2025|||

That's a pretty useless statement in the context of innovation.

The moment a technology reaches trust at scale, it becomes a non-innovation in people's mind.

Happened for TVs, electrical light in homes, AI for chess, and Google. Will happen with LLM-based assistants.

jazzyjackson 6/26/2025||

You're not catching on. It's not the trust in the technology, it's the trust between people. Consider business dealings between entities that do not have high trust - everything becomes mediated through lawyers and nothing happens without a contract. Slow and expensive. Handshake deals and promises kept move things along a lot faster and without the expense of hammering out legal arrangements.

LLM leads to distrust between people. From TFA, That concept is Trust - It underpins everything about how a group of engineers function and interact with each other in all technical contexts. When you discuss a project architecture you are trusting your team has experience and viewpoints to back up their assertions.

blurbleblurble 6/26/2025||

I bumped into this at work but not in the way you might expect. My colleague and I were under some pressure to show progress and decided to rush merging a pretty significant refactor I'd been working on. It was a draft PR but we merged it for momentum's sake. The next week some bugs popped up in an untested area of the code.

As we were debugging, my colleague revealed his assumption that I'd used AI to write it, and expressed frustration at trying to understand something AI generated after the fact.

But I hadn't used AI for this. Sure, yes I do use AI to write code. But this code I'd written by hand and with careful deliberate thought to the overall design. The bugs didn't stem from some fundamental flaw in the refactor, they were little oversights in adjusting existing code to a modified API.

This actually ended up being a trust building experience over all because my colleague and I got to talk about the tension explicitly. It ended up being a pretty gentle encounter with the power of what's happening right now. In hindsight I'm glad it worked out this way, I could imagine in a different work environment, something like this could have been more messy.

Be careful out there.

kldg 6/28/2025|

It can be a pretty a serious and offensive accusation for sure. When a dev voices their own characters in a game and has a flat affect and/or stilted speech pattern, it's inevitable to be called AI by someone. Art I don't understand or appreciate? Likely AI. Unimpressed by a Eurovision entry? Call it AI. Some people toss this around casually, but I wouldn't.

I made myself known to be a big fool ~4 years ago. A local newspaper published an article on a particular person with outrageous claims primarily using photographs as proof. I challenged the editor directly via email, laying out my reasoning for why I was sure the images were manipulated. My arguments relied on misunderstandings on my part and the person claims were levied against showing zero deviation in position and stance while posing with multiple people during a meet-and-greet. The editor was offended and trolled me in response. I didn't let up, and he realized I was an idiot, not an agitator, and shared the full unpublished video from where the photos were taken with me, at which point I apologized deeply and made a donation. My ego was appropriately small for the following year.

Before emailing him, I shared the photos with some level-headed friends for their opinion, specifically because I didn't want to make a false accusation. They came to the same conclusion that the images were most likely manipulated, so I was very confident going in.

Now I trust this paper and people involved implicitly, but this was a lot of work to convince just one person.

stavros 6/26/2025||

I don't understand the premise. If I trust someone to write good code, I learned to trust them because their code works well, not because I have a theory of mind for them that "produces good code" a priori.

If someone uses an LLM and produces bug-free code, I'll trust them. If someone uses an LLM and produces buggy code, I won't trust them. How is this different from when they were only using their brain to produce the code?

acedTrex 6/26/2025||

Author here:

Essentially the premise is that in medium trust environments like very large teams or low trust environments like an open source project.

LLMs make it very difficult to make an immediate snap judgement about the quality of the dev that submitted the patch based solely on the code itself.

In the absence of being able to ascertain the type of person you are dealing with you have to fall back too "no trust" and review everything with a very fine tooth comb. Essentially there are no longer any safe "review shortcuts" and that can be painful in places that relied on those markers to grease the wheels so to speak.

Obviously if you are in an existing competent high trust team then this problem does not apply and most likely seems completely foreign as a concept.

lxgr 6/26/2025|||

> LLMs make it very difficult to make an immediate snap judgement about the quality [...]

That's the core of the issue. It's time to say goodbye to heuristics like "the blog post is written in eloquent, grammatical English, hence the point its author is trying to make must be true" or "the code is idiomatic and following all code styles, hence it must be modeling the world with high fidelity".

Maybe that's not the worst thing in the world. I feel like it often made people complacent.

acedTrex 6/26/2025|||

> Maybe that's not the worst thing in the world. I feel like it often made people complacent.

For sure, in some ways perhaps reverting to a low trust environment might improve quality in that it now forces harsher/more in depth reviews.

That however doesn't make the requirement less exhausting for people previously relying heavily on those markers to speed things up.

Will be very interesting to see how the industry standardizes around this. Right now it's a bit of the wild west. Maybe people in ten years will look back at this post and think "what do you mean you judged people based on the code itself that's ridiculous"

furyofantares 6/26/2025||||

I think you're unfair to the heuristics people use in your framing here.

You said "hence the point its author is trying to make must be true" and "hence it must be modeling the world with high fidelity".

But it's more like "hence the author is likely competent and likely put in a reasonable effort."

When those assumptions hold, putting in a very deep review is less likely to pay off. Maybe you are right that people have been too complacent to begin with, I don't know, but I don't think you've framed it fairly.

lxgr 6/26/2025||

> But it's more like "hence the author is likely competent and likely put in a reasonable effort."

And isn't dyslexic, and is a native speaker etc. Some will gain from this shift, some will lose.

furyofantares 6/26/2025||

Yes! This is part of why I bristle at such reductive takes, we can use more nuance thinking about what we are gaining and what we are losing and how to deal with it.

o11c 6/26/2025||||

That's not how heuristics work.

The heuristic is "this submission doesn't even follow the basic laws of grammar, therefore I can safely assume incompetence and ignore it entirely."

You still have to do verification for what passes the heuristic, but it keeps 90% of the crap away.

tempodox 6/26/2025|||

Anyway, “following all code styles” is just a fancy way of saying “adheres to fashion”. What meaningful conclusions can you draw from that?

rurp 6/26/2025||

It's not about fashion, it's about diligence and consideration. Code formatting is totally different from say clothing fashion. Social fashions are often about being novel or surprising which is the opposite of how good code is written. Code should be as standard, clear and unsurprising as is reasonably possible. If someone is writing code in a way that's deliberately unconventional or overly fancy that's a strong signal that it isn't very good.

When someone follows standard conventions it means that they A) have a baseline level of knowledge to know about them, and B) care to write the code in a clear and approachable way for others.

tempodox 6/26/2025||

> If someone is writing code in a way that's deliberately unconventional or overly fancy that's a strong signal that it isn't very good.

“unconventional” or “fancy” is in the eye of the beholder. Whose conventions are we talking about? Code is bad when it doesn't look the way you want it to? How convenient. I may find code hard to read because it's formatted “conventionally”, but I wouldn't be so entitled as to call it bad just because of that.

kiitos 6/26/2025|||

> “unconventional” or “fancy” is in the eye of the beholder.

Literally not: a language defines its own conventions, they're not defined in terms of individual users/readers/maintainers subjective opinions.

> Whose conventions are we talking about?

The conventions defined by the language.

> Code is bad when it doesn't look the way you want it to?

No -- when it doesn't satisfy the conventions established by the language.

> I may find code hard to read because it's formatted “conventionally”,

If you did this then you'd be wrong, and that'd be a problem with your personal evaluation process/criteria, that you would need to fix.

Capricorn2481 6/26/2025||

> a language defines its own conventions

Where are these mythical languages? I think the word you're looking for is syntax, which is entirely different. Conventions are how code is structured and expected to be read. Very few languages actually enforce or even suggest conventions, hence the many style guides. It's a standout feature of Go to have a format style, and people still don't agree with it.

And it's kinda moot when you can always override conventions. It's more accurate to say a team decides on the conventions of a language.

habinero 6/26/2025|||

No, they're absolutely correct that it's critical in professional and open source environments. Code is written once but read hundreds or thousands of times.

If every rando hire goes in and has a completely different style and formatting -- and then other people come in and rewrite parts in their own style -- code rapidly goes to shit.

It doesn't matter what the style is, as long as there is one and it's enforced.

Capricorn2481 6/26/2025||

> No, they're absolutely correct that it's critical in professional and open source environments. Code is written once but read hundreds or thousands of times

What you're saying is reasonable, but that's not what they said at all. They said there's one way to write cleanly and that's "Standard conventions", whatever that means. Yes, conventions so standard that I've read 10 conflicting books on what they are.

There is no agreed upon definition of "readable code". A team can have a style guide, which is great to follow, but that is just formalizing the personal preference of the people working on a project. It's not anymore divine than the opinion of a "rando."

habinero 6/26/2025||

No, you misunderstood what they said. And I misspoke a little, too.

While it's true that in principle it doesn't matter what style you choose as long as there is one, in practice languages are just communities of people, and every community develops norms and standards. More recent languages often just pick a style and bake it in.

This is a good thing, because again, code is read 1000x more times than it's written. It saves everyone time and effort to just develop a typical style.

And yeah, the code might run no matter how you indent it, but it's not correct, any more than you going to a restaurant and licking the plates.

Capricorn2481 6/28/2025||

> More recent languages often just pick a style and bake it in.

Again, there's a couple examples of languages doing this, and everything else is a free for all.

> No, you misunderstood what they said.

Agree to disagree. Nothing in that comment talks about the conventions of a language, only the conventions of code. Again, I don't disagree with what you say, but the person you replied to was in a completely different argument.

eddd-ddde 6/27/2025||||

> In the absence of being able to ascertain the type of person you are dealing with you have to fall back too "no trust" and review everything with a very fine tooth comb.

Is that not how you review all code? I don't care who wrote the code, just because certain person wrote the code doesn't give them an instant pass to skip my review process.

sim7c00 6/26/2025|||

its about the quality of the code, not the quality of the dev. you might think it's related, but it's not.

a dev can write piece of good, and piece of bad code. so per code, review the code. not the dev!

haswell 6/26/2025|||

> its about the quality of the code, not the quality of the dev. you might think it's related, but it's not.

I could not disagree more. The quality of the dev will always matter, and has as much to do with what code makes it into a project as the LLM that generated it.

An experienced dev will have more finely tuned evaluation skills and will accept code from an LLM accordingly.

An inexperienced or “low quality” dev may not even know what the ideal/correct solution looks like, and may be submitting code that they do not fully understand. This is especially tricky because they may still end up submitting high quality code, but not because they were capable of evaluating it as such.

You could make the argument that it shouldn’t matter who submits the code if the code is evaluated purely on its quality/correctness, but I’ve never worked in a team that doesn’t account for who the person is behind the code. If its the grizzled veteran known for rarely making mistakes, the review might look a bit different from a review for the intern’s code.

NeutralCrane 6/26/2025||

> An experienced dev will have more finely tuned evaluation skills and will accept code from an LLM accordingly. An inexperienced or “low quality” dev may not even know what the ideal/correct solution looks like, and may be submitting code that they do not fully understand. This is especially tricky because they may still end up submitting high quality code, but not because they were capable of evaluating it as such.

That may be true, but the proxy for assessing the quality of the dev is the code. No one is standing over you as you code your contribution to ensure you are making the correct, pragmatic decisions. They are assessing the code you produce to determine the quality of your decisions, and over time, your reputation as a dev is made up of the assessments of the code you produced.

The point is that an LLM in no way changes this. If a dev uses an LLM in a non-pragmatic way that produces bad code, it will erode trust in them. The LLM is a tool, but trust still factors in to how the dev uses the tool.

haswell 6/26/2025||

> That may be true, but the proxy for assessing the quality of the dev is the code.

Yes, the quality of the dev is a measure of the quality of the code they produce, but once a certain baseline has been established, the quality of the dev is now known independent of the code they may yet produce. i.e. if you were to make a prediction about the quality of code produced by a "high quality" dev vs. a "low quality" dev, you'd likely find that the high quality dev tends to produce high quality code more often.

So now you have a certain degree of knowledge even before you've seen the code. In practice, this becomes a factor on every dev team I've worked around.

Adding an LLM to the mix changes that assessment fundamentally.

> The point is that an LLM in no way changes this.

I think the LLM by definition changes this in numerous ways that can't be avoided. i.e. the code that was previously a proxy for "dev quality" could now fall into multiple categories:

1. Good code written by the dev (a good indicator of dev quality if they're consistently good over time)

2. Good code written by the LLM and accepted by the dev because they are experienced and recognize the code to be good

3. Good code written by the LLM and accepted by the dev because it works, but not necessarily because the dev knew it was good (no longer a good indicator of dev quality)

4. Bad code written by the LLM

5. Bad code written by the dev

#2 and #3 is where things get messy. Good code may now come into existence without it being an indicator of dev quality. It is now necessary to assess whether or not the LLM code was accepted because the dev recognized it was good code, or because the dev got things to work and essentially got lucky.

It may be true that you're still evaluating the code at the end of the day, but what you learn from that evaluation has changed. You can no longer evaluate the quality of a dev by the quality of the code they commit unless you have other ways to independently assess them beyond the code itself.

If you continued to assess dev quality without taking this into consideration, it seems likely that those assessments would become less accurate over time as more "low quality" devs produce high quality code - not because of their own skills, but because of the ongoing improvements to LLMs. That high quality code is no longer a trustworthy indicator of dev quality.

> If a dev uses an LLM in a non-pragmatic way that produces bad code, it will erode trust in them. The LLM is a tool, but trust still factors in to how the dev uses the tool.

Yes, of course. But the issue is not that a good dev might erode trust by using the LLM poorly. The issue is that inexperienced devs will make it increasingly difficult to use the same heuristics to assess dev quality across the board.

acedTrex 6/26/2025|||

> you might think it's related, but it's not.

In my experience they very much are related. High quality devs are far more likely to output high quality working code. They test, they validate, they think, ultimately they care.

In that case that you are reviewing a patch from someone you have limited experience with, it previously was feasible to infer the quality of the dev from the context of the patch itself and the surrounding context by which it was submitted.

LLMs make that judgement far far more difficult and when you can not make a snap judgement you have to revert your review style to very low trust in depth review.

No more greasing the wheels to expedite a process.

alganet 6/26/2025|||

> I learned to trust them because their code works well

There's so much more than "works well". There are many cues that exist close to code, but are not code:

I trust more if the contributor explains their change well.

I trust more if the contributor did great things in the past.

I trust more if the contributor manages granularity well (reasonable commits, not huge changes).

I trust more if the contributor picks the right problems to work on (fixing bugs before adding new features, etc).

I trust more if the contributor proves being able to maintain existing code, not just add on top of it.

I trust more if the contributor makes regular contributions.

And so on...

acedTrex 6/26/2025||

Author here:

Spot on, there are so many little things that we as humans use as subtle verification steps to decide how much scrutiny various things require. LLMs are not necessarily the death of that concept but they do make it far far harder.

moffkalast 6/26/2025|||

It's easy to get overconfident and not test the LLM's code enough when it worked fine for a handful of times in a row, and then you miss something.

The problem is often really one of miscommunication, the task may be clear to the person working on it, but with frequent context resets it's hard to make sure the LLM also knows what the whole picture is and they tend to make dumb assumptions when there's ambiguity.

The thing that 4o does with deep research where it asks for additional info before it does anything should be standard for any code generation too tbh, it would prevent a mountain of issues.

stavros 6/26/2025||

Sure, but you're still responsible for the quality of the code you commit, LLM or no.

moffkalast 6/26/2025|||

Of course you are, but it's sort of like how people are responsible their Tesla driving on autopilot, which then suddenly swerves into a wall and disengages two seconds before impact. The process forces you to make mistakes you wouldn't normally ever do or even consider a possibility.

JohnKemeny 6/26/2025||

To add to devs and Teslas, you have journalists using LLMs writing summaries, lawyers using LLMs writing dispositions, doctors using LLMs writing their patient entries, and law enforcement using LLMs writing their forensics report.

All of these make mistakes (there are documented incidents).

And yes, we can counter with "the journalists are dumb for not verifying", "the lawyers are dumb for not checking", etc., but we should also be open for the fact that these are intelligent and professional people who make mistakes because they were mislead by those who sell LLMs.

bluefirebrand 6/26/2025||

I think it's analogous to physical labour

In the past someone might have been physically healthy and strong enough to physically shovel dirt all day long

Nowadays this is rarer because we use an excavator instead. Yes, a professional dirt mover is more productive with an excavator than a shovel, but is likely not as physically fit as someone spending their days moving dirt with a shovel

I think it will be similar with AI. It is absolutely going to offload a lot of people's thinking into the LLMs and their "do it by hand" muscles will atrophy. For knowledge workers, that's our brain

I know this was a similar concern with search engines and Stack Overflow, so I am trying to temper my concern here as best I can. But I can't shake the feeling that LLMs provide a way for people to offload their thinking and go on autopilot a lot more easily than Search ever did

I'm not saying that we were better off when we had to move dirt by hand either. I'm just saying there was a physical tradeoff when people moved out of the fields and into offices. I suspect there will be a cognitive tradeoff now that we are moving away from researching solutions to problems and towards asking the AI to give us solutions to problems

acedTrex 6/26/2025|||

In an ideal world you would think everyone see's it this way. But we are starting to see an uptick in "I don't know the LLMs said do that."

As if that is a somehow exonerating sentence.

NeutralCrane 6/26/2025||

It isn’t, and that is a sign of a bad dev you shouldn’t trust.

LLMs are a tool, just like any number of tools that are used by developers in modern software development. If a dev doesn’t use the tool properly, don’t trust them. If they do, trust them. The way to assess if they use it properly is in the code they produce.

Your premise is just fundamentally flawed. Before LLMs, the proof of a quality dev was in the pudding. After LLMs, the proof of a quality dev remains in the pudding.

acedTrex 6/26/2025||

> Your premise is just fundamentally flawed. Before LLMs, the proof of a quality dev was in the pudding. After LLMs, the proof of a quality dev remains in the pudding.

Indeed it does, however what the "proof" is has changed. In terms of sitting down and doing a full, deep review, tracing every path validating every line etc. Then for sure, nothing has changed.

However, at least in my experience, pre LLM those reviews were not EVERY CASE there were many times I elided parts of a deep review because i saw markers in the code that to me showed competency, care etc. With those markers there are certain failure conditions that can be deemed very unlikely to exist and therefore the checks can be skipped. Is that ALWAYS the correct assumption? Absolutely not but the more experienced you are the less false positives you get.

LLMs make those markers MUCH harder to spot, so you have to fall back to doing a FULL indepth review no matter what. You have to eat ALL the pudding so to speak.

For people that relied on maybe tasting a bit of the pudding then assuming based on the taste the rest of the pudding probably tastes the same its rather jarring and exhausting to now have to eat all of it all the time.

NeutralCrane 6/26/2025||

> However, at least in my experience, pre LLM those reviews were not EVERY CASE there were many times I elided parts of a deep review because i saw markers in the code that to me showed competency, care etc.

That was never proof in the first place.

If anything, someone basing their trust in a submission on anything other than the code itself is far more concerning and trust-damaging to me than if the submitter has used an LLM.

acedTrex 6/26/2025||

> That was never proof in the first place.

I mean, it's not necessarily HARD proof but it has been a reliable enough way to figure out which corners to cut. You can of course say that no corners should ever be cut and while that is true in an ideal sense. In the real world things always get fuzzy.

Maybe the death of cutting corners is a good thing overall for output quality. Its certainly exhausting on the people tasked with doing the reviews however.

breuleux 6/26/2025||

I don't know about that. Cutting corners will never die.

Ultimately I don't think the heuristics would change all that much, though. If every time you review a person's PR, almost everything is great, they are either not using AI or they are vetting what the AI writes themselves, so you can trust them as you did before. It may just take some more PRs until that's apparent. Those who submit unvetted slop will have to fix a lot of things, and you can crank up the heat on them until they do better, if they can. (The "if they can" is what I'm most worried about.)

insane_dreamer 6/26/2025|||

> If someone uses an LLM and produces bug-free code, I'll trust them.

Only because you already trust them to know that the code is indeed bug-free. Some cases are simple and straightforward -- this routine returns a desired value or it doesn't. Other situations are much more complex in anticipating the ways in which it might interact with other parts of the system, edge cases that are not obvious, etc. Writing code that is "bug free" in that situation requires the writer of the code to understand the implications of the code, and if the dev doesn't understand exactly what the code does because it was written by an LLM, then they won't be able to understand the implications of the code. It then falls to the reviewer to understand the implications of the code -- increasing their workload. That was the premise.

somewhereoutth 6/26/2025|||

Because when people use LLMs, they are getting the tool to do the work for them, not using the tool to do the work. LLMs are not calculators, nor are they the internet.

A good rule of thumb is to simply reject any work that has had involvement of an LLM, and ignore any communication written by an LLM (even for EFL speakers, I'd much rather have your "bad" English than whatever ChatGPT says for you).

I suspect that as the serious problems with LLMs become ever more apparent, this will become standard policy across the board. Certainly I hope so.

stavros 6/26/2025|||

Well, no, a good rule of thumb is to expect people to write good code, no matter how they do it. Why would you mandate what tool they can use to do it?

somewhereoutth 6/26/2025||

Because it pertains to the quality of the output - I can't validate every line of code, or test every edge case. So if I need a certain level of quality, I have to verify the process of producing it.

This is standard for any activity where accuracy / safety is paramount - you validate the process. Hence things like maintenance logs for airplanes.

acedTrex 6/26/2025|||

> So if I need a certain level of quality, I have to verify the process of producing it

Precisely this, and this is hardly a unique to software requirement. Process audits are everywhere in engineering. Previously you could infer the process of producing some code by simply reading the patch and that generally would tell you quite a bit about the author itself. Using advanced and niche concepts with imply a solid process with experience backing it. Which would then imply that certain contextual bugs are unlikely so you skip looking for them.

My premise in the blog is basically that "Well now I have go do a full review no matter what the code itself tells me about the author."

badsectoracula 6/26/2025||

> My premise in the blog is basically that "Well now I have go do a full review no matter what the code itself tells me about the author."

Which IMO is the correct approach - or alternatively, if you do actually trust the author, you shouldn't care if they used LLMs or not because you'd trust them to check the LLM output too.

badsectoracula 6/26/2025||||

The false assumption here is that humans will always write better code than LLMs, which is certainly not the case for all humans nor all LLMs.

mexicocitinluez 6/26/2025|||

[flagged]

tranchebald 6/26/2025||||

I’m not seeing a lot of discussion about verification or a stronger quality control process anywhere in the comments here. Is that some kind of unsolvable problem for software? I think if the standard of practice is to use author reputation as a substitute for a robust quality control process, then I wouldn’t be confident that the current practice is much better than AI code-babel.

badsectoracula 6/26/2025||||

> Because when people use LLMs, they are getting the tool to do the work for them, not using the tool to do the work.

You can say that for pretty much any sort of automation or anything that makes things easier for humans. I'm pretty sure people were saying that about doing math by hand around when calculators became mainstream too.

breuleux 6/26/2025||||

I think the main issue is people using LLMs to do things that they don't know how to do themselves. There's actually a similar problem with calculators, it's just a much smaller one: if you never learn how to add or multiply numbers by hand and use calculators for everything all the time, you may sometimes make absurd mistakes like tapping 44 * 3 instead of 44 * 37 and not bat an eye when your calculator tells you the result is a whole order of magnitude less than what you should have expected. Because you don't really understand how it works. You haven't developed the intuition.

There's nothing wrong with using LLMs to save time doing trivial stuff you know how to do yourself and can check very easily. The problem is that (very lazy) people are using them to do stuff they are themselves not competent at. They can't check, they won't learn, and the LLM is essentially their skill ceiling. This is very bad: what plus-value are you supposed to bring over something you don't understand? AGI won't have to improve from the current baseline to surpass humans if we're just going to drag ourselves down to its level.

mexicocitinluez 6/26/2025||||

>Because when people use LLMs, they are getting the tool to do the work for them, not using the tool to do the work.

What? How on god's green earth could you even pretend to know how all people are using these tools?

> LLMs are not calculators, nor are they the internet.

Umm, okay? How does that make them less useful?

I'm going to give you a concrete example of something I just did and let you try and do whatever mental gymnastics you have to do to tell me it wasn't useful:

Medicare requires all new patients receiving home health treatment go through a 100+ question long form. This form changes yearly, and it's my job to implement the form into our existing EMR. Well, part of that is creating a printable version. Guess what I did? I uploaded the entire pdf to Claude and asked it to create a print-friendly template using Cottle as the templating language in C#. It generated the 30 page print preview in a minute. And it took me about 10 more minutes to clean up.

> I suspect that as the serious problems with LLMs become ever more apparent, this will become standard policy across the board. Certainly I hope so.

The irony is that they're getting better by the day. That's not to say people don't use them for the wrong applications, but the idea that this tech is going to be banned is absurd.

> A good rule of thumb is to simply reject any work that has had involvement of an LLM

Do you have any idea how ridiculous this sounds to people who actually use the tools? Are you going to be able to hunt down the single React component in which I asked it to convert the MUI styles to tailwind? How could you possibly know? You can't.

sebmellen 6/26/2025||||

You’re being unfairly downvoted. There is a plague of well-groomed incoherency in half of the business emails I receive today. You can often tell that the author, without wrestling with the text to figure out what they want to say, is a kind of stochastic parrot.

This is okay for platitudes, but for emails that really matter, having this messy watercolor kind of writing totally destroys the clarity of the text and confuses everyone.

To your point, I’ve asked everyone on my team to refrain from writing words (not code) with ChatGPT or other tools, because the LLM invariably leads to more complicated output than the author just badly, but authentically, trying to express themselves in the text.

jimbokun 6/26/2025|||

I find the idea of using LLMs for emails confusing.

Surely it's less work to put the words you want to say into an email, rather than craft a prompt to get the LLM to say what you want to say, and iterate until the LLM actually says it?

fwip 6/26/2025||

My own opinion, which is admittedly too harsh, is that they don't really know what they want to say. That is, the prompt they write is very short, along the lines of `ask when this will be done` or `schedule a followup`, and give the LLM output a cursory review before copy-pasting it.

jimbokun 6/26/2025||

Still funny to me.

`ask when this will be done` -> ChatGPT -> paste answer into email

type: "when will this be done?" Send.

sebmellen 6/27/2025||

Greetings Jimbokun,

I trust this message finds you well.

I am writing to inquire about the projected completion timeline for the HackerNews initiative. In order to optimize our downstream workflows and ensure all dependencies are properly aligned, an estimated delivery date would be highly valuable.

Could you please provide an updated forecast on when we might anticipate the project's conclusion? This data will assist in calibrating our subsequent operational parameters.

Thank you for your continued focus and effort on this task. Please advise if any additional resources or support from my end could help expedite the process.

Best regards, Sebmellen

acedTrex 6/26/2025|||

Yep, I have come to really dislike LLMs for documentation as it just reads wrong to me and I find so often misses the point entirely. There is so much nuance tied up in documentation and much of it is in what is NOT said as much as what is said.

The LLMs struggle with both but REALLY struggle with figuring out what NOT to say.

unsignedint 6/27/2025|||

I definitely see where you're coming from, though I have a slightly different perspective.

I agree that LLMs often fall short when it comes to capturing the nuanced reasoning behind implementations—and when used in an autopilot fashion, things can easily go off the rails. Documentation isn't just about what is said, but also what’s not said, and that kind of judgment is something LLMs do struggle with.

That said, when there's sufficient context and structure, I think LLMs can still provide a solid starting point. It’s not about replacing careful documentation but about lowering the barrier to getting something down—especially in environments where documentation tends to be neglected.

In my experience, that neglect can stem from a few things: personal preference, time pressure, or more commonly, language barriers. For non-native speakers, even when they fully understand the material, writing clear and fluent documentation can be a daunting and time-consuming task. That alone can push it to the back burner. Add in the fact that docs need to evolve alongside the code, and it becomes a compounding issue.

So yes, if someone treats LLM output as the final product and walks away, that’s a real problem. And honestly, this ties into my broader skepticism around the “vibe coding” trend—it often feels more like “fire and forget” than responsible tool usage.

But when approached thoughtfully, even a 60–90% draft from an LLM can be incredibly useful—especially in situations where the alternative is having no documentation at all. It’s not perfect, but it can help teams get unstuck and move forward with something workable.

short_sells_poo 6/26/2025|||

I wonder if this is to a large degree also because when we communicate with humans, we take cues from more than just the text. The personality of the author will project into the text they write, and assuming you know this person at least a little bit, these nuances will give you extra information.

flir 6/26/2025||||

> A good rule of thumb is to simply reject any work that has had involvement of an LLM,

How are you going to know?

bluefirebrand 6/26/2025|||

That's sort of the problem isn't it? There is no real way to know so we sort of just have to assume every bit of work is involving LLMs now so we have to take a lot closer look at everything

jnxx 6/27/2025|||

We already treat spam and unsolicited commercial email that way.

taneq 6/26/2025|||

If you have a long standing, effective heuristic that “people with excellent, professional writing are more accurate and reliable than people with sloppy spelling and punctuation” then the appearance of a semi-infinite group of ‘people’ writing well presented, convincingly worded articles which nonetheless are riddled with misinformation, hidden logical flaws, and inconsistencies, you’re gonna end up trusting everyone a lot less.

It’s like if someone started bricking up tunnel entrances and painting ultra realistic versions of the classic Road Runner tunnel painting on them, all over the place. You’d have to stop and poke every underpass with a stick just to be sure.

stavros 6/26/2025||

Sure, your heuristic no longer works, and that's a bit inconvenient. We'll just find new ones.

sebmellen 6/26/2025|||

Yeah, now you need to be able to demonstrate verbal fluency. The problem is, that inherently means a loss of “trusted anonymous” communication, which is particularly damaging to the fiber of the internet.

acedTrex 6/26/2025|||

Author here:

Precisely, in the age where it is very difficult to ascertain the type or quality of skills you are interacting with say in a patch review or otherwise you frankly have to "judge" someone and fallback to suspicion and full verification.

taneq 6/26/2025|||

Yeah I think "trust for a fluent, seemingly logically coherent anonymous responder" pretty much captures it.

oasisaimlessly 6/26/2025||||

"A bit inconvenient" might be the understatement of the year. If information requires say, 2x the time to validate, the utility of the internet is halved.

jnxx 6/27/2025|||

Too bad that the language we use is also a demonstration of social status. If I think about it, it could have a somewhat corrosive effect on that glue that keeps society in shape.

legacynl 6/27/2025|||

> How is this different from when they were only using their brain to produce the code?

If A = B, higher B means that A is higher too.

Now it's A + Ai = B. Now Higher B doesn't necessarily mean higher A.

Especially since the current state of Ai is pretty much stochastic, and sometimes is worse than nothing at all

JambalayaJimbo 6/28/2025|||

I have never been in a work environment in which I’ve been able to do more than rubber stamp PRs. Performing a deep review of each change is simply impossible with the expectations we were given.

stavros 6/28/2025||

Interesting, I've never been in one where we didn't read PRs.

cipherhood 6/28/2025|||

[dead]

mexicocitinluez 6/26/2025||

It's not.

What you're seeing now is people who once thought and proclaimed these tools as useless now have to start to walk back their claims with stuff like this.

It does amaze me that the people who don't use these tools seem to have the most to say about them.

acedTrex 6/26/2025|||

Author here:

For what it's worth I do actually use the tools albeit incredibly intentionally and sparingly.

I see quite a few workflows and tasks that they can be a value add on, mostly outside of the hotpath of actual code generation but still quite enticing. So much so in fact I'm working on my own local agentic tool with some self hosted ollama models. I like to think that i am at least somewhat in the know on the capabilities and failure points of the latest LLM tooling.

That however doesn't change my thoughts on trying to ascertain if code submitted to me deserves a full indepth review or if I can maybe cut a few corners here and there.

mexicocitinluez 6/26/2025||

> That however doesn't change my thoughts on trying to ascertain if code submitted to me deserves a full indepth review or if I can maybe cut a few corners here and there.

How would you even know? Seriously, if I use Chatgpt to generate a one-off function for a feature I'm working on that searches all classes for one that inherits a specific interface and attribute, are you saying you'd be able to spot the difference?

And what does it even matter it works?

What if I use Bolt to generate a quick screen for a PoC? Or use Claude to create a print-preview with CSS of a 30 page Medicare form? Or converting a component's styles MUI to tailwind? What if all these things are correct?

This whole OS repos will ban LLM-generated code is a bit absurd.

> or what it's worth I do actually use the tools albeit incredibly intentionally and sparingly.

How sparingly? Enough to see how it's constantly improving?

acedTrex 6/26/2025||

> How would you even know? Seriously, if I use Chatgpt to generate a one-off function for a feature I'm working on that searches all classes for one that inherits a specific interface and attribute, are you saying you'd be able to spot the difference?

I don't know, thats the problem. As a result, because I can't know I have to now do full in depth reviews no matter what. Which is the "judging" I tongue in cheek talk about in the blog.

> How sparingly? Enough to see how it's constantly improving?

Nearly daily, to be honest I have not noticed too much improvement year over year in regards to how they fail. They still break in the exact same dumb ways now as they did before. Sure they might generate correct syntactic code reliably now and it might even work. But they still consistently fail to grok the underlying reasoning for things existing.

But I am writing my own versions of these agentic systems to use for some rote menial stuff.

mexicocitinluez 6/26/2025||

So you werent doing in depth reviews before? Are these people you know? And now you just don't trust them because they include a tool on their workflow?

globnomulous 6/26/2025|||

> It does amaze me that the people who don't use these tools seem to have the most to say about them.

You're kidding, right? Most people who don't use the tools and write about it are responding to the ongoing hype train -- a specific article, a specific claim, or an idea that seems to be gaining acceptance or to have gone unquestioned among LLM boosters.

I recently watched a talk by Andrei Karpathy. So much in it begged for a response. Google Glass was "all the rage" in 2013? Please. "Reading text is laborious and not fun. Looking at images is fun." You can't be serious.

Someone recently shared on HN a blog post explaining why the author doesn't use LLMs. The justification for the post? "People keep asking me."

mexicocitinluez 6/26/2025||

Being asked if I'm kidding by the person comparing Google glasses to machine learning algorithms is pretty funny ngl.

And the "I don't use these tools and never will" sentiment is rampant in the tech community right now. So yes, I am serious.

Youre not talking about the blog post that completely ignored agentless uses are you? The one that came to the conclusion LLMs arent useful despite only using a subset of its features?

bluefirebrand 6/26/2025|||

> And the "I don't use these tools and never will" sentiment is rampant in the tech community right now

So is the "These tools are game changers and are going to make all work obsolete soon" sentiment

Don't start pretending that AI boosters aren't everywhere in tech right now

I think the major difference I'm noticing is that many of the Boosters are not people who write any code. They are executives, managers, product owners, team leads, etc. Former Engineers maybe but very often not actively writing software daily

globnomulous 6/26/2025|||

> I think the major difference I'm noticing is that many of the Boosters are not people who write any code.

Plenty of current, working engineers who frequent and comment on Hacker News say they use LLMs and find them useful/'game changers,' I think.

Regardless, I think I agree overall: the key distinction I see is between people who like to read and write programs and people who just want to make some specific product. The former group generally treat LLMs as an unwelcome intrusion into the work they love and value. The latter generally welcome LLMs because the people selling them promise, in essence, that with LLMs you can skip the engineering and just make the product.

I'm part of the former group. I love reading code, thinking about it, and working with it. Meeting-based programming (my term for LLM-assisted programming) sounds like hell on earth to me. I'd rather blow my brains out than continue to work as a software engineer in a world where the LLM-booster dream comes true.

bluefirebrand 6/26/2025||

> I'd rather blow my brains out than continue to work as a software engineer in a world where the LLM-booster dream comes true.

I feel the same way

But please don't. I promise I won't either. There is still a place for people like you and me in this world, it's just gonna take a bit more work to find it

Deal? :)

globnomulous 6/27/2025||

Sounds good, thanks!

mexicocitinluez 6/26/2025|||

> So is the "These tools are game changers and are going to make all work obsolete soon" sentiment

Except we aren't talking about those people, are we? The blog post wans't about that.

> Don't start pretending that AI boosters aren't everywhere in tech right now

PLEASE tell me what I said that made you feel like you need to put words in my mouth. Seriously.

> I think the major difference I'm noticing is that many of the Boosters are not people who write any code

I write code every day. I just asked Claude to convert a Medicare mandated 30 page assessment to a printable version with CSS using Cottle in C# and it did it. I'd love to know why that sort of thing isn't useful.

globnomulous 6/26/2025|||

> Being asked if I'm kidding by the person comparing Google glasses to machine learning algorithms is pretty funny ngl.

I didn't draw the comparison. Karpathy, one of the most prominent LLM proponents on the planet -- the guy who invented the term 'vibe-coding' -- drew the comparison.[1]

> And the "I don't use these tools and never will" sentiment is rampant in the tech community right now. So yes, I am serious.

I think you misunderstood my comment -- or my comment just wasn't clear enough: I quoted the line "It does amaze me that the people who don't use these tools seem to have the most to say about them." and then I asked "You're kidding, right?" In other words, "you can't seriously believe that the nay-sayers 'always have the most to say.'" It's a ridiculous claim. Just about every naysayer 'think piece' -- whether or not it's garbage -- is responding to an overwhelming tidal wave of pro-LLM commentary and press coverage.

> Youre not talking about the blog post that completely ignored agentless uses are you? The one that came to the conclusion LLMs arent useful despite only using a subset of its features?

I'm referring to this one[2]. It's awful, smug, self-important, sanctimonious nonsense.

[1] https://www.youtube.com/watch?si=xF5rqWueWDQsW3FC&v=LCEmiRjP...

[2] https://news.ycombinator.com/item?id=44294633

mexicocitinluez 6/26/2025||

I'm so confused as to why you took that so literally. I didn't literally mean that the nay-sayers are producing more words than the evangelists. It was a hyperbolic expression. And I wasn't JUST talking about the blog posts. I'm talking about ALL comments about it.

globnomulous 6/26/2025||

Sure, that's fair, though tone is difficult both to communicate and to detect in writing. I have just the literal meaning of your words. And I'm a very literal-minded person. :)

mexicocitinluez 6/27/2025||

Agreed. I am, too. So I get it.

satisfice 6/26/2025||

LLMs make bad work— of any kind— look like plausibly good work. That’s why it is rational to automatically discount the products of anyone who has used AI.

I once had a member of my extended family who turned out to be a con artist. After she was caught, I cut off contact, saying I didn’t know her. She said “I am the same person you’ve known for ten years.” And I replied “I suppose so. And now I realized I have never known who that is, and that I never can know.”

We all assume the people in our lives are not actively trying to hurt us. When that trust breaks, it breaks hard.

No one who uses AI can claim “this is my work.” I don’t know that it is your work.

No one who uses AI can claim that it is good work, unless they thoroughly understand it, which they probably don’t.

A great many students of mine have claimed to have read and understand articles I have written, yet I discovered they didn’t. What if I were AI and they received my work and put their name on it as author? They’d be unable to explain, defend, or follow up on anything.

This kind of problem is not new to AI. But it has become ten times worse.

bobjordan 6/26/2025|

I see where you're coming from, and I appreciate your perspective. The "con artist" analogy is plausible, for the fear of inauthenticity this technology creates. However, I’d like to offer a different view from someone who has been deep in the trenches of full-stack software development.

I’m someone who put in my "+10,000 hours" programming complex applications, before useful LLMs were released. I spent years diving into documentation and other people's source code every night, completely focused on full-stack mastery. Eventually, that commitment led to severe burnout. My health was bad, my marriage was suffering. I released my application and then I immediately had to walk away from it for three years just to recover. I was convinced I’d never pick it up again.

It was hearing many reports that LLMs had gotten good at code that cautiously brought me back to my computer. That’s where my experience diverges so strongly from your concerns. You say, “No one who uses AI can claim ‘this is my work.’” I have to disagree. When I use an LLM, I am the architect and the final inspector. I direct the vision, design the system, and use a diff tool to review every single line of code it produces. Just recently, I used it as a partner to build a complex optimization model for my business's quote engine. Using a true optimization model was always the "right" way to do it but would have taken me months of grueling work before, learning all details of the library, reading other people’s code, etc. We got it done in a week. Do I feel like it’s my work? Absolutely. I just had a tireless and brilliant, if sometimes flawed, assistant.

You also claim the user won't "thoroughly understand it." I’ve found the opposite. To use an LLM effectively for anything non-trivial, you need a deeper understanding of the fundamentals to guide it and to catch its frequent, subtle mistakes. Without my years of experience, I would be unable to steer it for complex multi-module development, debug its output, or know that the "plausibly good work" it produced was actually wrong in some ways (like N+1 problems).

I can sympathize with your experience as a teacher. The problem of students using these tools to fake comprehension is real and difficult. In academia, the process of learning, getting some real fraction of the +10,000hrs is the goal. But in the professional world, the result is the goal, and this is a new, powerful tool to achieve better results. I’m not sure how a teacher should instruct students in this new reality, but demonizing LLM use is probably not the best approach.

For me, it didn't make bad work look good. It made great work possible again, all while allowing me to have my life back. It brought the joy back to my software development craft without killing me or my family to do it. My life is a lot more balanced now and for that, I’m thankful.

satisfice 6/26/2025||

Here's the problem, friend: I also have put in my 10,000 hours. I've been coding as part of my job since 1983. I switched to testing from production coding in 1987, but I ran a team that tested developer tools, at Apple and Borland, for eight years. I've been living and breathing testing for decades as a consultant and expert witness.

I do not lightly say that I don't trust the work of someone who uses AI. I'm required to practice with LLMs as part of my job. I've developed things with the help of AI. Small things, because the amount of vigilance necessary to do big things is prohibitive.

Fools rush in, they say. I'm not a fool, and I'm not claiming that you are either. What I know is that there is a huge burden of proof on the shoulders of people who claim that AI is NOT problematic-- given the substantial evidence that it behaves recklessly. This burden is not satisfied by people who say "well, I'm experienced and I trust it."

bobjordan 6/26/2025||

Thank you for sharing your deep experience. It's a valid perspective, especially from an expert in the world of testing.

You're right to call out the need for vigilance and to place the burden of proof on those of us who advocate for this tool. That burden is not met by simply trusting the AI, you're right, that would be foolish. The burden is met by changing our craft to incorporate the necessary oversight to not be reckless in our use of this new tool.

Coming from the manufacturing world, I think of it like the transition in metalwork industry from hand tools to advanced CNC machines and robotics. A master craftsman with a set of metal working files has total, intimate control. When a CNC machine is introduced, it brings incredible speed and capability, but also a new kind of danger. It has no judgment. It will execute a flawed design with perfect, precision.

An amateur using the CNC machine will trust it blindly and create "plausibly good" work that doesn’t meet the specifications. A master, however, learns a new set of skills: CAD design, calibrating the machine, and, most importantly, inspecting the output. Their vigilance is what turns reckless use of a new tool into an asset that allows them to create things they couldn't before. They don't trust the tool, they trust their process for using it.

My experience with LLM use has been the same. The "vigilance" I practice is my new craft. I spend less time on the manual labor of coding and more time on architecture, design, and critical review. That's the only way to manage the risks.

So I agree with your premise, with one key distinction: I don’t believe tools themselves can be reckless, only their users can. Ultimately, like any powerful tool, its value is unlocked not by the tool itself, but by the disciplined, expert process used to control it.

axegon_ 6/26/2025||

That is already the case for me. The amount of times I've read "apologies for the oversight, you are absolutely correct" is staggering: 8 or 9 out of 10 times. Meanwhile I constantly see people mindlessly copy paying llm generated code and subsequently furious when it doesn't do what they expected it to do. Which, btw, is the better option: I'd rather have something obviously broken as opposed to something seemingly working.

autobodie 6/26/2025||

In my experience, LLMs are extremely inclined to modify code just to pass tests instead of meeting requirements.

fwip 6/26/2025||

When they're not modifying the tests to match buggy behavior. :P

devjab 6/26/2025|||

Are you using the LLM's through a browser chatbot? Because the AI-agents we use with direct code-access aren't very chatty. I'd also argue that they are more capable than a lot of junior programmers, at least around here. We're almost at a point where you can feed the agents short specific tasks, and they will perform them well enough to not really require anything outside of a code review.

That being said, the prediction engine still can't do any real engineering. If you don't specifically task them with using things like Python generators, you're very likely to have a piece of code that eats up a gazillion memory. Which unfortunately don't set them appart from a lot of Python programmers I know, but it is an example of how the LLM's are exactly as bad as you mention. On the positive side, it helps with people actually writing the specification tasks in more detail than just "add feature".

Where AI-agents are the most useful for us is with legacy code that nobody prioritise. We have a data extractor which was written in the previous millennium. It basically uses around two hunded hard-coded coordinates to extact data from a specific type of documents which arrive by fax. It's worked for 30ish years because the documents haven't changed... but it recently did, and it took co-pilot like 30 seconds to correct the coordinates. Something that would've likely taken a human a full day of excruciating boredom.

I have no idea how our industry expect anyone to become experts in the age of vibe coding though.

furyofantares 6/26/2025|||

> Because the AI-agents we use with direct code-access aren't very chatty.

Every time I tell claude code something it did is wrong, or might be wrong, or even just ask a leading question about a potential bug it just wrote, it leads with "You're absolutely correct!" before even invoking any tools.

Maybe you've just become used to ignoring this. I mostly ignore it but it is a bit annoying when I'm trying to use the agent to help me figure out if the code it wrote is correct, so I ask it some question it should be capable of helping with and it leads with "you're absolutely correct".

I didn't make a proposition that can be correct or not, and it didn't do any work yet to to investigate my question - it feels like it has poisoned its own context by leading with this.

devjab 6/27/2025|||

It may have to do with workflow. I rarely talk with the AI agent, I task it with a VIBE.md or a specific outlined prompt that relates to inline "COPILOT: ...." comments, and then I review the changes and either keep or dismiss them. When I dismiss them I'll mostly rewrite the promt and do it again in a new context window.

I did get curious though. So I decided to look up some of the times where I did correct it after I dismissed a change. I only looked at a couple of prompts but most of the AI responses looked like this:

"There are two issues...", "The error is because...", "The error persists because...", "A new route, /class_ids/fully_owned, has been added...".

I was feeling confident that it wasn't bullshitting me at that point, but then I get to this one:

"Thank you for the details. The error response..."

Now, that is the AI agent. If I use the browser or one of their "apps" the LLM politeness encouragement bullshit alone will often be longer than the entire chat response in an agent. Like this is the entire response to what it was tasked with in my example:

"To add this route, I'll implement a new endpoint that queries all unique class_id values and checks if all items with that class_id have is_owned == True. The route will return a list of such class_id values.

I'll add this as a new GET route, e.g., /class_ids/fully_owned."

ChatGPT (which is supposedly the same model) will spend those lines telling me what a great question it was.

steveklabnik 6/27/2025|||

I put some stuff in Claude.md to tell it to chill out, and that helps. If you want it to communicate with you in a given style, tell it to do that.

gibspaulding 6/26/2025||||

> Where AI-agents are the most useful for us is with legacy code

I’d love to hear more about your workflow and the code base you’re working in. I have access to Amazon Q (which it looks like is using Claude Sonnet 4 behind the scenes) through work, and while I found it very useful for Greenfield projects, I’ve really struggled using it to work on our older code bases. These are all single file 20,000 to 100,000 line C modules with lots of global variables and most of the logic plus 25 years of changes dumped into a few long functions. It’s hard to navigate for a human, but seems to completely overwhelm Q’s context window.

Do other Agents handle this sort of scenario better, or are there tricks to making things more manageable? Obviously re-factoring to break everything up into smaller files and smaller functions would be great, but that’s just the sort of project that I want to be able to use the AI for.

devjab 6/27/2025||

We use co-pilot through our azure license in VSC. My personal workflow is that I'll write a VIBE.md with very specific information on what I want and what I rexpect. Then in the actual code file I'll add a comment like "COPILOT: this is where I want you to do X". I'll then grant the agent access to the necessary files for the context. With big files it gets trickier because the prediction engine fails to distinguish between relevant and irrelevant context. I have the most success with incremental changes where the agent has to do one task at a time, and you can outline that in the VIBE.md + the comments where you add "COPILOT: This is step X...". In my coordinate example it actually had to change quite a lot of things, but that is still what I consider one task.

Context size matters a lot in my experience, but I'm not sure if it matters whether your 100k lines are in a single or multiple files. I tend to cut down what I feed the agent to the actual context, so if I have a 100k line file, but only 3000 lines matter, then I'll only feed those 3000 lines to the AI. Even in a couple of small files with maybe 200 lines of code in total, I'll only give the AI access to the 40 line which is the context it needs to work on.

English isn't my first language, so when I say context, what I mean is everything which is related to the change I want the agent to do. I will use SQLC as an example. Even though I feed the AI the Go model generated, I'll also give it access to the raw SQL file.

> Obviously re-factoring to break everything up into smaller files and smaller functions would be great, but that’s just the sort of project that I want to be able to use the AI for.

I'm guessing here, but I think part of our success is also our YAGNI approach. AI seem to have an easier time with something like Go where everything is explicit, everything is functions and Go modules live in isolation. Similarily AI will do much better with Python that is build with dataclasses and functions, and struggle with Python that is build upon more traditional OOP hierarchies. We've also had very little success with agents on C#. I have no idea whether that is because of C#'s inherrent implicity and "black magic" or because of the .net > .net core > .net framework > .net + whatever I forgot journey confusing the prediction engine.

> Do other Agents handle this sort of scenario better

I don't know. I've only used the sanctioned co-pilot agent professionally. I believe that is a GPT-4 model, but I'm not exactly sure on the details. For personal projects I use both the free version of GPT-4 in co-pilot and Claude Sonnet 4, and I haven't noticed much of a difference, but I have no hobby projects which are compareable.

teeray 6/26/2025||||

> Because the AI-agents we use with direct code-access aren't very chatty.

So they’re even more confident in their wrongness

devjab 6/27/2025||

When there is less chat they appear less confident, but I think you're pretty spot on in point out why they are dangerous. If you're using them in an area where you're an expert, it's very easy to review the "diff" suggestion they come up with and decide whether it's bullshit or not. If you're using them in an area where you're not an expert, then how will you know?

axegon_ 6/27/2025|||

Of course I don't have any extensions locally, I am not a lunatic. I don't always have access to my personal hardware and I would never trust an extension to pass my code around over http to a server I don't have full control of. Ddosecrets should have been enough of a warning for most people but I suspect countless more will have to learn that lesson the hard way.

mexicocitinluez 6/26/2025||

> 8 or 9 out of 10 times.

Not they don't. This is 100% a made up statistic.

bluefirebrand 6/26/2025||

It isn't even being presented as a statistic it is someone saying what they have experienced

mexicocitinluez 6/27/2025|||

not to respond again but whats even funnier is that IT IS a statistic. saying "90% of the time I get a certain response" is literally a statistic.

mexicocitinluez 6/27/2025|||

nice. i'll add this to the list of "totally useless replies".

HardCodedBias 6/26/2025||

All of this fighting against LLMs is pissing in the wind.

It seems that LLMs, as they work today, make developers more productive. It is possible that they benefit less experienced developers even more than experienced developers.

More productivity, and perhaps very large multiples of productivity, will not be abandoned due roadblocks constructed by those who oppose the technology due to some reason.

Examples of the new productivity tool causing enormous harm (eg: bug that brings down some large service for a considerable amount of time) will not stop the technology if it being considerable productivity.

Working with the technology and mitigating it's weaknesses is the only rational path forward. And those mitigation can't be a set of rules that completely strip the new technology of it's productivity gains. The mitigations have to work with the technology to increase its adoption or they will be worked around.

mjr00 6/26/2025||

> It seems that LLMs, as they work today, make developers more productive.

Think this strongly depends on the developer and what they're attempting to accomplish.

In my experience, most people who swear LLMs make them 10x more productive are relatively junior front-end developers or serial startup devs who are constantly greenfielding new apps. These are totally valid use cases, to be clear, but it means a junior front-end dev and a senior embedded C dev tend to talk past each other when they're discussing AI productivity gains.

> Working with the technology and mitigating it's weaknesses is the only rational path forward.

Or just using it more sensibly. As an example: is the idea of an AI "agent" even a good one? The recent incident with Copilot[0] made MS and AI look like a laughingstock. It's possible that trying to let AI autonomously do work just isn't very smart.

As a recent analogy, we can look at blockchain and cryptocurrency. Love it or hate it, it's clear from the success of Coinbase and others that blockchain has found some real, if niche, use cases. But during peak crypto hype, you had people saying stuff like "we're going to track the coffee bean supply chain using blockchain". In 2025 that sounds like an exaggerated joke from Twitter, but in 2020 it was IBM legitimately trying to sell this stuff[1].

It's possible we'll look back and see AI agents, or other current applications of generative AI, as the coffee blockchain of this bubble.

[0] https://www.reddit.com/r/ExperiencedDevs/comments/1krttqo/my...

[1] https://www.forbes.com/sites/robertanzalone/2020/07/15/big-c...

parineum 6/26/2025||

> In my experience, most people who swear LLMs make them 10x more productive are relatively junior front-end developers or serial startup devs who are constantly greenfielding new apps. These are totally valid use cases, to be clear, but it means a junior front-end dev and a senior embedded C dev tend to talk past each other when they're discussing AI productivity gains.

I agree with this quite a lot. I also think that those greenfield apps quickly become unmanageable by AI as you need to start applying solutions that are unique/tailored for your objective or you want to start abstracting some functionality into building components and base classes that the AI hasn't seen before.

I find AI very useful to get me to a from beginner to intermediate in codebases and domains that I'm not familiar with but, once I get the familiarity, the next steps I take mostly without AI because I want to do novel things it's never seen before.

conartist6 6/26/2025|||

And here it is again. "More productive"

But this doesn't mean that the model/human combo is more effective at serving the needs of users! It means "producing more code."

There are no LLMs shipping changesets that delete 2000 lines of code -- that's how you know "making engineers more productive" is a way of talking about how much code is being created...

eikenberry 6/26/2025||

My wife's company recently hired some contractors and they were touting their productivity with AI by saying how it allowed them (one person) to write 150k lines of code in 3 weeks. They said this without sarcasm. It was funny and scary at the same time that anyone might buy this as a good outcome. Classic lines-of-code metric rearing its ugly head again.

FuckButtons 6/26/2025|||

I think you’re arguing against something the author didn’t actually say.

You seem to be claiming that this is a binary, either we will or won’t use llms, but the author is mostly talking about risk mitigation.

By analogy it seems like you’re saying the author is fundamentally against the development of the motor car because they’ve pointed out that some have exploded whereas before, we had horses which didn’t explode, and maybe we should work on making them explode less before we fire up the glue factories.

scelerat 6/26/2025|||

I didn't see the post as pissing into the wind so much as calling out several caveats of coding with LLMs, especially on teams, and ideas on how to mitigate them.

ge96 6/26/2025||

It is funny (ego) I remember when React was new and I refused to learn it, had I learned it earlier I probably would have entered the market years earlier.

Even now I have this refusal to use GPT where as my coworkers lately have been saying "ChatGPT says" or this code was created by chatGPT idk, for me I take pride writing code myself/not using GPT but I also still use google/stackoverflow which you could say is a slower version of GPT.

anthonypasq 6/26/2025||

this mindset does not work in software. My dad would still be programming with punchcards if he thought this way. instead he using copilot daily writing microservices and isnt some annoying dinosaur

ge96 6/26/2025||

yeah it's pro con, I also hear my coworkers saying "I don't know how it works" or there are methods in the code that don't exist

But anyway I'm at the point in my career where I am not learning to code/can already do it. Sure languages are new/can help there for syntax

edit: other thing I'll add, I can see the throughput thing, it's like a person has never used opensearch before and it's a rabbithole, anything new there's that wall you have to overcome, but it's like we'll get the feature done, but did we really understand how it works... do we need to? Idk. I know this person can barely code but because they use something like chatGPT they're able to crap out walls of code and with tweaking it will work eventually -- I am aware this sounds like gatekeeping from my part

Ultimately personally I don't want to do software professionaly/trying to save/invest enough then get out just because the job part sucks the fun out of development. I've been in it for about 10 years now, should have been plenty of time to save but I'm dumb/too generous.

I think there is healthy skepticism too vs. just jumping on the bandwagon that everyone else is doing and really my problem is just I'm insecure/indecisive, I don't need everyone to accept me especially if I don't need money

Last rant, I will be experimenting with agentic stuff as I do like Jarvis, make my own voice rec model/locally runs.

pu_pe 6/26/2025||

> While the industry leaping abstractions that came before focused on removing complexity, they did so with the fundamental assertion that the abstraction they created was correct. That is not to say they were perfect, or they never caused bugs or failures. But those events were a failure of the given implementation a departure from what the abstraction was SUPPOSED to do, every mistake, once patched led to a safer more robust system. LLMs by their very fundamental design are a probabilistic prediction engine, they merely approximate correctness for varying amounts of time.

I think what the author misses here is that imperfect, probabilistic agents can build reliable, deterministic systems. No one would trust a garbage collection tool based on how reliable the author was, but rather if it proves it can do what it intends to do after extensive testing.

I can certainly see an erosion of trust in the future, with the result being that test-driven development gains even more momentum. Don't trust, and verify.

lbalazscs 6/26/2025||

It's naive to hope that automatic tests will find all problems. There are several types of problems that are hard to detect automatically: concurrency problems, resource management errors, security vulnerabilities, etc.

An even more important question: who tests the tests themselves? In traditional development, every piece of logic is implemented twice: once in the code and once in the tests. The tests checks the code, and in turn, the code implicitly checks the tests. It's quite common to find that a bug was actually in the tests, not the app code. You can't just blindly trust the tests, and wait until your agent finds a way to replicate a test bug in the code.

acedTrex 6/26/2025|||

> I think what the author misses here is that imperfect, probabilistic agents can build reliable, deterministic systems. No one would trust a garbage collection tool based on how reliable the author was, but rather if it proves it can do what it intends to do after extensive testing.

> but rather if it proves it can do what it intends to do after extensive testing.

Author here: Here I was less talking about the effectiveness of the output of a given tool and more so about the tool itself.

To take your garbage collection example, sure perhaps an agentic system at some point can spin some stuff up and beat it into submission with test harnesses, bug fixes etc.

But, imagine you used the model AS the garbage collector/tool, in that say every sweep you simply dumped the memory of the program into the model and told it to release the unneeded blocks. You would NEVER be able to trust that the model itself correctly identifies the correct memory blocks and no amount of "patching" or "fine tuning" would ever get you there.

With other historical abstractions like say jvm, if the deterministic output, in this case the assembly the jit emits is incorrect that bug is patched and the abstraction will never have that same fault again. not so with LLMs.

To me that distinction is very important when trying to point out previous developer tooling that changed the entire nature of the industry. It's not to say I do not think LLMs will have a profound impact on the way things work in the future. But I do think we are in completely uncharted territory with limited historical precedence to guide us.

bluefirebrand 6/26/2025||

> I think what the author misses here is that imperfect, probabilistic agents can build reliable, deterministic systems

That is quite a statement! You're talking about systems that are essentially entropy-machines somehow creating order?

> with the result being that test-driven development gains even more momentum

Why is it that TDD is always put forward as the silver bullet that fixes all issues with building software

The number of times I've seen TDD build the wrong software after starting with the wrong tests is actually embarassing

cheriot 6/26/2025||

> promises that the contributed code is not the product of an LLM but rather original and understood completely.

> require them to be majority hand written.

We should specify the outcome not the process. Expecting the contributor to understand the patch is a good idea.

> Juniors may be encouraged/required to elide LLM-assisted tooling for a period of time during their onboarding.

This is a terrible idea. Onboarding is a lot of random environment setup hitches that LLMs are often really good at. It's also getting up to speed on code and docs and I've got some great text search/summarizing tools to share.

bluefirebrand 6/26/2025|

> Onboarding is a lot of random environment setup hitches

Learning how to navigate these hitches is a really important process

If we streamline every bit of difficulty or complexity out of our lives, it seems trivially obvious that we will soon have no idea what to do when we encounter difficulty or complexity. Is that just me thinking that?

cheriot 6/28/2025|||

Some people find a solution, think about it, and incorporate that into their understanding of the world.

Some people ask for a coworker to do it for them, c/p stackoverflw, etc and never learn.

AI makes the first group that much more effective. Setup should be a learning process not a hazing ritual.

RunningDroid 6/26/2025||||

> > Onboarding is a lot of random environment setup hitches > > Learning how to navigate these hitches is a really important process

To add to this, a barrier to contribution can reduce low quality/spam contributions. The downside is that a barrier to contribution that's too high reduces all contributions.

kmoser 6/26/2025|||

There will always be people who know how to handle the complexity we're trying to automate away. If I can't figure out some arcane tax law when filling out my taxes, I ask my accountant, as it's literally their job to know these things.

bluefirebrand 6/26/2025||

> There will always be people who know how to handle the complexity we're trying to automate away

This is not a given!

If we automated all accounting, why would anyone still take the time to learn to become an accountant?

Yes, there are sometimes people who are just invested in learning traditional stuff for the sake of it, but is that really what we want to rely on as the fallback when AI fails?

kmoser 6/27/2025||

It's highly unlikely that everybody will flock to LLMs, leaving absolutely nobody capable of stringing together a few lines of code on their own. Some devs may enjoy vibe coding, and may even be more productive that way, but there will always be use cases where it is preferable to produced deterministic code via a human dev.

namenotrequired 6/26/2025|

> LLMs … approximate correctness for varying amounts of time. Once that time runs out there is a sharp drop off in model accuracy, it simply cannot continue to offer you an output that even approximates something workable. I have taken to calling this phenomenon the "AI Cliff," as it is very sharp and very sudden

I’ve never heard of this cliff before. Has anyone else experienced this?

gwd 6/26/2025||

I experience it pretty regularly -- once the complexity of the code passes a certain threshold, the LLM can't keep everything in its head and starts thrashing around. Part of my job working with the LLM is to manage the complexity it sees.

And one of the things with current generators is that they tend to make things more complex over time, rather than less. It's always me prompting the LLM to refactor things to make it simpler, or doing the refactoring once it's gotten to complex for the LLM to deal with.

So at least with the current generation of LLMs, it seems rather inevitable that if you just "give LLMs their head" and let them do what they want, eventually they'll create a giant Rube Goldberg mess that you'll have to try to clean up.

ETA: And to the point of the article -- if you're an old salt, you'll be able to recognize when the LLM is taking you out to sea early, and be able to navigate your way back into shallower waters even if you go out a bit too far. If you're a new hand, you'll be out of your depth and lost at sea before you know it's happened.

windward 6/26/2025|||

I've seen it referred to as 'context drunk'.

Imagine that you have your input to the context, 10000 tokens that are 99% correct. Each time the LLM replies it adds 1000 tokens that are 90% correct.

After some back-and-forth of you correcting the LLM, its context window is mostly its own backwash^Woutput. Worse, the error compounds because the 90% that is correct is just correct extrapolation of an argument about incorrect code, and because the LLM ranks more recent tokens as more important.

The same problem also shows up in prose.

Workaccount2 6/26/2025|||

I call it context rot. As the context fills up the quality of output erodes with it. The rot gets even worse or progresses faster the more spurious or tangential discussion is in context.

This is also can be made much worse by thinking models, as their CoT is all in context, and if there thoughts really wander it just plants seeds of poison feeding the rot. I really wish they can implement some form of context pruning, so you can nip irrelevant context when it forms.

In the meantime, I make summaries and carry it to a fresh instance when I notice the rot forming.

bubblyworld 6/26/2025|||

I've only experienced this while vibe coding through chat interfaces, i.e. in the complete absence of feedback loops. This is much less of a problem with agentic tools like claude code/codex/gemini cli, where they manage their own context windows and can run your dev tooling to sanity check themselves as they go.

Paradigma11 6/26/2025|||

If the context gets to big or otherwise poisoned you have to restart the chat/agent. A bit like windows of old. This trains you to document the current state of your work so the new agent can get up to speed.

Kuinox 6/26/2025|||

I'm doing my own procedurally generated benchmark.

I can make the problem input bigger as I want.

Each LLM have a different thresholf for each problem, when crossed the performance of the LLM collapse.

lubujackson 6/26/2025|||

I definitely hit this vibe coding a large-ish backend. Well defined data structures, good modularity, etc. But at a point, Cursor started to lose the plot and rewrite or duplicate functions, recreate or misue data structures, etc.

The solve was to define several Cursor rules files for different views of the codebase - here's the structure, here's the validation logic, etc. That and using o3 has at least gotten me to the next level.

npteljes 6/26/2025|||

I reset "work" AI sessions quite frequently, so I didn't see that there. I experienced it though with storytelling. In my storytelling scenario, context and length was important. And the AI at one late point forgot how my characters should behave in the developing situation, and just had them react to it in a very different way. And there was no going back from that. Very weird experience.

sandspar 6/26/2025|||

I'm not sure. Is he talking about context poisoning?

impure 6/26/2025|||

This sounds a lot like accuracy collapse as discussed in that Apple paper. That paper clearly showed that there is some point where AI accuracy collapses extremely quickly.

I suspect it has something more to do with the model producing too many tokens and becoming fixated on what it said before. You'll often see this in long conversations. The only way to fix it is to start a new conversation.

Syzygies 6/26/2025||

One can find opinions that Claude Code Opus 4 is worth the monthly $200 I pay for Anthropic's Max plan. Opus 4 is smarter; one either can't afford to use it, or can't afford not to use it. I'm in the latter group.

One feature others have noted is that the Opus 4 context buffer rarely "wears out" in a work session. It can, and one needs to recognize this and start over. With other agents, it was my routine experience that I'd be lucky to get an hour before having to restart my agent. A reliable way to induce this "cliff" is to let AI take on a much too hard problem in one step, then flail helplessly trying to fix their mess. Vibe-coding an unsuitable problem. One can even kill Opus 4 this way, but that's no way to run a race horse.

Some "persistence of memory" harness is as important as one's testing harness, for effective AI coding. With the right care having AI edit its own context prompts for orienting new sessions, this all matters less. AI is spectacularly bad at breaking problems into small steps without our guidance, and small steps done right can be different sessions. I'll regularly start new sessions when I have a hunch that this will get me better focus for the next step. So the cliff isn't so important. But Opus 4 is smarter in other ways.

fwip 6/26/2025|||

Sometimes after it flails for a while, but I think it's on the right path, I'll rewind the context to just before it started trying to solve the problem (but keep the code changes). And I'll tell it "I got this other guy to attempt what we just talked about, but it still has some problems."

Snipping out the flailing in this way seems to help.

suddenlybananas 6/26/2025|||

>can't afford not to use it. I'm in the latter group.

People love to justify big expenses as necessary.

Syzygies 6/26/2025||

$200 is a small expense and you don't know why I need AI.

The online dialog about AI is mostly noise, and even at HN it is badly distorted by people who wince at $20 a month, and complain AI isn't that smart.

More comments...