LLMs corrupt your documents when you delegate

Posted by rbanffy 1 day ago

LLMs corrupt your documents when you delegate(arxiv.org)

421 points | 166 commentspage 2

meander_water 21 hours ago|

> We find that models are not failing due to “death by a thousand cuts” (i.e., many small errors). Instead, they main- tain near-perfect reconstruction in some rounds, and experience critical failures in a few rounds, typically losing 10-30+ points in a single round trip

> We find that weaker models’ degradation originates primarily from content deletion, while frontier models’ degradation is attributable to corruption of content.

I think we largely already knew this. This is why we fudge around with harnesses and temperature etc.

danielvaughn 20 hours ago||

I've spent the last few months reading a lot of AI-generated code. It's extremely difficult.

It's like how psychopaths are eerie because there's nothing behind their eyes. AI-generated code is eerie because there's nothing between the lines. Code is in some sense theory building, and when you read a humans code you can (mostly) feel their theory working in the background. LLMs have no such theory, the code is just facts strewn about. Very weird experience to try and understand it.

glaslong 19 hours ago||

Thank you I've had trouble articulating this sense, but it's strong. An uncanny valley.

leptons 18 hours ago||

My company is moving to a workflow where we only write Jira tickets, the LLM writes all the code and submits a PR. Then we are supposed to review the code the LLM wrote.

I'm looking for a new job.

8note 17 hours ago||

that doesnt seem particularly horrible, as long as you as the engineer can still go change things in the code package and surrounding infrastructure to improve the output, and make sure that the agent is actually making the right stuff the first time you see the outputs

eg. setting up better feedback loops, improving CI/CD, breaking changes up at the right scale, etc.

you i assume also can then put in more work up front, doing simulations of solutions, lean proofs, and so on?

more engineering, less plumbing

leptons 13 hours ago|||

The change is turning me from someone who writes pretty good reliable code, to someone who has to read and review pretty bad code. If you think this is an improvement, you're nuts.

It is inserting a pretty unreliable middle-man know for errors and hallucinations, that often just goes down and stops working for reasons we can't control into a workflow that has worked well for a decade, and we're paying extra to really break-even on the time spent creating new code.

Just because "everyone else is doing it". Not because it's proving to be a boon in productivity.

xstas1 8 hours ago||

Turning you into a "reverse centaur" to borrow a term from Cory Doctorow

nunez 5 hours ago|||

my (gender neutral) dude.

WAKE UP.

Literally anyone can write a Jira ticket. US engineers are expensive. What do you think will happen when the powers that enacted this policy decide that the ticket to merged into prod rate is acceptable to them?

Art9681 8 hours ago||

Remind yourselves that most research papers are written by career students with no real world practical experience. That is all.

LPisGood 5 hours ago|

Spending some time in and around applied research labs and seeing how poorly the sausage looks before it gets made into a paper is quite distressing.

I’m sure there are labs out there doing excellent work (especially those focused on theory), but most of the applied research I’ve seen up close and personal is very poor indeed.

rmwaite 19 hours ago||

What I find fascinating about LLMs is that a lot of their failures seem strikingly similar to the failures that humans struggle with. I’m not sure what this “means” but I think it’s interesting that we can theoretically fix these failures for LLMs but for humans it is much harder. You pretty much need to educate / indoctrinate people for their entire lives and even then it’s messy and unpredictable and prone to failure—just like LLMs.

peter_retief 5 hours ago||

I am surprised that not more people talk about this, I once had an ssh key deleted, so unexpected it took me a while to debug.

We live and learn.

Still a huge fan though.

charlie90 11 hours ago||

Doesnt this apply to humans as well? Thats why children play the game "Telephone" and watch as a message gets corrupted. The solution is to provide single source of truth.

andrewljohnson 20 hours ago||

LLM editing should be done to produce deterministic output.

That is, the LLM should produce a diff, and the user should accept the diff. It seems like a bad pattern to just tell the LLM to edit any long document without that sort of visibility. Same goes for prose as for code.

julianlam 18 hours ago||

I always thought it was a little weird that LLMs aren't sophisticated enough to surgically edit files as needed.

For example, if there is a code block that needs to be wrapped within another function call, it'll rewrite the entire function call and you'll just have to pray that the re-written code block wasn't subtly changed.

I _think_ so far it hasn't introduced any changes....

andrewljohnson 18 hours ago||

You can just look at the diff when you do a pull request, no prayer needed, and if you want it to be “surgical” in that way, your prompt (and agents.md) can be specific.

You can also unit test the function to better assure behavior didn’t change.

julianlam 14 hours ago||

Indeed, that's what I do. I inspect the diff, though if it's an indentation change the entire block will be marked changed.

Still not an excuse to not read every line of course...

Unit tests give me the confidence that at least those tested logic paths are unaffected.

Sometimes with older codebases one cannot assume the paths have adequate test coverage.

alansaber 18 hours ago||

This gets skipped because continual approvals break up user flow so we let LLMs make a few hundred line diffs then a user does a bulk review, and can just revert all/partially. It's naieve to assume user will review every LOC in every instance.

andrewljohnson 18 hours ago||

I’m fine with bulk review, it just has to get reviewed before a merge. You don’t need to review the LLM output as you work except as it aids you to work.

tmaly 17 hours ago||

When AI generates code, we have the ability to easily verify it and test it.

The same is not so easy with free form text. I have been thinking about this mainly around when agents write plans or edit plans, but I think figuring out how to do this in general would be a huge breakthrough.

Logical English was one idea I came across and Runcible https://runcible.com/ was another idea I recently stumbled on.

enrique_mendez 10 hours ago||

I'm making tools for fighting this kind of degradation: https://github.com/JigSpec/JigSpec

pickleRick243 11 hours ago|

With this paper by Microsoft and the infamous paper by Apple last year, it seems the tech giants that don't have their own models are getting a bit insecure.

More comments...