LLMs corrupt your documents when you delegate

Posted by rbanffy 1 day ago

LLMs corrupt your documents when you delegate(arxiv.org)

433 points | 171 commentspage 4

cyanydeez 1 day ago|

I played around with a local LLM to try and build a wiki like DAG. It made a lot of stupid errors from vague generic things like interpreting based on file names to not following redirects and placing the redirect response in them.

I've also had them convert to markdown something like an excel formatted document. It worked pretty well as long as I was examining the output. But the longer it ran in context, the more likely it was to try in slip things in that seemed related but wasn't part of the break down.

The only way I've found to mitigate some of it is to make every file a small-purpose built doc. This way you can definitely use git to revert changes but also limit the damage every time they touch them to the small context.

Anyone who thinks they're a genius creating docs or updating them isnt actually reading the output.

sebastiennight 1 day ago|

> I've also had them convert to markdown something like an excel formatted document.

This look like a task where the LLM would be best used in writing a deterministic script or program that then does the conversion.

Trusting a LLM to make the change without tools is like telling the smartest person you know to just recite the converted document out loud from memory. At some point they'll get distracted, wrong, or unwittingly inject their own biases and ideas into it whenever the source data is counter-intuitive to them.

trollbridge 1 day ago|||

I see people cut and paste from Excep into a chat, as an image, and ask it to sum up numbers.

somewhatgoated 23 hours ago||

I’ve seen people drink their own recycled piss and inject coffee into their ass - what’s your point?

sebastiennight 20 hours ago||

In the first half, I thought you were an astronaut, but the second half has me double-guessing myself.

somewhatgoated 16 hours ago||

I used to be a connoisseur of weird Facebook groups - I would advise everyone to never look into aged urine, coffee enemas or targeted individuals - makes you lose your faith in humanity

cyanydeez 1 day ago|||

it was, but the formatting was garbage so it ran again to fix thw format.

threethirtytwo 23 hours ago||

This experiment needs to be put in perspective. Let me explain. IF you did this SAME experiment with a human and had a human read an ENTIRE document and then reproduce said document with edits. The DOCUMENT would DEGRADE even more.

The way this experiment is conducted is not inline with how current agentic AI is used OR how even humans edit documents.

Here's how agentic AI currently typically do edits:

1. They read the whole document. 2. They come up with a patch. A diff of the section they want to edit. 3. They change THAT section only.

This is NOT what that experiment was doing. A 25% degradation rate would render the whole industry dead. No one would be using claude code because of that. The reality is... everyone is using claude code.

AI is alien to the human brain, but in many ways it is remarkably. This is one aspect of similarity in that we cannot edit a whole document holistically to produce one edit. It has to be targeted surgical edits rather then a regurgitation of the entire document with said edit.

hedora 19 hours ago||

Benjamin Franklin famously taught himself to write well by doing what you describe: Read a piece of a book, then rewrite it, then compare.

At first his copies were badly degraded. Eventually, he was considered one of the best writers of his time.

I feel like there's probably some way "the copy is better" could be quantified (at least to the point where it fools most of the people most of the time). If so, then expect LLMs to learn the same trick within a generation or two.

jrflowers 18 hours ago|||

>IF you did this SAME experiment with a human and had a human read an ENTIRE document and then reproduce said document with edits. The DOCUMENT would DEGRADE even more.

I like the idea that imagining somebody doing something in a way that nobody does it because it makes no sense for a person to do it like that is helpful here. It is like

IF you made a human eat an ENTIRE IHOP™ Chicken Fajita Omelette in one bite they would CHOKE and the OMELETTE would go UNDIGESTED. It would get everywhere and the OMELETTE would be RUINED.

threethirtytwo 17 hours ago||

That's the point bro. I am saying this Experiment makes no sense.

Humans don't do that. And Claude doesn't edit documents like that. Because it makes no sense. The point is saying that the Experiment itself is not helpful here.

jrflowers 17 hours ago||

It is, in fact, pretty common for people to throw a document at a language model along with a “make it more gooder” prompt.

threethirtytwo 13 hours ago||

That was true maybe 7 months ago. This is no longer the case. Harnesses use all kinds of tooling to edit things now.

rune-dev 13 hours ago|||

I think you’re living in a bubble if you think the average user of AI even knows what a harness is

The vast majority of people are literally going to chatGPT, pasting in their document and asking for edits.

threethirtytwo 11 hours ago||

This will change too man. Maybe I am in a bubble but with how fast things are changing, it won’t be too long before the bubble becomes reality.

Either way we should be doing experiments on the actual capabilities of AI not about the stupidest possible way to use AI because it helps validate your own negative bias against AI.

Additionally as software engineers using agentic AI… which HN basically is… this experiment is not at all relevant in the context of where it is posted. We ALL use agentic ai and we all have the agent use surgical tools for editing. Don’t you find it strange that despite the fact we all do this, HN is full of rabid engineers gobbling this paper up as validation despite complete lack of relevance?

rune-dev 3 hours ago|||

First off, It’s good to study all kinds of things isn’t it? Even if it’s not strictly practical.

Second, and more importantly these AI tools are EVERYWHERE right now. The effects of people using them for work can be seen throughout many industries and workplaces.

So I think studying how these models perform in the vast majority of use cases is not only a good idea, but it’s actually really important.

Even if you’re strictly pro-AI and believe it is the future, a study like this can help you explain to laymen why they need the harnesses you’re so in support of.

jrflowers 9 hours ago|||

> This will change too man. Maybe I am in a bubble but with how fast things are changing, it won’t be too long before the bubble becomes reality.

You can’t get mad at an experiment for not happening in the future.

> Either way we should be doing experiments on the actual capabilities of AI

They simulated common end user behavior

>because it helps validate your own negative bias against AI.

We’ve gone from “this study is flawed because language models don’t do that” to “this study is flawed because while language models do do that, I don’t think that they will in the future” to “data that could support a bias other than my own is bad”

threethirtytwo 6 hours ago||

> You can’t get mad at an experiment for not happening in the future.

I’m more getting mad at this sentence not making any sense. I’m disappointed at this experiment for not testing the actual capabilities of an LLM. Comprende?

> They simulated common end user behavior

Not the way you use it. And not the way it will be used.

You love it because you want it to stay this way so you can forever believe AI will never be better than you.

Bro the reality is unfolding as you speak. It’s like humanity just discovered guns but hasn’t discovered the bullets and your saying guns are useless because most of humanity hasn’t figured out bullets yet.

> We’ve gone from “this study is flawed because language models don’t do that” to “this study is flawed because while language models do do that, I don’t think that they will in the future” to “data that could support a bias other than my own is bad”

This is a flat out lie. Models DO do that. The only fucking argument you have is that non technical and average laymen people edit documents the wrong way while all people who use agentic AI as adepts use it the correct way. Like are you fucking kidding me?

The only change I acknowledge is your grandma copies and pastes essays into ChatGPT while YOU don’t. You go pretend you live in that reality where the bullets will never appear.

jrflowers 4 hours ago||

>You love it because you want it to stay this way so you can forever believe AI will never be better than you.

>Bro the reality is unfolding as you speak

>You go pretend you live in that reality where the bullets will never appear.

It’s too late bro, roko’s basilisk was real and it’s already punishing you

jrflowers 12 hours ago|||

People paste entire documents into gemini and chat gpt’s text boxes on the web and assume it will all turn out great

edit: apparently got beaten to this

threethirtytwo 5 hours ago||

I don’t understand you. We have an AI model. The AI model is obviously capable.

But you want to use pretend that it’s not useful because non technical people haven’t figured out how to properly use it yet?

Do you think that’s a valid argument? This article is making a claim of 25 percent degredation. Do you think that claim is true because a lot of people don’t use it right?

Humans have 99 percent degredation when editing one punctuation point of an entire book when regurgitating that entire book just to change one punctuation point. Does this statement sound reasonable to you? Because that is the statement you and your genius interloper into this thread are standing behind. Just replace human with LLM and it’s the same kind of genius logic.

leptons 20 hours ago||

>IF you did this SAME experiment with a human and had a human read an ENTIRE document and then reproduce said document with edits. The DOCUMENT would DEGRADE even more.

Except that isn't how humans edit documents, and it isn't how LLMs work either.

When a human edits a document, they don't typically "reproduce said document with edits", which I assume you mean read the document and reproduce it from memory. They have the document, either physically printed out, or in a word processor. To make edits they either cross-out and write in the edit, or in a word processor just delete the text and replace it with something better. There's no need to keep the entire document in a human's memory for them to reproduce it from memory.

The same goes for the LLM, it has access to the original document at all times. It can remove sections and replace them.

But the LLM hallucinates.

And if you give a document to a human high on LSD to edit, you might get some weird edits back.

threethirtytwo 17 hours ago||

>Except that isn't how humans edit documents,

Bro. That's my point.

>and it isn't how LLMs work either.

This is also my point. To be more technical about it, the harness around the LLM pushes it to do surgical edits rather then regurgitation, so my point is this experiment is garbage and testing an impractical and rarely used use case.

>When a human edits a document, they don't typically "reproduce said document with edits", which I assume you mean read the document and reproduce it from memory.

No shit sherlock. The point of that sentence was to illustrate the absurdity of doing that which in turn illustrates the absurdity of this scientific paper. You're kind of lost.

bigstrat2003 21 hours ago||

We don't need a study to tell us that LLMs always make mistakes. We already knew that. Anyone with sense is not using LLMs because of that.

chris_explicare 1 hour ago||

[flagged]

toshikatsu-oga 2 hours ago||

[flagged]

Ozzie-D 4 hours ago||

[dead]

30030 6 hours ago||

[flagged]

30030 5 hours ago||

[flagged]

GhostDriftInc 14 hours ago||

[flagged]

xiaosong001 12 hours ago|

[flagged]

More comments...