Posted by rbanffy 1 day ago
It's unsurprising that round-tripping long content through an LLM results in corruption. Frequent LLM users already know not to do that.
They claim that tool use didn't help, which surprised me... but they also said:
> To test this, we implemented a basic agentic harness (Yao et al., 2022) with file reading, writing, and code execution tools (Appendix M). We note this is not an optimized state-of-the-art agent system; future work could explore more sophisticated harnesses.
And yeah, their basic harness consists of read_file() and write_file() - that's just round-tripping with an extra step!
The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite described here: https://platform.claude.com/docs/en/agents-and-tools/tool-us...
The str_replace and insert commands are essential for avoiding round-trip risky edits of the whole file.
They do at least provide a run_python() tool, so it's possible the better models figured out how to run string replacement using that. I'd like to see their system prompt and if it encouraged Python-based manipulation over reading and then writing the file.
Update: found that harness code here https://github.com/microsoft/delegate52/blob/main/model_agen...
The relevant prompt fragment is:
You can approach the task in whatever
way you find most effective:
programmatically or directly
by writing files
As with so many papers like this, the results of the paper reflect more on the design of the harness that the paper's authors used than on the models themselves.I'm confident an experienced AI engineer / prompt engineer / pick your preferred title could get better results on this test by iterating on the harness itself.
>Frequent LLM users already know not to do that.
And I think that’s the biggest problem. Amidst the current push to utilize LLMs across orgs and groups there are a large (if even say majority) of people that are using them every day but who have never approached anything as technical as a “harness” before let alone an entire setup.
For them the behavior mentioned here is a major issue.
So the reasonable response to being told you're holding your scissors wrong is to realize that yes, you most likely are holding your scissors wrong[0], and ask the other person for advice (or just to do the thing), or look up a YouTube video and learn, or sign up to a class, or such.
Expecting mastery in 30 seconds is not a reasonable attitude, but it's unfortunately the lie that software industry tried to sell to people for the past 15 years or so.
--
[0] - There's much more to it than one would think.
I don’t have an example off hand, but I know that it’s easy to dismiss something an LLM does as trivial if your work is extremely marginal. Most devs aren’t creating their own programming languages. I can’t help but think people who hold this opinion also think the work most software professionals do is “trivial” (“you’re just moving strings around, that’s not impressive/trivial”)
A lathe operator isn’t any good if they don’t frequently operate lathes.
A articulated robot implementer needs frequent experience implementing robots to be any good.
That doesn’t mean lathes or robots are useless. Nor does it mean they have failed as products because they require expertise.
I do think it raises questions as to whether vast swathes of the population will be effective at using LLMs. Are they scissors, or a lathe?
To me learning to use LLMs is the same as doing anything else, you have to practice and put in the hours to get good. Maybe some harnesses will eventually allow LLMs to function more as scissors than lathes. This seems to be what Microsoft is trying to do by embedding Copilot in all their products and saying “choose the UI that works best for you”. If that doesn’t end up working we’ll need another paradigm for “non-technical” users to effectively operate computer assistants
What does one do when a full editor consumes too much bandwidth^H tokens? Use ed, the standard editor!
I think this is closely related to other sources saying that even if you have huge context the attention mechanism itself is not back referencing thus any tasks related to bigger contexts are prone to errors.
because I have some preconception of this maybe I am assuming this is what they were saying. Am I missing something ?
Also as a person developing agentic code tools since before Claude Code, I'm skeptical if str_replace provides accuracy improvement over just full rewrite.
Back in the day when SOTA models would do lazy coding like `// ... rest of the code ...`, full rewrite wasn't easy. Search/replace was fast, efficient and without the lazy coding. However, it came with slight accuracy drop.
Today that accuracy drop might be minimal/absent, but I'm not sure if it could lead to improvements like preventing doc corruption.
They've been decent at full rewrite for 2 years. I don't think they were good at search/replace until a year ago, but I'm not so sure.
It's true that the models 2 years ago would sometimes make errors in whole rewrite - e.g removing comments was fairly common. But I've never seen one randomly remove one character or anything like that. These days they're really good.
Main reason agentic harnesses use search/replace is speed and cost, surely! Whole file output is expensive for small changes.
This team is inexperienced and it shows.
The noise to signal ratio will get worse, even in "academia". Brace yourselves. The kids are growing up in this new world.
On editing tasks, one should only allow programmatic editing commands, the text shouldn't flow through the LLM at all. The LLM should analyze the text and emit commands to achieve a feedback directed goal.
The fact of the matter is, if you want to edit a document by reading the document and then regurgitating the entire document with said edits... a human will DO worse then a 25% degradation. It's possible for a human to achieve 0% degradation but the human will have to ingest the document hundreds of times to achieve a state called "memorization". The equivalent in an LLM is called training. If you train a document into an LLM you can get parity with the memorized human edit in this case.
But the above is irrelevant. The point is LLMs have certain similarities with humans. You need to design a harness such that an LLM edits a document the same way a human would: Search and surgical edits. All coding agents edit this way, so this paper isn't relevant.
OR it could be because their concerns are genuine but are ignored in favour of a good sounding story.
So that is definitively a biased interpretation. This is independent of how accurate my POV or your POV is on whether LLMs degrade documents. I am simply saying the experiment conducted is COMPLETELY DIFFERENT from how LLMs AND humans edit papers.
* than
As I was reading this article, a similar thought occurred to me: "I wonder if that's better or worse than a human?" Unfortunately, there was no human baseline in this study. That said, there are studies that compare LLM to human performance. Usually, humans perform much better (like 5-7x better) at long-running tasks.
In other words, a human would probably do better than an LLM on this task.
Humans lose to LLMs in narrow, well-specified text/symbolic reasoning tasks where the model can exploit breadth, speed, and search. Usually, the LLM performed ~15% better than humans, but I saw studies that were as high as 80%. To my surprise, these studies were usually about "soft skills" like creativity and persuasion.
Show your edit by regurgitating this entire thread by hand on a paper. Don't use any additional tools like Find and replace.
Boom there's your baseline. I can simulate the result in my head.
Guys I'm basically saying the experiment is innaccurate to the practical reality of how LLMs are actually used.
Most LLM users who are not touching code are certainly not going to be using a harness. They're going to take all the documents, slam all those tokens into the context window, see they have only used 500k out of their 1M tokens and say "summarize".
1. It's an older model.
2. You're prompting it wrong.
3. That's not what LLMs are for.
4. We knew that already.
It's as if there LLMs have no limitations, which of course goes completely against number 1 in the list above, because if LLMs have no limitations then how are newer models better and why are AI companies constantly releasing new versions?
But the debate has taken on an insidious identitarian character: it's no longer about understanding a technology, its strengths, its limitations, what makes it tick. It's a fractious internet fight between crowds of users who have attached this or that opinion to their very internet persona and will not budge from their entrenched positions.
That is basically the death of curious debate. Obviously there's no point in discussing any research under those conditions: good or bad, flawed or not, we're just not going to get any signal out of the noise on HN anymore.
"Semantic ablation" is my favorite term for it: https://www.theregister.com/software/2026/02/16/semantic-abl...
You can somewhat mitigate this, at the same moment you ask for the new edit, by adding new info or specifying the lost meaning you want to add back. But other things will still get washed out.
Nuances will drift, sharp corners will be ablated. You're doing a Xerox copy of your latest Xerox copy, so even if you add your comments with a sharpie, anything that was there right before will be slightly blurrier in the next version.
Often that thinking bit itself provides value to the person doing it, beyond the text itself. By letting a LLM do it for you, you rob yourself of the change of thought and the new findings you may encounter.
Working with LLMs just makes it quicker to get going, bit you need to be a ruthless editor.
Occasionally it would report the action, sometimes it would not bother to report it. It never reached into the README on an unrelated doc edit, but if it was touching the README, that line was getting excised.
They are essentially like that one JPEG meme, where each pass of saving as JPEG slightly degrades the quality until by the end its unrecognizable.
Except with LLMs, the starting point is intent. Each pass of the LLMs degrades the intent, like in the case of a precise scientific paper, just a little bit of nuance, a little bit of precision is lost with a re-wording here and there.
LLMs are mean reversion machines, the more 'outside of their training' the context/work load they are currently dealing with, the more they will tend to gradually pull that into some homogenous abstract equilibrium
Quality is really important to me in its own right, but I also worry about this exact "repeated compression" problem: when my codebase is clean and I have an up-to-date mental model, an LLM can quickly help me churn out some feature work and still leave the codebase in a reasonable state. But as the LLM dirties up the codebase, its past mistakes or misunderstandings compound, and it's likely to flub more and more things. So I have to go back and "restore" things to a correct state before I feel comfortable using the LLM again.
My takeaway from this is that AI is a temporary phenomena, the end stage of the Internet age. It's going to destroy the Internet as we know it as well as much of the technological knowledge of the developed world, and then we're going to have to start fresh and rebuild everything we know. My takeaway is that I'm trying to use AI to identify and download the remaining sources of facts on the Internet, the human-authored stuff that isn't generated for engagement but comes from the era when people were just putting useful stuff online to share information.
[1] https://en.wikipedia.org/wiki/Model_collapse
[2] https://www.nature.com/articles/s41586-024-07566-y
[3] https://cacm.acm.org/blogcacm/model-collapse-is-already-happ...
1. Prototype 2. Initial production implementation 3. Hardening
My experience with LLMs is that they solve “writer’s block” problems in the prototyping phase at the expense of making phases 2+3 slower because the system is less in your head. They also have a mixed effect on ongoing maintenance: small tasks are easier but you lose some of the feel of the system.
And indeed for me, the biggest productivity boost has nothing to do with my "typing speed" or any such nonsense, it's that it can help with writer's block and other kinds of unhelpful inertia.
It kind of reminds me of ADHD medication: it alleviates the "inability to direct attention at one thing" problem, but actually exacerbates the "time blindness" and "hyperfocus" problems.
I think probably a lot of complex tools have these characteristics: useful in some ways, liable to backfire in others, and ultimately context-sensitive (and maybe somewhat unpredictable) in their helpfulness.
Hopefully as LLMs are more widely experimented with by developers, the conversation can continue to move away from thinking about the effects of LLM use in terms of some uniform/fungible "productivity" and towards understanding where it hurts and where it helps, how to tell when it's time to put it away, what kinds of codebases are really hurt by that kind of detached engagement, and what kinds of projects leverage that sort of rapid prototyping the most effectively.
Plausible text generation is an almost magical trick, whether it's generating human language or computer code. But it turns out it's not a silver bullet, no matter how impressive the trick is. It's more interesting than a silver bullet, in fact: it's a system of surprising tradeoffs, even for different phases of the same overall task.
#1 -> #2 is a gap, but it also helps if you ask the LLM to explain its thinking and generate a human-readable design-doc of the approach it took and code organization it used. Then you read the design doc to gain the context, and pick up with #2.
This blog probably covers my exact headache [0]. In summary, "Etc/GMT+6" actually means UTC-6. I was developing a one-off helper script to massively create calendars to a web app via API, and when trying to validate my CSV+Python script's results, I kept getting confused as to when do the CSV rows have correct data and when does the web app UI have correct data. LLM probably developed the Python script in a manner that translated this on-the-fly, but my human-readable "Calendar name" column which had "Etc/GMT+6" would generate a -6 in the web app. This probably would not have been a problem with explicit locations specified, but my use case would not allow for that.
When trying to debug if something is wrong, the thinking trace was going into loops trying to figure out if the "problem" is coming from my directions, the code's bugs, or the CSV having incorrect data.
Learning: when facing problems like this, try using the well-known "notepad file" methods to track problems like this, so that if the over-eager LLM starts applying quick code fixes – although YOU were the "problem's" source – it will be easier to undo or clean up code that was added to the repository during a confusing debug session. For me, it has been difficult to separate "code generated due to more resilient improvements" vs. "code generated during debugging that sort of changed some specific step of the script".
(Do note that I am not an advanced software engineer, my practices are probably obvious to others. My repos are mainly comprised of sysadmin style shell/python helper code! :-) )
[0]https://blacksheepcode.com/posts/til_etc_timezone_is_backwar...
Yeah, I have definitely hit this as well. Sometimes I've named a function or variable in a way that misuses a term or concept, or I've changed what something does without fully thinking it through. The LLM sees that code, notices an inconsistency, and makes a guess about what I meant. But because I screwed up, only I know what I really meant (or what I "should have meant"). So the LLM ends up writing a fix that breaks assumptions made in other parts of the code— assumptions that fit with my overall original mental picture, but not the misnomer the LLM got snagged on. Or it writes a small-scoped fix but the mistake of mine it stumbled upon actually merits rethinking and redesigning how some parts interact, so even if its fix is better than what I had before, I want to unwind that change so I can redefine my interfaces or whatever.
That's definitely worth calling out: it's not only the LLM's mistakes that make it more likely to commit future mistakes. Any mistakes in the codebase can compound like that. If you want an LLM to do useful work for you, it's more relevant than ever to "tidy first".
Moving a file is a bit easier. LLMs may sometimes try to recite the file from memory. But if you tell them to use "git mv" and fix the compiler errors, they mostly will.
Ordinary editing on the other hand, generally works fine with any reasonable model and tool setup. Even Qwen3.6 27B is fine at this. And for in-place edits, you can review "git diff" for surprises.
I don't let AI touch git anyway, and I always review the diff after it generated stuff. If it modifies my documentation, I always want to check if it messed with the text instead of just added formatting.
I use Magit, and up until I started using LLM agents it was mostly a nice-to-have that I relied on casually. (I was definitely under-utilizing its power.) But for reviewing, selectively staging, and selectively rejecting the changes of an LLM agent? I feel like I'd die without it. Idk how others manage.
The LLM will come up with stupid ways to do things, common sense doesn’t exist for AI.
Latest big change is probably how feasible local models are becoming, like Qwen 3.6 and Gemma 4, they're no longer easily getting stuck in loops and repetition, although on lower quantizations they still pretty much suck for agentic usage.
I think it’s always been obvious where an LLM could be used effectively and where it cannot, if you understand how they work and don’t see them as magical.
The “increase in proficiency” is mostly people coming back to reality and being more intentional about LLM usage. There are no surprise discoveries here. One does not need to use an LLM a lot to get effective with them. A total noob could become effective on day 1 with proper guidance.
So e.g., don't use an LLM to call an API to gather data and produce a report on it, as that's feeding deterministic data through a "bullshit" layer, meaning you can't trust what comes out the other side. Instead use the LLM to help you write the code that will produce a deterministic output from deterministic data.
I've seen co-workers use LLMs to summarise deterministic data coming from APIs and have reports be wildly off the mark as often as they are accurate. Depending on what they're looking at that can have catastrophic risk.
However, there's a reason pre-computing bureaucracy came with paper trails and meeting minutes getting written up, why court cases are increasingly cautious about the reliability of eye witnesses.
It is ironic, the more AI becomes like us and less it acts like a traditional computer program, the worse it is at many things we want to use it for, but because collectively we're oblivious to our cognitive limitations we race into completely avoidable failures like this.
This was the comment I was coming in to make: I worked in a pre-computing bureaucracy (the U.S. Navy's) and "staff you delegated work to have consistent trouble following the directions you provide for the delegated work" is just a fact of life.
A lot of it is telephone game, a lot of it is is lack of real familiarity with office software, a lot of it is the inherent integration challenge from sending the same document out for coordination to dozens of stakeholders.
All those mistakes you made fixes for based on comments in the draft that went out for O-6 review? At least 2 will pop up again at 1-star review because staffers will copy the same text back out from their local copy they had stashed during O-6 review rather than re-reviewing from scratch.
Style guidance to meet the Admiral's preferred format? You can provide it but there's not a chance they'll follow it, formatting is for humanities majors so you'll need to catch and fix all that yourself.
That's not to say the LLMs are foolproof or magically always correct, but a lot of these style of criticisms apply just as much, if not more, to the current status quo. I don't need LLMs to be perfect, I just need them to be better than the current alternatives.
Building structures of dependencies, the interface between each pair seems to collapse to the lesser of the two. So there's a ton of work right now going into TLA+, structured io, etc to force even a bit of reliability back into the LLM/program boundaries. To have any hope of chaining multiple LLM dependencies in a stack without the whole thing toppling chaotically.
I experienced this with resume editing. The LLM removes everything that differentiates my resume from a pile of junior engineers with “average” experience. Anything that was special or unique or different was eventually replaced with generic stuff
Of course I didn’t use what it produced, but it was maddening because the LLM kept insisting this was better than what I had.
I found LLMs to be much more useful in suggesting edits to very small chunks of my resume (a sentence or three) rather than the overall vision of the document.
I came to your blog to read what you had to say. Why are you writing a blog if you aren’t even going to write it?
The fact of the matter is, humans don't edit things the way it was done in the paper and neither do coding agents like claude. Think about it: You do not ingest an entire paper and then regurgitate that paper with a single targeted edit... and neither do coding agents.
Also think carefully. A 25% degradation rate is unacceptable in the industry. The AI change that's taking over all of SWE development would not actually exist if there was 25% degradation... that's way too much.
The whole point of creating software to do things used to be getting things done more accurately and consistently.
"More accurately and consistently" was merely downstream from what capabilities were natural for machine logic and hard algorithms.
Now, we're just spoiled for choice. We have hard algorithm software where we want to do things that benefit for accurate, consistent, highly deterministic behavior - and we have soft algorithm AI for when we want to do things that simply aren't amenable to hard logic.
Machine translation used to be a horrid mess when we were trying to do it with symbolic systems. Because symbolic systems are "consistent, highly deterministic" but not at all "accurate" on translation tasks. Being able to leverage LLMs for that is a generational leap.
If you differ between AI source code and engineer source code say so. "Getting things done" is a business need. Which things get translated to a deterministic language executable by a computer is code.
There are entire languages dedicated for lesser engineers/domain experts to formulate business requirements.
Anyhow; What's your point? That we received a framework for "soft algorithms" where the output does not need to be correct and deducible? What's even the point of putting it into software. Just forward your input to the reader and let him judge on its own.
It all comes down to hard logic eventually, but that "eventually" has teeth. None of the interesting behaviors of AI systems live in "engine.py".
My point is: there are tasks where the choices are to use AI, use a meatbag, or suck forever. The "use AI" option going to be flawed, and often in the same ways "use meatbag" is. But it's going to be cheaper, much more scalable, and a lot better than "suck forever". Humanlike flaws are the price you pay for accessing humanlike capabilities.
And I know this because I see it all the time. I use composer-2 and sonnet 4.6 on a regular basis. It's not much better for my colleagues who use Opus or GPT or any of the other frontier models. Most of the time it's fine, but other times it does things simply unforgivable for a human. I have to watch the agent closely so that it doesn't decide to nuke my database; I don't have to do that with any of my juniors, even those with little experience and poor discipline.
> I don’t have to do that with any of my juniors…
For some values of “nuke,” I absolutely have had to do that with juniors in the past. Perhaps you’re referring to a single rm -r or hilarious force push or something, but undertrained and unsupervised juniors regularly introduce things like SQL injection, XSS, etc. simply because they don’t know any better yet. This isn’t saying “AI is better across the board” - I just don’t think they’re comparable, also think AI shouldn’t be used to chop the bottom 5 rungs off our career ladder. But let’s not pretend juniors can be left alone with a codebase without any worries.
That’s it. Once you look at everything through this lense everything makes sense - especially the fact there is no underlying understanding of reasoning and creativity. I don’t care what boosters say.
If you can come up with a way to do math without reasoning, that would be, in a sense, even more interesting than AI.
A calculator is different because it is not probabilistic; it executes a fixed procedure. One of these models, when doing math, is more like a learned probabilistic system that understands enough structure around mathematics that some of its high probability trajectories seem like genuine reasoning.
The difference is that when a human reasoner goes to solve a problem, they'll think "this kind of proof usually goes this way" - following an explicit rule enforcement. The model may produce the same output, and may even appear to approach it the same way, but the mechanism is a probabilistic pattern selection rather than explicit rule enforcement.
How is this different from "probabilistic pattern selection"?
Perhaps it’s best if most admit they don’t have the fundamental ways of thinking to even participate in the conservation.
When all nuance is lost, the discussion must end.
Logic is just syntactic manipulation of formulas. By the early 90s logical reasoning was pretty much solved with classical AI (the last building block being constraint logic programming).
If so, what exactly would you call the process by which the intelligent human solves the math problem that he or she does not initially understand?
Whatever you call that process is what a reasoning model does. You don't have to call it "reasoning," of course... unless you want other people to understand what you're talking about.
It's the default, and if we're lucky we harness pieces of it to discern something we're interested in.
What has worked well in practice is giving the agent a directory, and tell it to make independent markdown files for facts/findings it locates - with each file having front-matter for easy search-ability.
This de-complects most tasks from "research AND store iteratively in a final document format" to more cohesive tasks "research a set of facts and findings which may be helpful for a document", and "assemble the document".
Only a partial mitigation, but find it leads to more versatile re-use of findings, same as if a human was working.
The issue happens then if you're updating the individual research files on a regular basis. (Or making a long series of commits on a starting code base.) Every edit has a chance of doing a drive-by cleanup on nearby lines. Over a long enough timeline, it'll ablate your logic into something featureless, like if you compress an image too many times.
It would be interesting to know if the stronger results on Python are not just an artefact of the Python-specific evaluation, if they carry over to other common general-purpose languages, and if they are driven by something specific in the training processes.
That's why harnesses and prompting rituals using dozens of markdown down files do not work as advertised and is pretty much snake oil otherwise known as "agentic engineering".
Also, the agentic engineering is pretty much so called prompt engineering except that prompt is now spread across dozens of markdown files directories.
This works well for code regressions but also works for document writing. I've automated it at this point.
A case where using the CLI agent is much better than using the web chat.