Constraint Decay: The Fragility of LLM Agents in Back End Code Generation

Posted by wek 7 hours ago

Constraint Decay: The Fragility of LLM Agents in Back End Code Generation(arxiv.org)

107 points | 56 commentspage 2

gkfasdfasdf 5 hours ago|

Odd they used GPT-5.2 and not GPT-5.2-codex. i.e. the one optimized for coding agent tasks.

maleldil 2 hours ago|

Considering this is from academia, there's a chance there were limitations on the available models. My research group accesses OpenAI models via Azure, and until recently (last week) the latest model was GPT 5. We just got 5.4.

beering 9 minutes ago||

That’s wild. Are you at a university that bans using the OpenAI APIs directly?

leecommamichael 4 hours ago||

These things don’t think. We’re going to have to reiterate this for a long time, I fear.

emp17344 4 hours ago||

There is now a trillion-dollar industry bent to the task of convincing people these things can think. It’s gonna cause some damage.

suprfnk 3 hours ago||

I don't think they think. I still use them a lot despite that, because they are very powerful parameterised code generators.

sheeshkebab 4 hours ago|||

…but they reason well enough given enough context (using their matmuls).

noosphr 4 hours ago||

To this day frontier models think that A and not B means A and B when the sentence gets pushed far enough back in their context window. The context length that model can reason over without obvious errors is much smaller than the advertised context. Between a 1/4th to a 1/20th what is advertised on the tin.

antonvs 1 hour ago|||

Critiques like this tend to focus very hard on what models can't do. It's true, they have limitations.

But they're also superhuman in so many other ways. It's valid to point out limitations, but that doesn't support the conclusion that models are not incredibly powerful and capable of the functional equivalent of reasoning at human or superhuman levels in many scenarios.

Npovview 3 hours ago|||

Do you also happen to remember what you ate last thrusday?

leecommamichael 3 hours ago|||

Is that the same gap as what you’re responding to? To me, it seems his critique is about advertised capability and logical statements, and your rhetorical(?) question is about memory.

UncleEntity 1 hour ago|||

"If you have a question look in the specification for the answer and don't just guess" seems a fairly important thing to remember for more than a couple of minutes...

Npovview 21 minutes ago||

I had a coding session where I was doing stuff across two repositories. And CC forgot in exactly which repository a particular file was so it was grepping the parent directory. I just asked it to write all important key-value pairs which it thinks are important to a file and it never did parent directory grepping.

akomtu 1 hour ago||

There is a movie, Gold (2016), about a fake gold mine. One of its founders is a true believer: he found a few chunks of gold and started digging for more. The other founder is a nihilist: he realised that there is no gold there, but who cares if he makes the investors believe? So he does, and almost sells the company for $300M.

In our story, investors are mining intelligence from GPUs, and they truly believe they are one inch from discovering the biggest goldmine in history. But GPUs, unlike a goldmine, cannot be inspected for traces of gold by independent contractors. To keep the hype up, the nihilists in our story dig up cheap gold-looking metals from time to time and tell investors that with a bit of alchemy - agentic workflows, etc. - those metals can be magically turned into gold.

Investors will keep digging until the end of the age, or until they run out of money.

rbbydotdev 4 hours ago||

This is interesting, anecdotally I have felt like I was having better luck with raw sqlite than using an ORM in a recent typescript project, using raw sqlite queries vs drizzle

oulipo2 3 hours ago||

Exactly why you can't remove humans in the loop to assess that the solution is not only correct (which LLMs are quite bad at, once concurrency, logic, etc are involved), but also elegant, maintainable, etc

phrotoma 3 hours ago||

"constraint decay" isn't this just another name for the (already well understood) idea of "context rot"?

volume_tech 7 hours ago||

[flagged]

spacedoutman 3 hours ago|

This research is useless and nearly all other LLM research is too.

gpt 5.2 is the strongest model they tested, a nearly 6 month old model.

Traditional research can not keep up.

acgourley 2 hours ago|

I disagree, their findings should generalize to the frontier. Even if the latest can deal with the extra complexity, it stands to reason it will take more tokens to do less. This could be a useful insight into the next generation of evals.