Posted by wek 6 hours ago
One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.
If you only have functional requirements, then in effect you're doing some form of program synthesis, and RL can optimize that very hard.
If you have a mixture of functional and non-functional requirements, you are basically giving the model an incomplete specification, and it must in some way guess at the user's intent to fill in the blanks. This is also why adding to the prompt examples of the style of code you want (hats off to antirez for this particular tip ;)) is phenomenally powerful.
To put it in practice: if you point claude/codex to a repository and you ask it to implement feature X using style guide Y, the code will probably work, but you can usually get better results by saying "do it in the style of this file, it was done well there".
It is not great at decision making or judgment calls that don't have a well defined spec or plan in place yet; like unofficial or unapproved tokens if you will. A lot of this stuff simply never has had specs as it has been internal to how companies work and their secret sauce.
The closest thing we have are governance and compliance policies due to legal/business needs requiring it so it's far more well documented than operational ones in how we work. It is more about the how versus the what here I guess is what I'm saying.
But yeah this is why it does great when there are tests, design systems, evals, and other artifacts to mirror. Far more reckless and unpredictable without these things, but still great for exploration and finding the data output you seek.
It's like when I see people feeding it a whole bunch of "best practices" and expect it to follow them. It won't. But you could ask it questions about the best practices all day long.
Ended up pointing Claude at a few sample files from our existing reporting, gave it read-only oauth access to the ERP and said “build a new report showing the cash by project as calculated by xxxx - yyyy + zzzz in the style of the existing reports” and it basically one-shot from there.
Kind of crazy and I built a bunch of redundant check-sums because I honestly didn’t think it would be able to replace like 6 workdays of effort for the 2 FTEs who generate that kind of thing manually every month but so far so good..
For example, if you apply "guardrails" to an image generator of about a year ago, all the people start looking alike. Story generators start using only a few standard names.
That was last year. Is it happening with the frontier models?
I mean, I spend more tokens having them clean up all the places they didn't follow the the plan (if I catch it) or implementing what came out of a 'complete and tested' previous plan where they just stop as soon as all the pathetic new test pass and you discover half of it isn't even there when trying to implement the next thing on top of it.
Though... I have been conducting an experiment, of sorts, where we've been cooking on these fairly complicated projects and I don't ever touch a single line of code, just yell at them a lot, and with suitable amounts of marijuana (they are very frustrating most of the time) it's been going pretty well. I also helps that they need to explain what they're doing to somebody fairly-baked -- maybe not such an HR friendly plan?
I’m not really interested in analysis of the weaknesses of such models because in my experience many weaknesses disappear entirely as models get stronger and reasoning effort is turned up. Especially if you tell them what you want them to do.
Also, it’s not surprising to learn that when more acceptance criteria are added the failure rate increases.
Even the best models have trouble adhering to stuff as mundane as rules for how to style generated code (indent this much, name things with these patterns, etc.). Even the most die-hard AI-first coder will admit to that kind of stuff being not unheard-of. Yet they still delude themselves into thinking that these models will follow a sufficiently detailed spec to the letter, every time.
For a little complex changes, I always run codex (5.5-high) in planning mode first. I have linked various docs/{ARCHITECTURE,BACKEND-GUIDELINES,NESTJS-DI,..}.md etc. from AGENTS.md so they can quickly discover relevant docs at planning time, only if they are needed. No need to know react specific stuff when it's dealing with a backend problem for example. I typically blindly approve plans made by the agent with a fresh context, because that's as if I had prompted it. Works the best for me.
Using /goal however, it's really just constantly compacting and doing it's thing, of course it gets sloppy. If only there was a state machine that would transform tickets into a Planning Mode Prompt, then use, idk. guardian approvals (somehow a "Product Management Perspective Lens" approving or making changes to the plan) and then letting a less capable or less reasoning agent execute the plan, I think that would work the best.
[1]: https://medium.com/@vishvananda/i-spent-2-billion-tokens-wri...
I've only read the abstract of this one so far but it seems like this paper has zoomed in on programming with greater fidelity and shown a similar phenomenon. But not about long horizon tasks, more like "long style horizons" of larger sets of structural constraints.
[1] https://arxiv.org/abs/2604.15597
Discussion: https://news.ycombinator.com/item?id=48073246
A framework would use static code checking tools to force an architecture on to LLMs instead of trying to do so in markdown.
I don't know exactly what it will look like but for example I could imagine a Java Framework where the LLM could only create subclasses of certain classes.
If there's a second thing the generative AI tools have shown beyond any doubt it's that many of the more modern (relatively speaking) "best practices" that have always been over-hyped and questionably-evidenced really do tend to produce worse results. LLMs take these methods to their logical conclusions and show us the end result much sooner. You can't just iterate your way to a solution when you don't even know what problem you're trying to solve. If you don't have a clear spec then you don't know what a correct product looks like. You need to invest time in reviewing code properly. If you don't keep the big picture in mind then the big picture becomes a mess.
Maybe one day the LLMs will leave me out of a job but at least I'll feel validated first!
tasks spanning eight web frameworks
Does anyone else have this experience that LLM create better pure html+CSS+js than work with existing frameworks?The most incredible combo I've seen lately is progressive enhancement of Razor Pages with javascript. With this arrangement the newest models tend to make a really good call on if something should happen server-side (cshtml) or on the client (js).
When using Codex/Claude Code with Go code I cannot count the times the agent does some change, runs a build to check for errors, find some and fix them.
https://docs.python.org/3/library/typing.html
"The Python runtime does not enforce function and variable type annotations. They can be used by third party tools such as type checkers, IDEs, linters, etc."
Which third-party enforcement mechanism do you propose become the default?
There are many reasons for this. A big one is that many libraries are only partially typed at best, and dynamic types tend to propagate, weakening the guarantees you get from type checking.
Dynamic idioms in general, including something as common as string-indexed dictionaries, negate type checking. Runtime metaprogramming is the same. All of these things have equivalents in a good statically checked language, but Python doesn't follow those models.
Fundamentally, in Python static typing is an optional analysis layer over a dynamic language, and the consequences of that can't be fully mitigated. The result is a big difference in what types can guarantee.
I have exactly the inverse findings on my end. The bigger and more legacy the codebase, the more accurate the patches become.
The harness itself seems to be the most important part. I use a recursive loop that primes the root context based on the user prompt each time. My agent will often make over 100 tool calls to sql and git before it finally decides to apply a patch. If I was greenfield, there would be nothing to query or constrain against.
I usually find I can achieve 90% of the outcome I'm trying to achieve. I use sonnet for planning, qwen for coding, sonnet for review.