Posted by spicypete 12/21/2025
What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.
This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds.
The other thing missing from these benchmarks: recovery ability. When the AI gets stuck on hour 3 of a 4-hour task, can it recognize the problem and backtrack? Or does it confidently continue down the wrong path?
The problem with this approach is that in 30 minutes, an agent is able to produce a massive amount of stuff. Reviewing all this is a nightmare, in the sense that on the surface it seems fine and it often works, until it doesn't. The bugs introduced are often subtle and their effects manifest later, if ever.
So, for stuff that matters (to me), I prefer not to use agents at all.
Maybe things will change in a year, or 5, or 10. I will be giving it a try. but for the moment it's just not worth it, and the upside-down workflow it pushes on me is just making me tired and lose satisfaction from doing my job.
I don't think I have 50% success rate at month long tasks.
Anything that exceeds one day is pretty hard.
The Conclusions section is not for making a sales pitch for your article. It is for summarizing any new knowledge the article brings out.
Extrapolating any exponential growth is always dangerous, but over say 3 years at this pace, we'd go from 2 hours to 70,or about 8 days' work.
Quite scary. But what does cost do over the same timeline? Does it increase with computational complexity? Is it worse - because, IIRC, transformers computational cost is quadratic in context length. Is it better - some kind of economies of scale?
I glanced thought the article but couldn't find any info on this.
It matches the my personal feeling when using progressively better models over time.
An infinitely unscrupulous model provider could double this five hour result by cutting your output tokens/second in half!
This isn't only a question of gaming the metric: the very strong current small-fast models (4.5 Haiku, Gemini 3 Flash) have no hope of being measured fairly against this - they will succeed or fail much faster just because they are much faster.
How about something like total output token count as the "long term horizon" metric instead?
My introduction to this type of model measuring came from an interview where the repeatedly hammered-home point was that Sonnet 4.0 nailed a gigantic refactor (conversion of a large legacy asp.net or similar into react server-side components or similar) in a loop whose runtime was some large number of hours. I mistakenly attributed the same framing here.
And rarely is a software one and done, with a few round like this, the software architecture would have become schizophrenic. Combating this tendency usually require a lot of the work of these "long task" to be thrown away and more closely limiting what the AI is trying to do as they happen. The success of one "long task" is not necessarily a good thing!