Measuring AI Ability to Complete Long Tasks

Posted by spicypete 12/21/2025

Measuring AI Ability to Complete Long Tasks(metr.org)

247 points | 193 commentspage 2

yoan9224 12/21/2025|

The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed.

What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.

This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds.

The other thing missing from these benchmarks: recovery ability. When the AI gets stuck on hour 3 of a 4-hour task, can it recognize the problem and backtrack? Or does it confidently continue down the wrong path?

dvfjsdhgfv 12/21/2025|

> This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds.

The problem with this approach is that in 30 minutes, an agent is able to produce a massive amount of stuff. Reviewing all this is a nightmare, in the sense that on the surface it seems fine and it often works, until it doesn't. The bugs introduced are often subtle and their effects manifest later, if ever.

So, for stuff that matters (to me), I prefer not to use agents at all.

Maybe things will change in a year, or 5, or 10. I will be giving it a try. but for the moment it's just not worth it, and the upside-down workflow it pushes on me is just making me tired and lose satisfaction from doing my job.

scotty79 12/21/2025||

> As shown above, when we fit a similar trend to just the 2024 and 2025 data, this shortens the estimate of when AI can complete month-long tasks with 50% reliability by about 2.5 years.

I don't think I have 50% success rate at month long tasks.

Anything that exceeds one day is pretty hard.

zkmon 12/21/2025||

> We believe this work has important implications ... > First, our work demonstrates an approach ...

The Conclusions section is not for making a sales pitch for your article. It is for summarizing any new knowledge the article brings out.

rich_sasha 12/21/2025||

How does "cost" per frontier task change with time?

Extrapolating any exponential growth is always dangerous, but over say 3 years at this pace, we'd go from 2 hours to 70,or about 8 days' work.

Quite scary. But what does cost do over the same timeline? Does it increase with computational complexity? Is it worse - because, IIRC, transformers computational cost is quadratic in context length. Is it better - some kind of economies of scale?

I glanced thought the article but couldn't find any info on this.

yismail 12/21/2025||

Would be interesting to see Gemini 3.0 Pro benchmarked as well.

PunchTornado 12/21/2025||

Exactly. I don't understand how an article like this ignores the best models out there.

cubefox 12/21/2025||

This article was published a long time ago, in March.

yismail 12/21/2025||

That's true, but it looks like it's been updated since then because the benchmarks include Claude Opus 4.5

grim_io 12/21/2025||

This seems like a good way to measure LLM improvement.

It matches the my personal feeling when using progressively better models over time.

sshh12 12/21/2025||

For folks interested in some of the nuances of this benchmark, I just posted this deep dive:

https://blog.sshh.io/p/understanding-ai-benchmarks

big-chungus4 12/22/2025||

"Train adversarially robust image model" is not a long task imo

leecommamichael 12/22/2025|

I read their citations (which are actually the same authors of this paper) and they also define using Python's built-in web server to "build a web server" as a long task.

NiloCK 12/21/2025||

I appreciate horizon expansion as a fundamental metric, but duration seems like too crude a measure. We used to like it when computers were fast.

An infinitely unscrupulous model provider could double this five hour result by cutting your output tokens/second in half!

This isn't only a question of gaming the metric: the very strong current small-fast models (4.5 Haiku, Gemini 3 Flash) have no hope of being measured fairly against this - they will succeed or fail much faster just because they are much faster.

How about something like total output token count as the "long term horizon" metric instead?

scellus 12/21/2025||

The time (horizon) here is not that of the model completing the task, but a human completing the task.

NiloCK 12/22/2025|||

Wow that was a garbage comment!

My introduction to this type of model measuring came from an interview where the repeatedly hammered-home point was that Sonnet 4.0 nailed a gigantic refactor (conversion of a large legacy asp.net or similar into react server-side components or similar) in a loop whose runtime was some large number of hours. I mistakenly attributed the same framing here.

docstryder 12/21/2025||

Task duration is the time it would take for humans to complete the task. The speed of the models and how how long they might take to complete the task is not part of this metric.

Aperocky 12/21/2025|

I think the problem here is LLM eventually pollute its context window with so much of the current task that the larger picture or architectural sanity is forgotten in favor of the current task at hand.

And rarely is a software one and done, with a few round like this, the software architecture would have become schizophrenic. Combating this tendency usually require a lot of the work of these "long task" to be thrown away and more closely limiting what the AI is trying to do as they happen. The success of one "long task" is not necessarily a good thing!

Leynos 12/21/2025|

This was why server-side compaction in GPT-5.2 was such a big deal. The model is by default provided with a tool that will prioritise the initial task and salient updates in context window compaction, and the new model has been trained to use it.

More comments...