Measuring AI Ability to Complete Long Tasks

Posted by spicypete 12/21/2025

Measuring AI Ability to Complete Long Tasks(metr.org)

247 points | 193 commentspage 3

mkoubaa 12/21/2025|

Ask not what the agent can do you for you, ask what you can do for the agent.

If you fail to break up the task into agent sized chunks, you're the problem.

alexgotoi 12/21/2025||

[dead]

bentobean 12/21/2025||

> We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months.

If true, how much of this is a result of:

1. Genuine technical advancement

or:

2. Shoveling trillions of dollars into compute resources in order to service incoming LLM requests in a way that is completely unrealistic over the long term?

In other words… are we talking about genuine, sustainable innovation that we get to take with us moving forward and benefit from? Or are we talking about an “improvement” that is more akin to a mirage that will eventually disappear when the Ponzi scheme eventually collapses?

mediaman 12/21/2025||

Much of this is due to vastly better posttraining RL, not models that are much bigger. The idea that most of these gains comes from training really big models, or throwing immensely larger amounts of compute at it, is not really true.

emp17344 12/21/2025|||

I wonder how much of this stuff is attributable to true model advancement, or if it’s an improvement in the genetic harness? It’s impossible to separate strict model improvement from improvement in the associated tools.

dghost-dev 12/21/2025||

Good point.

Davidzheng 12/21/2025||

Big error bars and METR people are saying the longer end of the benchmark are less accurate right now. I think they mean this is a lower bound!

scellus 12/21/2025|

It's complicated. Opus 4.5 is actually not that good at the 80% threshold but is above others at 50% threshold of completion. I read there's a single task around 16h that the model completed, and the broad CI comes from that.

METR currently simply runs out of tasks at 10-20h, and as a result you have a small N and lots of uncertainty there. (They fit a logistic to the discrete 0/1 results to get the thresholds you see in the graph.) They need new tasks, then we'll know better.

JohnnyMarcone 12/21/2025||

Thanks for this comment. I've been trying to find anything about the huge error bars. Do you have any sources you can share for further reading?

nrhrjrjrjtntbt 12/21/2025||

Why measure in minutes and not tokens? Seems you could cheat by slowing the ai down.

wmf 12/21/2025|

They measure the time it takes a human to complete the task. They don't care how long the AI takes (although in practice it's much faster than human). Measuring tokens isn't a good idea because newer models can complete tasks using fewer tokens.

Dwedit 12/21/2025|

Opus is already the name of an audio codec.

pants2 12/21/2025||

Gemini is already the name of a Greek god, a constellation, a space mission, a crypto exchange, an astrological sign, a car, and a comic villain! How will we ever figure out which one someone is talking about?

GaggiX 12/21/2025|||

Opus: "an artistic work, especially one on a large scale."

The names Haiku, Sonnet, and Opus have not been chosen randomly.

oidar 12/21/2025||

And so much more intuitive than the OpenAI names for their models. I still don't get their naming scheme.

p1esk 12/21/2025||

Have you been living under a rock?