Top
Best
New

Posted by spicypete 2 days ago

Measuring AI Ability to Complete Long Tasks(metr.org)
242 points | 191 commentspage 3
mkoubaa 1 day ago|
Ask not what the agent can do you for you, ask what you can do for the agent.

If you fail to break up the task into agent sized chunks, you're the problem.

alexgotoi 2 days ago||
[dead]
bentobean 2 days ago||
> We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months.

If true, how much of this is a result of:

1. Genuine technical advancement

or:

2. Shoveling trillions of dollars into compute resources in order to service incoming LLM requests in a way that is completely unrealistic over the long term?

In other words… are we talking about genuine, sustainable innovation that we get to take with us moving forward and benefit from? Or are we talking about an “improvement” that is more akin to a mirage that will eventually disappear when the Ponzi scheme eventually collapses?

mediaman 2 days ago||
Much of this is due to vastly better posttraining RL, not models that are much bigger. The idea that most of these gains comes from training really big models, or throwing immensely larger amounts of compute at it, is not really true.
emp17344 2 days ago|||
I wonder how much of this stuff is attributable to true model advancement, or if it’s an improvement in the genetic harness? It’s impossible to separate strict model improvement from improvement in the associated tools.
dghost-dev 2 days ago||
Good point.
nrhrjrjrjtntbt 2 days ago||
Why measure in minutes and not tokens? Seems you could cheat by slowing the ai down.
wmf 2 days ago|
They measure the time it takes a human to complete the task. They don't care how long the AI takes (although in practice it's much faster than human). Measuring tokens isn't a good idea because newer models can complete tasks using fewer tokens.
Davidzheng 2 days ago||
Big error bars and METR people are saying the longer end of the benchmark are less accurate right now. I think they mean this is a lower bound!
scellus 2 days ago|
It's complicated. Opus 4.5 is actually not that good at the 80% threshold but is above others at 50% threshold of completion. I read there's a single task around 16h that the model completed, and the broad CI comes from that.

METR currently simply runs out of tasks at 10-20h, and as a result you have a small N and lots of uncertainty there. (They fit a logistic to the discrete 0/1 results to get the thresholds you see in the graph.) They need new tasks, then we'll know better.

JohnnyMarcone 1 day ago||
Thanks for this comment. I've been trying to find anything about the huge error bars. Do you have any sources you can share for further reading?
Dwedit 2 days ago|
Opus is already the name of an audio codec.
pants2 2 days ago||
Gemini is already the name of a Greek god, a constellation, a space mission, a crypto exchange, an astrological sign, a car, and a comic villain! How will we ever figure out which one someone is talking about?
GaggiX 2 days ago|||
Opus: "an artistic work, especially one on a large scale."

The names Haiku, Sonnet, and Opus have not been chosen randomly.

oidar 2 days ago||
And so much more intuitive than the OpenAI names for their models. I still don't get their naming scheme.
p1esk 2 days ago||
Have you been living under a rock?