Posted by spicypete 2 days ago
If you fail to break up the task into agent sized chunks, you're the problem.
If true, how much of this is a result of:
1. Genuine technical advancement
or:
2. Shoveling trillions of dollars into compute resources in order to service incoming LLM requests in a way that is completely unrealistic over the long term?
In other words… are we talking about genuine, sustainable innovation that we get to take with us moving forward and benefit from? Or are we talking about an “improvement” that is more akin to a mirage that will eventually disappear when the Ponzi scheme eventually collapses?
METR currently simply runs out of tasks at 10-20h, and as a result you have a small N and lots of uncertainty there. (They fit a logistic to the discrete 0/1 results to get the thresholds you see in the graph.) They need new tasks, then we'll know better.
The names Haiku, Sonnet, and Opus have not been chosen randomly.