SWE-bench Verified no longer measures frontier coding capabilities

Posted by kmdupree 21 hours ago

SWE-bench Verified no longer measures frontier coding capabilities(openai.com)

302 points | 168 commentspage 4

neuroelectron 18 hours ago|

It's really naïve to think any of the big AI companies won't cheat

DeathArrow 19 hours ago||

So Opus 4.7 and Mythos are solving problems that are impossible to solve?

tedsanders 14 hours ago||

Whether a problem is "good" or "bad" is not always objective or simple.

For example, you can have problems that are underspecified, with hardcoded tests for a particular solution (out of multiple possible solutions). If your solution works fine but used a different function name than the one hardcoded in the tests, you can unfairly score 0.

When an eval has underspecified problems like these, you can still score 100% if you remember the original solution from your training data or if you just have taste similar to the original human authors. And both of these qualities - good memory and good taste - are great, but they'll be rewarded unfairly relative to a model that still did exactly what it was asked but in a different way than the hardcoded tests expected.

karmasimida 13 hours ago||

To some extent yes.

It is not impossible to solve in absolute terms, in the sense, all necessary pieces of information are presented in the repo + problem statement.

But it is impossible to solve in the sense, unless you read the ground truth, you are NOT able to solve it the way the test patch demands.

Simply not plausible to me that model can read the problem statement so precisely that it nails exactly, like 100% what the test suite is trying to test.

DeathArrow 19 hours ago||

So we need to generate benchmarks after the models finish training. Or we need to keep the solutions to the benchmark problems as closed source.

retinaros 20 hours ago||

it never did

varispeed 20 hours ago||

Issue with these benchmark also is that they measure a model you are unlikely going to be routed to. My experience with Anthropic is that despite using Opus 4.6 and 4.7, most of the time the performance is matching low B parameter Qwen. I think there should be a way to verify what model is actually being used to process prompts - that should be independently verified. At the moment it is so bad, you have to ask verification question to the model in form of a non-trivial problem. If it solves it, then there is a chance you actually get Opus and not an impostor and so you can continue the session instead of restarting it hoping you get routed correctly. But that does not help if model is replaced with cheaper one mid session. I've got so much work lost because of these shenanigans.

gruez 19 hours ago||

> My experience with Anthropic is that despite using Opus 4.6 and 4.7, most of the time the performance is matching low B parameter Qwen.

Is this just the next level of the "they're serving quantized models!" theory?

varispeed 13 hours ago||

Not a theory buy lived experience. You never know when you get the nerfed session.

alansaber 20 hours ago||

I'm sure some inference providers don't, but most intentionally obfuscate this data. They have the full trace logs- my impression is that they don't share them because it's their competitive advantage, and it's easier for a competitor to distil their model if they did.

getverdict 2 hours ago||

[dead]

flowdesktech 3 hours ago||

[dead]

enesz 3 hours ago||

[dead]

tokenhub_dev 7 hours ago||

[dead]

alphainfo 17 hours ago|

[dead]

More comments...