SWE-bench Verified no longer measures frontier coding capabilities

Posted by kmdupree 17 hours ago

SWE-bench Verified no longer measures frontier coding capabilities(openai.com)

272 points | 155 commentspage 2

zachdotai 8 hours ago|

I wrote about this recently here: https://fabraix.com/blog/adversarial-cost-to-exploit

I think the core issue is in static benchmarks and the community needs to start moving beyond measuring pass/fail (which worked when agents were incapable of doing much of the work) to dynamic evals that simulate more how we evaluate humans.

marlburrow 12 hours ago||

The "private benchmarks" suggestion comes up every time, but I think there's a more interesting axis: benchmarks built on top of already-public, already-stable test instruments. SWE-bench is fundamentally a corpus that lives on GitHub — once it ships, it leaks into training data automatically. Benchmarks built on contested qualitative instruments (psych tests, opinion surveys) have a different contamination profile because the correct answer doesn't exist in the training corpus to memorize — only the question does.

That doesn't help for measuring coding ability specifically (you fundamentally need a code-correctness oracle), but for capability axes where the "answer" is a stated position rather than a verifiable fact, public + stable can still be useful. The SWE-bench problem isn't really "public", it's "public + has a fixed correct answer".

1a527dd5 16 hours ago||

This feels very much like "we are now moving the goal posts".

hashmap 12 hours ago||

It does, and it should. With each iteration getting closer to the goalposts exposes the flaws in the goalposts, and then you try to make better goalposts. The problem people seem to have with the goalposts moving is they assume the goalpost makers either made good goalposts or thought they made good goalposts, but the actual process is "do the best we can at the moment and update when we get better information".

neversupervised 16 hours ago|||

But this is the good kind of goalpost moving

iLoveOncall 16 hours ago||

Only if you didn't read the article.

They're saying they need to move on from it because the benchmark is flawed (without bringing in proof) and that's why they can't hit 100%.

It's not a "our models are so good that the benchmark is too easy" thing.

embedding-shape 16 hours ago|||

I feel like they're quite open about why they think the benchmark doesn't work anymore:

> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

> This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.

f33d5173 16 hours ago||||

> without bringing in proof

Did we read the same article?

MattRix 15 hours ago|||

How can you say “without bringing in proof” when there is literally proof in the article?

MattRix 15 hours ago||

Only if you didn’t read the article…

languid-photic 13 hours ago||

It’s very hard to encode the properties that matter most in code in tests. [1]

[1] https://voratiq.com/blog/your-workflow-is-the-eval

kimjune01 5 hours ago||

AI labs should compete on a bench that's adversarial, such as go or Starcraft

djoldman 16 hours ago||

> We have incorporated these findings into our recent evaluation efforts. In the last months we’ve chosen to report results from the public split of SWE-Bench Pro. We recommend other model developers do the same. SWE-bench Pro is not perfect, but empirically seems to suffer less from contamination issues.

https://arxiv.org/pdf/2509.16941

parentheses 13 hours ago||

The timing makes me wonder if this is a direct response to Deepseek V4 having performance comparable to SOTA models.

osti 5 hours ago|

This was published two months ago. Even though it was at a time that open source models are publishing comparable swe bench scores.

lmeyerov 13 hours ago||

It's been fun benchmarking AI investigations at botsbench.com . Part of it is checking for these kinds of issues - we recently started seeing contamination in our first generation challenge, and less obvious, agent sandbox escapes for other kinds of cheating. Fun times!

swyx 12 hours ago||

more context in small writeup + we interviewd the team behind this when it was announced: https://www.latent.space/p/swe-bench-dead

eugenekolo 12 hours ago|

Without SWE-Bench though, how will AI models properly game their results to show ~5-10% gain each iteration?

Once a benchmark is known and there's billion of dollars on the line, obviously every company will game them.

More comments...