SWE-bench Verified no longer measures frontier coding capabilities

Posted by kmdupree 19 hours ago

SWE-bench Verified no longer measures frontier coding capabilities(openai.com)

289 points | 160 commentspage 3

axpy906 13 hours ago|

Once the bench is public it’s out and probably in the training data. Better to have your own and test it on a new model.

Jimmc414 18 hours ago||

Goodhart’s Law in reverse, what can’t be gamed gets rejected.

stephen_cagle 15 hours ago||

You've almost buffer overrun Goodhart's Law into the https://en.wikipedia.org/wiki/McNamara_fallacy . :]

cbg0 16 hours ago||

SWE-bench verified was created in collaboration with OpenAI. It's also an open dataset so prone to contamination, meaning it can be gamed.

wredcoll 15 hours ago||

This is somewhat tangential, but I want a model that can detect physical objects placed on top of a board from a picture/video, specifically warhammer 40k models.

I want a model that can detect the actual units/models that are placed on top of the terrain/board so I can track how the models move during the game, but trying gemini and chatgpt they were absolutely rubbish.

z33k 15 hours ago|

Amiibo and Skylanders detect the pieces with NFC. Wiring up the whole board/ terrain with NFC readers would probably be difficult, though.

addaon 13 hours ago|||

The other classic approach has been a single camera under the table, but that conflicts with terrain use. mmWave radar is probably good enough for to localization at this point, and cheap, but distinguishing pieces is hard.

wredcoll 7 hours ago|||

An interesting thought but at the moment I was just talking about analyzing a video lol

gmerc 7 hours ago||

Translation: Now that all rest sets are ingested, we need to move the bar that gave use several years of free PR.

w4yai 18 hours ago||

I don't understand these websites which force translation to my native language.

I mean, it's fine as it's useful for many people, but where is the button for disabling it ? Or why is it enabled by default ?

"codage de pointe" sounds so weird and cringe in French.

Toutouxc 18 hours ago||

Same for apps and games. I understand English just fine, no need to switch to your shitty Google-translate localization just because my iPhone or PlayStation is set to my native language.

LukaD 18 hours ago||

Does your browser request French via an Accept-Language header perhaps? What really infuriates me is when sites don’t respect that header and give you a translation based on IP location.

embedding-shape 18 hours ago|||

Regardless if it does or not, users should be able to manually override what language the website is in, at least be able to read the native one, regardless of what the original language was, what headers you send and where geodatabases think your IP is from.

w4yai 18 hours ago|||

Correct answer! What a bad UX

cowartc 17 hours ago||

The headline leads with contamination, but buried is that 59% of audited failures had test design defects. That's a measurement system never validated against ground truth before being adopted industry-wide as a score that mattered. They reported on it for two years but the gauge was broken the entire time.

nothinkjustai 16 hours ago|

Ai comments are banned here.

kimjune01 7 hours ago||

AI labs should compete on a bench that's adversarial, such as go or Starcraft

gpm 18 hours ago||

Curiously Opus 4.7 claims to have a 87.6% pass rate and Mythos claims to have a 93.9% pass rate... leading to the conclusion that it's actually possible to "solve" the problems that OpenAI claims are incorrect.

jmalicki 17 hours ago||

Part of the issue they mention is contamination - the tests are in the training data.

The other issue they mention is being overly constrained vs. what is asked for - such as requiring specific class or function names to pass that were not part of what was specified.

It might be possible that even to the extent they are not contaminated Claude is better at predicting what sort of function names would be used in the repository (this fits my experience in using it on a number of projects with very different styles - I've found it to be good at "when in Rome") - this is a laudable trait, but it's also not what SWEbench claims to be measuring.

cjsaltlake 14 hours ago|||

If you read the mythos report, in which they discuss and account for contamination substantially, it still suggests that performance on SWE-bench verified is meaningful. Benchmarks, including SWE-bench can absolutely be gamed, but if you're not explicitly benchmaxxing, improving on SWE-bench still measures model improvements, at least up to the level of Mythos.

2ndorderthought 18 hours ago|||

Or that opus and mythos are training on the data somehow such that there solutions are incorrectly right. Or that openai is lying/wrong. Or that all of these companies are cheating so much it doesn't really matter and never did.

MattRix 17 hours ago||

The problem isn’t that the tasks are impossible to solve, it’s that they’re underspecified and/or impossible to solve consistently (ex. because a test is expecting the solution function to have a specific name that wasn’t specified in the task itself).

So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows?

gpm 17 hours ago||

We actually know that a "100% pass rate" is trivially possible: https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.

gruez 17 hours ago||

>Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

Define "cheat". If it's just hacking the test harness to return "PASSED", surely this would be easily detected with some human auditing? It sounds far more likely their solution are designed to pass the incorrect tests. That might be considered bad in a SWE context, but it's not exactly cheating either. It might even be considered a good thing, eg. in the context of backwards compatibility.

[1] https://learn.microsoft.com/en-us/troubleshoot/microsoft-365...

adityamwagh 18 hours ago||

> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

No shit, Sherlock!

neuroelectron 16 hours ago|

It's really naïve to think any of the big AI companies won't cheat

More comments...