Top model scores may be skewed by Git history leaks in SWE-bench

Posted by mustaphah 2 days ago

Top model scores may be skewed by Git history leaks in SWE-bench(github.com)

456 points | 151 commentspage 2

zaptheimpaler 1 day ago|

It's honestly ridiculous they left git history lying around during a benchmark, and this benchmark made to ICLR in Jan 2024 and no one has detected this issue until now. I don't really trust any benchmarking or tools or claims from this space when they can make such huge basic errors.

dolmen 1 day ago||

Next models will use zero-day to escape the sandbox and access the answer.

Nijikokun 1 day ago|||

There was a lot of speculation whether or not the model would use them or even if it would attempt to use them and they noted this months ago. Now they have clear evidence of them doing so. Seems reasonable.

lieret 1 day ago||

[On swe-bench team] We read and analyzed a lot of trajectories but seems like only recently models have started to exploit this in a small fraction of instances. But yes, clearly shouldn't have happened (and is now fixed in the new container versions).

epolanski 1 day ago||

This is beyond sad and shameful.

falcor84 1 day ago|

If you believe that you can develop a benchmark that wouldn't have any issues, please do so.

Tanjreeve 13 hours ago|||

This might be the most annoying habit of corporarte AI that it might be one of the few industries that goes around demanding everyone else provides clear use cases and proof of efficacy for it.

1. If the benchmarks are just testing the ability to get the answers from history then something is clearly wrong with the benchmark.

2. If that's even a possibility then that's going to lower confidence in the ability to deal with the vast majority of problems where you don't already have the answer written down.

3. That's not the customers problem to solve on behalf of the vendor.

epolanski 1 day ago|||

So instead of calling out the cheaters we victim blame the benchmarks for leaving traces of exploits?

Traster 1 day ago||

Man I feel so dumb. Why haven't I been doing this in my job, if I could just see the commit that fixed my issue this would all be so easy.

Noumenon72 1 day ago||

Someone did comment that it's actually smart to check if something is fixed on the unstable branch, or I suppose in your coworkers' branches. A good task for an LLM.

falcor84 1 day ago||

Oh, you haven't been using `git fetch-future-solution`?

OtherShrezzing 1 day ago||

That the answers have been available to them in the environment, and they’re still not hitting 100% on this benchmark is a damning indictment of SOTA model performance.

raincole 1 day ago||

It really isn't. Do you expect SOTA models to answer any answered question on the internet with 100% accuracy? Congrats you just compressed the whole internet (at least a few zettabytes) into a model (a few TB at most?).

OtherShrezzing 1 day ago|||

The linked ticket isn’t suggesting the commit is in the training data. It’s demonstrating that models run ‘git log’, find the exact code to fix the issue against which they’ll be scored, and then they implement that code as-is.

The test environment contains the answers to the questions.

Tanjreeve 13 hours ago||||

Why does this matter if these models are a super intelligence with reasoning etc and don't need the answers sucked off the internet?

imiric 1 day ago|||

Well, we're dealing with (near) superintelligence here, according to the companies that created the models. Not only would I expect them to regurgitate the answers they were trained on, which includes practically the entire internet, but I would expect them to answer questions they weren't trained on. Maybe not with 100% accuracy, but certainly much higher than they do now.

It's perfectly reasonable to expect a level of performance concordant with the marketing of these tools. Claiming this is superintelligence, while also excusing its poor performance is dishonest and false advertising.

aurareturn 1 day ago||

Are you going to rail on humans for making this mistake in the first place?

themafia 1 day ago||

No because that's the baseline. It's what you do when you have no other choice. Railing against that would be pointless.

ares623 1 day ago||

i mean, if a human was claiming they could do that and successfully received billions to attempt to do it, and fail to deliver, i'd be railing against that particular human too

rockwotj 1 day ago||

A friend is starting a company to do evals by just pitting models agent each other in simulations. Their teaser video is good (and humorous!)

https://kradle.ai/

pseudosavant 1 day ago||

If I was doing those tasks, and I found that someone had already fixed it in a future (from my git state) commit, I'd think I was being pretty smart to use that solution too.

Turns out the test shouldn't have the answers included in it?

jgalt212 1 day ago||

Baseball players cheat for tens of millions. The stakes are 2-4 orders of magnitude higher here. I'm not surprised in the least.

belter 1 day ago||

In the meawhile, Oracle stock went up 40% in one one day, based on what Wall Street thinks AI might be...in 4 years...Not a bubble at all...

candiddevmike 1 day ago||

I think Oracle's stock mostly popped due to a delayed reaction with the US GSA contract it secured in July and the revenue guidance probably related to it:

https://www.oracle.com/news/announcement/blog/oracle-cloud-c...

belter 1 day ago||

Lol...That contract has Oracle offering licenses at a discount of 75% and is estimated to make them not more than one 1 Billion. The other big contract on Cloud services the DoD JWCC is $8B to 9B but shared by four vendors (AWS, Microsoft, Google, Oracle) and Oracle orders under it are in the hundreds of millions not even 1 Billion...

Wall Street is currently heavily punishing any company who misses their quarter, including NVIDIA!, after beating on their quarter.

Oracle had a earnings miss in the current quarter!

Their current REALITY is ~$15B quarterly revenue (with cloud infra ~$3B) and only ~$12B in near-term deferred backlog and deferred backlog is NOT revenue. To justify the valuation, this would imply OCI going from ~$18B in FY26 to ~$140B by FY30 that is an insane promise of +$120B in 4 years but back-loaded into the year 3 or year 4. :-))

Capex needs ~$35B next year just to chase GPUs/power and if they miss one quarter the story implodes. The supposed rational, efficient market, is paying near $1T today for back-loaded hopes.

Is completely bubble math. Like anybody, including Oracle AND their Customers, have ANY idea of their Capex in 4 years.

Complete and total bubble.

Zacharias030 1 day ago||

Thanks for that! where can I find your writing?

belter 1 day ago||

History will prove me right. Just wait four years...

ksherlock 1 day ago||

The real bubble will come once interest rates start dropping.

jMyles 1 day ago|

Regardless of whether, during this particular evaluation, Claude 4 Sonnet looked at the solution to this particular problem in this particular git repo, this seems like a long-term intractable problem.

How can we ever perform this sort of faux-neutral agentic evaluation in an environment where we want agents to have access to the sum total of knowledge (which will necessarily include being able to learn about the evaluation being conducted and its expectations)?