Top
Best
New

Posted by pseudolus 4/8/2025

Meta got caught gaming AI benchmarks(www.theverge.com)
347 points | 161 commentspage 3
floppiplopp 4/9/2025|
"I tested myself on a subset of my training data and found that I am the greatest!"
codingwagie 4/8/2025||
The truth is that the vast majority of FAANG engineers making high six figures are only good at deterministic work. They cant produce new things, and so meta and google are struggling to compete when actual merit matters, and they cant just brute force the solutions. Inside these companies, the massive tech systems built, are actually generally terrible, but they pile on legions of engineers to fix the problems.

This is the culture of META hurting them, they are paying "AI VPs" millions of dollars to go to status meetings to get dates for when these models will be done. Meanwhile, deepseek r1 has a flat hierarchy with engineers that actually understand low level computing

Its making a mockery of big tech, and is why startups exist. Big company employees rise the ranks by building skill sets other than producing true economic value

jjani 4/8/2025||
> They cant produce new things, and so meta and google are struggling to compete when actual merit matters, and they cant just brute force the solutions.

You haven't been keeping up. Less than 2 weeks ago, Google released a model that has crushed the competition, clearly being SotA while currently effectively free for personal use.

Gemini 2.0 was already good, people just weren't paying attention. In fact 1.5 pro was already good, and ironically remains the #1 model at certain very specific tasks, despite being set for deprecation in September.

Google just suffered from their completely botched initial launch way back when (remember Bard?), rushed before the product was anywhere near ready, making them look lile a bunch of clowns compared to e.g. OpenAI. That left a lasting impression on those who don't devote significant time to keeping up with newer releases.

codingwagie 4/8/2025||
gemini 2.5 pro isnt good, and if you think it is, you arent using LLMs correctly. The model gets crushed by o1 pro and sonnet 3.7 thinking. Build a large contextual prompt ( > 50k tokens) with a ton of code, and see how bad it is. I cancelled my gemini subscription
lerchmo 4/8/2025|||
https://aider.chat/docs/leaderboards/ your experience doesn't align with my experience or this benchmark. o1 pro is good but I would rather do 20 cycles on gemini 2.5 rather than wait for Pro to return.
int_19h 4/8/2025||||
I have just watched Sonnet 3.7 vs Gemini 2.5 solving the same task (fix a bug end-to-end) side by side, and Sonnet hallucinated far worse and repeatedly got stuck in dead-ends requiring manual rescue. OTOH Gemini understood the problem based on bug description and code from the get go, and required minimal guidance to come up with a decent solution and implement it.
jjani 4/8/2025|||
I have, dozens of times, and it's generally better than 3.7. Especially with more context it's less forgetful. o1-pro is absurdly expensive and slow, good luck using that with tools. Virtually all benchmarks, including less gamed ones such as Aider's, show the same. WebLM still has 3.7 ahead, with Sonnet always having been particularly strong at web development, but even on there 2.5 Pro is miles in front of any OpenAI model.

Gemini subscription? Surely if you're "using LLMs correctly" you'd have been using the APIs for everything anyway. Subscriptions are generally for non-techy consumers.

In any case, just straight up saying "it isn't good" is absurd, even if you personally prefer others.

kylebyte 4/8/2025|||
The problem is less that those high level engineers are only good at deterministic work and more that they're only rewarded for deterministic work.

There is no system to pitch an idea as opening new frontiers - all ideas must be able to optimize some number that leadership has already been tricked into believing is important.

SubiculumCode 4/8/2025|||
A whole lot of opinion there, not a whole lot of evidence.
codingwagie 4/8/2025||
Evidence is a decade inside these companies, watching the circus
danjl 4/8/2025||
"I'm not bitter! No chip on my shoulder."
codingwagie 4/8/2025||
bitter about what? I'm a long time employee
brcmthrowaway 4/8/2025||
What company?
gessha 4/8/2025||
Next on Matt Levine’s newsletter: Is Meta fudging with stock evaluation-correlated metrics? Is this securities fraud?
benhill70 4/8/2025||
Sarcasm doesn't translate well in text. Please, elaborate.
camjw 4/8/2025||
Matt Levine has a common refrain which is that basically everything is securities fraud. If you were an investor who invested on the premise that Meta was good at AI and Zuck knowingly put out a bad model, is that securities fraud? Matt Levine will probably argue that it could be in a future edition of Money Stuff (his very good newsletter).
jjani 4/8/2025|||
The "everything is securities fraud" meme is really unfortunate, not quite as bad as the "fiduciary duty means execs have to chase short-term profit" myth, but still harmful.

It's only because lying ("puffery") about everything has become the norm in corporate America that indeed, almost all listed companies commit securities fraud. If they'd go back to being honest businessmen, no more securities fraud. Just stop claiming things that aren't true. This is a very real option they could take. If they don't, then they're willingly and knowingly commiting securities fraud. But the meme makes it sound to people as if it's unavoidable, when it's anything but.

NickC25 4/8/2025|||
is it securities fraud? sort of.

If Mark, both through Meta and through his own resources, has the capital to hire and retain the best AI researchers / teams, and claims he's doing so, but puts out a model that sucks, he's liable. It's probably not directly fraud, but if he claims he's trying to compete with Google or Microsoft or Apple or whoever, yet doesn't adequately deploy a comparable amount of resources, capital, people, whatever, and doesn't explain why, it could (stretch) be securities fraud....I think.

genewitch 4/8/2025||
And the fine for that? Probably 0.001% of revenue. If that.
myrmidon 4/8/2025|||
If that's what it takes to get some honesty out of corporate, is it such a bad idea? Why?
grvdrm 4/8/2025|||
Nailed it!
seydor 4/8/2025||
tech companies competing over something that is losing them money is the most bizarre spectacle yet.
bko 4/8/2025||
I think Meta sees AI and VR/AR as a platform. They got left behind on the mobile platform and forever have to contend with Apple semi-monopoly. They have no control and little influence over the ecosystem. It's an existential threat to them.

They have vowed not to make that mistake again so are pushing for an open future that won't be dominated by a few companies that could arbitrarily hurt Meta's business.

That's the stated rationale at least and I think it more or less makes sense

fullshark 4/8/2025|||
Makes sense except for the fact that they leaked the llama weights by accident and needed to reverse engineer that explanation.
jsheard 4/8/2025||||
I wouldn't call what Meta is doing with VR/AR an "open future", it's pretty much the exact same playbook that Google and Apple used for their platforms. The only difference is Meta gets to be the landlord this time.
alex1138 4/8/2025||||
I'm in favor of whatever semi-monopoly enables fine grained permissions so Facebook can't en masse slurp Whatsapp (antitrust?) contacts
esafak 4/8/2025|||
They stated that?
asveikau 4/8/2025|||
Feels very late 90s.

The old joke is they're losing money on every sale but they'll make up for it in volume.

dfedbeef 4/8/2025||
chef kiss perfect
tananaev 4/8/2025|||
The reason is simple. All tech companies have very high valuations. They have to sell investors a dream to justify that valuation. They have to convince people that they have the next big thing around the corner.
FL33TW00D 4/8/2025|||
Plot big tech stock valuations with markers for successful OS model releases.
roughly 4/8/2025|||
https://en.wikipedia.org/wiki/Dollar_auction
abc-1 4/8/2025|||
Is it really losing them money if investors throw fistfuls of cash at them for it
diggan 4/8/2025|||
Borderline conspiracy theory with an ounce of truth:

None of the models Meta put out are actually open source (by any measure), and everyone who are redistributing Llama models or any derivatives, or use Llama models for their business, are on the hook of getting sued in the future based on the terms and conditions people been explicitly/implicitly agreeing to when they use/redistribute these models.

If you start depending on these Llama models which have unfavorable proprietary terms today but Meta don't act on them, doesn't mean they won't act on it in the future. Maybe this has all been a play to get people into this position, so Meta can in the future start charging for them or something else.

recursive 4/8/2025||
You never go full Oracle.
timewizard 4/8/2025|||
This is a signal that the sector is heavily monopolized.

This has happened many times before in US history.

fullshark 4/8/2025|||
Come on, you can do the critical thinking here to understand why these companies would want the best in class (open/closed) weight LLMs.
seydor 4/8/2025||
then why would they cheat?
SubiculumCode 4/8/2025|||
I didn't see evidence of cheating in the article. Having a slightly differently tuned version of 4 is not the most dastardly thing that can be done. Everything else is insinuation.
fullshark 4/8/2025||||
Well we'll see if they suffer consequences of this and they cheated too hard, but being perceived as best in class is arguably worth even more than being the best in class, especially if differences in performance are hard to perceive anecdotally.

The goal is long term control over a technology's marketshare, as winner take all dynamics are in play here.

baby 4/8/2025|||
they're all cheating, see grok
nomel 4/8/2025||
Are you referring to this [1]?

> Critics have pointed out that xAI’s approach involves running Grok 3 multiple times and cherry-picking the best output while comparing it against single runs of competitor models.

[1] https://medium.com/@cognidownunder/the-hype-machine-gpt-4-5-...

moralestapia 4/8/2025||
LeCun making up results ... well he comes from Academia, so ...
alittletooraph2 4/8/2025||
I tried to make Studio Ghibli inspired images using presumably their new models. It was ass.
simonw 4/8/2025||
Llama is not an image generating model. Any interface that uses Llama and generates images is calling out to a separate image generator as a tool, like OpenAI used to do with ChatGPT and DALL-E up until a couple of weeks ago: https://simonwillison.net/2023/Oct/26/add-a-walrus/
echelon 4/8/2025||
GPT 4o images is the future of all image gen.

Every other player: Black Forest Labs' Flux, Stability.ai's Stable Diffusion, and even closed models like Ideogram and Midjourney, are all on the path to extinction.

Image generation and editing must be multimodal. Full stop.

Google Imagen will probably be the first model to match the capabilities of 4o. I'm hoping one of the open weights labs or Chinese AI giants will release a model that demonstrates similar capabilities soon. That'll keep the race neck and neck.

minimaxir 4/8/2025|||
One very important distinction between image models is the implementation: 4o is autogressive, slow, and extremely expensive.

Although the Ghibli trend is market validation, I suspect that competitors may not want to copy it just yet.

JamesBarney 4/8/2025|||
Extremely expensive in what since? In that it costs $.03 instead of $.00003c? Yeah it's relatively far more expensive than other solutions, but from an absolute standpoint still very cheap for the vast majority of use cases. And it's a LOT better.
svachalek 4/8/2025||
Dall-E is already 4-8 cents per image. Afaik this is not in the API yet but I wouldn't be surprised if it's $1 or more.
echelon 4/8/2025|||
> 4o is autogressive, slow, and extremely expensive.

If you factor in the amount of time wasted with prompting and inpainting, it's extremely well worth it.

HunterX1 4/8/2025|
Impressive results from Meta's Llama adapting to various benchmarks. However, gaming performance seems lackluster compared to specialized models like Alpaca. It raises questions about the viability of large language models for complex, interactive tasks like gaming without more targeted fine-tuning. Exciting progress nonetheless!