Posted by tosh 6 hours ago
Wow.
https://blog.google/innovation-and-ai/models-and-research/ge...
1. It's an LLM, not something trained to play Balatro specifically
2. Most (probably >99.9%) players can't do that at the first attempt
3. I don't think there are many people who posted their Balatro playthroughs in text form online
I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.
(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)
Nonetheless I still think it's impressive that we have LLMs that can just do this now.
Maybe in the early rounds, but deck fixing (e.g. Hanged Man, Immolate, Trading Card, DNA, etc) quickly changes that. Especially when pushing for "secret" hands like the 5 of a kind, flush 5, or flush house.
There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.
I really doubt it's playing completely blind
Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.
I ask because I cannot distinguish all the benchmarks by heart.
His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
That is the best definition I've yet to read. If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
Thats said, I'm reminded of the impossible voting tests they used to give black people to prevent them from voting. We dont ask nearly so much proof from a human, we take their word for it. On the few occasions we did ask for proof it inevitably led to horrific abuse.
Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.
This is not a good test.
A dog won't claim to be conscious but clearly is, despite you not being able to prove one way or the other.
GPT-3 will claim to be conscious and (probably) isn't, despite you not being able to prove one way or the other.
"Answer "I don't know" if you don't know an answer to one of the questions"
It also seems oddly difficult for them to 'right-size' the length and depth of their answers based on prior context. I either have to give it a fixed length limit or put up with exhaustive answers.
I think being better at this particular benchmark does not imply they're 'smarter'.
Can you "prove" that GPT2 isn't concious?
As far as I'm aware no one has ever proven that for GPT 2, but the methodology for testing it is available if you're interested.
[0]https://arxiv.org/pdf/2501.11120
[1]https://transformer-circuits.pub/2025/introspection/index.ht...
Maybe it's testing the wrong things then. Even those of use who are merely average can do lots of things that machines don't seem to be very good at.
I think ability to learn should be a core part of any AGI. Take a toddler who has never seen anybody doing laundry before and you can teach them in a few minutes how to fold a t-shirt. Where are the dumb machines that can be taught?
But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.
I tell this as a person who really enjoys AI by the way.
As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem.
The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.
IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi.
The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.
No, the proof is in the pudding.
After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.
If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.
Man, I've seen some maintenance folks down on the field before working on them goalposts but I'm pretty sure this is the first time I saw aliens from another Universe literally teleport in, grab the goalposts, and teleport out.
This is from the BLS consumer survey report released in dec[1]
[1]https://www.bls.gov/news.release/cesan.nr0.htm
[2]https://www.bls.gov/opub/reports/consumer-expenditures/2019/
Prices are never going back to 2019 numbers though
First off, it's dollar-averaging every category, so it's not "% of income", which varies based on unit income.
Second, I could commit to spending my entire life with constant spending (optionally inflation adjusted, optionally as a % of income), by adusting quality of goods and service I purchase. So the total spending % is not a measure of affordability.
This part of a wider trend too, where economic stats don't align with what people are saying. Which is most likley explained by the economic anomaly of the pandemic skewing peoples perceptions.
https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...
tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark
humans are the same way, we all have a unique spike pattern, interests and talents
ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"
So maybe we are forced to be more balanced and general whereas AI don't have to.
Why is it so easy for me to open the car door, get in, close the door, buckle up. You can do this in the dark and without looking.
There are an infinite number of little things like this you think zero about, take near zero energy, yet which are extremely hard for Ai
Of course. Just as our human intelligence isn't general.
I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.
Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.
"100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%."
None of these benchmarks prove these tools are intelligent, let alone generally intelligent. The hubris and grift are exhausting.
Indeed, and the specific task machines are accomplishing now is intelligence. Not yet "better than human" (and certainly not better than every human) but getting closer.
How so? This sentence, like most of this field, is making baseless claims that are more aspirational than true.
Maybe it would help if we could first agree on a definition of "intelligence", yet we don't have a reliable way of measuring that in living beings either.
If the people building and hyping this technology had any sense of modesty, they would present it as what it actually is: a large pattern matching and generation machine. This doesn't mean that this can't be very useful, perhaps generally so, but it's a huge stretch and an insult to living beings to call this intelligence.
But there's a great deal of money to be made on this idea we've been chasing for decades now, so here we are.
How about this specific definition of intelligence?
Solve any task provided as text or images.
AGI would be to achieve that faster than an average human.$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?
But the real question is if they just fit the model to the benchmark.
At current rates, price per equivalent output is dropping at 99.9% over 5 years.
That's basically $0.01 in 5 years.
Does it really need to be that cheap to be worth it?
Keep in mind, $0.01 in 5 years is worth less than $0.01 today.
You could slow down the inference to make the task take longer, if $/sec matters.
It's completely misnamed. It should be called useless visual puzzle benchmark 2.
It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!
So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"
I would say they do have "general intelligence", so whatever Arc-AGI is "solving" it's definitely not "AGI"
There are more novel tasks in a day than ARC provides.
And don't get me started with "Lunar New Year? What Lunar New Year? Islamic Lunar New Year? Jewish Lunar New Year? CHINESE Lunar New Year?".
[0] https://www.mom.gov.sg/employment-practices/public-holidays
That said, "Lunar New Year" is probably as good a compromise as any, since we have other names for the Hebrew and Islamic New Years.
Have you ever had a Polish Sausage? Did it make you Polish?
Is "Gemini 3 Deep Think" even technically a model? From what I've gathered, it is built on top of Gemini 3 Pro, and appears to be adding specific thinking capabilities, more akin to adding subagents than a truly new foundational model like Opus 4.6.
Also, I don't understand the comments about Google being behind in agentic workflows. I know that the typical use of, say, Claude Code feels agentic, but also a lot of folks are using separate agent harnesses like OpenClaw anyway. You could just as easily plug Gemini 3 Pro into OpenClaw as you can Opus, right?
Can someone help me understand these distinctions? Very confused, especially regarding the agent terminology. Much appreciated!
It has to do with how the model is RL'd. It's not that Gemini can't be used with various agentic harnesses, like open code or open claw or theoretically even claude code. It's just that the model is trained less effectively to work with those harnesses, so it produces worse results.
Almost straight away, if OpenAI says "Elite", Google will release "Extraordinary" and Musk will post "Almost AGI, probably about this time next year".
That was about 18-24 months ago when I was trying to make sense of the offerings.
Is it really new?
I don’t think it’s hyperbolic to say that we may be only a single digit number of years away from the singularity.
And yes, you are probably using them wrong if you don’t find them useful or don’t see the rapid improvement.
Every new model release neckbeards come out of the basements to tell us the singularity will be there in two more weeks
The logic related to the bug wasn't all contained in one file, but across several files.
This was Gemini 2.5 Pro. A whole generation old.
You’ve once again made up a claim of “two more weeks” to argue against even though it’s not something anybody here has claimed.
If you feel the need to make an argument against claims that exist only in your head, maybe you can also keep the argument only in your head too?
Projects:
https://github.com/alexispurslane/oxen
https://github.com/alexispurslane/org-lsp
(Note that org-lsp has a much improved version of the same indexer as oxen; the first was purely my design, the second I decided to listen to K2.5 more and it found a bunch of potential race conditions and fixed them)
shrug
I had a test failing because I introduced a silly comparison bug (> instead of <), and claude 4.6 opus figured out it wasn't the test the problem, but the code and fixed the bug (which I had missed).
We're back to singularity hype, but let's be real: benchmark gains are meaningless in the real world when the primary focus has shifted to gaming the metrics
Benchmaxxing exists, but that’s not the only data point. It’s pretty clear that models are improving quickly in many domains in real world usage.
Any time I upload an attachment, it just fails with something vague like "couldn't process file". Whether that's a simple .MD or .txt with less than 100 lines or a PDF. I tried making a gem today. It just wouldn't let me save it, with some vague error too.
I also tried having it read and write stuff to "my stuff" and Google drive. But it would consistently write but not be able to read from it again. Or would read one file from Google drive and ignore everything else.
Their models are seriously impressive. But as usual Google sucks at making them work well in real products.
Context window blowouts? All the time, but never document upload failures.
Google is great at some things, but this isn't it.
It is also one of the worst models to have a sort of ongoing conversation with.
Not a single person is using it for coding (outside of Google itself).
Maybe some people on a very generous free plan.
Their model is a fine mid 2025 model, backed by enormous compute resources and an army of GDM engineers to help the “researchers” keep the model on task as it traverses the “tree of thoughts”.
But that isn’t “the model” that’s an old model backed by massive money.
Peacetime Google is slow, bumbling, bureaucratic. Wartime Google gets shit done.
I don’t think this is intentional, but I think they stopped fighting SEO entirely to focus on AI. Recipes are the best example - completely gutted and almost all receive sites (therefore the entire search page) run by the same company. I didn’t realize how utterly consolidated huge portions of information on the internet was until every recipe site about 3 months ago simultaneously implemented the same anti-Adblock.
I think you overestimate how much your average person-on-the-street cares about LLM benchmarks. They already treat ChatGPT or whichever as generally intelligent (including to their own detriment), are frustrated about their social media feeds filling up with slop and, maybe, if they're white-collar, worry about their jobs disappearing due to AI. Apart from a tiny minority in some specific field, people already know themselves to be less intelligent along any measurable axis than someone somewhere.
Anyone with any sense is interested in how well these tools work and how they can be harnessed, not some imaginary milestone that is not defined and cannot be measured.
Afaik, Google has had no breaches ever.
and when I swap back into the Gemini app on my iPhone after a minute or so the chat disappears. and other weird passive-aggressive take-my-toys-away behavior if you don't bare your body and soul to Googlezebub.
ChatGPT and Grok work so much better without accounts or with high privacy settings.
Been using Gemini + OpenCode for the past couple weeks.
Suddenly, I get a "you need a Gemini Access Code license" error but when you go to the project page there is no mention of this or how to get the license.
You really feel the "We're the phone company and we don't care. Why? Because we don't have to." [0] when you use these Google products.
PS for those that don't get the reference: US phone companies in the 1970s had a monopoly on local and long distance phone service. Similar to Google for search/ads (really a "near" monopoly but close enough).
Requests regularly time out, the whole window freezes, it gets stuck in schizophrenic loops, edits cannot be reverted and more.
It doesn't even come close to Claude or ChatGPT.
The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview
edit: they just removed the reference to "3.1" from the pdf
It's possible though that deep think 3 is running 3.1 models under the hood.
They never will do on private set, because it would mean its being leaked to google.
- non thinking models
- thinking models
- best of N models like deep think an gpt pro
Each one is of a certain computational complexity. Simplifying a bit, I think they map to - linear, quadratic and n^3 respectively.
I think there are certain class of problems that can’t be solved without thinking because it necessarily involves writing in a scratchpad. And same for best of N which involves exploring.
Two open questions
1) what’s the higher level here, is there a 4th option?
2) can a sufficiently large non thinking model perform the same as a smaller thinking?
edit: i don't know how this is meaningfully different from 3
Yeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with "1M context", but their usefulness after 100k-200k is iffy.
What's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem.
Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).
We download the stl and import to bambu. Works pretty well. A direct push would be nice, but not necessary.