Top
Best
New

Posted by tosh 7 hours ago

Gemini 3 Deep Think(blog.google)
https://x.com/GoogleDeepMind/status/2021981510400709092

https://x.com/fchollet/status/2021983310541729894

522 points | 313 commentspage 2
anematode 2 hours ago|
It found a small but nice little optimization in Stockfish: https://github.com/official-stockfish/Stockfish/pull/6613

Previous models including Claude Opus 4.6 have generally produced a lot of noise/things that the compiler already reliably optimizes out.

Metacelsus 7 hours ago||
According to benchmarks in the announcement, healthily ahead of Claude 4.6. I guess they didn't test ChatGPT 5.3 though.

Google has definitely been pulling ahead in AI over the last few months. I've been using Gemini and finding it's better than the other models (especially for biology where it doesn't refuse to answer harmless questions).

CuriouslyC 5 hours ago||
Google is way ahead in visual AI and world modelling. They're lagging hard in agentic AI and autonomous behavior.
throwup238 7 hours ago|||
The general purpose ChatGpt 5.3 hasn’t been released yet, just 5.3-codex.
neilellis 6 hours ago|||
It's ahead in raw power but not in function. Like it's got the worlds fast engine but one gear! Trouble is some benchmarks only measure horse power.
NitpickLawyer 6 hours ago||
> Trouble is some benchmarks only measure horse power.

IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.

scarmig 4 hours ago|||
> especially for biology where it doesn't refuse to answer harmless questions

Usually, when you decrease false positive rates, you increase false negative rates.

Maybe this doesn't matter for models at their current capabilities, but if you believe that AGI is imminent, a bit of conservatism seems responsible.

Davidzheng 6 hours ago|||
I gather that 4.6 strengths are in long context agentic workflows? At least over Gemini 3 pro preview, opus 4.6 seems to have a lot of advantages
verdverm 6 hours ago||
It's a giant game of leapfrog, shift or stretch time out a bit and they all look equivalent
nkzd 5 hours ago|||
Google models and CLI harness feels behind in agentic coding compared OpenAI and Antrophic
simianwords 7 hours ago||
The comparison should be with GPT 5.2 pro which has been used successfully to solve open math problems.
aliljet 4 hours ago||
The problem here is that it looks like this is released with almost no real access. How are people using this without submitting to a $250/mo subscription?
andxor 3 hours ago||
People are paying for the subscriptions.
tootie 3 hours ago||
I gather this isn't intended a consumer product. It's for academia and research institutions.
siva7 5 hours ago||
I can't shake of the feeling that Googles Deep Think Models are not really different models but just the old ones being run with higher number of parallel subagents, something you can do by yourself with their base model and opencode.
Davidzheng 5 hours ago|
And after i do that, how do i combine the output of 1000 subagents into one output? (Im not being snarky here, i think it's a nontrivial problem)
tifik 5 hours ago|||
The idea is that each subagent is focused on a specific part of the problem and can use its entire context window for a more focused subtask than the overall one. So ideally the results arent conflicting, they are complimentary. And you just have a system that merges them.. likely another agent.
mattlondon 5 hours ago||||
You just pipe it to another agent to do the reduce step (i.e. fan-in) of the mapreduce (fan-out)

It's agents all the way down.

jonathanstrange 5 hours ago|||
Start with 1024 and use half the number of agents each turn to distill the final result.
sinuhe69 6 hours ago||
I'm pretty certain that DeepMind (and all other labs) will try their frontier (and even private) models on First Proof [1].

And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.

[1] https://1stproof.org/

zozbot234 6 hours ago||
The 1st proof original solutions are due to be published in about 24h, AIUI.
octoberfranklin 1 hour ago||
Really surprised that 1stproof.org was submitted three times and never made front page at HN.

https://hn.algolia.com/?q=1stproof

This is exactly the kind of challenge I would want to judge AI systems based on. It required ten bleeding-edge-research mathematicians to publish a problem they've solved but hold back the answer. I appreciate the huge amount of social capital and coordination that must have taken.

I'm really glad they did it.

simonw 6 hours ago||
The pelican riding a bicycle is excellent. I think it's the best I've seen.

https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/

nickthegreek 4 hours ago||
I routinely check out the pelicans you post and I do agree, this is the best yet. It seemed to me that the wings/arms were such a big hangup for these generators.
tasuki 2 hours ago|||
Tbh they'd have to be absolutely useless at benchmarkmaxxing if they didn't include your pelican riding a bicycle...
Manabu-eo 6 hours ago|||
How likely this problem is already on the training set by now?
simonw 6 hours ago|||
If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.
suddenlybananas 5 hours ago||
Why would they train on that? Why not just hire someone to make a few examples.
simonw 5 hours ago||
I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks.
suddenlybananas 4 hours ago||
But they could just train on an assortment of animals and vehicles. It's the kind of relatively narrow domain where NNs could reasonably interpolate.
simonw 4 hours ago||
The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.
suddenlybananas 4 hours ago||
When you're spending trillions on capex, paying a couple of people to make some doodles in SVGs would not be a big expense.
simonw 3 hours ago||
The embarrassment of getting caught doing that would be expensive.
throwup238 6 hours ago||||
For every combination of animal and vehicle? Very unlikely.

The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.

recursive 6 hours ago||
No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.
svara 5 hours ago||
More likely you would just train for emitting svg for some description of a scene and create training data from raster images.
recursive 47 minutes ago||
None of this works if the testers are collaborating with the trainers. The tests ostensibly need to be arms-length from the training. If the trainers ever start over-fitting to the test, the tester would come up with some new test secretly.
zarzavat 6 hours ago||||
You can always ask for a tyrannosaurus driving a tank.
verdverm 6 hours ago||||
I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too
enraged_camel 4 hours ago|||
Is there a list of these for each model, that you've catalogued somewhere?
throwup238 6 hours ago|||
The reflection of the sun in the water is completely wrong. LLMs are still useless. (/s)
margalabargala 5 hours ago||
It's not actually, look up some photos of the sun setting over the ocean. Here's an example:

https://stockcake.com/i/sunset-over-ocean_1317824_81961

throwup238 5 hours ago||
That’s only if the sun is above the horizon entirely.
margalabargala 4 hours ago||
No, it's not.

https://stockcake.com/i/serene-ocean-sunset_1152191_440307

throwup238 2 hours ago||
Yes, it is. In that photo the sun is clearly above the horizon, the bottom half is just obscured by clouds.
deron12 6 hours ago|||
It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a "human made art" perspective. In other words, it's still got a ways to go!

Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?

gs17 5 hours ago|||
It depends, if you meant from a human coding an SVG "manually" the same way, I'd still say this is excellent (minus the reflection issue). If you meant a human using a proper vector editor, then yeah.
fvdessen 5 hours ago||
maybe you're a pro vector artist but I couldn't create such a cool one myself in illustrator tbh
dfdsf2 5 hours ago|||
Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.
saberience 6 hours ago|||
Do you have to still keep trying to bang on about this relentlessly?

It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.

Again, like I said before, it's also a terrible benchmark.

jeanloolz 3 hours ago|||
I'll agree to disagree. In any thread about a new model, I personally expect the pelican comment to be out there. It's informative, ritualistic and frankly fun. Your comment however, is a little harsh. Why mad?
Davidzheng 5 hours ago||||
Eh, i find it more of a not very informative but lighthearted commentary
simonw 5 hours ago|||
It being a terrible benchmark is the bit.
dfdsf2 5 hours ago||
Highly disagree.

I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.

If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.

chriswarbo 5 hours ago|||
I disagree. The task asks for an SVG; which is a vector format associated with line drawings, clipart and cartoons. I think it's good that models are picking up on that context.

In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible.

I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.

peaseagee 5 hours ago|||
The request is for an SVG, generally _not_ the format for photorealistic images. If you want to start your own benchmark, feel free to ask for a photorealistic JPEG or PNG of a pelican riding a bicycle. Could be interesting to compare and contrast, honestly.
neilellis 6 hours ago||
Less than a year to destroy Arc-AGI-2 - wow.
Davidzheng 6 hours ago||
I unironically believe that arc-agi-3 will have a introduction to solved time of 1 month
ACCount37 3 hours ago|||
Not very likely?

ARC-AGI-3 has a nasty combo of spatial reasoning + explore/exploit. It's basically adversarial vs current AIs.

etyhhgfff 6 hours ago||||
The AGI bar has to be set even higher, yet again.
dakolli 5 hours ago|||
wow solving useless puzzles, such a useful metric!
esafak 3 hours ago||
How is spatial reasoning useless??
modeless 5 hours ago|||
It's still useful as a benchmark of cost/efficiency.
XCSme 5 hours ago|||
But why only a +0.5% increase for MMMU-Pro?
kingstnap 4 hours ago|||
Its possibly label noise. But you can't tell from a single number.

You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.

It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.

kenjackson 5 hours ago|||
Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago.
saberience 5 hours ago||
It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish.

Arc-AGI score isn't correlated with anything useful.

Legend2440 3 hours ago|||
It's correlated with the ability to solve logic puzzles.

It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem.

HDThoreaun 1 hour ago||||
ARC-AGI 2 is an IQ test. IQ tests have been shown over and over to have predictive power in humans. People who score well on them tend to be more successful
fsh 2 minutes ago||
IQ tests only work if the participants haven't trained for them. If they do similar tests a few times in a row, scores increase a lot. Current LLMs are hyper-optimized for the particular types of puzzles contained in popular "benchmarks".
jabedude 5 hours ago|||
how would we actually objectively measure a model to see if it is AGI if not with benchmarks like arc-AGI?
WarmWash 4 hours ago||
Give it a prompt like

>can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home

And get back an automatic coupon code app like the user actually wanted.

Legend2440 3 hours ago||
I'm really interested in the 3D STL-from-photo process they demo in the video.

Not interested enough to pay $250 to try it out though.

ramshanker 6 hours ago|
Do we get any model architecture details like parameter size etc.? Few months back, we used to talk more on this, now it's mostly about model capabilities.
Davidzheng 6 hours ago|
I'm honestly not sure what you mean? The frontier labs have kept arch as secrets since gpt3.5
willis936 4 hours ago||
At the very least gemini 3's flyer claims 1T parameters.
More comments...