Top
Best
New

Posted by tosh 9 hours ago

Gemini 3 Deep Think(blog.google)
https://x.com/GoogleDeepMind/status/2021981510400709092

https://x.com/fchollet/status/2021983310541729894

603 points | 355 commentspage 3
simonw 8 hours ago|
The pelican riding a bicycle is excellent. I think it's the best I've seen.

https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/

nickthegreek 6 hours ago||
I routinely check out the pelicans you post and I do agree, this is the best yet. It seemed to me that the wings/arms were such a big hangup for these generators.
tasuki 4 hours ago|||
Tbh they'd have to be absolutely useless at benchmarkmaxxing if they didn't include your pelican riding a bicycle...
Manabu-eo 8 hours ago|||
How likely this problem is already on the training set by now?
simonw 7 hours ago|||
If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.
suddenlybananas 7 hours ago||
Why would they train on that? Why not just hire someone to make a few examples.
simonw 6 hours ago||
I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks.
suddenlybananas 6 hours ago||
But they could just train on an assortment of animals and vehicles. It's the kind of relatively narrow domain where NNs could reasonably interpolate.
simonw 6 hours ago||
The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.
suddenlybananas 6 hours ago||
When you're spending trillions on capex, paying a couple of people to make some doodles in SVGs would not be a big expense.
simonw 5 hours ago||
The embarrassment of getting caught doing that would be expensive.
throwup238 8 hours ago||||
For every combination of animal and vehicle? Very unlikely.

The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.

recursive 7 hours ago||
No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.
svara 7 hours ago||
More likely you would just train for emitting svg for some description of a scene and create training data from raster images.
recursive 2 hours ago||
None of this works if the testers are collaborating with the trainers. The tests ostensibly need to be arms-length from the training. If the trainers ever start over-fitting to the test, the tester would come up with some new test secretly.
zarzavat 8 hours ago||||
You can always ask for a tyrannosaurus driving a tank.
verdverm 8 hours ago||||
I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too
enraged_camel 6 hours ago|||
Is there a list of these for each model, that you've catalogued somewhere?
throwup238 8 hours ago|||
The reflection of the sun in the water is completely wrong. LLMs are still useless. (/s)
margalabargala 7 hours ago||
It's not actually, look up some photos of the sun setting over the ocean. Here's an example:

https://stockcake.com/i/sunset-over-ocean_1317824_81961

throwup238 7 hours ago||
That’s only if the sun is above the horizon entirely.
margalabargala 6 hours ago||
No, it's not.

https://stockcake.com/i/serene-ocean-sunset_1152191_440307

throwup238 3 hours ago||
Yes, it is. In that photo the sun is clearly above the horizon, the bottom half is just obscured by clouds.
deron12 8 hours ago|||
It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a "human made art" perspective. In other words, it's still got a ways to go!

Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?

gs17 7 hours ago|||
It depends, if you meant from a human coding an SVG "manually" the same way, I'd still say this is excellent (minus the reflection issue). If you meant a human using a proper vector editor, then yeah.
fvdessen 7 hours ago||
maybe you're a pro vector artist but I couldn't create such a cool one myself in illustrator tbh
dfdsf2 7 hours ago|||
Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.
saberience 7 hours ago|||
Do you have to still keep trying to bang on about this relentlessly?

It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.

Again, like I said before, it's also a terrible benchmark.

jeanloolz 5 hours ago|||
I'll agree to disagree. In any thread about a new model, I personally expect the pelican comment to be out there. It's informative, ritualistic and frankly fun. Your comment however, is a little harsh. Why mad?
Davidzheng 7 hours ago||||
Eh, i find it more of a not very informative but lighthearted commentary
simonw 6 hours ago|||
It being a terrible benchmark is the bit.
dfdsf2 7 hours ago||
Highly disagree.

I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.

If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.

chriswarbo 7 hours ago|||
I disagree. The task asks for an SVG; which is a vector format associated with line drawings, clipart and cartoons. I think it's good that models are picking up on that context.

In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible.

I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.

peaseagee 7 hours ago|||
The request is for an SVG, generally _not_ the format for photorealistic images. If you want to start your own benchmark, feel free to ask for a photorealistic JPEG or PNG of a pelican riding a bicycle. Could be interesting to compare and contrast, honestly.
sinuhe69 8 hours ago||
I'm pretty certain that DeepMind (and all other labs) will try their frontier (and even private) models on First Proof [1].

And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.

[1] https://1stproof.org/

zozbot234 7 hours ago||
The 1st proof original solutions are due to be published in about 24h, AIUI.
octoberfranklin 3 hours ago||
Really surprised that 1stproof.org was submitted three times and never made front page at HN.

https://hn.algolia.com/?q=1stproof

This is exactly the kind of challenge I would want to judge AI systems based on. It required ten bleeding-edge-research mathematicians to publish a problem they've solved but hold back the answer. I appreciate the huge amount of social capital and coordination that must have taken.

I'm really glad they did it.

vessenes 8 hours ago||
Not trained for agentic workflows yet unfortunately - this looks like it will be fantastic when they have an agent friendly one. Super exciting.
dakolli 7 hours ago|
Its really weird how you all are begging to be replaced by llms, you think if agentic workflows get good enough you're going to keep your job? Or not have your salary reduced by 50%?

If Agents get good enough it's not going to build some profitable startup for you (or whatever people think they're doing with the llm slot machines) because that implies that anyone else with access to that agent can just copy you, its what they're designed to do... launder IP/Copyright. Its weird to see people get excited for this technology.

None of this good. We are simply going to have our workforces replaced by assets owned by Google, Anthropic and OpenAI. We'll all be fighting for the same barista jobs, or miserable factory jobs. Take note on how all these CEOs are trying to make it sound cool to "go to trade school" or how we need "strong American workers to work in factories".

BeetleB 4 hours ago|||
> Its really weird how you all are begging to be replaced by llms, you think if agentic workflows get good enough you're going to keep your job? Or not have your salary reduced by 50%?

The computer industry (including SW) has been in the business of replacing jobs for decades - since the 70's. It's only fitting that SW engineers finally become the target.

sgillen 5 hours ago||||
I think a lot of people assume they will become highly paid Agent orchestrators or some such. I don't think anyone really knows where things are heading.
dakolli 1 hour ago||
Why would someone get paid well for this skill? Its not valuable at all.
timeattack 4 hours ago||||
I agree with you and have similar thoughts (maybe, unfortunately for me). I personally know people who outsource not just their work, but also their life to LLMs, and reading their exciting comments makes me feel a mix of cringe, fomo and dread. But what is the engame for me and you likes, when we finally would be evicted from our own craft? Stash money while we still can, watching 'world crash and burn', and then go and try to ascend in some other, not yet automated craft?
dakolli 4 hours ago||
Yeah, that's a good question that I can't stop thinking about. I don't really enjoy much else other than building software, its genuinely my favorite thing to do. Maybe there will be a world where we aren't completely replaced, we have handmade clothes still after all that are highly coveted. I just worry its going to uproot more than just software engineering, theoretically it shouldn't be hard to replace all low hanging fruit in the realm of anything that deals with computer I/O. Previous generations of automation have created new opportunities for humans, but this seems mostly just as a means of replacement. The advent of mass transportation/vehicles created machines who needed mechanics (and eventually software), I don't see that happening in this new paradigm.

I don't think that's going to make society very pleasant if everyone's fighting over the few remaining ways to make livelihood. People need to work to eat. I certainly don't see the capitalist class giving everyone UBI and letting us garden or paint for the rest of our lives. I worry we're likely going to end up in trenches or purged through some other means.

vessenes 3 hours ago||||
I’m someone who’d like to deploy a lot more workers than I want to manage.

Put another way, I’m on the capital side of the conversation.

The good news for labor that has experience and creativity is that it just started costing 1/100,000 what it used to to get on that side of the equation.

jimmymcgee73 2 hours ago|||
If LLMs truly cause widespread replacement of labor, you’re screwed just as much as anyone else. If we hit say 40% unemployment do you think people will care you own your home or not? Do you think people will care you have currency or not? The best case outcome will be universal income and a pseudo utopia where everyone does ok. The “bad” scenario is widespread war.

I am one of the “haves” and am not looking forward to the instability this may bring. Literally no one should.

blibble 1 hour ago|||
> I am one of the “haves” and am not looking forward to the instability this may bring. Literally no one should.

these people always forget capitalism is permitted to exist by consent of the people

if there's 40% unemployment it won't continue to exist, regardless of what the TV/tiktok/chatgpt says

dakolli 58 minutes ago|||
Well he also thinks $10.00 in LLM tokens is equivalent to a $1mm labor budget. These are the same people who were grifting during the NFTs days, claiming they were the future of art.
dakolli 1 hour ago|||
lmao, you are an idealistic moron. If llms can replace labor at 1/100k of the cost (lmfao) why are you looking to "deploy" more workers? So are you trying to say if I have $100.00 in tokens I have the equivalent of $10mm in labor potential.... What kind of statement is this?

This is truly the dumbest statement I've ever seen on this site for too many reasons to list.

You people sound like NFT people in 2021 telling people that they're creating and redefining art.

Oh look peter@capital6.com is a "web3" guy. Its all the same grifters from the NFT days behaving the same way.

ergonaught 6 hours ago||||
Most folks don't seem to think that far down the line, or they haven't caught on to the reality that the people who actually make decisions will make the obvious kind of decisions (ex: fire the humans, cut the pay, etc) that they already make.
blibble 4 hours ago||
they think they're going to be the person making that decision

but forgot there's likely someone above them making exactly the same one about them

newswasboring 5 hours ago||||
You don't hate AI, you hate capitalism. All the problems you have listed are not AI issues, its this crappy system where efficiency gains always end up with the capital owners.
OtomotO 7 hours ago|||
[flagged]
dakolli 7 hours ago|||
Well I honestly think this is the solution. It's much harder to do French Revolution V2 though if they've used ML to perfect people's recommendation algorithms to psyop them into fighting wars on behalf of capitalists.

I imagine llm job automation will make people so poor that they beg to fight in wars, and instead of turning that energy against he people who created the problem they'll be met with hours of psyops that direct that energy to Chinese people or whatever.

We will see.

uxhoiuewfhhiu 5 hours ago|||
[flagged]
ramshanker 8 hours ago||
Do we get any model architecture details like parameter size etc.? Few months back, we used to talk more on this, now it's mostly about model capabilities.
Davidzheng 8 hours ago|
I'm honestly not sure what you mean? The frontier labs have kept arch as secrets since gpt3.5
willis936 6 hours ago||
At the very least gemini 3's flyer claims 1T parameters.
Legend2440 5 hours ago||
I'm really interested in the 3D STL-from-photo process they demo in the video.

Not interested enough to pay $250 to try it out though.

Dirak 6 hours ago||
Praying this isn't another Llama4 situation where the benchmark numbers are cooked. 84.6% on Arc-AGI is incredible!
jonathanstrange 9 hours ago|
Unfortunately, it's only available in the Ultra subscription if it's available at all.
More comments...