Top
Best
New

Posted by MallocVoidstar 7 hours ago

Gemini 3.1 Pro(blog.google)
Preview: https://console.cloud.google.com/vertex-ai/publishers/google...

Card: https://deepmind.google/models/model-cards/gemini-3-1-pro/

361 points | 618 comments
spankalee 5 hours ago|
I hope this works better than 3.0 Pro

I'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.

It's stunningly good at reasoning, design, and generating the raw code, but it just falls over a lot when actually trying to get things done, especially compared to Claude Opus.

Within VS Code Copilot Claude will have a good mix of thinking streams and responses to the user. Gemini will almost completely use thinking tokens, and then just do something but not tell you what it did. If you don't look at the thinking tokens you can't tell what happened, but the thinking token stream is crap. It's all "I'm now completely immersed in the problem...". Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress. It's bad at using tools and tries to edit files in weird ways instead of using the provided text editing tools. In Copilot it, won't stop and ask clarifying questions, though in Gemini CLI it will.

So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude. The experience is just so much better.

For as much as I hear Google's pulling ahead, Anthropic seems to be to me, from a practical POV. I hope Googlers on Gemini are actually trying these things out in real projects, not just one-shotting a game and calling it a win.

bluegatty 34 minutes ago||
Yes, this is very true and it speaks strongly to this wayward notion of 'models' - it depends so much on the tuning, the harness, the tools.

I think it speaks to the broader notion of AGI as well.

Claude is definitively trained on the process of coding not just the code, that much is clear.

Codex has the same limitation but not quite as bad.

This may be a result of Anthropic using 'user cues' with respect to what are good completions and not, and feeding that into the tuning, among other things.

Anthropic is winning coding and related tasks because they're focused on that, Google is probably oriented towards a more general solution, and so, it's stuck in 'jack of all trades master of none' mode.

andai 24 minutes ago||
Tell me more about Codex. I'm trying to understand it better.

I have a pretty crude mental model for this stuff but Opus feels more like a guy to me, while Codex feels like a machine.

I think that's partly the personality and tone, but I think it goes deeper than that.

(Or maybe the language and tone shapes the behavior, because of how LLMs work? It sounds ridiculous but I told Claude to believe in itself and suddenly it was able to solve problems it wouldn't even attempt before...)

karmasimida 4 hours ago|||
Gemini just doesn’t do even mildly well in agentic stuff and I don’t know why.

OpenAI has mostly caught up with Claude in agentic stuff, but Google needs to be there and be there quickly

onlyrealcuzzo 3 hours ago|||
Because Search is not agentic.

Most of Gemini's users are Search converts doing extended-Search-like behaviors.

Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.

Macha 3 hours ago||
> Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.

I do wonder what percentage of revenue they are. I expect it's very outsized relative to usage (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results)

curly6 2 hours ago|||
> Most agent actions on our public API are low-risk and reversible. Software engineering accounted for nearly 50% of agentic activity, but we saw emerging usage in healthcare, finance, and cybersecurity.

via Anthropic

https://www.anthropic.com/research/measuring-agent-autonomy

this doesn’t answer your question, but maybe Google is comfortable with driving traffic and dependency through their platform until they can do something like this

https://www.adweek.com/media/google-gemini-ads-2026/

onlyrealcuzzo 2 hours ago|||
> (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results)

Nobody is paying for Search. According to Google's earnings reports - AI Overviews is increasing overall clicks on ads and overall search volume.

bayindirh 1 hour ago||
So, apparently switching to Kagi continues to pay in dividends, elegantly.

No ads, no forced AI overview, no profit centric reordering of results, plus being able to reorder results personally, and more.

swftarrow 51 minutes ago||||
I suspect a large part of Google's lag is due to being overly focused on integrating Gemini with their existing product and app lines.
alphabetting 3 hours ago||||
the agentic benchmarks for 3.1 indicate Gemini has caught up. the gains are big from 3.0 to 3.1.

For example the APEX-Agents benchmark for long time horizon investment banking, consulting and legal work:

1. Gemini 3.1 Pro - 33.2% 2. Opus 4.6 - 29.8% 3. GPT 5.2 Codex - 27.6% 4. Gemini Flash 3.0 - 24.0% 5. GPT 5.2 - 23.0% 6. Gemini 3.0 Pro - 18.0%

kakugawa 1 hour ago|||
In mid-2024, Anthropic made the deliberate decision to stop chasing benchmarks and focus on practical value. There was a lot of skepticism at the time, but it's proven to be a prescient decision.
girvo 1 hour ago||||
Benchmarks are basically straight up meaningless at this point in my experience. If they mattered and were the whole story, those Chinese open models would be stomping the competition right now. Instead they're merely decent when you use them in anger for real work.

I'll withhold judgement until I've tried to use it.

metadat 34 minutes ago||||
Ranking Codex 5.2 ahead of plain 5.2 doesn't make sense. Codex is expressly designed for coding tasks. Not systems design, not problem analysis, and definitely not banking, but actually solving specific programming tasks (and it's very, very good at this). GPT 5.2 (non-codex) is better in every other way.
306bobby 20 minutes ago|||
It could be problem specific. There are certain non program things that opus seems better than sonnet at as well
306bobby 19 minutes ago|||
Swapped sonnet and opus on my last reply, oops
blueaquilae 1 hour ago||||
Marketing team agree with benchmark score...
HardCodedBias 2 hours ago|||
LOL come on man.

Let's give it a couple of days since no one believes anything from benchmarks, especially from the Gemini team (or Meta).

If we see on HN that people are willing switching their coding environment, we'll know "hot damn they cooked" otherwise this is another wiff by Google.

drivebyhooting 23 minutes ago||
You can’t put Gemini and Meta in the same sentence. Llama 4 was DOA, and Meta has given up on frontier models. Internally they’re using Claude.
hintymad 1 hour ago||||
My guess is that Gemini team didn't focus on the large-scale RL training for the agentic workload. And they are trying to catch up with 3.1.
ionwake 3 hours ago||||
Can you explain what you mean by its bad at agentic stuff?
karmasimida 3 hours ago||
Accomplish the task I give to it without fighting me with it.

I think this is classic precision/recall issue: the model needs to stay on task, but also infer what user might want but not explicitly stated. Gemini seems particularly bad that recall, where it goes out of bounds

ionwake 26 minutes ago||
cool thanks for the explanation
renegade-otter 1 hour ago|||
It's like anything Google - they do the cool part and then lose interest with the last 10%. Writing code is easy, building products that print money is hard.
miohtama 52 minutes ago||
One does not need products if you have monopoly on search
margorczynski 47 minutes ago||
That monopoly is worth less as time goes by and people more and more use LLMs or similar systems to search for info. In my case I've cut down a lot of Googling since more competent LLMs appeared.
raducu 29 minutes ago|||
> Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress.

Yes, gemini loops but I've found almost always it's just a matter of interrupting and telling it to continue.

Claude is very good until it tries something 2-3 times, can't figure it out and then tries to trick you by changing your tests instead of your code (if you explicitly tell it not to, maybe it will decide to ask) OR introduce hyper-fine-tuned IFs to fit your tests, EVEN if you tell it NOT to.

RachelF 13 minutes ago||
I haven't used 3.1 yet, but 3.0 Pro has been frustrating for two reasons:

- it is "lazy": I keep having to tell it to finish, or continue, it wants to stop the task early.

- it hallucinates: I have arguments with it about making up API functions to well known libraries which just do not exist.

WhitneyLand 16 minutes ago|||
Yeah it’s amazing how it can be the best model on paper, and in some ways in practice, but coding has sucked with it.

Makes you wonder though how much of the difference is the model itself vs Claude Code being a superior agent.

jpcompartir 28 minutes ago|||
Yep, Gemini is virtually unusable compared to Anthropic models. I get it for free with work and use maybe once a week, if that. They really need to fix the instruction following.
s3p 4 hours ago|||
Don't get me started on the thinking tokens. Since 2.5P the thinking has been insane. "I'm diving in to the problem", "I'm fully immersed" or "I'm meticulously crafting the answer"
raducu 20 minutes ago|||
> Don't get me started on the thinking tokens.

Claude provides nicer explanations, but when it comes to CoT tokens or just prompting the LLM to explain -- I'm very skeptical of the truthfulness of it.

Not because the LLM lies, but because humans do that also -- when asked how the figured something, they'll provide a reasonable sounding chain of thought, but it's not how they figured it out.

foz 3 hours ago||||
This is part of the reason I don't like to use it. I feel it's hiding things from me, compared to other models that very clearly share what they are thinking.
dumpsterdiver 1 hour ago||
To be fair, considering that the CoT exposed to users is a sanitized summary of the path traversal - one could argue that sanitized CoT is closer to hiding things than simply omitting it entirely.
mikestorrent 1 hour ago||
This is something that bothers me. We had a beautiful trend on the Web of the browser also being the debugger - from View Source decades ago all the way up to the modern browser console inspired by Firebug. Everything was visible, under the hood, if you cared to look. Now, a lot of "thinking" is taking place under a shroud, and only so much of it can be expanded for visibility and insight into the process. Where is the option to see the entire prompt that my agent compiled and sent off, raw? Where's the option to see the output, replete with thinking blocks and other markup?
dist-epoch 3 hours ago|||
That's not the real thinking, it's a super summarized view of it.
Oras 4 hours ago|||
Glad I’m not the only one who experienced this. I have a paid antigravity subscription and most of the time I use Claude models due to the exact issues you have pointed out.
stephen_cagle 2 hours ago|||
I also worked at Google (on the original Gemini, when it was still Bard internally) and my experience largely mirrors this. My finding is that Gemini is pretty great for factual information and also it is the only one that I can reliably (even with the video camera) take a picture of a bird and have it tell me what the bird is. But it is just pretty bad as a model to help with development, myself and everyone I know uses Claude. The benchmarks are always really close, but my experience is that it does not translate to real world (mostly coding) task.

tldr; It is great at search, not so much action.

neves 57 minutes ago|||
Gemini interesting with Google software gives me the best feature of all LLMs. When I receive a invite for an event, I screenshot it, share with Gemini app and say: add to my Calendar.

It's not very complex, but a great time saver

stephen_cagle 39 minutes ago|||
Yeah, as evidenced by the birds (above), I think it is probably the best vision model at this time. That is a good idea, I should also use it for business cards as well I guess.
jeffbee 12 minutes ago|||
That's great but it can't add stuff to your calendar unless you throw the master switch for "personalization" giving it access to your GMail, Docs, etc. I tried that and it went off the rails immediately, started yapping in an unrelated context about the 2002 Dodge Ram that I own, which of course I do not own, but some imbecile who habitually uses my email address once ordered parts for one. I found that to be a pretty bad feature so I had to turn it off, and now it can't do the other stuff like make calendars or add my recipes to Keep.
menaerus 1 hour ago|||
I don't know ... as of now I am literally instructing it to solve the chained expression computation problem which incurs a lot of temporary variables, of which some can be elided by the compiler and some cannot. Think linear algebra expressions which yield a lot of intermediate computations for which you don't want to create a temporary. This is production code and not an easy problem.

And yet it happily told me what I exactly wanted it to tell me - rewrite the goddamn thing using the (C++) expression templates. And voila, it took "it" 10 minutes to spit out the high-quality code that works.

My biggest gripe for now with Gemini is that Antigravity seems to be written by the model and I am experiencing more hiccups than I would like to, sometimes it's just stuck.

stephen_cagle 27 minutes ago|||
Can't argue with that, I'll move my Bayesian's a little in your direction. With that said, are most other models able to do this? Also, did it write the solution itself or use a library like Eigen?

I have noticed that LLM's seem surprisingly good at translating from one (programming) language to another... I wonder if transforming a generic mathematical expression into an expression template is a similar sort of problem to them? No idea honestly.

cmrdporcupine 31 minutes ago|||
People's objections are not the quality of code or analysis that Gemini produces. It's that it's inept at doing things like editing pieces of files or running various tools.

As an ex-Googler part of me wonders if this has to do with the very ... bespoke ... nature of the developer tooling inside Google. Though it would be crazy for them to be training on that.

agentifysh 2 hours ago|||
Relieved to read this from an ex-Googler at least we are no the crazy ones we are made out to be whenever we point out issues with Gemini
knollimar 5 hours ago|||
Is the thinking token stream obfuscated?

Im fully immersed

orbital-decay 4 hours ago||
It's just a summary generated by a really tiny model. I guess it also an ad-hoc way to obfuscate it, yes. In particular they're hiding prompt injections they're dynamically adding sometimes. Actual CoT is hidden and entirely different from that summary. It's not very useful for you as a user, though (neither is the summary).
ukuina 4 hours ago|||
Agree the raw thought-stream is not useful.

It's likely filled with "Aha!" and "But wait!" statements.

FergusArgyll 3 hours ago||||
They hide the CoT because they don't want competitors to train on it
orbital-decay 2 hours ago||
Training on the CoT itself is pretty dubious since it's reward hacked to some degree (as evident from e.g. GLM-4.7 which tried pulling that with 3.0 Pro, and ended up repeating Model Armor injections without really understanding/following them). In any case they aren't trying to hide it particularly hard.
FergusArgyll 2 hours ago||
> In any case they aren't trying to hide it particularly hard.

What does that mean? Are you able to read the raw cot? how?

cubefox 1 hour ago|||
The early version of Gemini 2.5 did initially show the actual CoT in AI Studio, and it was pretty interesting in some cases.
slopinthebag 4 hours ago|||
Hmm, interesting..

My workflow is to basically use it to explain new concepts, generate code snippets inline or fill out function bodies, etc. Not really generating code autonomously in a loop. Do you think it would excel at this?

mikestorrent 59 minutes ago||
I think that you should really try to get whatever agent you can to work on that kind of thing for you - guide it with the creation of testing frameworks and code coverage, focus more on the test cases with your human intellect, and let it work to pass them.
slopinthebag 38 minutes ago||
I'm not really interested in that workflow, too far removed from the code imo. I only really do that for certain tasks with a bunch of boilerplate, luckily I simply don't use languages or frameworks that require very much BS anymore.
jbellis 4 hours ago|||
yeah, g3p is as smart or smarter as the other flagships but it's just not reliable enough, it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

https://blog.brokk.ai/gemini-3-pro-preview-not-quite-baked/

hopefully 3.1 is better.

nicce 3 hours ago||
> it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

Maybe it is just a genius business strategy.

mikestorrent 57 minutes ago||
Similarly, Cursor's "Auto Mode" purports to use whichever model is best for your request, but it's only reasonable to assume it uses whatever model is best for Cursor at that moment
varispeed 2 hours ago||
> stuck in loops

I wonder if there is some form of cheating. Many times I found that after a while Gemini becomes like a Markov chain spouting nonsense on repeat suddenly and doesn't react to user input anymore.

sdeiley 1 hour ago||
People underrate Google's cost effectiveness so much. Half price of Opus. HALF.

Think about ANY other product and what you'd expect from the competition thats half the price. Yet people here act like Gemini is dead weight

____

Update:

3.1 was 40% of the cost to run AA index vs Opus Thinking AND SONNET, beat Opus, and still 30% faster for output speed.

https://artificialanalysis.ai/?speed=intelligence-vs-speed&m...

bluegatty 31 minutes ago||
You can pay 1 cent for a mediocre answer or 2 cents for a great answer.

So a lot of these things are relative.

Now if that equation plays out 20K times a day, well that's one thing, but if it's 'once a day' then the cost basis becomes irrelevant. Like the cost of staplers for the Medical Device company.

Obviously it will matter, but for development ... it's probably worth it to pay $300/mo for the best model, when the second best is $0.

For consumer AI, the math will be different ... and that will be a big deal in the long run.

vitaflo 4 minutes ago|||
Deepseek is 2% of the cost of Opus. But most people aren't using that for code even tho it's ridiculously cheap.
nu11ptr 42 minutes ago|||
That sounds great, but if Opus generates 20% better code think of the ramifications of that on a real world project. Already $100/month gets you a programmer (or maybe even 2 or 3) that can do your work for you. Insanity. Do I even care if there is something 80% as good for 50% the cost? My answer: no. That said, if it is every bit as good, and their benchmarks suggest it is (but proof will be in testing it out), then sure, a 50% cost reduction sounds really nice.
jstummbillig 1 hour ago|||
It's not half price or cost effective if it can't do the job, that I am happy to pay twice the price for to get done.

But I agree: If they can get there (at one point in the past year I felt they were the best choice for agentic coding), their pricing is very interesting. I am optimistic that it would not require them to go up to Opus pricing.

WarmWash 16 minutes ago|||
Gemini is the most paradoxical model because it benchmarks great even in private benchmarks done by regular people, Deep Mind is unquestionably full of capable engineers with incredible skill, and personally Gemini has been great for my day job and my coding for fun (not for profit) endeavors. Switching between it and 4.6 in antigravity and I don't see much of a difference, they both do what I ask.

But man, people are really avid about it being an awful model.

Decabytes 58 minutes ago|||
Any tips for working with Gemini through its chat interface? I’ve worked with ChatGPT and Claude and I’ve generally found them pleasant to work with, but everytime I use Gemini the output is straight dookie
londons_explore 29 minutes ago||
make sure you use ai studio (not the vertex one), not the consumer gemini interface. Seems to work better for code there.
metadat 22 minutes ago|||
Attention is the new scarce resource. Saving even 50% is nothing if it wastes more of my time.
csmpltn 46 minutes ago|||
> "People underrate Google's cost effectiveness so much. Half price of Opus. HALF."

Google undercutting/subsidizing it's own prices to bite into Anthropic's market share (whilst selling at a loss) doesn't automatically mean Google is effective.

sdeiley 38 minutes ago||
Everybody is subsidizing their prices.

But Flash is 1/8 the cost of sonnet and its not impressive?

csmpltn 7 minutes ago||
Sure, for the launch. Until they start introducing ads, capping existing subscriptions and raising prices (on all products)
mritchie712 47 minutes ago|||
It's half the price per token. Not all tokens are generated equally.
sdeiley 37 minutes ago||
Neither are cars but Ill take a Porsche over a Ferrari for a fraction of the price.
ionwake 24 minutes ago||
which model?
sdeiley 21 minutes ago||
For me any, tbh. I wouldn't fit in a Ferrari lol
fastball 33 minutes ago|||
We are not at the moment where price matters. All that matters is performance.
sdeiley 22 minutes ago|||
What did you say? Cant hear you over the $400B in capex spend.

Counterpoint: price will matter before we hit AGI

willis936 26 minutes ago|||
It matters to me. I pay for it and I like using it. I pick my models to keep my spend reigned in.
cyanydeez 55 minutes ago|||
Some people like blackjack and a technical edge with card counting, others just say screw it and do slot machines.
sdeiley 20 minutes ago||
This is a decent analogy actually. Kudos
Svoka 55 minutes ago|||
While price is definitely important, results are extremely important. Gemini often falls into the 'didn't do' it part of the spectrum, this days Opus almost always does 'good enough'.

Gemini definitely has its merits but for me it just doesn't do what other models can. I vibe-coded an app which recommends me restaurants. The app uses gemini API to make restaurants given bunch of data and prompt.

App itself is vibe-coded with Opus. Gemini didn't cut it.

sdeiley 31 minutes ago||
The binary you draw on models that havent been out a quarter is borderline insane.

Opus is absurdly good in Claude code but theres a lot of use cases Gemini is great at.

I think Google is further behind with the harness than the model

SV_BubbleTime 56 minutes ago|||
Well, it’s half if the product is equal.

Is it? Honestly, I still chuckle about black Nazis and the female Indian Popes. That was my first impression of Gemini, and first impressions are hard to break. I used Gemini’s VL (vision) for something and it refused to describe because it assumed it was NSFW imagery, which is was not.

I also question statis as an obvious follow up. Is Gemini equal to Opus? Today? Tomorrow? Has Google led the industry thus far and do I expect them to continue?

Counterpoint to that would be that with natural language input and output, that LLM specific tooling is rare and it is easy to switch around if you commoditize the product backend.

varispeed 43 minutes ago||
If something is shit, it doesn't matter it costs half price of something okay.
n4pw01f 1 minute ago||
I created a nice harness and visual workflow builder for my Gemini agent chains, works very well. I did this so it would create code the way I do, that is very editable.

In contrast, the vs code plugin was pretty bad, and did crazy things like mix languages

sheepscreek 39 minutes ago||
If it’s any consolation, it was able to one-shot a UI & data sync race condition that even Opus 4.6 struggled to fix (across 3 attempts).

So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.

While it gives me hope, I am going to play it by the ear. Otherwise it’s going to be - Gemini for world knowledge/general intelligence/R&D and Opus/Sonnet 4.6 to finish it off.

sigmoid10 32 minutes ago|
For me it's Opus 4.6 for researching code/digging through repos, gpt 5.3 codex for writing code, gemini for single hardcore science/math algorithms and grok for things the others refuse to answer or skirt around (e.g. some security/exploitability related queries). Get yourself one of those wrappers that support all models and forget thinking about who has the best model. The question is who has the best model for your problem. And there's usually a correct answer, even if it changes regularly.
xrd 4 hours ago||
These models are so powerful.

It's totally possible to build entire software products in the fraction of the time it took before.

But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.

It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.

But it also feels like you have to be diligent about adopting new models (even same family and just point version updates) because they operate totally differently regardless of your prompt and agent files.

Imagine managing a team of software developers where every month it was an entirely new team with radically different personalities, career experiences and guiding principles. It would be chaos.

I suspect that older models will be deprecated quickly and unexpectedly, or, worse yet, will be swapped out with subtle different behavioral characteristics without notice. It'll be quicksand.

nly 4 minutes ago||
I keep giving the top Anthropic, Google and OpenAI models problems.

They come up with passable solutions and are good for getting juices flowing and giving you a start on a codebase, but they are far from building "entire software products" unless you really don't care about quality and attention to detail.

simonw 4 hours ago|||
I had an interesting experience recently where I ran Opus 4.6 against a problem that o4-mini had previously convinced me wasn't tractable... and Opus 4.6 found me a great solution. https://github.com/simonw/sqlite-chronicle/issues/20

This inspired me to point the latest models at a bunch of my older projects, resulting in a flurry of fixes and unblocks.

small_model 3 hours ago|||
I have a codebase (personal project) and every time there is a new Claude Opus model I get it to do a full code review. Never had any breakages in last couple of model updates. Worried one day it just generates a binary and deletes all the code.
TZubiri 2 hours ago||
No version control?
small_model 2 hours ago||
I was being facetious, I mean one day models might skip the middle man of code and compilation and take your specs and produce an ultra efficent binary.
mikestorrent 50 minutes ago||
Musk was saying that recently but I don't see it being efficient or worthwhile to do this. I could be proven brutally wrong, but code is language; executables aren't. There's also no real reason to bother with this when we have quick-compiling languages.

More realistically, I could see particular languages and frameworks proving out to be more well-designed and apt for AI code creation; for instance, I was always too lazy to use a strongly-typed language, preferring Ruby for the joy of writing in it (obsessing about types is for a particular kind of nerd that I've never wanted to be). But now with AI, everything's better with strong types in the loop, since reasoning about everything is arguably easier and the compiler provides stronger guarantees about what's happening. Similarly, we could see other linguistic constructs come to the forefront because of what they allow when the cost of implementation drops to zero.

TZubiri 23 minutes ago||
You can map tokens to CPU instructions and train a model on that, that's what they do for input images I think.

I think the main limitation on the current models is not that cpu instructions aren't cpu instructions (even though they can be with .asm), it's that they are causal, the cpu would need to generate a binary entirely from start to finish sequentially.

If we learned something over the last 50 years of programming is that that's hard and that's why we invented programming languages? Why would it be simpler to just generate the machine code, sure maybe an LLM to application can exist, but my money is in that there will be a whole toolchain in the middle, and it will probably be the same old toolchain that we are using currently, an OS, probably Linux.

Isn't it more common that stuff builds on the existing infra instead of a super duper revolution that doesn't use the previous tech stack? It's much easier to add onto rather than start from scratch.

poszlem 42 minutes ago||||
This may seem obvious, but many people overlook it. The effect is especially clear when using an AI music model. For example, in Suno AI you can remaster an older AI generated track with a newer model. I do this with all my songs whenever a new model is released. It makes it super easy to see the improvements that were made to the models over time.
jauntywundrkind 3 hours ago||||
From the project description here for your sqlite-chronicle project:

> Use triggers to track when rows in a SQLite table were updated or deleted

Just a note in case its interesting to anyone, sqlite compatible Turso database has CDC, a changes table! https://turso.tech/blog/introducing-change-data-capture-in-t...

petesergeant 4 hours ago|||
I continue to get great value out of having claude and codex bound together in a loop: https://github.com/pjlsergeant/moarcode
apitman 3 hours ago||
They are one, the ring and the dark lord
jama211 4 hours ago|||
Yeah I keep maintaining a specific app I built with gpt 5.1 codex max with that exact model because it continues to work for the requests I send it, and attempts with other models even 5.2 or 5.3 codex seemed to have odd results. If I were superstitious I would say it’s almost like the model that wrote the code likes to work on the code better. Perhaps there’s something about the structure it created though that it finds easier to understand…
seizethecheese 4 hours ago|||
> It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.

In my experience it’s more like idiot savant engineers. Still remarkable.

worldsavior 4 hours ago|||
Sushy dinner? What are you building with AI, a calculator?
WarmWash 4 hours ago|||
I have long suspected that a large part of people's distaste for given models comes from their comfort with their daily driver.

Which I guess feeds back to prompting still being critical for getting the most out of a model (outside of subjective stylistic traits the models have in their outputs).

HardCodedBias 2 hours ago||
"These models are so powerful."

Careful.

Gemini simply, as of 3.0, isn't in the same class for work.

We'll see in a week or two if it really is any good.

Bravo to those who are willing to give up their time to test for Google to see if the model is really there.

(history says it won't be. Ant and OAI really are the only two in this race ATM).

minimaxir 6 hours ago||
Price is unchanged from Gemini 3 Pro: $2/M input, $12/M output. https://ai.google.dev/gemini-api/docs/pricing

Knowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3

Compare to Opus 4.6's $5/M input, $25/M output. If Gemini 3.1 Pro does indeed have similar performance, the price difference is notable.

amluto 1 hour ago||
Now compare the monthly plans for business users who want the CLI agent but who don’t want the models trained on their data.

OpenAI: no big deal — sign up, pick your number of seats, and you’re all set.

Anthropic: also no big deal but there’s an obnoxious minimum purchase.

Google: first you have to try to figure out what the product is called. Then you need to figure out how to set the correct IAM rules. Then you have to sign up and pay for it. Maybe you succeed. Maybe you give up after an hour or two of cursing. Gemini is, of course, completely unable to help. (OpenAI clearly has not trained their models on how to operate their tools. Google’s models hallucinate Google’s product offerings so outrageously that I’m not sure I could tell. I haven’t asked Claude about Claude Code.)

At least the monthly pricing is similar once you get over the hurdles.

waffletower 44 minutes ago|||
Well some are using Anthropic on AWS Bedrock which is a bit more like the Google paragraph. Perhaps a good thing that Nova models aren't competitive (and many here are asking "What's a Nova model?"). And remember, many businesses aren't flinching at IAM controls and are asking for data privacy contracts.
trvz 4 minutes ago||
Well some are masochists.
miohtama 48 minutes ago|||
I can confirm the products bit, I tried to use Gemini to help with G Suite admin.
rancar2 5 hours ago|||
If we don't see a huge gain on the long-term horizon thinking reflected with the Vendor-Bench 2, I'm not going to switch away from CC. Until Google can beat Anthropic on that front, Claude Code paired with the top long-horizon models will continue to pull away with full stack optimizations at every layer.
jbellis 4 hours ago|||
still no minimal reasoning in G3.1P :(

(this is why Opus 4.6 is worth the price -- turning off thinking makes it 3x-5x faster but it loses only a small amount of intelligence. nobody else has figured that out yet)

sunaookami 2 hours ago||
Thinking is just tacked on for Anthropic's models and always has been so leaving it off actually produces better results everytime.
girvo 1 hour ago||
What about for analysis/planning? Honestly I've been using thinking, but if I don't have to with Opus 4.6 I'm totally keen to turn it off. Faster is better.
agentifysh 2 hours ago|||
Looks like its cheaper than codex ??? this might be interesting then
TZubiri 2 hours ago||
It's not trained for agentic coding I don't think
oblio 2 hours ago|||
> Knowledge cutoff is unchanged at Jan 2025.

Isn't that a bit old?

minimaxir 2 hours ago||
Old relative to its competitors, but the Search tool can compensate for it.
plaidfuji 5 hours ago||
Sounds like the update is mostly system prompt + changes to orchestration / tool use around the core model, if the knowledge cutoff is unchanged
sigmar 5 hours ago|||
knowledge cutoff staying the same likely means they didn't do a new pre-train. We already knew there were plans from deepmind to integrate new RL changes in the post training of the weights. https://x.com/ankesh_anand/status/2002017859443233017
brokencode 4 hours ago|||
This keeps getting repeated for all kinds of model releases, but isn’t necessarily true. It’s possible to make all kinds of changes without updating the pretraining data set. You can’t judge a model’s newness based on what it knows about.
mijoharas 5 hours ago||
Gemini 3 is still in preview (limited rate limits) and 2.5 is deprecated (still live but won't be for long).[0]

Are Google planning to put any of their models into production any time soon?

Also somewhat funny that some models are deprecated without a suggested alternative(gemini-2.5-flash-lite). Do they suggest people switch to Claude?

[0] https://ai.google.dev/gemini-api/docs/deprecations

andrewmutz 5 hours ago||
I agree completely. I don't know how anyone can be building on these models when all of them are either deprecated or not actually released yet. As someone who has production systems running on the deprecated models, this situation really causes me grief.
irthomasthomas 1 hour ago|||
I dont think any of them really wants api customers in the end. They are only temporarily useful.
moffkalast 1 hour ago|||
Well let me use llama.cpp to run worlds-smallest-violin-Q8.gguf

When you build on something that can be rugpulled at any moment, that's really kind of on you.

NitpickLawyer 3 hours ago|||
You are reading your link wrong. They are deprecating 2.5-preview models. 2.5 (including lite) are up till at least sept/oct 26.
mijoharas 3 hours ago||
gemini-2.5-pro has a listed shutdown date of "June 17, 2026" in the linked table.

(Another commenter pointed out that this is the earliest shutdown date and it won't necessarily be shut down on that date).

Where are you getting sept/Oct from? I see gemini-2.5-flash-image in October, but everything else looks like June/July to me?

vidarh 4 hours ago|||
This feels very Google
drbacon 2 hours ago||
I found the Googler!
vidarh 1 hour ago||
Nope. The closest I've gotten was rejecting Google recruiters several times.

But like everyone else I'm used to Google failing to care about products.

cmrdporcupine 27 minutes ago||
Inside Google we just constantly joked/complained about "old thing is deprecated, new isn't ready yet"

This held for internal APIs, facilities, systems more even than it did for the outside world. Which is terrible.

axpy906 1 hour ago|||
Have 2.5 in prod. Hope they release 3 lite soon so it will be easier to swap them. Holding my breath as pro pricing is a non starter.
jakejohnson 4 hours ago|||
I haven't seen any deprecation notices for 2.5 yet, just for 2. I'd expect (and hope) the deprecation timeline for 2.5 is longer since 3.0 is still in preview. Maybe they just default to 1 year here?

> Note: The shutdown dates listed in the table indicate the /earliest/ possible dates on which a model might be retired. We will communicate the exact shutdown date to users with advance notice to ensure a smooth transition to a replacement model.

mijoharas 3 hours ago||
I think you're right, it was 2 I think I saw explicitly deprecated, then searched again and saw 2.5 having a shutdown date.

This article[0] talks about 2 being deprecated.

It's still frustrating that they don't have proper production endpoints for 3.0 yet.

[0] https://ai.google.dev/gemini-api/docs/changelog

logicchains 1 hour ago||
It's the Tensorflow strategy; deprecate the old version while the new version still doesn't support all the old one's functionality.
incognito124 52 minutes ago||
welcome to Google, where the only kinds of systems are the deprecated ones, and the not-yet-production-ready ones
1024core 4 hours ago||
It got the car wash question perfectly:

You are definitely going to have to drive it there—unless you want to put it in neutral and push!

While 200 feet is a very short and easy walk, if you walk over there without your car, you won't have anything to wash once you arrive. The car needs to make the trip with you so it can get the soap and water.

Since it's basically right next door, it'll be the shortest drive of your life. Start it up, roll on over, and get it sparkling clean.

Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?

1024core 4 hours ago||
Some people are suggesting that this might actually be in the training set. Since I can't rule that out, I tried a different version of the question, with an elephant instead of a car:

> It's a hot and dusty day in Arizona and I need to wash my elephant. There's a creek 300 feet away. Should I ride my elephant there or should I just walk there by myself?

Gemini said:

That sounds like quite the dusty predicament! Given the options, you should absolutely ride (or lead) your elephant to the creek.

Here is the straightforward reality of the situation:

    The Logistical Nightmare of Walking Alone: If you walk to the creek by yourself, your elephant stays at your starting point. Elephants require a massive amount of water for a bath. Unless you are planning an extreme workout carrying hundreds of buckets of water 300 feet back through the Arizona heat, you're going to have a very bad time—and the elephant will stay dusty.

    The Perks of Going Together: 300 feet is a very short, easy stroll for an elephant. Plus, elephants generally love water and mud! Letting your elephant get right into the creek will make the washing process much easier for you and much more enjoyable for them. They can even use their trunk to help hose themselves down.
Grab some elephant-safe soap and head down to the water together!

Would you like some tips on the best methods and tools for scrubbing down an elephant?

kovek 4 hours ago|||
I think that semantically this question is too similar to the car wash one. Changing subjects from car to elephant and car wash to creek does not change the fact that they are subjects. The embeddings will be similar in that dimension.
1024core 4 hours ago||
I understand. But isn't it a sign of "smarts" that one can generalize from analoguous tasks?
willis936 4 hours ago||
Sure it is, but it's a different set of smarts than the kind of gotcha logic puzzle trying to be tested with the car wash question.
nebezb 17 minutes ago||
My gut says you’re right, but I don’t know if this is indeed true. It might be the same thing.
irthomasthomas 1 hour ago||||
i would say this is a lower difficulty. the car question primes it to think about stuff like energy and pollution.
proto-n 3 hours ago||||
From Gemini pro:

You should definitely ride the elephant (or at least lead it there)!

Here is the logic:

If you walk there by yourself, you will arrive at the creek, but the dirty elephant will still be 300 feet back where you started. You can't wash the elephant if it isn't with you!

Plus, it is much easier to take the elephant to the water than it is to carry enough buckets of water 300 feet back to the elephant.

Would you like another riddle, or perhaps some actual tips on how to keep cool in the Arizona heat?

jmkni 1 hour ago|||
I like telling it the car wash is 24,901.1 miles away (with the implication being that I'm right beside the car wash)
embedding-shape 4 hours ago|||
GPT-OSS-120b, a (downloadable) model released more than half a year ago also gets that right, I'm not sure this is such a great success.

> Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?

Regardless of what I do, the days I decide to wash my car, it ALWAYS rains the day after, even if I look at the weather report and it says sunny. Cute that Gemini thinks it can predict it better than universe's will.

sushisource 1 hour ago|||
> Regardless of what I do, the days I decide to wash my car, it ALWAYS rains the day after

Undeniable universal truth. I sometimes find myself making plans based on the fact that the most annoying possible outcome is also the most likely one.

Aachen 1 hour ago|||
Non car person here. Why does that matter? It's not like rain means you didn't have to go to the wash, it rains often enough here that there wouldn't be car wash places left near me but there are plenty
opinion-is-bad 1 hour ago||
Many people avoid washing cars just before rain to avoid spots, etc. Phoenix as an extreme example rarely rains and leaves everything filthy afterwards.
steve1977 4 hours ago|||
The question of course is, did it get the car wash question right because it is "the car wash question" or because it could actually infer why the car needed to be there?
embedding-shape 4 hours ago|||
Wasn't that "twoot" (or whatever Mastodon calls them) made just a week ago? Unlikely to have been in the training dataset of a model becoming available for public use today, unless Google made some serious advancements on the training front.
jama211 4 hours ago|||
Shouldn’t be too hard to come up with a new unique reasoning question
leumon 2 hours ago|||
Gemini 3 pro and flash already answered this correctly.
upmind 4 hours ago|||
The answer here is why I dislike Gemini, though it gets the correct answer, it's far too verbose.
KoolKat23 1 hour ago|||
I can't stand a model over-explaining, needless fluff and wasting tokens. I asked the question so I know the context.
mikelward 2 hours ago|||
I don't love the verbosity of any of the chatbots when I'm using my phone, but at least it put the answer/tl;dr in the first paragraph.
patrickmcnamara 3 hours ago|||
Truly we entering the era of AGI.
suddenlybananas 4 hours ago|||
They probably had time to toss that example in the training soup.
AlphaAndOmega0 4 hours ago||
Previous models from competitors usually got that correct, and the reasoning versions almost always did.

This kind of reflexive criticism isn't helpful, it's closer to a fully generalized counter-argument against LLM progress, whereas it's obvious to anyone that models today can do things they couldn't do six months ago, let alone 2 years back.

suddenlybananas 4 hours ago||
I'm not denying any progress, I'm saying that reasoning failures that are simple which have gone viral are exactly the kind of thing that they will toss in the training data. Why wouldn't they? There's real reputational risks in not fixing it and no costs in fixing it.
AlphaAndOmega0 1 hour ago||
Given that Gemini 3 Pro already did solid on that test, what exactly did they improve? Why would they bother?

I double checked and tested on AI Studio, since you can still access the previous model there:

>You should drive. >If you walk there, your car will stay behind, and you won't be able to wash it.

Thinking models consistently get it correct and did when the test was brand new (like a week or two ago). It is the opposite of surprising that a new thinking model continues getting it correct, unless the competitors had a time machine.

buttered_toast 4 hours ago||
I think we need to reevaluate what purpose these sorts of questions serve and why they're important in regards to judging intelligence.

The model getting it correct or not at any given instance isn't the point, the point is if the model ever gets it wrong we can still assume that it still has some semblance of stochasticity in its output, given that a model is essentially static once it is released.

Additionally, hey don't learn post training (except for in context which I think counts as learning to some degree albeit transient), if hypothetically it answers incorrectly 1 in 50 attempts, and I explain in that 1 failed attempt why it is wrong, it will still be a 1-50 chance it gets it wrong in a new instance.

This differs from humans, say for example I give an average person the "what do you put in a toaster" trick and they fall for it, I can be pretty confident that if I try that trick again 10 years later they will probably not fall for it, you can't really say that for a given model.

energy123 3 hours ago||
They're important but not as N=1. It's like cherry picking a single question from SimpleQA and going aha! It got it right! Meanwhile it's 8% lower score than some other model when evaluated on all questions.
buttered_toast 3 hours ago||
Makes me wonder what people would consider better, a model that gets 92% of questions right 100% of the time, or a model that gets 95% of the questions right 90% of the time and 88% right the other 10%?

I think that's why benchmarking is so hard for me to fully get behind, even if we do it over say, 20 attempts and average it. For a given model, those 20 attempts could have had 5 incredible outcomes and 15 mediocre ones, whereas another model could have 20 consistently decent attempts and the average score would be generally the same.

We at least see variance in public benchmarks, but in the internal examples that's almost never the case.

nickandbro 6 hours ago||
Does well on SVGs outside of "pelican riding on a bicycle" test. Like this prompt:

"create a svg of a unicorn playing xbox"

https://www.svgviewer.dev/s/NeKACuHj

Still some tweaks to the final result, but I am guessing with the ARC-AGI benchmark jumping so much, the model's visual abilities are allowing it to do this well.

ertgbnm 1 hour ago||
Animated SVGs are one of the example in the press release. Which is fine, I just think the weird SVG benchmark is now dead. Gemini has beat the benchmark and now differences are just coming down to taste.

I don't know if it got these abilities through generalization or if google gave it a dedicated animated SVG RL suite that got it to improve so much between models.

Regardless we need a new vibe check benchmark ala bicycle pelican.

pugio 30 minutes ago|||
Unfortunately it still fails my personal SVG benchmark (educational 2d cross section of the human heart), even after multiple iterations and screenshots feedback. Oh well, back to the (human) drawing board.
simonw 6 hours ago|||
Interesting how it went a bit more 3D with the style of that one compared to the pelican I got.
andy12_ 6 hours ago|||
I'm thinking now that as models get better and better at generating SVGs, there could be a point where we can use them to just make arbitrary UIs and interactive media with raw SVGs in realtime (like flash games).
rafark 3 hours ago|||
> there could be a point where we can use them to just make arbitrary UIs and interactive media with raw SVGs

So render ui elements using xml-like code in a web browser? You’re not going to believe me when I tell you this…

nickandbro 5 hours ago|||
Or quite literally a game where SVG assets are generated on the fly using this model
kridsdale3 4 hours ago||
Thats one dimension before another long term milestone: Realtime generation of 3D mesh content during gameplay.

Which is the "left brain" approach vs the "right brain" approach of coming at dynamic videogames from the diffusion model direction which the Gemini Genie thing seems to be about.

roryirvine 5 hours ago|||
On the other hand, creation of other vector image formats (eg. "create a postscript file showing a walrus brushing its teeth") hasn't improved nearly so much.

Perhaps they're deliberately optimising for SVG generation.

mclau153 3 hours ago||
can we move on from SVG to 3D models at some point?
knicholes 37 minutes ago|||
Image to model is already a thing, and it's pretty good.
simonw 6 hours ago|
Pretty great pelican: https://simonwillison.net/2026/Feb/19/gemini-31-pro/ - took over 5 minutes though, but I think that's because they're having performance teething problems on launch day.
embedding-shape 6 hours ago||
It's an excellent demonstration of the main issue I have with the Gemini family of models, they always go "above and beyond" to do a lot of stuff, even if I explicitly prompt against it. In this case, most of the SVG ends up consisting not just of a bike and a pelican, but clouds, a sun, a hat on the pelican and so much more.

Exactly the same thing happens when you code, it's almost impossible to get Gemini to not do "helpful" drive-by-refactors, and it keeps adding code comments no matter what I say. Very frustrating experience overall.

mullingitover 5 hours ago|||
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

Just asking "Explain what this service does?" turns into

[No response for three minutes...]

+729 -522

cowmoo728 5 hours ago|||
it's also so aggressive about taking out debug log statements and in-progress code. I'll ask it to fill in a new function somewhere else and it will remove all of the half written code from the piece I'm currently working on.
chankstein38 5 hours ago||
I ended up adding a "NEVER REMOVE LOGGING OR DEBUGGING INFO, OPT TO ADD MORE OF IT" to my user instructions and that has _somewhat_ fixed the problem but introduced a new problem where, no matter what I'm talking to it about, it tries to add logging. Even if it's not a code problem. I've had it explain that I could setup an ESP32 with a sensor so that I could get logging from it then write me firmware for it.
sd9 4 hours ago|||
If it's adding too much logging now, have you tried softening the instruction about adding more?

"NEVER REMOVE LOGGING OR DEBUGGING INFO. If unsure, bias towards introducing sensible logging."

Or just

"NEVER REMOVE LOGGING OR DEBUGGING INFO."

bratwurst3000 4 hours ago|||
"I've had it explain that I could setup an ESP32 with a sensor so that I could get logging from it then write me firmware for it." lol did you try it? This so far from everything ratinonal
BartShoot 4 hours ago||||
if you had to ask it obviously needs to refactor code for clarity so next person does not need to ask
quotemstr 4 hours ago||||
What. You don't have yours ask for edit approval?
girvo 1 hour ago|||
The depressing truth is most I know just run all these tools in /yolo mode or equivalents.

Because your coworkers definitely are, and we're stack ranked, so it's a race (literally) to the bottom. Just send it...

(All this actually seems to do is push the burden on to their coworkers as reviewers, for what it's worth)

embedding-shape 4 hours ago||||
Who has time for that? This is how I run codex: `codex --sandbox danger-full-access --dangerously-bypass-approvals-and-sandbox --search exec "$PROMPT"`, having to approve each change would effectively destroy the entire point of using an agent, at least for me.

Edit: obviously inside something so it doesn't have access to the rest of my system, but enough access to be useful.

well_ackshually 2 hours ago|||
>Who has time for that?

People that don't put out slop, mostly.

embedding-shape 39 minutes ago||
That's another thing entirely, I still review and manually decide the exact design and architecture of the code, with more care now than before. Doesn't mean I want the UI of the agent to need manual approval of each small change it does.
quotemstr 4 hours ago|||
I wouldn't even think of letting an agent work in that made. Even the best of them produce garbage code unless I keep them on a tight leash. And no, not a skill issue.

What I don't have time to do is debug obvious slop.

kees99 3 hours ago|||
I ended up running codex with all the "danger" flags, but in a throw-away VM with copy-on-write access to code folders.

Built-in approval thing sounds like a good idea, but in practice it's unusable. Typical session for me was like:

  About to run "sed -n '1,100p' example.cpp", approve?
  About to run "sed -n '100,200p' example.cpp", approve?
  About to run "sed -n '200,300p' example.cpp", approve?
Could very well be a skill issue, but that was mighty annoying, and with no obvious fix (options "don't ask again for ...." were not helping).
embedding-shape 38 minutes ago|||
I keep it on a tight leash too, not sure how that's related. What gets edited on disk is very different from what gets committed.
mullingitover 3 hours ago|||
Ask mode exists, I think the models work on the assumption that if you're allowing edits then of course you must want edits.
kylec 5 hours ago||||
"I don't know what did it, but here's what it does now"
moffkalast 1 hour ago||||
I've seen Kimi do this a ton as well, so insufferable.
SignalStackDev 4 hours ago|||
[dead]
Yizahi 31 minutes ago||||
Asking LLM programs to "not do the thing" often results in them tripping and generating output including that "thing", since those are simply the tokens which will enter the input. I always try to rephrase query the way that all my instructions have only "positive" forms - "do only this" or "do it only in that way" or "do it only for those parameters requested" etc. Can't say if that helps much, but it is possible.
h14h 3 hours ago||||
Would be really interesting to see an "Eager McBeaver" bench around this concept. When doing real work, a model's ability to stay within the bounds of a given task has almost become more important than its raw capabilities now that every frontier model is so dang good.

Every one of these models is so great at propelling the ship forward, that I increasingly care more and more about which models are the easiest to steer in the direction I actually want to go.

h14h 54 minutes ago|||
For sure. I imagine it'd be pretty difficult to evaluate the "correct" amount of steer-ability. You'd probably just have to measure a delta in eagerness on a single same task between when given highly-specified prompts, and more open-ended prompts. Probably not dissimilar from how artificialanalysis.ai does their "omniscience index".
cglan 3 hours ago|||
being TOO steerable is another issue though.

Codex is very steerable to a fault, and will gladly "monkey paw" your requests to a fault.

Claude Opus will ignore your instructions and do what it thinks is "right" and just barrel forward.

Both are bad and papering over the actual issue which is these models don't really have the ability to actually selectively choose their behavior per issue (ie ask for followup where needed, ignore users where needed, follow instructions where needed). Behavior is largely global

kees99 3 hours ago||
I my experience Claude gradually stops being opinionated as task at hand becomes more arcane. I frequently add "treat the above as a suggestion, and don't hesitate to push back" to change requests, and it seems to help quite a bit.
enobrev 5 hours ago||||
I have the same issue. Even when I ask it to do code-reviews and very explicitly tell it not to change files, it will occasionally just start "fixing" things.
mikepurvis 4 hours ago||
I find Copilot leans the other way. It'll myopically focus its work in the exact function I point it at, even when it's clear that adding a new helper would be a logical abstraction to share behaviour with the function right beside it.

Overall, I think it's probably better that it stay focused, and allow me to prompt it with "hey, go ahead and refactor these two functions" rather than the other way around. At the same time, really the ideal would be to have it proactively ask, or even pitch the refactor as a colleague would, like "based on what I see of this function, it would make most sense to XYZ, do you think that makes sense? <sure go ahead> <no just keep it a minimal change>"

Or perhaps even better, simply pursue both changes in parallel and present them as A/B options for the human reviewer to select between.

neya 4 hours ago||||
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

This has not been my experience. I do Elixir primarily and Gemini has helped build some really cool products and massive refactors along the way. And it would even pick up security issues and potential optimizations along the way

What HAS been an issue constantly though was randomly the model will absolutely not respond at all and some random error would occur which is embarrassing for a company like Google with the infrastructure they own.

embedding-shape 4 hours ago||
Out of curiosity, do you have any public projects (with public source code) you've made exclusively with Gemini, so one could take a look? I've tried a bunch of times to use Gemini to at least finish something small but I always end up sufficiently frustrated to abort it as the instruction-following seems so bad.
msteffen 3 hours ago||||
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

Not like human programmers. I would never do this and have never struggled with it in the past, no...

embedding-shape 3 hours ago||
Fairer comparison would be against other models, which are typically better at instruction following. You say "don't change anything not explicitly mentioned" or "Don't add any new code comments" and they tend to follow that.
apitman 3 hours ago||||
This matches my experience using Gemini CLI to code. It would also frequently get stuck in loops. It was so bad compared to Codex that I feel like I must have been doing something fundamentally wrong.
tyfon 5 hours ago||||
I was using gemini antigravity in opencode a few weeks ago before they started banning everyone for that and I got into the habit of writing "do x, then wait for instructions".

That helped quite a bit but it would still go off on it's own from time to time.

JLCarveth 4 hours ago||||
Every time I have tried using `gemini-cli` it just thinks endlessly and never actually gives a response.
gavinray 5 hours ago||||
Do you have Personalization Instructions set up for your LLM models?

You can make their responses fairly dry/brief.

embedding-shape 5 hours ago|||
I'm mostly using them via my own harnesses, so I have full control of the system prompts and so on. And no matter what I try, Gemini keeps "helpfully" adding code comments every now and then. With every other model, "- Don't add code comments" tends to be enough, but with Gemini I'm not sure how I could stop the comments from eventually appearing.
WarmWash 5 hours ago|||
I'm pretty sure it writes comments for itself, not for the user. I always let the models comment as much as they want, because I feel it makes the context more robust, especially when cycling contexts often to keep them fresh.

There is a tradeoff though, as comments do consumer context. But I tend to pretty liberally dispense of instances and start with a fresh window.

embedding-shape 5 hours ago||
> I'm pretty sure it writes comments for itself, not for the user

Yeah, that sounds worse than "trying to helpful". Read the code instead, why add indirection in that way, just to be able to understand what other models understand without comments?

metal_am 5 hours ago|||
I'd love to hear some examples!
gavinray 5 hours ago|||
I use LLM's outside of work primarily for research on academic topics, so mine is:

  Be a proactive research partner: challenge flawed or unproven ideas with evidence; identify inefficiencies and suggest better alternatives with reasoning; question assumptions to deepen inquiry.
ai4prezident 5 hours ago|||
[dead]
zengineer 5 hours ago|||
true, whenever I ask Gemini to help me with a prompt for generating an image of XYZ, it generates the image.
jasonjmcghee 3 hours ago|||
What's crazy is you've influenced them to spend real effort ensuring their model is good at generating animated svgs of animals operating vehicles.

The most absurd benchmaxxing.

https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...

simonw 2 hours ago|||
I like how they also did a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.
jasonjmcghee 2 hours ago|||
Ok Google what are some other examples like a pelican riding a bicycle
simultsop 2 hours ago|||
reminds me of andor, luthen, positive reinforcing wasting time of emperor
threatofrain 3 hours ago||||
Animated SVG is huge. People in different professions are worrying to different degrees in terms of being replaced by ML, but this one is huge with regards to digital art.
yieldcrv 1 hour ago||
yeah, complex SVG's are so much more bandwidth, computation and energy efficient than raster images - up to a point! but in general use we are not at that point and there's so much more we can do with it

I've been meaning to let coding agents take a stab at using the lottie library https://github.com/airbnb/lottie-web to supercharge the user experience without needing to make it a full time job

eurekin 3 hours ago||||
Can't wait until they finally get to real world CAD
tngranados 3 hours ago||
There's a CAD example in that same thread: https://x.com/JeffDean/status/2024528776856817813
tantalor 3 hours ago||||
He's svg-mogging
gnatolf 3 hours ago||||
So let's put things we're interested in in the benchmarks.

I'm not against pelicans!

ghurtado 3 hours ago||
I think the reason the pelican example is great is because it's bizarre enough that it's unlikely that to appear in the training as one unified picture.

If we picked something more common, like say, a hot dog with toppings, then the training contamination is much harder to control.

troymc 21 minutes ago|||
I think it's now part of their training though, thanks to Simon constantly testing every new model against it, and sharing his results publicly.

There's a specific term for this in education and applied linguistics: the washback effect.

rvnx 2 hours ago|||
It's the most common SVG test, it's the equivalent of Will Smith eating spaghettis, so obviously they benchmax toward it
casey2 3 hours ago||||
You don't have to benchmax everything, just the benchmarks in the right social circles
UltraSane 3 hours ago|||
It if funny to think that Jeff Dean personally worked to optimize the pelican riding a bike benchmark.
MrCheeze 5 hours ago|||
Does anyone understand why LLMs have gotten so good at this? Their ability to generate accurate SVG shapes seems to greatly outshine what I would expect, given their mediocre spatial understanding in other contexts.
tedsanders 3 hours ago|||
A few thoughts:

- One thing to be aware of is that LLMs can be much smarter than their ability to articulate that intelligence in words. For example, GPT-3.5 Turbo was beastly at chess (1800 elo?) when prompted to complete PGN transcripts, but if you asked it questions in chat, its knowledge was abysmal. LLMs don't generalize as well as humans, and sometimes they can have the ability to do tasks without the ability to articulate things that feel essential to the tasks (like answering whether the bicycle is facing left or right).

- Secondly, what has made AI labs so bullish on future progress over the past few years is that they see how little work it takes to get their results. Often, if an LLM sucks at something that's because no one worked on it (not always, of course). If you directly train a skill, you can see giant leaps in ability with fairly small effort. Big leaps in SVG creation could be coming from relatively small targeted efforts, where none existed before.

emp17344 2 hours ago||
We’re literally at the point where trillions of dollars have been invested in these things and the surrounding harnesses and architecture, and they still can’t do economically useful work on their own. You’re way too bullish here.
dbeardsl 2 hours ago||
Neither do cars until very recently. A tool doesn't have to be unsupervised to be useful.
simonw 4 hours ago||||
My best guess is that the labs put a lot of work into HTML and CSS spatial stuff because web frontend is such an important application of the models, and those improvements leaked through to SVG as well.
mitkebes 2 hours ago||||
All models have improved, but from my understanding, Gemini is the main one that was specifically trained on photos/video/etc in addition to text. Other models like earlier chatgpt builds would use plugins to handle anything beyond text, such as using a plugin to convert an image into text so that chatgpt could "see" it.

Gemini was multimodal from the start, and is naturally better at doing tasks that involve pictures/videos/3d spatial logic/etc.

The newer chatgpt models are also now multimodal, which has probably helped with their svg art as well, but I think Gemini still has an edge here

pknerd 4 hours ago||||
> Does anyone understand why LLMs have gotten so good at this?

Added more IF/THEN/ELSE conditions.

kridsdale3 4 hours ago||
More wires and jumpers on the breadboard.
sam_1421 5 hours ago|||
Models are soon going to start benchmaxxing generating SVGs of pelicans on bikes
cbsks 5 hours ago|||
That’s Simon’s goal. “All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one.”

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

travisgriggs 4 hours ago||
So once that's achieved, I wonder how well it deals with unsuspected variations. E.g.

"Give me an illustration of a bicycle riding by a pelican"

"Give me an illustration of a bicycle riding over a pelican"

"Give me an illustration of a bicycle riding under a flying pelican"

So on and so forth. Or will it start to look like the Studio C sketch about Lobster Bisque: https://youtu.be/A2KCGQhVRTE

embedding-shape 5 hours ago||||
Soon? I'd be willing to bet it's been included in the training set at least 6 months by now. Not so obvious so it generates always perfect pelicans on bikes, but sufficiently for the "minibench" to be less useful today than in the past.
Rudybega 1 hour ago|||
If only there were some way to test it, like swapping the two nouns in the sentence. Alas.
jsheard 5 hours ago||||
Simons been doing this exact test for nearly 18 months now, if vendors want to benchmaxx it then they've had more than enough time to do so already.
stri8ted 5 hours ago||
Exactly. As far as I'm concerned, the benchmark is useless. It's way too easy and rewarding to train on it.
bonoboTP 3 hours ago|||
It's just an in-joke, he doesn't intend it as a serious benchmark anymore. I think it's funny.
Legend2440 5 hours ago||||
Y'all are way too skeptical, no matter what cool thing AI does you'll make up an excuse for how they must somehow be cheating.
toraway 3 hours ago|||
Jeff Dean literally featured it in a tweet announcing the model. Personally it feels absurd to believe they've put absolutely no thought into optimizing this type of SVG output given the disproportionate amount of attention devoted to a specific test for 1 yr+.

I wouldn't really even call it "cheating" since it has improved models' ability to generate artistic SVG imagery more broadly but the days of this being an effective way to evaluate a model's "interdisciplinary" visual reasoning abilities have long since passed, IMO.

It's become yet another example in the ever growing list of benchmaxxed targets whose original purpose was defeated by teaching to the test.

https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...

arcatech 4 hours ago|||
Or maybe you’re too trusting of companies who have already proven to not be trustworthy?
pixl97 5 hours ago|||
I mean if you want to make your own benchmark, simply don't make it public and don't do it often. If your salamander on skis or whatever gets better with time it likely has nothing to do with being benchmaxxed.
ks2048 3 hours ago|||
Forget the paperclip maximizer - AGI will turn the whole world into pelicans on bikes.
SoKamil 5 hours ago|||
It seems they trained the model to output good svg’s.

In their blog post[1], first use case they mention is svg generation. Thus, it might not be any indicator at all anymore.

[1] https://blog.google/innovation-and-ai/models-and-research/ge...

brikym 1 hour ago|||
Another great benchmark would be to convert a raster image of a logo into SVG. I've yet to find a good tool for this that produces accurate smooth lines.
culi 2 hours ago|||
Cost per task has increased 4.2x but their ARC-AGI-2 score went from 33.6% to 77.1%

Cost per task is still significantly lower than Opus. Even Opus 4.5

https://arcprize.org/leaderboard

Arcuru 6 hours ago|||
Did you stop using the more detailed prompt? I think you described it here: https://simonwillison.net/2025/Nov/18/gemini-3/
simonw 4 hours ago||
It seems to be having capacity problems right now but I'll run that as soon as I can get it to work.
simonw 2 hours ago||
Pretty solid: https://gist.github.com/simonw/f5c893203621a7631ff178d9093a8...
WarmWash 5 hours ago|||
Less pretty and more practical, it's really good at outputting circuit designs as SVG schematics.

https://www.svgviewer.dev/s/dEdbH8Sw

InitialLastName 5 hours ago|||
I don't know what of this is the prompt and what was the output, but that's a pretty bad schematic (for both aesthetic and circuit-design reasons).
WarmWash 4 hours ago|||
The prompts were doing the design, reference voltage, hysteresis, output stage, all the maths and then the SVG is from asking the model to take all that and the current BOM to make an SVG schematic of it. In the past models would just output totally incoherent messes of lines and shapes.

I did a larger circuit too that this is part of, but it's not really for sharing online.

svnt 5 hours ago|||
Yes but you concede it is a schematic.
tadfisher 3 hours ago||
How far we have come!
0_____0 5 hours ago|||
that's pretty amazing for an LLM but as an EE, if my intern did this i would sigh inwardly and pull up some existing schematics for some brief guidance on symbol layout.
AmazingTurtle 5 hours ago|||
At this point, the pelican benchmark became so widely used that there must be high quality pelicans in the dataset, I presume. What about generating an okapi on a bicycle instead?
ascorbic 3 hours ago|||
Loads of examples here https://x.com/jeffdean/status/2024525132266688757
tromp 4 hours ago|||
Or, even more challenging, an okapi on a recumbent ?!
steve_adams_86 5 hours ago|||
Ugh, the gears and chain don't mesh and there's no sprocket on the rear hub

But seriously, I can't believe LLMs are able to one-shot a pelican on a bicycle this well. I wouldn't have guessed this was going to emerge as a capability from LLMs 6 years ago. I see why it does now, but... It still amazes me that they're so good at some things.

emp17344 5 hours ago|||
Is this capability “emergent”, or do AI firms specifically target SVG generation in order to improve it? How would we be able to tell?
steve_adams_86 3 hours ago|||
I asked myself the same thing as I typed that comment, and I'm not sure what the answer is. I don't think models are specifically trained on this (though of course they're trained on how to generate SVGs in general), but I'm prepared to be wrong.

I have a feeling the most 'emergent' aspect was that LLMs have generally been able to produce coherent SVG for quite a while, likely without specific training at first. Since then I suspect there has been more tailored training because improvements have been so dramatic. Of course it makes sense that text-based images using very distinct structure and properties could be manipulated reasonably well by a text-based language model, but it's still fascinating to me just how well it can work.

Perhaps what's most incredible about it is how versatile human language is, even when it lacks so many dimensions as bits on a machine. Yet it's still cool that we can resurrect those bits at rest and transmogrify them back into coherent projections of photons from a screen.

I don't think LLMs are AGI or about to completely flip the world upside down or whatever, but it seems undeniably magical when you break it down.

simonw 4 hours ago|||
Google specifically boast about their SVG performance in the announcement post: https://blog.google/innovation-and-ai/models-and-research/ge...

You can try any combination of animal on vehicle to confirm that they likely didn't target pelicans directly though.

0_____0 5 hours ago||||
next time you host a party, have people try to draw a bicycle on your whiteboard (you have a whiteboard in your house right? you should, anyway...)

human adults are generally quite bad at drawing them, unless they spend a lot of time actually thinking about bicycles as objects

542354234235 5 hours ago|||
They are, and it is very funny.

https://www.behance.net/gallery/35437979/Velocipedia

iammattmurphy 4 hours ago||
Fantastic post, thanks for that.
emp17344 5 hours ago|||
What’s your point? Yes, humans fail sometimes, as do AI models. Are you trying to imply that, in light of this, AI is now as capable as human beings? If so, that conclusion doesn’t follow logically.
0_____0 4 hours ago||
it's not a loaded point, i just think it's funny that humans typically cannot one-shot this. and it will make your friends laugh
HPsquared 5 hours ago|||
And the left leg is straight while the right leg is bent.

EDIT: And the chain should pass behind the seat stay.

bredren 6 hours ago|||
What is that, a snack in the basket?
troymc 17 minutes ago|||
The number of snacks in the basket is a random variable with a Poisson distribution.
sigmar 6 hours ago||||
"integrating a bicycle basket, complete with a fish for the pelican... also ensuring the basket is on top of the bike, and that the fish is correctly positioned with its head up... basket is orange, with a fish inside for fun."

how thoughtful of the ai to include a snack. truly a "thanks for all the fish"

defen 5 hours ago||
A pelican already has an integrated snack-holder, though. It wouldn't need to put it in the basket.
SauntSolaire 2 hours ago||
That one's full too
WarmWash 6 hours ago|||
A fish for the road
tarr11 4 hours ago|||
What do you think this particular prompt is evaluating for?

The more popular these particular evals are, the more likely the model will be trained for them.

Gander5739 4 hours ago||
Sea https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
TZubiri 2 hours ago|||
You think they are able to see their output and iterate on it? Or is it pure token generation?
infthi 5 hours ago|||
Wonder when will we get something other than a side view
mikepurvis 5 hours ago||
That would be a especially challenging for vector output. I tried just now on ChatGPT 5.2 to jump straight to an image, with this prompt:

"make me a cartoon image of a pelican riding a bicycle, but make it from a front 3/4 view, that is riding toward the viewer."

The result was basically a head-on view, but I expect if you then put that back in and said, "take this image and vectorize it as an SVG" you'd have a much better time than trying to one-shot the SVG directly from a description.

... but of course, if that's so, then what's preventing the model from being smart enough to identify this workflow and follow it on its own to get the task completed?

calny 6 hours ago|||
Great pelican but what’s up with that fish in the basket?
coldtea 5 hours ago|||
It's a pelican. What do you expect a pelican to have in his bike's basket?

It's a pretty funny and coherent touch!

embedding-shape 5 hours ago||
> What do you expect a pelican to have in his bike's basket?

Probably stuff it cannot fit in the gullet, or don't want there (think trash). I wouldn't expect a pelican to stash fish there, that's for sure.

kridsdale3 4 hours ago||
You never travel with a snack fish for later on? He's going to be burning calories.
nicr_22 4 hours ago||||
Yeah, why only _one_ fish?

It's obvious that pelican is riding long distance, no way a single fish is sufficiently energy dense for more than a few miles.

Can't the model do basic math???

gavinray 5 hours ago|||
Where else are cycling Pelican's meant to keep their fish?
calny 4 hours ago||
I get it, I just meant the fish is poorly done, when I’d have guessed it would be relatively simple part. Maybe the black dot eye is misplaced idk.
mohsen1 5 hours ago|||
is there something in your prompt about hats? why the pelican always wearing a hat recently?!
bigfishrunning 5 hours ago||
At this point, i think maybe they're training on all of the previous pelicans, and one of them decided to put a hat on it?

Disclaimer: This is an unsubstantiated claim that i made up

xnx 6 hours ago|||
Not even animated? This is 2026.
readitalready 5 hours ago||
Jeff Dean just posted an animated version: https://x.com/JeffDean/status/2024525132266688757
benbreen 5 hours ago|||
One underrated thing about the recent frontier models, IMO, is that they are obviating the need for image gen as a standalone thing. Opus 4.6 (and apparently 3.1 Pro as well) doesn't have the ability to generate images but it is so good at making SVG that it basically doesn't matter at this point. And the benefit of SVG is that it can be animated and interactive.

I find this fascinating because it literally just happened in the past few months. Up until ~summer of 2025, the SVG these models made was consistently buggy and crude. By December of 2026, I was able to get results like this from Opus 4.5 (Henry James: the RPG, made almost entirely with SVG): https://the-ambassadors.vercel.app

And now it looks like Gemini 3.1 Pro has vaulted past it.

embedding-shape 5 hours ago|||
> doesn't have the ability to generate images but it is so good at making SVG that it basically doesn't matter at this point

Yeah, since the invention of vector images, suddenly no one cares about raster images anymore.

Obviously not true, but that's how your comment reads right now. "Image" is very different from "Image", and one doesn't automagically replace the other.

buu700 4 hours ago|||
This reminds me of the time I printed a poster with a blown up version of some image for a high school history project. A classmate asked how I did it, so I started going on about how I used software to vectorize the image. Turned out he didn't care about any of that and just wanted the name of the print shop.
Der_Einzige 4 hours ago|||
You have no idea how badly I want to be teleported to the alternative world where VECTOR COMPUTING was the dominant form of computers.

We had high framerate (yes it was variable), bright, beautiful displays in the 1980s with the vectrex.

cachius 5 hours ago|||
2025 that is
bigfishrunning 5 hours ago|||
That Ostrich Tho
cachius 4 hours ago||
That Tires Tho
DonHopkins 4 hours ago|||
How about STL files for 3d printing pelicans!
baq 3 hours ago||
Harder: the bike must work

Hardest: the pelican must work

benatkin 4 hours ago|||
I used the AI studio link and tried running it with the temperature set to 1.75: https://jsbin.com/locodaqovu/edit?html,output
saberience 5 hours ago||
I hope we keep beating this dead horse some more, I'm still not tired of it.
More comments...