Top
Best
New

Posted by MallocVoidstar 9 hours ago

Gemini 3.1 Pro(blog.google)
Preview: https://console.cloud.google.com/vertex-ai/publishers/google...

Card: https://deepmind.google/models/model-cards/gemini-3-1-pro/

361 points | 618 commentspage 2
Robdel12 7 hours ago|
I really want to use google’s models but they have the classic Google product problem that we all like to complain about.

I am legit scared to login and use Gemini CLI because the last time I thought I was using my “free” account allowance via Google workspace. Ended up spending $10 before realizing it was API billing and the UI was so hard to figure out I gave up. I’m sure I can spend 20-40 more mins to sort this out, but ugh, I don’t want to.

With alllll that said.. is Gemini 3.1 more agentic now? That’s usually where it failed. Very smart and capable models, but hard to apply them? Just me?

alpineman 7 hours ago||
100% agreed. I wish someone would make a test for how reliably the LLMs follow tool use instructions etc. The pelicans are nice but not useful for me to judge how well a model will slot into a production stack.
embedding-shape 7 hours ago||
At first when I got started with using LLMs I read/analyzed benchmarks, looked at what example prompts people used and so on, but many times, a new model does best at the benchmark, and you think it'll be better, but then in real work, it completely drops the ball. Since then I've stopped even reading benchmarks, I don't care an iota about them, they always seem more misdirected than helpful.

Today I have my own private benchmarks, with tests I run myself, with private test cases I refuse to share publicly. These have been built up during the last 1/1.5 years, whenever I find something that my current model struggles with, then it becomes a new test case to include in the benchmark.

Nowadays it's as easy as `just bench $provider $model` and it runs my benchmarks against it, and I get a score that actually reflects what I use the models for, and it feels like it more or less matches with actually using the models. I recommend people who use LLMs for serious work to try the same approach, and stop relying on public benchmarks that (seemingly) are all gamed by now.

cdelsolar 7 hours ago||
share
embedding-shape 6 hours ago||
The harness? Trivial to build yourself, ask your LLM for help, it's ~1000 LOC you could hack together in 10-15 minutes.

As for the test cases themselves, that would obviously defeat the purpose, so no :)

phamilton 7 hours ago|||
> For those building with a mix of bash and custom tools, Gemini 3.1 Pro Preview comes with a separate endpoint available via the API called gemini-3.1-pro-preview-customtools. This endpoint is better at prioritizing your custom tools (for example view_file or search_code).

It sounds like there was at least a deliberate attempt to improve it.

pdntspa 7 hours ago|||
You can delete the billing from a given API key
Stevvo 7 hours ago|||
You could always use it through Copilot. The credits based billing is pretty simple without surprise charges.
surgical_fire 7 hours ago|||
May be very silly of me, but I avoid using Gemini on my personal Google account. I use it at work, because my employer provides it.

I am scared some automated system may just decide I am doing something bad and terminate my account. I have been moving important things to Proton, but there are some stuff that I couldn't change that would cause me a lot of annoyance. It's not trivial to set up an alternative account just for Gemini, because my Google account is basically on every device I use.

I mostly use LLMs as coding assistant, learning assistant, and general queries (e.g.: It helped me set up a server for self hosting), so nothing weird.

paganel 2 hours ago|||
Same feeling here, if it makes you feel any better (for sure it made me better seeing I'm not alone in this).
CamperBob2 5 hours ago|||
For what it's worth, there was an (unfortunately unsuccessful) HN submission from a guy who got his Gemini account banned, apparently without losing his whole Google account: https://news.ycombinator.com/item?id=47007906
surgical_fire 4 hours ago||
Comforting to know that they may ban you from only some of their services, I guess?

I really regret relying so much on my Google account for so long. Untangling myself from it is really hard. Some places treat your email as a login, not as simply as a way to contact you. This is doubly concerning for government websites, where setting up a new account may just not be a possibility.

At some point I suppose Gemini will be the only viable option for LLMs, so oh well.

horsawlarway 7 hours ago|||
So much this.

It's absolutely amazing how hostile Google is to releasing billing options that are reasonable, controllable, or even fucking understandable.

I want to do relatively simple things like:

1. Buy shit from you

2. For a controllable amount (ex - let me pick a limit on costs)

3. Without spending literally HOURS trying to understand 17 different fucking products, all overlapping, with myriad project configs, api keys that should work, then don't actually work, even though the billing links to the same damn api key page, and says it should work.

And frankly - you can't do any of it. No controls (at best delayed alerts). No clear access. No real product differentiation pages. No guides or onboarding pages to simplify the matter. No support. SHIT LOADS of completely incorrect and outdated docs, that link to dead pages, or say incorrect things.

So I won't buy shit from them. Period.

sciencejerk 7 hours ago||
You think AWS is better?
3form 6 hours ago|||
Exact reason I used none of these platforms for my personal projects, ever.
pdimitar 6 hours ago|||
Who is comparing to AWS and why? They can both be terrible at the same time, you know.
abiraja 5 hours ago|||
I've been using it lately with OpenCode and it's working pretty well (except for API reliability issues).
himata4113 7 hours ago||
use openrouter instead
Robdel12 5 hours ago||
This is actually an excellent idea, I’ll give this a shot tonight!
WarmWash 6 hours ago||
3.1 Pro is the first model to correctly count the number of legs on my "five legged dog" test image. 3.0 flash was the previous best, getting it after a few prompts of poking. 3.1 got it on the first prompt though, with the prompt being "How many legs does the dog have? Count Carefully".

However, it didn't get it on the first try with the original prompt (prompt: "How many legs does the dog have?"). It initially said 4, then with a follow up prompt got it to hesitantly say 5, with one limb must being obfuscated or hidden.

So maybe I'll give it a 90%?

This is without tools as well.

merlindru 6 hours ago|
your question may have become part of the training data with how much coverage there was around it. perhaps you should devise a new test :P
devsda 5 hours ago|||
I suggest asking it to identify/count the number of fire hydrants, crosswalks, bridges, bicycles, cars, buses and traffic signals etc.

Pit Google against Google :D

iamdelirium 5 hours ago||||
3.1 Pro has the same Jan 2025 knowledge cutoff as the other 3 series models. So if 3.1 has it in its training data, the other ones would have as well.
ainch 3 hours ago||
The fact it's still Jan 2025 is weird to me. Have they not have a successful pretrain in over a year?
gallerdude 6 hours ago||||
My job may have become part of the training data with how much coverage there is around it. Perhaps another career would be a better test of LLM capabilities.
suddenlybananas 6 hours ago||
Have you ever heard of a black swan?
WarmWash 6 hours ago||||
Honestly at this point I have fed this image in so many times on so many models, that it also functions as a test for "Are they training on my image specifically" (they are generally, for sure, but that's along with everything else in the ocean of info people dump in).

I genuinely don't think they are. GPT-5.2 still stands by 4 legs, and OAI has been getting this image consistently for over a year. And 3.1 still fumbled with the harder prompt "How many legs does the dog have?". I needed to add the "count carefully" part to tip it off that something was amiss.

Since it did well I'll make some other "extremely far out of the norm" images to see how it fairs. A spider with 10 legs or a fish with two side fins.

wat10000 6 hours ago|||
Easy fix, make a new test image with six legs, and watch all the LLMs say it has five.
datakazkn 1 hour ago||
One underappreciated reason for the agentic gap: Gemini tends to over-explain its reasoning mid-tool-call in a way that breaks structured output expectations. Claude and GPT-4o have both gotten better at treating tool calls as first-class operations. Gemini still feels like it's narrating its way through them rather than just executing.
carbocation 1 hour ago|
I agree with this; it feels like the most likely tool to drop its high-level comments in code comments.
sigmar 8 hours ago||
blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...

edit: biggest benchmark changes from 3 pro:

arc-agi-2 score went from 31.1% -> 77.1%

apex-agents score went from 18.4% -> 33.5%

ripbozo 8 hours ago||
Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests
maxall4 7 hours ago|||
Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.
moffkalast 2 hours ago||
https://arcprize.org/arc-agi/1/

It's a sort of arbitrary pattern matching thing that can't be trained on in the sense that the MMLU can be, but you can definitely generate billions of examples of this kind of task and train on it, and it will not make the model better on any other task. So in that sense, it absolutely can be.

I think it's been harder to solve because it's a visual puzzle, and we know how well today's vision encoders actually work https://arxiv.org/html/2407.06581v1

boplicity 7 hours ago||||
Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.
energy123 6 hours ago||||
Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.
tasuki 5 hours ago|||
Didn't the same Francois Chollet claim that this was the Real Test of Intelligence? If they target it, perhaps they target... real intelligence?
ainch 3 hours ago||
He's always said ARC is a necessary but not sufficient condition for testing intelligence afaik
CamperBob2 6 hours ago|||
I don't know what he could mean by that, as the whole idea behind ARC-AGI is to "target the benchmark." Got any links that explain further?
layer8 5 hours ago||
The fact that ARC-AGI has public and semi-private in addition to private datasets might explain it: https://arcprize.org/arc-agi/2/#dataset-structure
blinding-streak 7 hours ago|||
I assume all the frontier models are benchmaxxing, so it would make sense
sho_hn 8 hours ago||
The touted SVG improvements make me excited for animated pelicans.
takoid 7 hours ago|||
I just gave it a shot and this is what I got: https://codepen.io/takoid/pen/wBWLOKj

The model thought for over 5 minutes to produce this. It's not quite photorealistic (some parts are definitely "off"), but this is definitely a significant leap in complexity.

onionisafruit 7 hours ago|||
Good to see it wearing a helmet. Their safety team must be on their game.
BrokenCogs 6 hours ago||
Yes but why would a pelican need a helmet? If it falls over it can just fly away... Common sense 1 Gemini 0
throwa356262 3 hours ago|||
Obviously these domestic pelicans can't fly, otherwise why would they need a bike?
Gander5739 3 hours ago|||
Why would a pelican be riding a bicycle at all, for that matter?
BrokenCogs 2 hours ago||
Because the user asked for it
tasuki 4 hours ago||||
That's a good pelican. What I like the most is that the SVG is nice and readable. If only Inkscape could output nice SVG like this!
makeavish 7 hours ago||||
Looks great!
benatkin 7 hours ago|||
Here's what I got from Gemini Pro on gemini.google.com, it thought for under a minute...might you have been using AI studio? https://jsbin.com/zopekaquga/edit?html,output

It does say 3.1 in the Pro dropdown box in the message sending component.

james2doyle 7 hours ago||||
The blog post includes a video showcasing the improvements. Looks really impressive: https://blog.google/innovation-and-ai/models-and-research/ge...
aoeusnth1 7 hours ago||||
I imagine they're also benchgooning on SVG generation
vunderba 6 hours ago||||
SVG is an under-rated use case for LLMs because it gives you the scalability of vector graphics along with CSS-style interactivity (hover effects, animations, transitions, etc.).
rdtsc 5 hours ago||||
My perennial joke is as soon as that got on HN front page Google went and hired some interns and they spend a 100% of the time on pelicans.
DonHopkins 5 hours ago|||
How about STL files for 3d printing pelicans!
zapnuk 2 hours ago||
Gemini 3 was:

1. unreliable in GH copilot. Lots of 500 and 4XX errors. Unusable in the first 2 months

2. not available in vertex ai (europe). We have requirements regarding data residency. Funny enough anthropic is on point with releasing their models to vertex ai. We already use opus and sonnet 4.6.

I hope google gets their stuff together and understands that not everyone wants/can use their global endpoint. We'd like to try their models.

esafak 8 hours ago||
Has anyone noticed that models are dropping ever faster, with pressure on companies to make incremental releases to claim the pole position, yet making strides on benchmarks? This is what recursive self-improvement with human support looks like.
emp17344 7 hours ago||
Remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced these benchmark improvements aren’t data leakage.
culi 3 hours ago|||
Look at the ARC site. The scores of these models is plotted against their "cost per task". All of these huge jumps come along with massive increases in cost per task. Including Gemini 3.1 Pro which increased by 4.2x
casey2 4 hours ago|||
ARC 2 was made specifically to artificially lower contemporary LLM scores, therefore any kind of model improvements will have outsized effects

Also people use "saturated" too liberally. The top left corner 1 cent per task is saturated IMO. Since there are billions of people who would perfer to solve arc 1 tasks at 52 cents per task. Arc 2 a human would make thousands of dollars a day with 99.99% accuracy

z3t4 3 hours ago|||
How much do I get if I solve this? :D

https://arcprize.org/play

alisonkisk 3 hours ago|||
You are saying something interesting but too esoteric. Can you explain for beginners?
redox99 7 hours ago|||
I don't think there's much recursive improvement yet.

I'd say it's a combination of

A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model.

B) There's more compute online

C) Competition is more fierce.

m_ke 6 hours ago|||
this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)

so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long

culi 3 hours ago|||
I feel like they're actually dropping slower. Chinese models are dropping right before lunar new year as seems to be an emerging tradition.

A couple of western models have dropped around the same time too but I don't think the "strides on benchmarks" are that impressive when you consider how much tokens are being spent to make those "improvements". E.g. Gemini 3.1 Pro's ARC-AGI-2 score went from 33.6% to 77.1% buuut their "cost per task" also increased by 4.2x. It seems to be the same story for most of these benchmark improvements and similar for Claude model improvements.

I'm not convinced there's been any substantial jump in capabilities. More likely these companies have scaled their datacenters to allow for more token usage

ankit219 6 hours ago|||
not much to do with self improvement as such. openai has increased its pace, others are pretty much consistent. Google last year had three versions of gemini-2.5-pro each within a month of each other. Anthropic released claude 3 in march 24, sonnet 3.5 in june 24, 3.5 new in oct 24, and then 3.7 in feb 25, where they went to 4 series in May 25. then followed by opus 4.1 in august, sonnet 4.5 in oct, opus 4.5 in nov, 4.6 in feb, sonnet 4.6 in feb itself. Yes, they released both within weeks of each other, but originally they only released it together. This staggered release is what creates the impression of fast releases. its as much a function of training as a function of available compute, and they have ramped up in that regard.
oliveiracwb 6 hours ago|||
With the advent of MoEs, efficiency gains became possible. However, MoEs still operate far from the balance and stability of dense models. My view is that most progress comes from router tuning based on good and bad outcomes, with only marginal gains in real intelligence
ainch 3 hours ago|||
It's becoming impossible to keep up - in the last week or so we've had: Gemini 3 Deep Think, Gemini 3.1 Pro, Claude Sonnet 4.6, GPT-5.3-Codex Spark, GLM-5, Minimax-2.5, Step 3.5 Flash, Qwen 3.5 and Grok 4.20.

and I'm sure others I've missed...

nikcub 6 hours ago|||
and anyone notice that the pace has broken xAI and they were just dropped behind? The frontier improvement release loop is now ant -> openai -> google
gavinray 5 hours ago|||
xAI just released Grok 4.20 beta yesterday or day before?
dist-epoch 5 hours ago|||
Musk said Grok 5 is currently being trained, and it has 7 trillion params (Grok 4 had 3)
svara 4 hours ago||
My understanding is that all recent gains are from post training and no one (publicly) knows how much scaling pretraining will still help at this point.

Happy to learn more about this if anyone has more information.

dist-epoch 4 hours ago||
You gain more benefit spending compute on post-training than on pre-training.

But scaling pre-training is still worth it if you can afford it.

gmerc 6 hours ago|||
That's what scaling compute depth to respond to the competition look like, lighting those dollars on fire.
toephu2 5 hours ago|||
This is what competition looks like.
PlatoIsADisease 7 hours ago|||
Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention...

Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.

If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data.

KoolKat23 2 hours ago||
I have a relatively consistent task that it completed with new information on weekdays at the edge of its intelligence. Interestingly 3.0 flash was good when it came out, took a nose dive a month back and is now excellent, I actually can't fault it it's so good.

It's performance in antigravity has also actually improved since launch day where it was giving non-stop typescript errors (not sure if that was antigravity itself).

boxingdog 6 hours ago||
[dead]
davidguetta 7 hours ago||
Implementation and Sustainability Hardware: Gemini 3 Pro was trained using Google’s Tensor Processing Units (TPUs). TPUs are specically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs. TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training, which can lead to better model quality. TPU Pods (large clusters of TPUs) also provide a scalable solution for handling the growing complexity of large foundation models. Training can be distributed across multiple TPU devices for faster and more efficient processing.

So google doesn't use NVIDIA GPUs at all ?

dekhn 6 hours ago||
When I worked there, there was a mix of training on nvidia GPUs (especially for sparse problems when TPUs weren't as capable), CPUs, and TPUs. I've been gone for a few years but I've heard a few anecdotal statements that some of their researchers have to use nvidia GPUs because the TPUs are busy.
sdeiley 2 hours ago|||
Googler. We use GPUs, but its a drop in the bucket in the sea of our accelerators. We might sell more GPUs in Cloud than we use internally.

These are not data driven observations just vibes

rjh29 4 hours ago|||
I assume that's a Gemini LLM response? You can tell Gemini is bullshitting when it starts using "often" or "usually" - like in this case "TPUs often come with large amounts of memory". Either they did or they didn't. "This (particular) mall often has a Starbucks" was one I encountered recently.
w10-1 2 hours ago||
It's not bullshit (i.e., intended) but probabilities all the way down, as Hume reminded us: from observations, you can only say the sun will likely rise in the east. You'd need to stand behind a theory of the world to say otherwise (but we were told "attention is all you need"...)
PunchTornado 7 hours ago|||
no. only tpus
paride5745 7 hours ago|||
Another reason to use Gemini then.

Less impact on gamers…

TiredOfLife 6 hours ago||
TPUs still use ram and chip production capacity
lejalv 6 hours ago||
Bla bla bla yada sustainability yada often come with large better growing faster...

It's such an uninformative piece of marketing crap

the_duke 7 hours ago||
Gemini 3 is pretty good, even Flash is very smart for certain things, and fast!

BUT it is not good at all at tool calling and agentic workflows, especially compared to the recent two mini-generations of models (Codex 5.2/5.3, the last two versions of Anthropic models), and also fell behind a bit in reasoning.

I hope they manage to improve things on that front, because then Flash would be great for many tasks.

chermi 7 hours ago||
You can really notice the tool use problems. They gotta get on that. The agent trend seems real, and powerful. They can't afford to fall behind on it.
verdverm 7 hours ago|||
I don't really have tool usage issues that I don't put under that doesn't follow system prompt instructions consistently

there are these times where it puts a prefix on all function calls, which is weird and I think hallucination, so maybe that one

3.1 hopefully fixes that

HardCodedBias 3 hours ago|||
"They can't afford to fall behind on it."

They are very, very seriously far behind as of 3.0.

We'll see if 3.1 addresses the issue at all.

verdverm 7 hours ago|||
These improvements are one of the things specifically called out on the submitted page
anthonypasq 7 hours ago|||
yeah, it seems to me like Gemini is a little behind on the current RL patterns and also they dont seem interested in really creating a dedicated coding model. I think they have so much product surface (search, AI mode, gmail, youtube, chrome etc), they are prioritizing making the model very general. but who knows im just talking out of my ass.
spwa4 7 hours ago||
In other words: they just need to motivate their employees while giving in to finance's demands to fire a few thousand every month or so ...

And don't forget, it's not just direct motivation. You can make yourself indispensable by sabotaging or at least not contributing to your colleagues' efforts. Not helping anyone, by the way, is exactly what your managers want you to do. They will decide what happens, thank you very much, and doing anything outside of your org ... well there's a name for that, isn't there? Betrayal, or perhaps death penalty.

maxloh 8 hours ago||
Gemini 3 seems to have a much smaller token output limit than 2.5. I used to use Gemini to restructure essays into an LLM-style format to improve readability, but the Gemini 3 release was a huge step back for that particular use case.

Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response, it still truncates the source text too aggressively, losing vital context and meaning in the restructuring process.

I hope the 3.1 release includes a much larger output limit.

NoahZuniga 7 hours ago||
Output limit has consistently been 64k tokens (including 2.5 pro).
esafak 8 hours ago|||
People did find Gemini very talkative so it might be a response to that.
jayd16 8 hours ago|||
> Even when the model is explicitly instructed to pause due to insufficient tokens

Is there actually a chance it has the introspection to do anything with this request?

maxloh 6 hours ago|||
Yeah, it does. It was possible with 2.5 Flash.

Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...

jayd16 6 hours ago||
Ok it prints some stuff at the end but does it actually count the output tokens? That part was already built in somehow? Is it just retrying until it has enough space to add the footer?
verdverm 6 hours ago||||
No, the model doesn't have purview into this afaik

I'm not even sure what "pausing" means in this context and why it would help when there are insufficient tokens. They should just stop when you reach the limit, default or manually specified, but it's typically a cutoff.

You can see what happens by setting output token limit much lower

otabdeveloper4 7 hours ago|||
No.
MallocVoidstar 7 hours ago||
> Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response

AI models can't do this. At least not with just an instruction, maybe if you're writing some kind of custom 'agentic' setup.

maxloh 6 hours ago||
Yeah, it does. It was possible with 2.5 Flash.

Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...

zhyder 7 hours ago|
Surprisingly big jump in ARC-AGI-2 from 31% to 77%, guess there's some RLHF focused on the benchmark given it was previously far behind the competition and is now ahead.

Apart from that, the usual predictable gains in coding. Still is a great sweet-spot for performance, speed and cost. Need to hack Claude Code to use their agentic logic+prompts but use Gemini models.

I wish Google also updated Flash-lite to 3.0+, would like to use that for the Explore subagent (which Claude Code uses Haiku for). These subagents seem to be Claude Code's strength over Gemini CLI, which still has them only in experimental mode and doesn't have read-only ones like Explore.

WarmWash 7 hours ago|
>I wish Google also updated Flash-lite to 3.0+

I hope every day that they have made gains on their diffusion model. As a sub agent it would be insane, as it's compute light and cranks 1000+ tk/s

zhyder 6 hours ago||
Agree, can't wait for updates to the diffusion model.

Could be useful for planning too, given its tendency to think big picture first. Even if it's just an additional subagent to double-check with an "off the top off your head" or "don't think, share first thought" type of question. More generally would like to see how sequencing autoregressive thinking with diffusion over multiple steps might help with better overall thinking.

More comments...