Top
Best
New

Posted by MallocVoidstar 23 hours ago

Gemini 3.1 Pro(blog.google)
Preview: https://console.cloud.google.com/vertex-ai/publishers/google...

Card: https://deepmind.google/models/model-cards/gemini-3-1-pro/

828 points | 842 commentspage 11
dude250711 22 hours ago|
I hereby allow you to release models not at the same time as your competitors.
sigmar 22 hours ago|
It is super interesting that this is the same thing that happened in November (ie all labs shipping around the same week 11/12-11/23).
zozbot234 21 hours ago||
They're just throwing a big Chinese New Year celebration.
vintermann 7 hours ago||
Could that actually be connected? There are a LOT of Chinese engineers and researchers working on all these models, I assume they would like to take some vacation days, and it makes sense to me to time releases around it.
PunchTornado 22 hours ago||
The biggest increase is LiveCodeBench Pro: 2887. The rest are in line with Opus 4.6 or slightly better or slightly worse.
shmoogy 22 hours ago|
but is it still terrible at tool calls in actual agentic flows?
Topfi 22 hours ago||
Appears the only difference to 3.0 Pro Preview is Medium reasoning. Model naming has long gone from even trying to make sense, but considering 3.0 is still in preview itself, increasing the number for such a minor change is not a move in the right direction.
GrayShade 22 hours ago||
Maybe that's the only API-visible change, saying nothing about the actual capabilities of the model?
xnx 22 hours ago|||
> increasing the number for such a minor change is not a move in the right direction

A .1 model number increase seems reasonable for more than doubling ARC-AGI 2 score and increasing so many other benchmarks.

What would you have named it?

Topfi 20 hours ago||
My issue is that we haven't even gotten the release version of 3.0, that is also still in Preview, so may stick with 3.0 till that has been deemed stable.

Basically, what does the word "Preview" mean, if newer releases happen before a Preview model is stable? In prior Google models, Preview meant that there'd still be updates and improvements to said model prior to full deployment, something we saw with 2.5. Now, there is no meaning or reason for this designation to exist if they forgo a 3.0 still in Preview for model improvements.

xnx 20 hours ago||
Given the pace AI is improving and that it doesn't give the exact same answers under many circumstances, is the the [in]stability of "preview" a concern?

GMail was in "beta" for 5 years.

Topfi 1 hour ago|||
Should have clarified initially what I meant by stable, especially because it isn't that known how these terms are defined for Gemini models. Not talking about getting consistent output from a not-deterministic model, but stable from a usage perspective and in the way Google uses the word "stable" to describe their model deployments [0]. "Preview" in regard to Gemini models means a few very specific restrictions including far stricter rate limits and a very tight 14 day deprecation window, making them models one cannot build on.

That is why I'd prefer for them to finish the role out of an existing model before starting work on a dedicated new version.

[0] https://ai.google.dev/gemini-api/docs/models

verdverm 19 hours ago|||
ChatGPT 4.5 was never released to the public, but it is widely believed to be the foundation the 5.x series is built on.

Wonder how GP feels about the minor bumps for other model providers?

Topfi 1 hour ago||
Minor version bumps are good and I want model providers to communicate changes. The issue I am having is that Gemini "preview" class models have different deprecation timelines and rate limits, making them impossible to rely on for professional use cases. That's why I'd prefer they finish the 3.0 role out prior to putting resources into deploying a second "preview" class model.

For a stable deployment, Google needs a sufficient amount of hardware to guarantee inference and having two Pro models running makes that even more challenging: https://ai.google.dev/gemini-api/docs/models

argsnd 22 hours ago|||
I disagree. Incrementing the minor number makes so much more sense than “gemini-3-pro-preview-1902” or something.
jannyfer 22 hours ago||
According to the blog post, it should be also great at drawing pelicans riding a bicycle.
naiv 22 hours ago||
ok , so they are scared that 5.3 (pro) will be released today/tomorrow and blow it out of the water and rushed it while they could still reference 5.2 benchmarks.
PunchTornado 22 hours ago|
I don't think models blow other models anymore. We have the big 3 which are neck to neck in most benchmarks and the rest. I doubt that 5.3 will blow the others.
scld 21 hours ago||
easy now
LZ_Khan 22 hours ago||
biggest problem is that it's slow. also safety seems overtuned at the moment. getting some really silly refusals. everything else is pretty good.
mustaphah 22 hours ago||
Google is terrible at marketing, but this feels like a big step forward.

As per the announcement, Gemini 3.1 Pro score 68.5% on Terminal-Bench 2.0, which makes it the top performer on the Terminus 2 harness [1]. That harness is a "neutral agent scaffold," built by researchers at Terminal-Bench to compare different LLMs in the same standardized setup (same tools, prompts, etc.).

It's also taken top model place on both the Intelligence Index & Coding Index of Artificial Analysis [2], but on their Agentic Index, it's still lagging behind Opus 4.6, GLM-5, Sonnet 4.6, and GPT-5.2.

---

[1] https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...

[2] https://artificialanalysis.ai

saberience 22 hours ago|
Benchmarks aren't everything.

Gemini consistently has the best benchmarks but the worst actual real-world results.

Every time they announce the best benchmarks I try again at using their tools and products and each time I immediately go back to Claude and Codex models because Google is just so terrible at building actual products.

They are good at research and benchmaxxing, but the day to day usage of the products and tools is horrible.

Try using Google Antigravity and you will not make it an hour before switching back to Codex or Claude Code, it's so incredibly shitty.

mustaphah 21 hours ago|||
That's been my experience too; can't disagree. Still, when it comes to tasks that require deep intelligence (esp. mathematical reasoning [1]), Gemini has consistently been the best.

[1] https://arxiv.org/abs/2602.10177

gregorygoc 22 hours ago|||
What’s so shitty about it?
trilogic 20 hours ago||
Humanity last exam 44%, Scicode 59, and that 80, and this 78 but not 100% ever.

Would be nice to see that this models, Plus, Pro, Super, God mode can do 1 Bench 100%. I am missing smth here?

kuprel 19 hours ago||
Why don't they show Grok benchmarks?
andxor 18 hours ago|
They've fallen way behind.
kuprel 17 hours ago||
GPT 5.2 loses at everything but they included that
andxor 14 hours ago||
Who are they supposed to compare it to? I'm not sure what makes you think that Grok is even remotely comparable to the frontier models right now.
andrewstuart 8 hours ago||
Gemini current version drops most of the code every time I try to use it.

Useless.

jdthedisciple 18 hours ago|
Why should I be excited?
More comments...