Top
Best
New

Posted by MallocVoidstar 10 hours ago

Gemini 3.1 Pro(blog.google)
Preview: https://console.cloud.google.com/vertex-ai/publishers/google...

Card: https://deepmind.google/models/model-cards/gemini-3-1-pro/

467 points | 669 commentspage 3
0x110111101 1 hour ago|
Relevant: Scanned diaries from 1945 of USFS Ranger. Had this transcribed in Claude.

[1]:https://news.ycombinator.com/item?id=47041836

zapnuk 4 hours ago||
Gemini 3 was:

1. unreliable in GH copilot. Lots of 500 and 4XX errors. Unusable in the first 2 months

2. not available in vertex ai (europe). We have requirements regarding data residency. Funny enough anthropic is on point with releasing their models to vertex ai. We already use opus and sonnet 4.6.

I hope google gets their stuff together and understands that not everyone wants/can use their global endpoint. We'd like to try their models.

vnglst 6 hours ago||
I asked Gemini 3.1 Pro to generate some of the modern artworks in my "Pelican Art Gallery". I particularly like the rendition of the Sunflowers: https://pelican.koenvangilst.nl/gallery/category/modern
alwinaugustin 2 hours ago||
I use gemini if i need to write something in my native language- Malayalam or translation. it works very well in writing in Indian regional languages.
qingcharles 9 hours ago||
I've been playing with the 3.1 Deep Think version of this for the last couple of weeks and it was a big step up for coding over 3.0 (which I already found very good).

It's only February...

nubg 8 hours ago|
> I've been playing with the 3.1 Deep Think version of this

How?

verdverm 8 hours ago||
A select few have had early access through various programs Google offers. I believe there was a sentence or two to this effect on the Gemini 3 Deep Think post from Deepmind.
mbh159 5 hours ago||
77.1% on ARC-AGI-2 and still can't stop adding drive-by refactors. ARC-AGI-2 tests novel pattern induction, it's genuinely hard to fake and the improvement is real. But it doesn't measure task scoping, instruction adherence, or knowing when to stop. Those are the capabilities practitioners actually need from a coding agent. We have excellent benchmarks for reasoning. We have almost nothing that measures reliability in agentic loops. That gap explains this thread.
XCSme 7 hours ago||
Gets 10/10 on my potato benchmarks: https://aibenchy.com/model/google-gemini-3-1-pro-preview-med...
XCSme 7 hours ago||
Now I need to write more tests.

It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.

thevinter 5 hours ago|||
Are you intentionally keeping the benchmarks private?
XCSme 5 hours ago||
Yes.

I am trying to think what's the best way to give most information about how the AI models fail, without revealing information that can help them overfit on those specific tests.

I am planning to add some extra LLM calls, to summarize the failure reason, without revealing the test.

XCSme 5 hours ago||
Added one more test, which surprisingly gemini flash 3 reasoning passes, but gemini 3.1 pro not
attentive 2 hours ago||
A lot of gemini bashing. But flash 3.0 with opencode is reasonably good and reliable coder.

I'd rate it between haiku 4.5 (also pretty good for a price) and sonnet. Closer to sonnet.

Sure, if I am not cost-sensitive I'd run everything in opus 4.6 but alas.

datakazkn 3 hours ago|
One underappreciated reason for the agentic gap: Gemini tends to over-explain its reasoning mid-tool-call in a way that breaks structured output expectations. Claude and GPT-4o have both gotten better at treating tool calls as first-class operations. Gemini still feels like it's narrating its way through them rather than just executing.
carbocation 3 hours ago|
I agree with this; it feels like the most likely tool to drop its high-level comments in code comments.
More comments...