Gemini 3 Flash: Frontier intelligence built for speed

Posted by meetpateltech 12/17/2025

1102 points | 580 comments

samyok 12/17/2025|

Don’t let the “flash” name fool you, this is an amazing model.

I have been playing with it for the past few weeks, it’s genuinely my new favorite; it’s so fast and it has such a vast world knowledge that it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high, for a fraction (basically order of magnitude less!!) of the inference time and price

thecupisblue 12/17/2025||

Oh wow - I recently tried 3 Pro preview and it was too slow for me.

After reading your comment I ran my product benchmark against 2.5 flash, 2.5 pro and 3.0 flash.

The results are better AND the response times have stayed the same. What an insane gain - especially considering the price compared to 2.5 Pro. I'm about to get much better results for 1/3rd of the price. Not sure what magic Google did here, but would love to hear a more technical deep dive comparing what they do different in Pro and Flash models to achieve such a performance.

Also wondering, how did you get early access? I'm using the Gemini API quite a lot and have a quite nice internal benchmark suite for it, so would love to toy with the new ones as they come out.

lancekey 12/18/2025|||

Curious to learn what a “product benchmark” looks like. Is it evals you use to test prompts/models? A third party tool?

Examples from the wild are a great learning tool, anything you’re able to share is appreciated.

thecupisblue 12/18/2025|||

It's an internal benchmark that I use to test prompts, models and prompt-tunes, nothing but a dashboard calling our internal endpoints and showing the data, basically going through the prod flow.

For my product, I run a video through a multimodal LLM with multiple steps, combine data and spit out the outputs + score for the video.

I have a dataset of videos that I manually marked for my usecase, so when a new model drops, I run it + the last few best benchmarked models through the process, and check multiple things:

- Diff between outputed score and the manual one - Processing time for each step - Input/Output tokens - Request time for each step - Price of request

And the classic stats of average score delta, average time, p50, p90 etc. + One fun thing which is finding the edge cases, since even if the average score delta is low (means its spot-on), there are usually some videos where the abs delta is higher, so these usually indicate niche edge cases the model might have.

Gemini 3 Flash nails it sometimes even better than the Pro version, with nearly the same times as 2.5 Pro does on that usecase. Actually, pushed it to prod yesterday and looking at the data, it seems it's 5 seconds faster than Pro on average, with my cost-per-user going down from 20 cents to 12 cents.

IMO it's pretty rudimentary, so let me know if there's anything else I can explain.

theshrike79 12/18/2025|||

Everyone should have their own "pelican riding a bicycle" benchmark they test new models on.

And it shouldn't be shared publicly so that the models won't learn about it accidentally :)

bluecalm 12/19/2025|||

I am asking the models to generate an image where fictional characters play chess or Texas Holdem. None of them can make a realistic chess position or poker game. Always something is off like too many pawns or too may cards, or some cards being ace-up when they shouldn't be.

ggsp 12/18/2025|||

Any suggestions for a simple tool to set up your own local evals?

dimava 12/18/2025|||

Just ask LLM to write one on top of OpenRouter, AI SDK and Bun To take your .md input file and save outputs as md files (or whatever you need) Take https://github.com/T3-Content/auto-draftify as example

theshrike79 12/18/2025||||

My "tool" is just prompts saved in a text file that I feed to new models by hand. I haven't built a bespoke framework on top of it.

...yet. Crap, do I need to now? =)

ggsp 12/18/2025|||

Yeah I’ve wondered about the same myself… My evals are also a pile of text snippets, as are some of my workflows. Thought I’d have a look to see what’s out there and found Promptfoo and Inspect AI. Haven’t tried either but will for my next round of evals

kedihacker 12/18/2025||||

Well you need to stop them from getting incorporated into its training data

lobsterthief 12/18/2025|||

_Brain backlog project #77 created_

m00dy 12/18/2025|||

May I ask your internal benchmark ? I'm building a new set of benchmarks and testing suite for agentic workflows using deepwalker [0]. How do you design your benchmark suite ? would be really cool if you can give more details.

[0] https://deepwalker.xyz

thecupisblue 12/18/2025||

Shared a bit more here - https://news.ycombinator.com/item?id=46314047.

But pretty rudimentary, nothing special. Also did not know about deepwalker, looks quite interesting - you building it?

m00dy 12/19/2025||

I personally know the team who builds the product.

lambda 12/17/2025|||

I'm a significant genAI skeptic.

I periodically ask them questions about topics that are subtle or tricky, and somewhat niche, that I know a lot about, and find that they frequently provide extremely bad answers. There have been improvements on some topics, but there's one benchmark question that I have that just about every model I've tried has completely gotten wrong.

Tried it on LMArena recently, got a comparison between Gemini 2.5 flash and a codenamed model that people believe was a preview of Gemini 3 flash. Gemini 2.5 flash got it completely wrong. Gemini 3 flash actually gave a reasonable answer; not quite up to the best human description, but it's the first model I've found that actually seems to mostly correctly answer the question.

So, it's just one data point, but at least for my one fairly niche benchmark problem, Gemini 3 Flash has successfully answered a question that none of the others I've tried have (I haven't actually tried Gemini 3 Pro, but I'd compared various Claude and ChatGPT models, and a few different open weights models).

So, guess I need to put together some more benchmark problems, to get a better sample than one, but it's at least now passing a "I can find the answer to this in the top 3 hits in a Google search for a niche topic" test better than any of the other models.

Still a lot of things I'm skeptical about in all the LLM hype, but at least they are making some progress in being able to accurately answer a wider range of questions.

prettyblocks 12/17/2025|||

I don't think tricky niche knowledge is the sweet spot for genai and it likely won't be for some time. Instead, it's a great replacement for rote tasks where a less than perfect performance is good enough. Transcription, ocr, boilerplate code generation, etc.

lambda 12/17/2025|||

The thing is, I see people use it for tricky niche knowledge all the time; using it as an alternative to doing a Google search.

So I want to have a general idea of how good it is at this.

I found something that was niche, but not super niche; I could easily find a good, human written answer in the top couple of results of a Google search.

But until now, all LLM answers I've gotten for it have been complete hallucinated gibberish.

Anyhow, this is a single data point, I need to expand my set of benchmark questions a bit now, but this is the first time that I've actually seen progress on this particular personal benchmark.

ozim 12/18/2025|||

That’s riding hype machine and throwing baby with bath water.

Get an API and try to use it for classification of text or classification of images. Having an excel file with somewhat random looking 10k entries you want to classify or filter down to 10 important for you, use LLM.

Get it to make audio transcription. You can now just talk and it will make note for you on level that was not possible earlier without training on someone voice it can do anyone’s voice.

Fixing up text is of course also big.

Data classification is easy for LLM. Data transformation is a bit harder but still great. Creating new data is hard so like answering questions where it has to generate stuff from thin air it will hallucinate like a mad man.

The ones that LLMs are good in are used in background by people creating actual useful software on top of LLMs but those problems are not seen by general public who sees chat box.

illiac786 12/18/2025||||

But people using the wrong tool for a task is nothing new. Using excel as a database (still happening today), etc.

Maybe the scale is different with genAI and there are some painful learnings ahead of us.

mikepurvis 12/18/2025||||

And Google themselves obviously believe that too as they happily insert AI summaries at the top of most serps now.

ComputerGuru 12/18/2025||

Or maybe Google knows most people search inane, obvious things?

coldtea 12/18/2025|||

Or more likely Google couldn't give a rat's arse whether those AI summaries are good or not (except to the degree that people don't flee it), and what it cares is that they keep users with Google itself, instead of clicking of to other sources.

After all it's the same search engine team that didn't care about its search results - it's main draw - activey going shit for over a decade.

vitorgrs 12/18/2025||||

Google AI Overview a lot of times write wrong about obvious things so... lol

They probably use old Flash Lite model, something super small, and just summarize the search...

mikepurvis 12/18/2025||

Those summaries would be far more expensive to generate than the searches themselves so they're probably caching the top 100k most common or something, maybe even pre-caching it.

katzenversteher 12/18/2025|||

I also use niche questions a lot but mostly to check how much the models tend to hallucinate. E.g. I start asking about rank badges in Star Trek which they usually get right and then I ask about specific (non existing) rank badges shaped like strawberries or something like that. Or I ask about smaller German cities and what's famous about them.

I know without the ability to search it's very unlikely the model actually has accurate "memories" about these things, I just hope one day they will acutally know that their "memory" is bad or non-existing and they will tell me so instead of hallucinating something.

Europas 12/18/2025||

I'm waiting for properly adjusted specific LLMs. A LLM trained on so much trustworth generic data that it is able to understand/comprehend me and different lanugages but always talks to a fact database in the background.

I don't need an LLM to have a trillion parameters if i just need it to be a great user interface.

Someone is probably working on this somewere or will but lets see.

ozim 12/17/2025||||

Second this.

Basically making sense of unstructured data is super cool. I can get 20 people to write an answer the way they feel like it and model can convert it to structured data - something I would have to spend time on, or I would have to make form with mandatory fields that annoy audience.

I am already building useful tools with the help of models. Asking tricky or trivia questions is fun and games. There are much more interesting ways to use AI.

DeathArrow 12/18/2025||||

Well, I used Grok to find information I forgot about like product names, films, books and various articles on different subjects. Google search didn't help but putting the LLM at work did the trick.

So I think LLMs can be good for finding niche info.

DrewADesign 12/18/2025|||

Yeah, but tests like that deliberately prod the boundaries of its capability rather than how well it does what it’s good at.

andai 12/17/2025||||

So this is an interesting benchmark, because if the answer is actually in the top 3 google results, then my python script that runs a google search, scrapes the top n results and shoves them into a crappy LLM would pass your benchmark too!

Which also implies that (for most tasks), most of the weights in a LLM are unnecessary, since they are spent on memorizing the long tail of Common Crawl... but maybe memorizing infinite trivia is not a bug but actually required for the generalization to work? (Humans don't have far transfer though... do transformers have it?)

lambda 12/17/2025||

I've tried doing this query with search enabled in LLMs before, which is supposed to effectively do that, and even with that they didn't give very good answers. It's a very physical kind of thing, and its easy to conflate with other similar descriptions, so they would frequently just conflate various different things and give some horrible mash-up answer that wasn't about the specific thing I'd asked about.

andai 12/18/2025||

So it's a difficult question for LLMs to answer even when given perfect context?

Kinda sounds like you're testing two things at the same time then, right? The knowledge of the thing (was it in the training data and was it memorized?) and the understanding of the thing (can they explain it properly even if you give them the answer in context).

jve 12/18/2025||||

Counter point about general knowledge that is documented/discussed in different spots on the internet.

Today I had to resolve performance problems for some sql server statement. Been doing it years, know the regular pitfalls, sometimes have to find "right" words to explain to customer why X is bad and such.

I described the issue to GPT5.2, gave the query, the execution plan and asked for help.

It was spot on, high quality responses and actionable items and explanations on why this or that is bad, how to improve it and why particularly sql may have generated such a query plan. I could instantly validate the response given my experience in the field. I even answered with some parts of chatgpt on how well it explained. However I did mention that to customer and I did tell them I approve the answer.

Asked high quality question and receive a high quality answer. And I am happy that I found out about an sql server flag where I can influence particular decision. But the suggestion was not limited to that, there were multiple points given that would help.

fragmede 12/17/2025||||

Even the most magical wonderful auto-hammer is gonna be bad at driving in screws. And, in this analogy I can't fault you because there are people trying to sell this hammer as a screwdriver. My opinion is that it's important to not lose sight of the places where it is useful because of the places where it isn't.

pretzellogician 12/17/2025||

Funny, I grew up using what's called a "hand impact screwdriver"... turns out a hammer can be used to drive in screws!

TeodorDyakov 12/17/2025||||

Hi. I am curious what was the benchmark question? Cheers!

Turskarama 12/17/2025|||

The problem with publicly disclosing these is that if lots of people adopt them they will become targeted to be in the model and will no longer be a good benchmark.

lambda 12/17/2025|||

Yeah, that's part of why I don't disclose.

Obviously, the fact that I've done Google searches and tested the models on these means that their systems may have picked up on them; I'm sure that Google uses its huge dataset of Google searches and search index as inputs to its training, so Google has an advantage here. But, well, that might be why Googles new models are so much better, they're actually taking advantage of some of this massive dataset they've had for years.

grog454 12/18/2025|||

This thought process is pretty baffling to me, and this is at least the second time I've encountered it on HN.

What's the value of a secret benchmark to anyone but the secret holder? Does your niche benchmark even influence which model you use for unrelated queries? If LLM authors care enough about your niche (they don't) and fake the response somehow, you will learn on the very next query that something is amiss. Now that query is your secret benchmark.

Even for niche topics it's rare that I need to provide more than 1 correction or knowledge update.

nl 12/18/2025|||

I have a bunch of private benchmarks I run against new models I'm evaluating.

The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs.

However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily.

grog454 12/18/2025||

Ok, but then your "post" isn't scientific by definition since it cannot be verified. "Post" is in quotes because I don't know what you're trying to but you're implying some sort of public discourse.

For fun: https://chatgpt.com/s/t_694361c12cec819185e9850d0cf0c629

eru 12/18/2025|||

I didn't see anyone claiming any 'science'? Did I miss something?

grog454 12/18/2025||

I guess there's two things I'm still stuck on:

1. What is the purpose of the benchmark?

2. What is the purpose of publicly discussing a benchmark's results but keeping the methodology secret?

To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.

nl 12/18/2025||

1. The purpose of the benchmark is to choose what models I use for my own system(s). This is extremely common practice in AI - I think every company I've worked with doing LLM work in the last 2 years has done this in some form.

2. I discussed that up-thread, but https://github.com/microsoft/private-benchmarking and https://arxiv.org/abs/2403.00393 discuss some further motivation for this if you are interested.

> To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.

This is an odd way of looking at it. There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.

grog454 12/18/2025||

I see the potential value of private evaluations. They aren't scientific but you can certainly beat a "vibe test".

I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.

> There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.

Then you must not be working in an environment where a better benchmark yields a competitive advantage.

eru 12/18/2025||

> I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.

In principle, we have ways: if nl's reports consistently predict how public benchmarks will turn out later, they can build up a reputation. Of course, that requires that we follow nl around for a while.

nl 12/18/2025|||

As ChatGPT said to you:

> A secret benchmark is: Useful for internal model selection

That's what I'm doing.

grog454 12/18/2025||

My question was "What's the value of a secret benchmark to anyone but the secret holder?"

The root of this whole discussion was a post about how Gemini 3 outperformed other models on some presumably informal question benchmark (a"vibe test"?). When asked for the benchmark, the response from the op and and someone else was that secrecy was needed to protect the benchmark from contamination. I'm skeptical of the need in the op's cases and I'm skeptical of the effectiveness of the secrecy in general. In a case where secrecy has actual value, why even discuss the benchmark publicly at all?

Turskarama 12/18/2025||||

The point is that it's a litmus test for how well the models do with niche knowledge _in general_. The point isn't really to know how well the model works for that specific niche. Ideally of course you would use a few of them and aggregate the results.

akoboldfrying 12/18/2025||||

I actually think "concealing the question" is not only a good idea, but a rather general and powerful idea that should be much more widely deployed (but often won't be, for what I consider "emotional reasons").

Example: You are probably already aware that almost any metric that you try to use to measure code quality can be easily gamed. One possible strategy is to choose a weighted mixture of metrics and conceal the weights. The weights can even change over time. Is it perfect? No. But it's at least correlated with code quality -- and it's not trivially gameable, which puts it above most individual public metrics.

grog454 12/18/2025||

It's hard to have any certainty around concealment unless you are only testing local LLMs. As a matter of principle I assume the input and output of any query I run in a remote LLM is permanently public information (same with search queries).

Will someone (or some system) see my query and think "we ought to improve this"? I have no idea since I don't work on these systems. In some instances involving random sampling... probably yes!

This is the second reason I find the idea of publicly discussing secret benchmarks silly.

grog454 12/18/2025||

I learned in another thread there is some work being done to avoid contamination of training data during evaluation of remote models using trusted execution environments (https://arxiv.org/pdf/2403.00393). It requires participation of the model owner.

theshrike79 12/18/2025|||

Because it encompasses the very specific way I like to do things. It's not of use to the general public.

kridsdale3 12/17/2025||||

If they told you, it would be picked up in a future model's training run.

jacobn 12/17/2025|||

Don't the models typically train on their input too? I.e. submitting the question also carries a risk/chance of it getting picked up?

I guess they get such a large input of queries that they can only realistically check and therefore use a small fraction? Though maybe they've come up with some clever trick to make use of it anyway?

nl 12/18/2025|||

OpenAI and Anthropic don't train on your questions if you have pressed the opt-out button and are using their UI. LMArena is a different matter.

jerojero 12/17/2025||||

they probably dont train on inputs from testing grounds.

you dont train on your test data because you need to have that to compare if training is improving or not.

energy123 12/17/2025|||

Given they asked in on LMArena, yes.

lambda 12/17/2025||

Yeah, probably asking on LMArena makes this an invalid benchmark going forward, especially since I think Google is particular active in testing models on LMArena (as evidenced by the fact that I got their preview for this question).

I'll need to find a new one, or actually put together a set of questions to use instead of just a single benchmark.

_heimdall 12/17/2025|||

Is that an issue if you now need a new question to ask?

Marazan 12/18/2025|||

Heres my old benchmark question and my new variant:

"When was the last time England beat Scotland at rugby union"

new variant "Without using search when was the last time England beat Scotland at rugby union"

It is amazing how bad ChatGPT is at this question and has been for years now across multiple models. It's not that it gets it wrong - no shade, I've told it not to search the web so this is _hard_ for it - but how badly it reports the answer. Starting from the small stuff - it almost always reports the wrong year, wrong location and wrong score - that's the boring facts stuff that I would expect it to stumble on. It often creates details of matches that didn't exist, cool standard hallucinations. But even within the text it generates itself it cannot keep it consistent with how reality works. It often reports draws as wins for England. It frequently states the team that it just said scored most points lost the match, etc.

It is my ur example for when people challenge my assertion LLMs are stochastic parrots or fancy Markov chains on steroids.

arisAlexis 12/18/2025||||

can you give us an example of this niche knowledge? I highly doubt there is knowledge that is not inside some internet training material.

vitaflo 12/18/2025|||

I also have my own tricky benchmark that up til now only Deepseek has been able to answer. Gemini 3 Pro was the second. Every other LLM fail horribly. This is the main reason I started looking at G3pro more seriously.

mips_avatar 12/17/2025|||

OpenAI made a huge mistake neglecting fast inferencing models. Their strategy was gpt 5 for everything, which hasn't worked out at all. I'm really not sure what model OpenAI wants me to use for my applications that require lower latency. If I follow their advice in their API docs about which models I should use for faster responses I get told either use GPT 5 low thinking, or replace gpt 5 with gpt 4.1, or switch to the mini model. Now as a developer I'm doing evals on all three of these combinations. I'm running my evals on gemini 3 flash right now, and it's outperforming gpt5 thinking without thinking. OpenAI should stop trying to come up with ads and make models that are useful.

danpalmer 12/17/2025|||

Hardware is a factor here. GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data. There are lots of other factors here, but latency specifically favours TPUs.

The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.

nl 12/18/2025|||

> GPUs are necessarily higher latency than TPUs for equivalent compute on equivalent data.

Where are you getting that? All the citations I've seen say the opposite, eg:

> Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design.

https://massedcompute.com/faq-answers/

> The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.

Both Cerebras and Grok have custom AI-processing hardware (not CPUs).

The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing.

danpalmer 12/18/2025|||

I thought it was generally accepted that inference was faster on TPUs. This was one of my takeaways from the LLM scaling book: https://jax-ml.github.io/scaling-book/ – TPUs just do less work, and data needs to move around less for the same amount of processing compared to GPUs. This would lead to lower latency as far as I understand it.

The citation link you provided takes me to a sales form, not an FAQ, so I can't see any further detail there.

> Both Cerebras and Grok have custom AI-processing hardware (not CPUs).

I'm aware of Cerebras' custom hardware. I agree with the other commenter here that I haven't heard of Grok having any. My point about knowledge grounding was simply that Grok may be achieving its latency with guardrail/knowledge/safety trade-offs instead of custom hardware.

nl 12/18/2025||

Sorry I meant Groq custom hardware, not Grok!

I don't see any latency comparisons in the link

danpalmer 12/18/2025||

The link is just to the book, the details are scattered throughout. That said the page on GPUs specifically speaks to some of the hardware differences and how TPUs are more efficient for inference, and some of the differences that would lead to lower latency.

https://jax-ml.github.io/scaling-book/gpus/#gpus-vs-tpus-at-...

Re: Groq, that's a good point, I had forgotten about them. You're right they too are doing a TPU-style systolic array processor for lower latency.

mips_avatar 12/18/2025|||

I'm pretty sure xAI exclusively uses Nvidia H100s for Grok inference but I could be wrong. I agree that I don't see why TPUs would necessarily explain latency.

danpalmer 12/18/2025||

To be clear I'm only suggesting that hardware is a factor here, it's far from the only reason. The parent commenter corrected their comment that it was actually Groq not Grok that they were thinking of, and I believe they are correct about that as Groq is doing something similar to TPUs to accelerate inference.

jrk 12/18/2025|||

Why are GPUs necessarily higher latency than TPUs? Both require roughly the same arithmetic intensity and use the same memory technology at roughly the same bandwidth.

eru 12/18/2025|||

And our LLMs still have latencies well into the human perceptible range. If there's any necessary, architectural difference in latency between TPU and GPU, I'm fairly sure it would be far below that.

danpalmer 12/18/2025|||

My understanding is that TPUs do not use memory in the same way. GPUs need to do significantly more store/fetch operations from HBM, where TPUs pipeline data through systolic arrays far more. From what I've heard this generally improves latency and also reduces the overhead of supporting large context windows.

andai 12/17/2025||||

Hard to find info but I think the -chat versions of 5.1 and 5.2 (gpt-5.2-chat) are what you're looking for. They might just be an alias for the same model with very low reasoning though. I've seen other providers do the same thing, where they offer a reasoning and non reasoning endpoint. Seems to work well enough.

ComputerGuru 12/18/2025|||

They’re not the same, there are (at least) two different tunes per 5.x

For each you can use it as “instant” supposedly without thinking (though these are all exclusively reasoning models) or specify a reasoning amount (low, medium, high, and now xhigh - though if you do g specify it defaults to none) OR you can use the -chat version which is also “no thinking” but in practice performs markedly differently from the regular version with thinking off (not more or less intelligent but has a different style and answering method).

mips_avatar 12/18/2025|||

It's weird they don't document this stuff. Like understanding things like tool call latency and time to first token is extremely important in application development.

eru 12/18/2025||

Humans often answer with fluff like "That's a good question, thanks for asking that, [fluff, fluff, fluff]" to give themselves more breathing room until the first 'token' of their real answer. I wonder if any LLM are doing stuff like that for latency hiding?

mips_avatar 12/18/2025|||

I don't think the models are doing this, time to first token is more of a hardware thing. But people writing agents are definitely doing this, particularly in voice it's worth it to use a smaller local llm to handle the acknowledgment before handing it off.

strangegecko 12/18/2025|||

Do humans really do that often?

Coming up with all that fluff would keep my brain busy, meaning there's actually no additional breathing room for thinking about an answer.

eru 12/18/2025||

People who professionally answer questions do that, yes. Eg politicians or press secretaries for companies, or even just your professor taking questions after a talk.

> Coming up with all that fluff would keep my brain busy, meaning there's actually no additional breathing room for thinking about an answer.

It gets a lot easier with practice: your brain caches a few of the typical fluff routines.

simonw 12/17/2025||||

Yeah, I'm surprised that they've been through GT-5.1 and GPT-5.1-Codex and GPT-5.1-Codex-Max and now GPT-5.2 but their most recent mini model is still GPT-5-mini.

mips_avatar 12/18/2025||

I cannot comprehend how they do not care about this segment of the market.

yakbarber 12/18/2025||

it's easy to comprehend actually. they're putting everything on "having the best model". It doesn't look like they're going to win, but that's still their bet/

mips_avatar 12/18/2025||

I mean they’re trying to outdo google. So they need to do that.

eru 12/18/2025||

Until recently, Google was the underdog in the LLM race and OpenAI was the reigning champion. How quickly perceptions shift!

mips_avatar 12/18/2025||

I just want a deepseek moment for an open weights model fast enough to use in my app, I hate paying the big guys.

eru 12/18/2025||

Isn't deepseek an open weights model?

mips_avatar 12/18/2025||

yeah but not super fast like flash or grok fast

windexh8er 12/18/2025||||

One can only hope OpenAI continues down the path they're on. Let them chase ads. Let them shoot themselves in the foot now. If they fail early maybe we can move beyond this ridiculous charade of generally useless models. I get it, applied in specific scenarios they have tangible use cases. But ask your non-tech caring friend or family member what frontier model was released this week and they'll not only be confused by what "frontier" means, but it's very likely they won't have any clue. Also ask them how AI is improving their lives on the daily. I'm not sure if we're at the 80% of model improvement as of yet, but given OpenAIs progress this year it seems they're at a very weak inflection point. Start serving ads so the house of cards can get a nudge.

And now with RAM, GPU and boards being a PitA to get based on supply and pricing - double middle finger to all the big tech this holiday season!

behnamoh 12/17/2025||||

> OpenAI made a huge mistake neglecting fast inferencing models.

It's a lost battle. It'll always be cheaper to use an open source model hosted by others like together/fireworks/deepinfra/etc.

I've been maining Mistral lately for low latency stuff and the price-quality is hard to beat.

mips_avatar 12/18/2025||

I'll try benchmarking mistral against my eval, I've been impressed by kimi's importance but it's too slow to do anything useful realtime.

campers 12/18/2025||||

I had wondered if they run their inference at high batch sizes to get better throughput to keep their inference costs lower.

They do have a priority tier at double the cost, but haven't seen any benchmarks on how much faster that actually is.

The flex tier was an underrated feature in GPT5, batch pricing with a regular API call. GPT5.1 using flex priority is an amazing price/intelligence tradeoff for non-latency sensitive applications, without needing to extra plumbing of most batch APIs

mips_avatar 12/18/2025||

I’m sure they do something like that. I’ve noticed azure has way faster gpt 4.1 than OpenAI

TacticalCoder 12/18/2025||||

> OpenAI should stop trying to come up with ads and make models that are useful.

Turns out becoming a $4 trillion company first with ads (Google), then owning everybody on the AI-front could be the winning strategy.

seunosewa 12/18/2025|||

GPT 5 Mini is supposed to be equivalent to Gemini Flash.

kartayyar 12/18/2025|||

Can confirm. We at Roblox open sourced a new frontier game eval today, and it's beating even Gemini 3 Pro! ( Previous best model ).

https://github.com/Roblox/open-game-eval/blob/main/LLM_LEADE...

seany62 12/18/2025||

Unbelievable

scrollop 12/17/2025|||

Alright so we have more benchmarks including hallucinations and flash doesn't do well with that, though generally it beats gemini 3 pro and GPT 5.1 thinking and gpt 5.2 thinking xhigh (but then, sonnet, grok, opus, gemini and 5.1 beat 5.2 xhigh) - everything. Crazy.

https://artificialanalysis.ai/evaluations/omniscience

tallclair 12/17/2025||

On your Omniscience-Index vs. Cost graph, I think your Gemini 3 pro & flash models might be swapped.

giancarlostoro 12/17/2025|||

I wonder at what point will everyone who over-invested in OpenAI will regret their decision (expect maybe Nvidia?). Maybe Microsoft doesn't need to care, they get to sell their models via Azure.

toomuchtodo 12/17/2025|||

Amazon Set to Waste $10 Billion on OpenAI - https://finance.yahoo.com/news/amazon-set-waste-10-billion-1... - December 17th, 2025

outside1234 12/17/2025||||

Very soon, because clearly OpenAI is in very serious trouble. They are scaled and have no business model and a competitor that is much better than them at almost everything (ads, hardware, cloud, consumer, scaling).

TacticalCoder 12/18/2025||||

Oracle's stock skyrocketed then took a nosedive. Financial experts warned that companies who bet big on OpenAI like Oracle and Coreweave to pump their stock would go down the drain, and down the drain they went (so far: -65% for Coreweave and nearly -50% of Oracle compared to their OpenAI-hype all-time highs).

Markets seems to be in a: "Show me the OpenAI money" mood at the moment.

And even financial commentators who don't necessarily know a thing about AI can realize that Gemini 3 Pro and now Gemini 3 Flash are giving ChatGPT a run for its money.

Oracle and Microsoft have other source of revenues but for those really drinking the OpenAI koolaid, including OpenAI itself, I sure as heck don't know what the future holds.

My safe bet however is that Google ain't going anywhere and shall keep progressing on the AI front at an insane pace.

eru 12/18/2025||

Financial experts [0] and analysts are pretty much useless. Empirically their predictions are slightly worse than chance.

[0] At least the guys who publish where you or me can read them.

guelo 12/17/2025||||

OpenAI's doom was written when Altman (and Nadella) got greedy, threw away the nonprofit mission, and caused the exodus of talent and funding that created Anthropic. If they had stayed nonprofit the rest of the industry could have consolidated their efforts against Google's juggernaut. I don't understand how they expected to sustain the advantage against Google's infinite money machine. With Waymo Google showed that they're willing to burn money for decades until they succeed.

This story also shows the market corruption of Google's monopolies, but a judge recently gave them his stamp of approval so we're stuck with it for the foreseeable future.

deegles 12/18/2025|||

I think their downfall will be the fact that they don't have a "path to AGI" and have been raising investor money on the promise that they do.

taytus 12/18/2025||

I believethere’s also exponential dislike growing for Altman among most AI users, and that impacts how the brand/company is perceived.

mingusrude 12/18/2025||

Most AI users outside of HN does not have any idea of who Altman is. ChatGPT is in many circles synonymous to AI so their brand recognition is huge.

giancarlostoro 12/18/2025||

I agree, I have said it before, ChatGPT is like Photoshop at this point, or Google. Even if you are using Bing you are googling it. Even if you are using MS Paint to edit an image it was photoshopped.

behnamoh 12/17/2025|||

> I don't understand how they expected to sustain the advantage against Google's infinite money machine.

I ask this question about Nazi Germany. They adopted the Blitkrieg strategy and expanded unsustainably, but it was only a matter of time until powers with infinite resources (US, USSR) put an end to it.

goobatrooba 12/18/2025|||

I know you're making an analogy but I have to point out that there are many points where Nazi Germany could have gone a different route and potentially could have ended up with a stable dominion over much of Western Europe.

Most obvious decision points were betraying the USSR and declaring war on the US (no one really had been able to print the reason, but presumably it was to get Japan to attack the soviets from the other side, which then however didn't happen). Another could have been to consolidate after the surrender/supplication of France, rather than continue attacking further.

qcnguy 12/18/2025||||

Lots of plausible alternative histories don't end with the destruction of Nazi Germany. Others already named some, another is if the RAF collapsed during the Battle of Britain and Germany had established air superiority. The Germans would have taken out the Royal Navy and mounted an invasion of Britain soon after; if Britain had fallen there'd have been nowhere for the US to stage D-Day. Hitler could have then diverted all resources to the eastern front and possibly managed to reach Moscow before the winter set in.

eru 12/18/2025|||

Huh? How did the USSR have infinite resources? They were barely kept afloat by western allied help (especially at the beginning). Remember also how Tsarist Russia was the first power to collapse and get knocked out of the war in WW1, long before the war was over. They did worse than even the proverbial 'Sick Man of Europe', the Ottoman Empire.

Not saying that the Nazi strategy was without flaws, of course. But your specific critique is a bit too blunt.

SoftTalker 12/18/2025||

they had more soldiers to throw into the meat grinder

eru 12/18/2025||

They also had more soldiers in WW1.

elbear 12/18/2025||

They withdrew in WW1 after the revolution.

spaceman_2020 12/18/2025||||

Seeing Sergey Brin back in the trenches makes me think Google is really going to win this

They always had the best talent, but with Brin at the helm, they also have someone with the organizational heft to drive them towards a single goal

jack_riminton 12/17/2025|||

But you’re forgetting the Jonny Ive hardware device that totally isn’t like that laughable pin badge thing from Humane

user34283 12/18/2025||

I agree completely. Altman was at some point talking about a screen less device and getting people away from the screen.

Abandoning our mose useful sense, vision, is a recipe for a flop.

jack_riminton 12/18/2025||

I'm not entirely sure it will ever see the light of day tbh

The amount of money sloshing around in these acquisitions makes you wonder what they're really for

mmaunder 12/17/2025|||

Thanks, having it walk a hardcore SDR signal chain right now --- oh damn it just finished. The blog post makes it clear this isn't just some 'lite' model - you get low latency and cognitive performance. really appreciate you amplifying that.

yunohn 12/17/2025|||

I love how every single LLM model release is accompanied by pre-release insiders proclaiming how it’s the best model yet…

hexasquid 12/18/2025|||

Make me think of how every iPhone is the best iPhone yet.

Waiting for Apple to say "sorry folks, bad year for iPhone"

eru 12/18/2025||

Wouldn't you expect that every new iPhone is genuinely the best iPhone? I mean, technology marches on.

OrangeMusic 12/18/2025||

It was sarcasm.

Europas 12/18/2025|||

Thats true though.

All these announcements beat all the other models on most benchmarks and are then the best model yet. They can't see the future yet so they are not aware or care anyway that 2 weeks later someone says "hold my beer" and we get again better benchmark results from someone else.

Exhausting and exciting

yunohn 12/18/2025||

My criticism is more about the fake-sounding pre-release insider hype aspect than the inevitable nature of forward progress.

behnamoh 12/17/2025|||

> Don’t let the “flash” name fool you

I think it's bad naming on google's part. "flash" implies low quality, fast but not good enough. I get less negative feeling looking at "mini" models.

pietz 12/17/2025|||

Interesting. Flash suggests more power to me than Mini. I never use gpt-5-mini in the UI whereas Flash appears to be just as good as Pro just a lot faster.

taytus 12/18/2025||

Im in between :)

Mini - small, incomplete, not good enough

Flash - good, not great, fast, might miss something.

nemonemo 12/17/2025|||

Fair point. Asked Gemini to suggest alternatives, and it suggested Gemini Velocity, Gemini Atom, Gemini Axiom (and more). I would have liked `Gemini Velocity`.

behnamoh 12/18/2025||

I like Anthropic's approach: Haiku, Sonnet, Opus. Haiku is pretty capable still and the name doesn't make me not wanna use it. But Flash is like "Flash Sale". It might still be a great model but my monkey brain associates it with "cheap" stuff.

jauntywundrkind 12/17/2025|||

Just to point this out: many of these frontier models cost isn't that far away from two orders of magnitude more than what DeepSeek charges. It doesn't compare the same, no, but with coaxing I find it to be a pretty capable competent coding model & capable of answering a lot of general queries pretty satisfactorily (but if it's a short session, why economize?). $0.28/m in, $0.42/m out. Opus 4.5 is $5/$25 (17x/60x).

I've been playing around with other models recently (Kimi, GPT Codex, Qwen, others) to try to better appreciate the difference. I knew there was a big price difference, but watching myself feeding dollars into the machine rather than nickles has also founded in me quite the reverse appreciation too.

I only assume "if you're not getting charged, you are the product" has to be somewhat in play here. But when working on open source code, I don't mind.

happyopossum 12/17/2025|||

Two orders of magnitude would imply that these models cost $28/m in and $42/m out. Nothing is even close to that.

jauntywundrkind 12/17/2025|||

To me as an engineer, 60x for output (which is most of the cost I see, AFAICT) is not that significantly different from 100x.

I tried to be quite clear with showing my work here. I agree that 17x is much closer to a single order of magnitude than two. But 60x is, to me, a bulk enough of the way to 100x that yeah I don't feel bad saying it's nearly two orders (it's 1.78 orders of magnitude). To me, your complaint feels rigid & ungenerous.

My post is showing to me as -1, but I standby it right now. Arguing over the technicalities here (is 1.78 close enough to 2 orders to count) feels besides the point to me: DeepSeek is vastly more affordable than nearly everything else, putting even Gemini 3 Flash here to shame. And I don't think people are aware of that.

I guess for my own reference, since I didn't do it the first time: at $0.50/$3.00 / M-i/o, Gemini 3 Flash here is 1.8x & 7.1x (1e1.86) more expensive than DeepSeek.

minraws 12/17/2025|||

Gpt 5.2 pro is well beyond that iirc

jauntywundrkind 12/18/2025||

Whoa! I had no idea. $21/$168. That's 75x / 400x (1e1.875/1e2.6). https://platform.openai.com/docs/pricing

KoolKat23 12/17/2025|||

I struggle to see the incentive to do this, I have similar thoughts for locally run models. It's only use case I can imagine is small jobs at scale perhaps something like auto complete integrated into your deployed application, or for extreme privacy, honouring NDA's etc.

Otherwise, if it's a short prompt or answer, SOTA (state of the art) model will be cheap anyway and id it's a long prompt/answer, it's way more likely to be wrong and a lot more time/human cost is spent on "checking/debugging" any issue or hallucination, so again SOTA is better.

lukan 12/17/2025||

"or for extreme privacy"

Or for any privacy/IP protection at all? There is zero privacy, when using cloud based LLM models.

Workaccount2 12/17/2025||

Really only if you are paranoid. It's incredibly unlikely that the labs are lying about not training on your data for the API plans that offer it. Breaking trust with outright lies would be catastrophic to any lab right now. Enterprise demands privacy, and the labs will be happy to accommodate (for the extra cost, of course).

mistercheph 12/18/2025||

No, it's incredibly unlikely that they aren't training on user data. It's billions of dollars worth of high quality tokens and preference that the frontier labs have access to, you think they would give that up for their reputation in the eyes of the enterprise market? LMAO. Every single frontier model is trained on torrented books, music, and movies.

user34283 12/18/2025||

Considering that they will make a lot of money with enterprise, yes, that's exactly what I think.

What I don't think is that I can take seriously someone's opinion on enterprise service's privacy after they write "LMAO" in capslock in their post.

lukan 12/18/2025||

I just know many people here complained about the very unclear way, google for example communicates what they use for training data and what plan to choose to opt out of everything, or if you (as a normal buisness) even can opt out. Given the whole volatile nature of this thing, I can imagine an easy "oops, we messed up" from google if it turns out they were in fact using allmost everything for training.

Second thing to consider is the whole geopolitical situation. I know companies in europe are really reluctant to give US companies access to their internal data.

KoolKat23 12/19/2025||

To be fair, we all know googles terms are ambiguous as hell. It would not be a big surprise nor an outright lie if they did use it.

Its different if they proclaimed outright they won't use it and then do.

Not that any of this is right, it wouldn't be a true betrayal.

On a related note, these terms to me are a great example of success for EU GDPR regulations, and regulations on corporates in general. It's clear as day, additional protections are afforded to EU residents in these terms purely due to the law.

esafak 12/17/2025|||

What are you using it for and what were you using before?

tonyhart7 12/17/2025|||

I think google is the only one that still produce general knowledge LLM right now

claude is coding model from the start but GPT is in more and more becoming coding model

Imustaskforhelp 12/17/2025|||

I agree with this observation. Gemini does feel like code-red for basically every AI company like chatgpt,claude etc. too in my opinion if the underlying model is both fast and cheap and good enough

I hope open source AI models catch up to gemini 3 / gemini 3 flash. Or google open sources it but lets be honest that google isnt open sourcing gemini 3 flash and I guess the best bet mostly nowadays in open source is probably glm or deepseek terminus or maybe qwen/kimi too.

Uehreka 12/17/2025|||

I would expect open weights models to always lag behind; training is resource-intensive and it’s much easier to finance if you can make money directly from the result. So in a year we may have a ~700B open weights model that competes with Gemini 3, but by then we’ll have Gemini 4, and other things we can’t predict now.

xbmcuser 12/17/2025|||

There will be diminishing returns though as the future models won't be thah much better we will reach a point where the open source model will be good enough for most things. And the need for being on the latest model no longer so important.

For me the bigger concern which I have mentioned on other AI related topics is that AI is eating all the production of computer hardware so we should be worrying about hardware prices getting out of hand and making it harder for general public to run open source models. Hence I am rooting for China to reach parity on node size and crash the PC hardware prices.

FuckButtons 12/17/2025||

I had a similar opinion, that we were somewhere near the top of the sigmoid curve of model improvement that we could achieve in the near term. But given continued advancements, I’m less sure that prediction holds.

eru 12/18/2025|||

My model is a bit simpler: model quality is something like the logarithm of effort you put into making the model. (Assuming you know what you are doing with your effort.)

So I don't think we are on any sigmoid curve or so. Though if you plot the performance of the best model available at any point in time against time on the x-axis, you might see a sigmoid curve, but that's a combination of the logarithm and the amount of effort people are willing to spend on making new models.

(I'm not sure about it specifically being the logarithm. Just any curve that has rapidly diminishing marginal returns that nevertheless never go to zero, ie the curve never saturates.)

Imustaskforhelp 12/18/2025|||

Yeah I have a similar opinion and you can go back almost a year when claude 3.5 launched and I said on hackernews, that its good enough

And now I am saying the same for gemini 3 flash.

I still feel the same way tho, sure there is an increase but I somewhat believe that gemini 3 is good enough and the returns on training from now on might not be worth thaat much imo but I am not sure too and i can be wrong, I usually am.

baq 12/17/2025|||

If Gemini 3 flash is really confirmed close to Opus 4.5 at coding and a similarly capable model is open weights, I want to buy a box with an usb cable that has that thing loaded, because today that’s enough to run out of engineering work for a small team.

eru 12/18/2025||

Open weights doesn't mean you can necessarily run it on a (small) box.

If Google released their weights today, it would technically be open weight; but I doubt you'd have an easy time running the whole Gemini system outside of Google's datacentres.

leemoore 12/17/2025||||

Gemini isn't code red for Anthropic. Gemini threatens none of Anthropic's positioning in the market.

ralusek 12/17/2025|||

Yes it does. I never use Claude anymore outside of agentic tasks.

leemoore 12/17/2025|||

What demographic are you in that is leaving anthropic in mass that they care about retaining? From what I see Anthropic is targeting enterprise and coding.

Claude Code just caught up to cursor (no 2) in revenue and based on trajectories is about to pass GitHub copilot (number 1) in a few more months. They just locked down Deloitte with 350k seats of Claude Enterprise.

In my fortune 100 financial company they just finished crushing open ai in a broad enterprise wide evaluation. Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.

There is 1 leader with enterprise. There is one leader with developers. And google has nothing to make a dent. Not Gemini 3, not Gemini cli, not anti gravity, not Gemini. There is no Code Red for Anthropic. They have clear target markets and nothing from google threatens those.

Karrot_Kream 12/18/2025|||

I agree with your overall thesis but:

> Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.

Does that mean y'all never evaluated Gemini at all or just that it couldn't compete? I'd be worried that prior performance of the models prejudiced stats away from Gemini, but I am a Claude Code and heavy Anthropic user myself so shrug.

user34283 12/18/2025|||

Enterprise is slow. As for developers, we will be switching to Google unless the competition can catch up and deliver a similarly fast model.

Enterprise will follow.

I don't see any distinction in target markets - it's the same market.

Imustaskforhelp 12/18/2025||

Yeah, this is what I was trying to say in my original comment too.

Also I do not really use agentic tasks but I am not sure that gemini 3/3 flash have mcp support/skills support for agentic tasks

if not, I feel like they are very low hanging fruits and something that google can try to do too to win the market of agentic tasks over claude too perhaps.

user34283 12/18/2025||

I don't use MCP, but I am using agents in Antigravity.

So far they seem faster with Flash, and with less corruption of files using the Edit tool - or at least it recovered faster.

siva7 12/17/2025|||

so? agentic tasks is where the promised agi is for many of us

Workaccount2 12/17/2025|||

Open source models are riding coat tails, they are basically just distilling the giant SOTA models, hence perpetually being 4-6mos behind.

waffletower 12/17/2025|||

If this quantification of lag is anywhere near accurate (it may be larger and/or more complex to describe), soon open source models will be "simply good enough". Perhaps companies like Apple could be 2nd round AI growth companies -- where they market optimized private AI devices via already capable Macbooks or rumored appliances. While not obviating cloud AI, they could cheaply provide capable models without subscription while driving their revenue through increased device sales. If the cost of cloud AI increases to support its expense, this use case will act as a check on subscription prices.

xzjis 12/18/2025||

Google already has dedicated hardware for running private LLMs: just look at what they're doing on the Google Pixel. The main limiting factor right now is access to hardware that's powerful enough, and especially has enough memory, to run a good LLM, which will happen eventually. Normally, by 2031 we should have devices with 400 GB of RAM, but the current RAM crisis could throw off my calculations...

Gigachad 12/17/2025|||

So basically the proprietary models are devalued to almost 0 in about 4-6 months. Can they recover the training costs + profit margin every 4 months?

Workaccount2 12/17/2025|||

Coding is basically an edge case for LLMs too.

Pretty much every person in the first (and second) world is using AI now, and only small fraction of those people are writing software. This is also reflected in OAI's report from a few months ago that found programming to only be 4% of tokens.

int_19h 12/17/2025|||

That may be so, but I rather suspect the breakdown would be very different if you only count paid tokens. Coding is one of the few things where you can actually get enough benefit out of AI right now to justify high-end subscriptions (or high pay-per-token bills).

aleph_minus_one 12/17/2025|||

> Pretty much every person in the first (and second) world is using AI now

This sounds like you live in a huge echo chamber. :-(

chpatrick 12/18/2025|||

All of my non techy friends use it, it's the new search engine. I think at this point people refusing to use it are the echo chamber.

lukan 12/17/2025|||

Depends what you count as AI (just googling makes you use the LLM summary), but also my mother who is really not tech affine loved what google lense can do, after I showed her.

Apart from my very old grandmothers, I don't know anyone not using AI.

pests 12/17/2025|||

How many people do you know? Do you talk to your local shop keeper? Or the clerk at the gas station? How are they using AI? I'm a pretty techy person with a lot of tech friends, and I know more people not using AI (on purpose, or lack of knowledge) then do.

GeneralMaximus 12/18/2025|||

I live in India and a surprising number of people here are using AI.

A lot of public religious imagery is very clearly AI generated, and you can find a lot of it on social media too. "I asked ChatGPT" is a common refrain at family gatherings. A lot of regular non-techie folks (local shopkeepers, the clerk at the gas station, the guy at the vegetable stand) have been editing their WhatsApp profile pictures using generative AI tools.

Some of my lawyer and journalist friends are using ChatGPT heavily, which is concerning. College students too. Bangalore is plastered with ChatGPT ads.

There's even a low-cost ChatGPT plan called ChatGPT Go you can get if you're in India (not sure if this is available in the rest of the world). It costs ₹399/mo or $4.41/mo, but it's completely free for the first year of use.

So yes, I'd say many people outside of tech circles are using AI tools. Even outside of wealthy first-world countries.

lukan 12/17/2025|||

Hm, quite some. Like I said, it depends what you count as AI.

Just googling means you use AI nowdays.

eru 12/18/2025||

Whether Googling something counts as AI has more to do with the shifting definition of AI over time, then with Googling itself.

Remember, really back in the day the A* search algorithm was part of AI.

If you had asked anyone in the 1970s whether a box that given a query pinpoints the right document that answers that question (aka Google search in the early 2000s), they'd definitely would have called it AI.

lukan 12/18/2025||

Google gives you an AI summary, reading that means interacting with LLMs.

pests 12/18/2025||

Google also gives you ads. Some learn to scroll past before reading.

SoftTalker 12/18/2025|||

I'm sort of old but not a grandmother. Not using AI.

epolanski 12/17/2025|||

Gemini 2.0 flash was good already for some tasks of mine long time ago..

kqr 12/18/2025|||

Yes, 2.5 Flash is extremely cost efficient in my favourite private benchmark: playing text adventures[1]. I'm looking forward to testing 3.0 Flash later today.

[1]: https://entropicthoughts.com/haiku-4-5-playing-text-adventur...

freedomben 12/17/2025|||

Cool! I've been using 2.5 flash and it is pretty bad. 1 out of 5 answers it gives will be a lie. Hopefully 3 is better

samyok 12/17/2025||

Did you try with the grounding tool? Turning it on solved this problem for me.

Davidzheng 12/17/2025||

what if the lie is a logical deduction error not a fact retrieval error

rat9988 12/17/2025||

The error rate would still be improved overall and might make it a viable tool for the price depending on the usecase.

unsupp0rted 12/17/2025|||

How good is it for coding, relative to recent frontier models like GPT 5.x, Sonnet 4.x, etc?

jasonjmcghee 12/18/2025|||

My experience so far- much less reliable. Though it’s been in chat not opencode or antigravity etc. you give it a program and say change it in this way, and it just throws stuff away, changes unrelated stuff etc. completely different quality than pro (or sonnet 4.5 / GPT-5.2)

PrayagS 12/18/2025|||

Been thinking of having Opus generate plans and then having Gemini 3 Flash execute. Might be better than using Haiku for the same.

Anyone tried something similar already?

piokoch 12/18/2025|||

So why Flash is so high in LiveCodeBench Pro?

BTW: I have the same impression, Claude was working better for me for coding tasks.

bovermyer 12/17/2025|||

In my own, very anecdotal, experience, Gemini 3 Pro and Flash are both more reliably accurate than GPT 5.x.

I have not worked with Sonnet enough to give an opinion there.

pplonski86 12/18/2025|||

Lately I was trying ask LLMs to generate SVG pictures, do you have famous pelican on bike created by flash model?

encroach 12/17/2025|||

How did you get early access?

ZuoCen_Liu 12/18/2025|||

What type of question is your one about testing AI inference time?

tonymet 12/17/2025|||

Can you be more specific on the tasks you’ve found exceptional ?

dfsegoat 12/17/2025|||

> it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high

...and all of that done without any GPUs as far as i know! [1]

[1] - https://www.uncoveralpha.com/p/the-chip-made-for-the-ai-infe...

(tldr: afaik Google trained Gemini 3 entirely on tensor processing units - TPUs)

poopiokaka 12/17/2025|||

[dead]

Sincere6066 12/17/2025|||

[flagged]

moffkalast 12/17/2025||

Should I not let the "Gemini" name fool me either?

__jl__ 12/17/2025||

This is awesome. No preview release either, which is great to production.

They are pushing the prices higher with each release though: API pricing is up to $0.5/M for input and $3/M for output

For comparison:

Gemini 3.0 Flash: $0.50/M for input and $3.00/M for output

Gemini 2.5 Flash: $0.30/M for input and $2.50/M for output

Gemini 2.0 Flash: $0.15/M for input and $0.60/M for output

Gemini 1.5 Flash: $0.075/M for input and $0.30/M for output (after price drop)

Gemini 3.0 Pro: $2.00/M for input and $12/M for output

Gemini 2.5 Pro: $1.25/M for input and $10/M for output

Gemini 1.5 Pro: $1.25/M for input and $5/M for output

I think image input pricing went up even more.

Correction: It is a preview model...

mips_avatar 12/17/2025||

I'm more curious how Gemini 3 flash lite performs/is priced when it comes out. Because it may be that for most non coding tasks the distinction isn't between pro and flash but between flash and flash lite.

KoolKat23 12/17/2025|||

Token usage also needs to be factored in specifically when thinking is enabled, these newer models find more difficult problems easier and use less tokens to solve.

srameshc 12/17/2025|||

Thanks that was a great breakup of cost. I just assumed before that it was the same pricing. The pricing probably comes from the confidence and the buzz around Gemini 3.0 as one of the best performing models. But competetion is hot in the area and it's not too far where we get similar performing models for cheaper price.

YetAnotherNick 12/17/2025|||

For comparison, GPT-5 mini is $0.25/M for input and $2.00/M for output, so double the price for input and 50% higher for output.

AuthError 12/17/2025||

flash is closer to sonnet than gpt minis though

martythemaniak 12/17/2025|||

The price increase sucks, but you really do get a whole lot more. They also had the "Flash Lite" series, 2.5 Flash Lite is 0.10/M, hopefully we see something like 3.0 Flash Lite for .20-.25.

sunaookami 12/17/2025|||

This is a preview release.

reed1234 12/18/2025||

https://openrouter.ai/google/gemini-3-flash-preview

uluyol 12/17/2025|||

Are these the current prices or the prices at the time the models were released?

__jl__ 12/17/2025||

Mostly at the time of release except for 1.5 Flash which got a price drop in Aug 2024.

Google has been discontinuing older models after several months of transition period so I would expect the same for the 2.5 models. But that process only starts when the release version of 3 models is out (pro and flash are in preview right now).

misiti3780 12/17/2025||

is there a website where i can compare openai, anthropic and gemini models on cost/token ?

jsnell 12/17/2025|||

There are plenty. But it's not the comparison you want to be making. There is too much variability between the number of tokens used for a single response, especially once reasoning models became a thing. And it gets even worse when you put the models into a variable length output loop.

You really need to look at the cost per task. artificialanalysis.ai has a good composite score, measures the cost of running all the benchmarks, and has 2d a intelligence vs. cost graph.

misiti3780 12/17/2025||

thanks

deaux 12/18/2025||

For reference the above completely depends on what you're using them for. For many tasks, the number of tokens used is consistent within 10~20%.

deaux 12/18/2025||||

https://www.helicone.ai/llm-cost

Tried a lot of them and settled on this one, they update instantly on model release and having all models on one page is the best UX.

rrhartjr 12/18/2025||||

https://www.llm-prices.com/

int_19h 12/17/2025|||

https://openrouter.ai/models

RobinL 12/17/2025||

Feels like Google is really pulling ahead of the pack here. A model that is cheap, fast and good, combined with Android and gsuite integration seems like such powerful combination.

Presumably a big motivation for them is to be first to get something good and cheap enough they can serve to every Android device, ahead of whatever the OpenAI/Jony Ive hardware project will be, and way ahead of Apple Intelligence. Speaking for myself, I would pay quite a lot for truly 'AI first' phone that actually worked.

exegete 12/17/2025||

Apple Intelligence is going to be Gemini https://www.macrumors.com/2025/11/05/apple-siri-google-gemin...

willis936 12/18/2025||

That's too bad. Apple's most interesting value proposition is running local inference with big privacy promises. They wouldn't need to be the highest performer to offer something a lot of people might want.

cmckn 12/18/2025|||

My understanding is Apple will be hosting Gemini models themselves on the private compute system they announced a while back.

floundy 12/18/2025|||

Apple’s most interesting value proposition was ignoring all this AI junk and letting users click “not interested” on Apple Intelligence and never see it again.

From a business perspective it’s a smart move (inasmuch as “integrating AI” is the default which I fundamentally disagree with) since Apple won’t be left holding the bag on a bunch of AI datacenters when/if the AI bubble pops.

I don’t want to lose trust in Apple, but I literally moved away from Google/Android to try and retain control over my data and now they’re taking me… right back to Google. Guess I’ll retreat further into self-hosting.

willis936 12/18/2025|||

I also agree with this. Microsoft successfully removed my entire household from ever owning one of their products again after this year. Apple and linux make up the entire delta.

As long as Apple doesn't take any crazy left turns with their privacy policy then it should be relatively harmless if they add in a google wrapper to iOS (and we won't need to take hard right turns with grapheneOS phones and framework laptops).

bitpush 12/18/2025||||

> Apple’s most interesting value proposition was ignoring all this AI junk

Did you forget all the Apple Intelligence stuff? They were never "ignoring" if anything they talked a big talk, and then failed so hard.

The whole iPhone 16 was marketed as AI first phone (including in billboards). They had full length ads running touting AI benefits.

Apple was never "ignoring" or "sitting AI out". They were very much in it. And they failed.

hu3 12/18/2025|||

Sure. If by ignore you mean flaunt about Apple Intelligence only to fail miserably on the expectation they themselves generated.

skerit 12/17/2025|||

Pulling ahead? Depends on the usecase I guess. 3 turns into a very basic Gemini-CLI session and Gemini 3 Pro has already messed up a simple `Edit` tool-call. And it's awfully slow. In 27 minutes it did 17 tool calls, and only managed to modify 2 files. Meanwhile Claude-Code flies through the same task in 5 minutes.

RobinL 12/17/2025|||

Yeah - agree, Anthropic much better for coding. I'm more thinking about the 'average chat user' (the larger potential userbase), most of whom are on chatgpt.

nowittyusername 12/18/2025|||

Knowing Googles MO, its most likely not the model but their harness system that's the issue. God they are so bad at their UI and agentic coding harnesses...

eldenring 12/18/2025||

I think Claude is genuinely much smarter, and more lucid.

mark_l_watson 12/18/2025|||

My non-tech brother has the latest Google Pixel phone and he enthusiastically uses Gemini for many interactions with his phone.

I almost switched out of the Apple ecosystem a few months ago, but I have an Apple Studio monitor and using it with non-Apple gear is problematic. Otherwise a Pixel phone and a Linux box with a commodity GPU would do it for me.

anukin 12/17/2025||

What will you use the ai in the phone to do for you? I can understand tablets and smart glasses being able to leverage smol AI much better than a phone which is reliant on apps for most of the work.

Workaccount2 12/17/2025|||

I desperately want to be able to real-time dictate actions to take on my phone.

Stuff like:

"Open Chrome, new tab, search for xyz, scroll down, third result, copy the second paragraph, open whatsapp, hit back button, open group chat with friends, paste what we copied and send, send a follow-up laughing tears emoji, go back to chrome and close out that tab"

All while being able to just quickly glance at my phone. There is already a tool like this, but I want the parsing/understanding of an LLM and super fast response times.

KoolKat23 12/17/2025|||

This new model is absurdly quick on my phone and for launch day, wonder if it's additional capacity/lower demand or if this is what we can expect going forward.

On a related note, why would you want to break down your tasks to that level surely it should be smart enough to do some of that without you asking and you can just state your end goal.

pests 12/17/2025||||

This has been my dream for voice control of PC for ages now. No wake word, no button press, no beeping or nagging, just fluently describe what you want to happen and it does.

pylotlight 12/18/2025|||

without a wake word, it would have to listen and process all parsed audio. you really want everything captured near the device/mic to be sent to external servers?

TeMPOraL 12/18/2025||

I might if that's what it takes to make it finally work. The fueling of the previous 15 years was not worth it, but that was then.

nielsbot 12/18/2025|||

Apple tried this ages ago:

https://en.wikipedia.org/wiki/PlainTalk

procaryote 12/17/2025|||

is that faster to say than do, or is it an accessibility or while-driving need?

CamperBob2 12/18/2025|||

I don't understand that use case at all. How can you tell it to do all that stuff, if you aren't sitting there glued to the screen yourself?

TeMPOraL 12/18/2025||

Because typing on mobile is slow, app switching is slow, text selection and copy-paste are torture. Pretty much the only interaction of the ones OP listed is scrolling.

Plus, if the above worked, the higher level interactions could trivially work too. "Go to event details", "add that to my calendar".

FWIW, I'm starting to embrace using Gemini as general-purpose UI for some scenarios just because it's faster. Most common one, "<paste whatever> add to my calendar please."

wiseowise 12/18/2025|||

Analyse e-mails/text/music/videos, edit photos, summarization, etc.

fariszr 12/17/2025||

These flash models keep getting more expensive with every release.

Is there an OSS model that's better than 2.0 flash with similar pricing, speed and a 1m context window?

Edit: this is not the typical flash model, it's actually an insane value if the benchmarks match real world usage.

> Gemini 3 Flash achieves a score of 78%, outperforming not only the 2.5 series, but also Gemini 3 Pro. It strikes an ideal balance for agentic coding, production-ready systems and responsive interactive applications.

The replacement for old flash models will be probably the 3.0 flash lite then.

thecupisblue 12/17/2025||

Yes, but the 3.0 Flash is cheaper, faster and better than 2.5 Pro.

So if 2.5 Pro was good for your usecase, you just got a better model for about 1/3rd of the price, but might hurt the wallet a bit more if you use 2.5 Flash currently and want an upgrade - which is fair tbh.

mark_l_watson 12/18/2025||

I agree, adding one point: a better model can in effect use fewer tokens if you get a higher percentage of successful one-shots to work. I am a ‘retired gentleman scientist’ so take this with a grain of salt (I do a lot of non-commercial, non-production experiments): when I watch the output for tool use, better models have fewer tool ‘re-tries.’

aoeusnth1 12/17/2025|||

I think it's good, they're raising the size (and price) of flash a bit and trying to position Flash as an actually useful coding / reasoning model. There's always lite for people who want dirt cheap prices and don't care about quality at all.

sosodev 12/17/2025|||

Nvidia released Nemotron 3 nano recently and I think it fits your requirements for an OSS model: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B...

It's extremely fast on good hardware, quite smart, and can support up to 1m context with reasonable accuracy

mark_l_watson 12/18/2025||

I second this: I have spent about five hours this week experimenting with Nemotron 3 nano for both tool use and code analysis: it is excellent! and fast!

Relevant to the linked Google blog: I feel like getting Nemotron 3 nano and Gemini 3 flash in one week is an early Christmas gift. I have lived with the exponential improvements in practical LLM tools over the last three years, but this week seems special.

mips_avatar 12/17/2025|||

For my apps evals Gemini flash and grok 4 fast are the only ones worth using. I'd love for an open weights model to compete in this arena but I haven't found one.

scrollop 12/17/2025|||

This one is more powerful than openai models, including gpt 5.2 (which is worse on various benchmarks than 5.1 which is worse than 5.1, and that's where 5.2 was using XHIGH, whiulst the others were on high eg: https://youtu.be/4p73Uu_jZ10?si=x1gZopegCacznUDA&t=582 )

https://epoch.ai/benchmarks/simplebench

fullstackwife 12/17/2025||

cost of e2e task resolution should be cheaper, even if single inference cost is higher, you need fewer loops to solve a problem now

fariszr 12/17/2025||

Sure, but for simple tasks that require a large context window, aka the typical usecase for 2.0 flash, it's still significantly more expensive.

qnleigh 12/18/2025||

This model is breaking records on my benchmark of choice, which is 'the fraction of Hacker News comments that are positive.' Even people who avoid Google products on principle are impressed. Hardly anyone is arguing that ChatGPT is better in any respect (except brand recognition).

ipsum2 12/18/2025||

Chatgpt 5.2 thinking is significantly better quality for most knowledge work, but it trades off in speed.

energy123 12/18/2025|||

That has been my experience. Primarily because it is allowed to expend far more test-time tokens than Gemini 3.0 Pro to solve the same prompt.

eli 12/18/2025|||

And GPT costs 4x as much

Palmik 12/18/2025|||

No offense, but that seems like a poor benchmark. These initial vibe checks are easily swayed by personal brand biases.

awestroke 12/18/2025|||

The brand bias is heavily against Google, not in Googles favor

Palmik 12/18/2025||

In context of AI I'm mostly seeing anti-OpenAI pro-Google bias.

clarkmoreno 12/18/2025||

Facts. These HN threads are half astroturfing and paid shills. Near impossible to decifer authentic takes that are not actual colleagues or people IRL

qnleigh 12/18/2025|||

Fair. No benchmark is perfect.

I do pay special attention to what the most negative comments say (which in this case are unusually positive). And people discussing performance on their own personal benchmarks.

Simon321 12/18/2025||

i don't know, chat gpt seems to hallucinate a lot less

Workaccount2 12/17/2025||

So gemini 3 flash (non thinking) is now the first model to get 50% on my "count the dog legs" image test.

Gemini 3 pro got 20%, and everyone else has gotten 0%. I saw benchmarks showing 3 flash is almost trading blows with 3 pro, so I decided to try it.

Basically it is an image showing a dog with 5 legs, an extra one photoshopped onto it's torso. Every models counts 4, and gemini 3 pro, while also counting 4, said the dog had a "large male anatomy". However it failed a follow-up saying 4 again.

3 flash counted 5 legs on the same image, however I added distinct a "tattoo" to each leg as an assist. These tattoos didn't help 3 pro or other models.

So it is the first out of all the models I have tested to count 5 legs on the "tattooed legs" image. It still counted only 4 legs on the image without the tattoos. I'll give it 1/2 credit.

Valakas_ 12/18/2025|

What if you also number the legs, but with an error like: 1,2,3,5,6. Or 1,2,3, ,4.

simonsarris 12/17/2025||

Even before this release the tools (for me: Claude Code and Gemini for other stuff) reached a "good enough" plateau that means any other company is going to have a hard time making me (I think soon most users) want to switch. Unless a new release from a different company has a real paradigm shift, they're simply sufficient. This was not true in 2023/2024 IMO.

With this release the "good enough" and "cheap enough" intersect so hard that I wonder if this is an existential threat to those other companies.

bgirard 12/17/2025||

Why wouldn't you switch? The cost to switch is near zero for me. Some tools have built in model selectors. Direct CLI/IDE plug-ins practically the same UI.

azuanrb 12/17/2025|||

Not OP, but I feel the same way. Cost is just one of the factor. I'm used to Claude Code UX, my CLAUDE.md works well with my workflow too. Unless there's any significant improvement, changing to new models every few months is going to hurt me more.

bgirard 12/17/2025|||

I used to think this way. But I moved to AGENTS.md. Now I use the different UI as a mental context separation. Codex is working on Feature A, Gemini on feature B, Claude on Feature C. It has become a feature.

rolisz 12/17/2025||

You're assuming that different models need the same stuff in AGENTS.md

In my experience, to get the best performance out of different models, they need slightly different prompting.

NamlchakKhandro 12/17/2025||||

just switch to Opencode and stop locking yourself into a particular providers way of doing things.

There's a plugin for everything that mimics anything the others are doing

azuanrb 12/18/2025||

Being open does not magically make everything better. People are willing to pay for Claude Code for many valid reasons. You are also assuming I have never used OpenCode, which is incorrect. Claude is simply my preference.

I see all of these tools as IDEs. Whether someone locks into VS Code, JetBrains, Neovim, or Sublime Text comes down to personal preference. Everyone works differently, and that is completely fine.

NamlchakKhandro 12/26/2025||

I use claude on opencode.

I'm not sure you even understand what opencode is.

Gasp0de 12/18/2025|||

Does that mean that you also don't switch to newer Anthropic models? Because they would change similarly, wouldn't they?

nevir 12/17/2025||||

I think a big part of the switching cost is the cost of learning a different model's nuances. Having good intuition for what works/doesn't, how to write effective prompts, etc.

Maybe someday future models will all behave similarly given the same prompt, but we're not quite there yet

NamlchakKhandro 12/17/2025|||

Because some people are restricted by company policy to only use providers with which they have a legally binding agreement to not use their chats as training data.

theLiminator 12/17/2025|||

For me, the last wave of models finally started delivering on their agentic coding promises.

orourke 12/17/2025|||

This has been my experience exactly. Even over just the last few weeks I’ve noticed a dramatic drop in having to undo what the agents have done.

inquirerGeneral 12/17/2025|||

[dead]

nprateem 12/17/2025|||

But for me the previous models were routinely wrong time wasters that overall added no speed increase taking the lottery of whether they'd be correct into account.

catigula 12/17/2025|||

Correct. Opus 4.5 'solved' software engineering. What more do I need? Businesses need uncapped intelligence, and that is a very high bar. Individuals often don't.

gaigalas 12/17/2025|||

If Opus is one-size-fits-all, then why Claude keeps the other series? (rethorical).

Opus and Sonnet are slower than Haiku. For lots of less sophisticated tasks, you benefit from the speed.

All vendors do this. You need smaller models that you can rapid-fire for lots of other reasons than vibe coding.

Personally, I actually use more smaller models than the sophisticated ones. Lots of small automations.

dimitri-vs 12/18/2025||

Yes, all the major CLIs (Claude Code, Codex, etc) and many agentic applications use a large model main agent with task delegation to small model sub-agent. For example in CC using Opus4.5 it will delegate an Explore task to a Haiku/Sonnet subagent or multiple subagents.

gaigalas 12/18/2025||

The agent interfaces are for human interaction. Some tasks can be fully unattended though. For those, I find smaller models more capable due to their speed.

Think beyond interfaces. I'm talking about rapid-firing hundreds of small agents and having zero human interaction with them. The feedback is deterministic (non agentic) and automated too.

esperent 12/18/2025|||

> What more do I need?

Much cheaper price and much faster token generation.

At least, that's what I need. I stopped using Anthropic because for their $20 a month offering, I get rate limited constantly, but for Gemini $20/month I've never even once hit a limit.

calflegal 12/17/2025|||

I asked a similar question yesterday:

https://news.ycombinator.com/item?id=46290797

alex1138 12/17/2025|||

I just can't stop thinking though about the vulnerability of training data

You say good enough. Great, but what if I as a malicious person were to just make a bunch of internet pages containing things that are blatantly wrong, to trick LLMs?

calflegal 12/17/2025|||

The internet has already tried this, for about a few decades. The garbage is in the corpus; it gets weighted as such

floundy 12/18/2025|||

>a bunch of internet pages containing things that are blatantly wrong

So Reddit?

I’d imagine the AI companies have all the “pre AI internet” data they scraped very carefully catalogued.

szundi 12/17/2025||

[dead]

mmaunder 12/17/2025||

I think about what would be most terrifying to Anthropic and OpenAI i.e. The absolute scariest thing that Google could do. I think this is it: Release low latency, low priced models with high cognitive performance and big context window, especially in the coding space because that is direct, immediate, very high ROI for the customer.

Now, imagine for a moment they had also vertically integrated the hardware to do this.

JumpCrisscross 12/17/2025||

> think about what would be most terrifying to Anthropic and OpenAI

The most terrifying thing would be Google expanding its free tiers.

wasabi991011 12/18/2025|||

It's the only model provider that has offered a decent deal to students: a full year of google ai pro.

Granted, this doesn't give api access, only what google calls their "consumer ai products", but it makes a huge difference when chatgpt only allows a handful of document uploads and deep research queries per day.

Davidzheng 12/18/2025|||

on aistudio the free tier limits on all models are decent

mark_l_watson 12/18/2025||

I turned on API billing on API Studio in the hope of getting the best possible service. As long as you are not using the Gemini thinking and research APIs for long-running computations, the APIs are very inexpensive to use.

avazhi 12/17/2025||

"Now, imagine for a moment they had also vertically integrated the hardware to do this."

Then you realise you aren't imagining it.

iwontberude 12/17/2025||

“And then imagine Google designing silicon that doesn’t trail the industry. While you are there we may as well start to imagine Google figures out how to support a product lifecycle that isn’t AdSense”

Google is great on the data science alone, every thing else is an after thought

avazhi 12/17/2025||

https://blog.google/products/google-cloud/ironwood-google-tp...

"And then imagine Google designing silicon that doesn’t trail the industry."

I'm def not a Google stan generally, but uh, have you even been paying attention?

https://en.wikipedia.org/wiki/Tensor_Processing_Unit

mmaunder 12/17/2025|||

It's not funny when I have to explain the joke.

avazhi 12/17/2025||

Oh I got your joke, sir - but as you can see from the other comment, there are techies who still don't have even a rudimentary understanding of tensor cores, let alone the wider public and many investors. Over the next year or two the gap between Google and everybody else, even those they license their hardware to, is going to explode.

iwontberude 12/17/2025|||

Exactly my point, they have bespoke offerings but when they compete head to head for performance they get smoked. See more: their Tensor processor that they use in the beleaguered Pixel. They are in last place.

TPUs on the other hand are ASICs, we are more than familiar with the limited application, high performance and high barriers to entry associated with them. TPUs will be worthless as the AI bubble keeps deflating and excess capacity is everywhere.

The people who don't have a rudimentary understanding are the wall street boosters that treat it like the primary threat to Nvidia or a moat for Google (hint: it is neither).

kingstnap 12/17/2025||

It has a SimpleQA score of 69%, a benchmark that tests knowledge on extremely niche facts, that's actually ridiculously high (Gemini 2.5 *Pro* had 55%) and reflects either training on the test set or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model.

I'm speculating but Google might have figured out some training magic trick to balance out the information storage in model capacity. That or this flash model has huge number of parameters or something.

scrollop 12/17/2025||

Also

https://artificialanalysis.ai/evaluations/omniscience

Prepare to be amazed

albumen 12/17/2025|||

I’m amazed by how much Gemini 3 flash hallucinates; it performs poorly in that metric (along with lots of other models). In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant; GPT-5.1 (high), opus 4.5 and 4.5 haiku are.

Can someone explain how Gemini 3 pro/flash then do so well then in the overall Omniscience: Knowledge and Hallucination Benchmark?

wasabi991011 12/18/2025|||

Hallucination rate is hallucination/(hallucination+partial+ignored), while omniscience is correct-hallucination.

One hypothesis is that gemini 3 flash refuses to answer when unsuure less often than other models, but when sure is also more likely to be correct. This is consistent with it having the best accuracy score.

Wyverald 12/18/2025|||

I'm a total noob here, but just pointing out that Omniscience Index is roughly "Accuracy - Hallucination Rate". So it simply means that their Accuracy was very high.

> In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant

This doesn't mean much. As long as Gemini 3 has a high hallucination rate (higher than at least 50% others), it's not going to be in the most desirable quadrant by definition.

For example, let's say a model answers 99 out of 100 questions correctly. The 1 wrong answer it produces is a hallucination (i.e. confidently wrong). This amazing model would have a 100% hallucination rate as defined here, and thus not be in the most desirable quadrant. But it should still have a very high Omniscience Index.

andy12_ 12/18/2025|||

I'm confused about the "Accuracy vs Cost" section. Why is Gemini 3 Pro so cheap? It's basically the cheapest model in the graph (sans Llama 4 and Mistral Large 3) by a wide margin, even compared to Gemini 3 Flash. Is that an error?

noelsusman 12/18/2025||

It's not an error, Gemini 3 Pro is just somehow able to complete the benchmark while using way fewer tokens than any other model. Gemini 3 Flash is way cheaper per token, but it also tends to generate a ton of reasoning tokens to get to its answer.

They have a similar chart that compares results across all their benchmarks vs. cost and 3 Flash is about half as expensive as 3 Pro there despite being four times cheaper per token.

int_19h 12/17/2025|||

> reflects either training on the test set or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model

That's what MoE is for. It might be that with their TPUs, they can afford lots of params, just so long as the activated subset for each token is small enough to maintain throughput.

tanh 12/17/2025|||

This will be fantastic for voice. I presume Apple will use it

GaggiX 12/17/2025|||

>or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model.

More experts with a lower pertentage of active ones -> more sparsity.

leumon 12/17/2025||

Or could it be that it's using tool calls in reasoning (e.g. a google search)?

simonw 12/17/2025|

Quick pricing comparison: https://www.llm-prices.com/#it=100000&ot=10000&sel=gemini-3-...

It's 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro >200k - notable that the new Flash model doesn’t have a price increase after that 200,000 token point.

It’s also twice the price of GPT-5 Mini for input, half the price of Claude 4.5 Haiku.

More comments...