Top
Best
New

Posted by spectraldrift 12 hours ago

Gemini 3.5 Flash(blog.google)
https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...
687 points | 495 commentspage 4
aliljet 12 hours ago|
Is there a good benchmark tracking hallucinations? The models are all incredibly good now, even the open ones, and my hope is that the rate of hallucinations is something that's falling off in concert with larger and larger context lengths.
vlmutolo 3 hours ago||
> While OpenAI originally pioneered Codex (which went on to power GitHub Copilot), Google’s direct answer for dedicated, native code completion and natural-language-to-code generation is CodeGemma.

https://g.co/gemini/share/33e7a589a161

WarmWash 11 hours ago|||
People complain about them incessantly, but I can almost never get people to actually post receipts. Every provider allows sharing chats, and anyone can share a prompt that reliably produces hallucinations.

More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.

Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.

hibikir 11 hours ago|||
I see constant hallucination in claude code when using specific tooling: It thinks it knows aws cli, for instance, but there's some flags that don't exist, it attempts to use all the time in 4.6 and 4.7. When asked about it, it says that yes , the flag doesn't exist in that command, but it exists in a different command (which it does), and yet, it attempts to use it without extra info.

Claude also believes it knows how AWS' KMS works, quite confidently, while getting things wrong. I have a separate "this is how KMS replication actually works" file just to deal with its misconceptions.

For gemini, I typically use it to query information from large corpuses, but it often web searches and hallucinates instead of reading the actual corpus. On a book series, it will hallucinate chapters and events which, while reasonable and plausible, do not exist. "Go look at the files and see if your reference is correct" shows that it's not correct, and it's a mandatory step. But that doesn't prevent hallucination, but makes sure you catch it after the fact, just like a method in a class that doesn't exist gets found out by the compiler. The LLM still hallucinated it.

asdfasgasdgasdg 10 hours ago||||
https://gemini.google.com/share/9cd8ca68025a

I was trying to understand a game I've been playing, The Last Spell. I asked it for a tier list of omens -- which ones the community considers most important. At least a few of the names it posts are hallucinated ("omen of the sun" does not exist, and the omens that give extra gold are "savings," "fortune," and "great wealth").

Obviously not a critical use case but issues like this do keep me on my toes regarding whether the thing is working at all. I should ask 3.5 flash to do the same job. (I did try and it once again hallucinated the omen names and some of the effects.)

hamdingers 10 hours ago||||
I can reliably produce hallucinations with this genre of prompt: "write a script that does <simple task> with <well known but not too-well-known API>." Even the frontier models will hallucinate the perfect API endpoint that does exactly what I want, regardless of if it exists.

The fix is easy enough though, a line in my global AGENTS.md instructing agents to search/ask for documentation before working on API integrations.

sapneshnaik 10 hours ago||
Yeah. Better to have more details in your prompt than fewer. For example, I use this:

```

Build a Nango sync that stores Figma projects.

Integration ID: figma

Connection ID for dry run: my-figma-connection

Frequency: every hour

Metadata: team_id

Records: Project with id, name, last_modified

API reference: https://www.figma.com/developers/api#projects-endpoints

```

Note: You do need a Nango account and the Nango Skill installed before it could work.

Corence 9 hours ago||||
https://gemini.google.com/share/3717c8505d6b

Two of the three strip titles are hallucinated and two of the three strips are bad examples. Haley is mute in strip 403 and does nothing. Strip 578 is the start of the arc that shows the behavior Gemini is talking about, but has things going wrong so it's not a good example either.

Claude picks a good strip but also hallucinates the strip title: https://claude.ai/share/56be379d-c3da-443e-b60f-2d33c374eba8

brooksc 10 hours ago||||
I asked gemini 3.1 Pro to search for the linkedin URLs for a list of peers. It generated a plausible list of links -- but they were all hallucinated. On a follow up it confirmed it couldn't actually search, but didn't tell me that without prompting.
rjh29 11 hours ago||||
"People complain about them incessantly, but I can almost never get people to actually post receipts."

...my chats are all pretty long and involve personal conversations, or I've deleted them. It's a lot to ask for someone to post receipts. The number of complaints is enough data.

No matter how big the model is there will be edge cases where it has no data or is out of date. In these cases it just makes stuff up. You can detect it yourself by looking for words like usually or often when it states facts, e.g. "the mall often has a Starbucks." I asked it about a Genshin Impact character released in June 2025 and it consistently interpreted the name (Aino) as my player character because Aino wasn't in its data.

Honestly I'm surprised your haven't encountered it if you're using it more than casually. Pro is much better but not perfect.

ls612 10 hours ago||
Claude has gotten good in the past month or two at recognizing when it might need to search the web for updated info rather than saying that it has no idea what I'm talking about or making stuff up.
krupan 10 hours ago||||
Are the knowledge cut off issues well known? I don't remember seeing them prominently displayed.

Also, prompts that reliably produce hallucinations is kind of a hard ask. It's inconsistent. One day the LLM I work with quotes verbatim from the PCIe spec and it's super helpful. The next day it gives me wrong information and when I ask it what section of the spec that information comes from it just makes up a section number

saberience 11 hours ago||||
I see hallucinations ALL the time. It's only obvious when you're prompting about a subject you know well.

And when I say all the time, I mean it, and this is for Opus 4.7 Adaptive.

I often have to say, please do searches and cite sources, as if it doesn't it will confidently give me wrong or outdated information.

If you're often asking questions about a topic that's not in your specialist knowledge you won't notice them.

droidjj 11 hours ago||
Hallucination is also much better controlled in the context of agentic coding because outputs can be validated by running the code (or linters/LSP). I almost never notice hallucinations when I’m coding with AI, but when using AI for legal work (my real job) it hallucinates constantly and perniciously because the hallucinations are subtle—e.g., making up a crucial fact about a real case.
krupan 10 hours ago||
Yes, you can catch many mistakes that LLMs make whike coding, but I wouldn't necessarily call it "controlled." Every now and then the LLM will run into dead ends where it makes a certain mistake, the compiler or unit tests find the mistake, so it tries a different approach that also fails, and then it goes back to the first approach, then tries the second approach again, and gets stuck in an endless loop trying small variations on those two approaches over and over.

If you aren't paying attention it can spend a long time (and a lot of tokens) spinning in that loop. Sometimes there might be more than two approaches in the loop, which makes it even harder to see that it's repeating itself in a loop. It's pretty frustrating to see it working away productively (so you think) for 20 minutes or so only to finally notice what's going on

vitorgrs 7 hours ago|||
Just ask any real question about stuff. LLM is not about code only...
throawayonthe 12 hours ago|||
well there is https://artificialanalysis.ai/evaluations/omniscience
goldenarm 11 hours ago||
It's a gibberish input detection benchmark, and does not measure output hallucinations.
Sevii 12 hours ago|||
I haven't been bothered by hallucinations in premier models since early last year. Still see it in smaller local models though.
aliljet 12 hours ago||
I'm really running into this deep at the edges of content creation. Take, for example, a need to general some kind of legal work. The cost of painstakingly checking and rechecking each case cited is reducing the value of these frontier models immensely.

Coding, however, is solved like magic. Easier to add tests, to be fair.

krupan 10 hours ago|||
It really depends what you are asking it. If the answer is in the training data, then the odds of it lying to you are much lower than if you are asking it for something it has never seen before.
majso 11 hours ago|||
maybe something like this? https://petergpt.github.io/bullshit-benchmark/viewer/index.v...
FergusArgyll 11 hours ago|||
As long as the model uses web search, they almost never hallucinate anymore. The fast models (haiku, gpt-instant, flash) still sometimes have the problem where they don't search before answering so they can hallucinate
goldenarm 11 hours ago||
I've seen chatGPT and Gemini hallucinate even from web search, it's better is not sufficient
yieldcrv 12 hours ago||
if last year's models were the ones people got familiar with in late 2022, hallucinations would be an underrepresented rumor, there would be no articles about it because its so rare. overconfident lawyers wouldn't have messed up dockets in court with fake case law, in other domains that move faster, sources would be only partially outdated with agentic search and mcp servers filling in the gaps

AI psychosis would be the problem people talk about more, not just outright agreement but subtle ways of making you feel confident in your ideas. "yes, buy that domain name buy these other ones for defensibility"

(the domain name is dumb and completely unmarketable)

jampekka 11 hours ago||
The models still hallucinate bad when called via APIs, especially if web search is not enabled. Gemini hallucinates quite frequently even with the app and search enabled. More recent (e.g. ChatGPT 5.x and Deepseek v4) prompts/harnesses search very aggressively, which does greatly mitigate hallucinations.
ElenaDaibunny 2 hours ago||
but latency in real GUI workflows with 50+ steps is still the elephant in the room for cloud-based agents
eis 12 hours ago||
3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite. $1551 for 3.5 Flash [0] vs $892 for 3.1 Pro [1]. That's 74% more cost while ranking lower. It's 2.5x as fast but I don't think the bang for the buck is there anymore like it was with 3.0 Flash. I'm a bit bummed out to be honest.

I did not expect such a huge (3x) price increase from 3.0 Flash and I bet many people will not just blindly upgrade as the value proposition is widely different.

One interesting point to note is that Google marked the model as Stable in contrast to nearly everything else being perpetually set as Preview.

[0] https://artificialanalysis.ai/models/gemini-3-5-flash [1] https://artificialanalysis.ai/models/gemini-3-1-pro-preview

hedora 5 hours ago||
Ouch. That's going in completely the wrong direction.

How many people complain that we have too much low quality AI output for humans to read, let alone evaluate vs. how many people are complaining that they want higher quality, more trustworthy output?

ekojs 11 hours ago|||
Seems like the only good thing about 3.5 Flash is its speed. Not cost-competitive or benchmark-leading by any means.
pingou 11 hours ago|||
How do they calculate that?

3.1 has 57M output tokens from Intelligence Index, 3.5 Flash has 73M, so not a lot more, and 3.5 is a bit cheaper, I don't get how 3.5 can be 74% more expensive.

knollimar 9 hours ago||
Only speculation but cache maybe?
ls_stats 11 hours ago|||
>3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite

That's everything I needed to know.

mijoharas 11 hours ago||
That's what I came here to check. Last model release they only put it into preview[0] at first.

Does that mean this model is production ready?

[0] https://news.ycombinator.com/item?id=47076484

jonnyasmar 7 hours ago||
The $1.50/$9.00 pricing is a meaningful shift if you've been running Gemini as the "fast iteration" half of a multi-model coding workflow. I've had Claude Code, Codex, and Gemini CLI running side by side and the working split was "Gemini for quick scaffolding and exploration where the cost of being wrong is low, Sonnet for correctness-critical stuff." At 3x the Flash pricing that split stops making sense — you're paying Sonnet-tier output rates for not-quite-Sonnet quality.

For pure chat that's annoying but tolerable. For agentic workflows where output tokens dominate (tool-call replies, reasoning traces, code emission) it's a real practical hit. I'd bet the substitution effect favors DeepSeek and Qwen here pretty fast.

superchink 5 hours ago|
Out of curiosity, what was your workflow to generate this comment? I’m curious what model (claude?) and process (manual prompt with bullet points?) you used.
mixtureoftakes 12 hours ago||
benchmarks look REALLY good, the price hike is big but it also beats sonnet 4.6 in every discipline?
benjiro3000 10 hours ago|
[dead]
sigbeta 4 hours ago||
I am interested to see how they will serve demand with they TPU monopoly have.
bredren 10 hours ago||
Can anyone who has extensive, recent, experience with Claude code and Codex contextualize the current Gemini CLI product experience?
mpalczewski 8 hours ago||
Gemini models have consistently disregarded rules and gone their own way for me. They will finish a task and get it done frequently way above the scope that you gave it, but they take a million shortcuts to get there. e.g. deciding the linter isn't important and disabling the pre commit hook. coding features you didn't ask for.
SwellJoe 10 hours ago|||
I have and use both Claude Code and Gemini CLI, and still don't consider Gemini worth starting for coding except to review Claude's output in critical commits (on a security boundary, maybe broad refactors, etc.), though I try side-by-side every now and then just to see the state of things. I also use Gemini Pro in a security scanning harness to act as a second set of eyes, but Opus is better at finding security bugs than Gemini, so I don't know that it's accomplishing anything beyond just using Opus.

Gemini Pro 3.1 for agentic coding is still clumsy. It chews a lot, has a harder time with tools and interacting with the codebase. I haven't tried any 3.5 version, yet, though. The benchmarks look promising.

I'll note I like the Google models' prose better than any others at the moment, though. Even the small open models (Gemma 4 family) have excellent prose, relatively speaking, that doesn't stink of the LLMisms that I find so annoying about OpenAI (especially) and Anthropic models. So, I'll probably start using Gemini for writing API docs, even if all code is Claude.

nicce 9 hours ago||
I would argue that prose is just a prompt issue. GPT 5.5 outout is easier to control whan Gemini by prompting. Having better defaults does not make it necessarily better.
SwellJoe 8 hours ago||
I would disagree. I think it'd take a lot of prompting to make GPT 5.5 not have the underlying personality of GPT, which I find awful. They have knobs in ChatGPT to choose a "professional" tone, which improves it somewhat, but even that is still the worst prose of any leading model.

My default AGENTS.md/CLAUDE.md/etc. is a few sentences from Strunk and White, to try to make all the models not suck at writing. It helps keep the models brief, but it doesn't actually make models with shitty prose have good prose. The relevant portion of my agents file is: "Omit needless words. Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts." Which might add up roughly the same as "be brief" in the weights, I don't know.

If you have a prompt that makes GPT a decent-to-good writer, I would like to see it.

Gemini produces decent-to-good prose without prompting, which improves if instructed to be concise. The other models, even the frontier models, do not have decent-to-good prose without prompting, and even with prompting, rarely elevate to what I would consider Good Enough. Part of this may be that GPT and Claude models get used a lot more heavily, and so I'm highly tuned into their idiosyncrasies. The heavy use of emojis, the click-bait headline style, etc. that they both use unprompted. All of that is repugnant to me, so anything that doesn't do all that by default, or at least not as aggressively, has a huge leg up.

bel8 7 hours ago||
My anecdote: smart but too stubborn to be useful.

I have been trying Gemini since 2.5 for coding.

It is the smartest for creative web stuff like HTML/CSS/JS.

But it has been very stubborn with following instructions like AGENTS.md.

And architecturally for large projects I tested, the code isn't on par with Opus 4.5+ and GPT 5.3+.

I would rather use DeepSeek 4 Flash on High (not max) than Gemini even if they had the same cost.

I currently use GPT 5.5 + DeepSeek 4 Flash.

BUT I didn't test Gemini 3.5 Flash yet. And it seems, from another comment in this post, that the Antigravity quota for is bricked for Google Pro plans which is the plan I have. So I don't have high hopes.

paperwork360 10 hours ago||
Google also updated Antigravity. version 2.0 is more for conversation with agent. The previous VS Code like IDE was much better.
operatingthetan 5 hours ago||
It's been renamed to "antigravity IDE." Updating my old IDE got me the new non-IDE app though, which is strange.
xnx 6 hours ago||
They still have an Antigravity IDE version.
mchusma 5 hours ago||
I have thought about this and I think overall, this was a disappointing release from Google. I'm not sure the sentiment, but this feels like a miss.

What they did do in the keynote was spend a lot of time talking about their distribution advantage, and how they can own the consumer in search. But not a lot that will benefit partners or developers.

Basically, they released something broadly competitive with Sonnet 4.6, a new Omni model that seems interesting but unclear yet. They have completely ceded the frontier to OpenAI / Anthropic, and are saying "look for pro next month".

The best release since nano banana pro from Google has been Gemma.

More comments...