GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

Posted by oshrimpton 4 days ago

GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2(arrowtsx.dev)

577 points | 292 commentspage 3

spwa4 4 days ago|

Why is everyone expecting LLMs to be like the Star Trek computer? I wonder if anyone's ever measured what the hallucination rate of a human is.

flexagoon 4 days ago||

Because AI company executives and devoted vibecoders constantly make egregious claims like "programming is fully solved" and even straight up "hallucinations don't exist on frontier models"

verdverm 3 days ago||

We don't have to listen to these people and can form our own perspectives. Following bad leaders is something to avoid

flexagoon 3 days ago||

I agree, but I was responding to the question of why people expect LLMs to be like the star trek computer, and the answer is "because people making and promoting those LLMs claim they are like that"

verdverm 3 days ago||

It is unclear if GP is referring to the global we or the HN we. I leaned towards the latter and injected our knowledge and understandings into the basis for my comment. HN recognizes what's going on

flexagoon 3 days ago||

> HN recognizes what's going on

Maybe the pre-2024 users do, but I've seen plenty of those exact "frontier models never hallucinate" comments on HN as well

__natty__ 3 days ago|||

Because this is how LinkedIn “specialists” promotes LLM. The same specialists shouting about crypto a few years ago, then specialists about nft and now about how coding, architecture, accounting, law, medicine and basically every white collar job is solved and you just need enough money to pay for Opus/GPT.

master-lincoln 4 days ago|||

Yeah it has been looked at e.g. in [0]. They separate that from lying, but I think for the LLM context it should be included. To me the difference is humans do not bullshit at the same rate and I can find out over time who tends to bullshit more and exclude that persons info from my pool.

> Why is everyone expecting LLMs to be like the Star Trek computer?

Because they are often marketed as magic AIs, not as mere language models.

[0] https://bpspsychub.onlinelibrary.wiley.com/doi/10.1111/bjso....

glouwbug 3 days ago|||

It’s not a lie if everyone collectively believes it

bravetraveler 4 days ago|||

Marketing, essentially

oshrimpton 4 days ago||

I would be so curious to find a comprehensive benchmark on this, humans do have an unfortunate ahem Dunning-Kruger effect ahem tendency to do this

czk 3 days ago||

if you're benchmaxxing then maybe bigger doesnt always mean better, but for general intelligence and big model smell, that couldn't be further from the truth

the oss models are impressive but it's pretty clear how quickly they fall off when you try to use them outside of a narrow set of problems they benchmarked well on when compared to opus/5.5

raincole 4 days ago||

> meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer.

From how they measure it, a model that simply answers "I don't know." to any prompt would be the one hallucinates the least. So it's not surprising at all that a smaller model can perform better.

orbital-decay 3 days ago||

DS v4 is an undertrained snapshot, which is mentioned in their model card. The full version is supposed to be released later and have multimodal input. That said, hallucination rate likely depends on the training policy and different optimization tradeoffs a lot more than on the scale.

nghnam 3 days ago||

I’d be careful about reading too much into these numbers. The test only looks at cases where the model doesn’t know the answer, so it doesn’t show how often users will actually see hallucinations.

EbNar 4 days ago||

The fact that a huge amount uf parameters may lead to worse hallucinations is something I didn't think of. Would this somewhat imply that DeepSeek V4 flash should be less prone to these issues?

verdverm 3 days ago|

small models cannot encode so many facts, they will hallucinate more out-of-box

a key method to help with hallucinations is to provide good sources when asking questions (context engineering / knowledge base)

gcanyon 3 days ago||

> it is clear that actual intelligence has plateaued significantly

N=1, but I disagree strongly. I'm writing a hard-science science fiction story, and the physics of it is at (and frankly, beyond) my skillset. The story's plot has had to change over a dozen times as I realized errors in my application of physics in the story.

Throughout, I've been reviewing the physics with LLMs, mainly Gemini 3.1 Pro Preview, but also with Claude and OpenAI. Often I have the LLMs debate each other -- "My friend [another model] said XYZ about the physics, is that right or wrong?" In almost all cases, Gemini explains why the other models are wrong, and when I send its explanation to them, they concede it is right and they are wrong.

As I said, I did the above checks literally dozens of times as I wrote the story. And everything was dialed in: no further issues claimed by anyone, me or the LLMs.

Not with Fable. I managed to get it to review the story while it was running, and it listed out something like ten issues: some minor, some general knowledge-based, and two that were impressive:

1. It pointed out where Gemini (and I, and other LLMs) had missed a , resulting in values about 152 times larger than they should have been. I sent that to Gemini and it fully conceded that it had been wrong all along. 2. It pointed out a simple inconsistency in the application of special relativity (I thought I had that at least dialed in, but no :-/ ) that affected a very specific plot point. The story is novella-length, about 28,000 words long, and this is a point that was mentioned in the first two pages, and then not again until the very last page. And it's obvious, once you realize it. And I missed it. Gemini missed it. Claude and ChatGPT missed it.

Only Fable found it. Again, N=1, but that was a remarkable run I got out of it in the couple days it was available.

Bolwin 3 days ago|

Hah, I noticed the same thing writing fiction with fable. Most models seem to go into a sort of "storytelling mode" where they forget their PhD level smarts. I had a character who is doing repair on a satellite. Most models would give you a half-baked explanation with some technical terms - half of them right half of them wrong.

Fable gave a description so deep that even I couldn't figure out what was going on and had to ask it to give me a simpler explanation.

gcanyon 3 days ago||

Nice to hear N=2. I'm really hoping Fable comes back soon.

In my case two people are making very-near-light-speed trips to a star 20-ish light years away. Originally, I had one leaving a month earlier and making the journey with a Lorentz factor of 40, while the protagonist takes the same trip at > 200.

The former experiences a trip of 6 months, the latter something like 25 days. And I wrote it as if that meant that the protagonist would get there months ahead. But both of them will take hours to a day over the time light takes, and the one who leaves a month earlier will get almost a month before.

That error sat in my manuscript for two months of back and forth with other models. Fable found it on the first go.

LMK if you want to trade manuscripts!

nextaccountic 4 days ago||

>GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies. While it is true that a multi-trillion parameter model will always beat a lightweight consumer model on paper (today at least), the commoditization of these huge models is blurring the line between benchmark performance and actual real-world truthfulness and accuracy.

What about using two models, with a smaller model used for this kind of negative reasoning?

bastawhiz 4 days ago|

Now you need a third model to decide if the two other models disagree

firemelt 1 day ago||

my exp with gpt is the model tend to mention file that not even exist

stevenhubertron 3 days ago|

The more I have been using 5.2 the more I have been impressed with it. And I’ve just been using the usually neutered ollama version.

More comments...