GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

Posted by oshrimpton 4 days ago

GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2(arrowtsx.dev)

577 points | 292 commentspage 2

nathan_compton 3 days ago|

Synthesizing a bunch of stuff I've read here lately, it seems like if OpenAI and Claude have actually found product market fit (generating code) then the question of hallucination is going to get less attention in the future. If the real money is in code generation (where there is a relatively clear acceptance criteria of at least "it runs and does what I wanted as far as I can tell") then there doesn't seem to be a lot of juice in pulling ones hair out on hallucination of facts.

It seems like for agentic coding, just making sure the AI can find the relevant documentation to establish a ground truth is probably sufficient.

Note that I'm distinguishing here between hallucination of what you might call "free facts" and hallucination of material which deviates from what is in the context itself. The latter seems both a tractable problem and one which will improve coding agent functionality. But the former seems like its no longer on the critical path, probably because its hard.

hyperpape 3 days ago||

> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling. The limits of this paradigm were put on the world’s stage when Claude Fable 5 was restricted by the US government just three days after its release, marking the first US AI ban stemming from national security. One of the biggest models in the world was banned because a single jailbreak was too much of a risk.

Such a weird thing to start with. The legal status of Fable does not mean that it's not intelligent. If anything, the problem is the opposite, someone thinks it's too intelligent (and/or that Anthropic wouldn't share its last gen intelligent models on the terms the government demanded).

giancarlostoro 3 days ago||

I wonder if this is what a “Minimally Viable LLM” looks like. I often wonder how much of an LLM do you need before you can just shove a bigger context Window and any dynamic knowledge content to it like a PDF or markdown file to give it knowledge outside of its training data. I feel like LLMs don’t need more data they just need to be refined.

x3cca 3 days ago|

You might be interested in this model. It's a densely trained on math whuch let's it punch way higher than it should https://github.com/WeiboAI/VibeThinker

giancarlostoro 2 days ago||

Cant open the link without an account is it private or is that just GitHub being annoying?

cwillu 4 days ago||

Please don't editorialize titles unless the original title is misleading.

aubanel 4 days ago||

> Bigger is not better

The article uses the example of GLM being smaller than DeepSeek, yet better on hallucinations as "smaller can be good too"

But the GLM family itself is scaling up fast: GLM-5.x family is 754B, double the previous generation of GLM-4.x

> comes within just 4 points of GPT-5.5 and 9 points of Fable 5

9 percentage points IS a big difference

CuriouslyC 4 days ago|

If we're hand waiving how an open source model from a Chinese lab that you can use a nearly unlimited amount for <100/mo's 9% difference from the premier, unavailable, expensive when it was available American frontier model, we've already lost.

wiether 4 days ago||

Purely anecdotal, but when OpenAI removed Codex-5.3 from the ChatGPT sub and forced me to move to GPT-5.5, the result was far worse than what I was enjoying with Codex.

And, of course, it was burning 10 times more tokens for this output.

fvv 4 days ago||

I have the opposite experience with codex 5.3 I had to use 5.2 to design and 5.3-codex to execute , while 5.4 was a better in both, and 5.5 ( all used xhigh) is even better

oshrimpton 4 days ago||

Yeah they are 100% in the wrong for removing the fine tuned codex models. It makes sense why they wouldn't want to allocate so many resources towards fine tuning but still the enshittification of GPT models is real

embedding-shape 4 days ago|||

Huh, the fine-tuned "codex" variants always seemed like "quick specific edit" prototypes that weren't meant for real use. They worked OK when you were very specific, but besides that, nowhere close to GPT5.X and the other "real" models.

wiether 4 days ago||

Since Codex-5.3 came out it was my daily driver for everything: quick scripting, greenfield projects, new features on old projects...

Idk if it was the harness (OpenCode), my AGENT or my prompts, but I was getting exactly what I wanted, and quickly.

With GPT-5.5 it tries to play smart, takes much more times and is often stuck on basic stuff that DeepSeek solves oneshot.

embedding-shape 4 days ago||

> With GPT-5.5 it tries to play smart, takes much more times and is often stuck on basic stuff that DeepSeek solves oneshot.

You have any session logs or similar that shows this thing? Never once, since I started using the codex TUI when it became available, has GPT models gotten stuck on something another model breeze through, I quite literally run every prompt I do through multiple providers, this would be very visible very quickly for me.

I remember trying every -codex variant of the models and could never get them to be productive for tasks taking longer than 5-10 minutes, compared to GPT 5.5 which quite literally worked through the night day (with the /goal feature), and actually had something valuable and useful in the end this morning that wasn't exploding in LOC and complexity. I don't think any of the -codex variants would have been able to do this at all, based on how they worked when I last used them.

fuck_google 4 days ago|||

[dead]

chazeon 3 days ago||

GPT-5.5 must have serious issues; it is fast, but quality-wise, it is just not good. It read one LaTeX paper (which is not long) and can spell my name wrong. This is GPT-5.5-high.

EbNar 4 days ago|

The fact that a huge amount uf parameters may lead to worse hallucinations is something I didn't think of. Would this somewhat imply that DeepSeek V4 flash should be less prone tho these issues?

oshrimpton 4 days ago||

Surprisingly not! It is the biggest hallucinator on the AA Omniscience Index just 2pp away from V4 Pro. I think this is partially due to the fact that Flash was trained on >32T tokens just like Pro deapite being almost 10x smaller - it seems somewhat likely it was overfit.

fuck_google 3 days ago||

[dead]

More comments...