Posted by oshrimpton 4 days ago
It seems like for agentic coding, just making sure the AI can find the relevant documentation to establish a ground truth is probably sufficient.
Note that I'm distinguishing here between hallucination of what you might call "free facts" and hallucination of material which deviates from what is in the context itself. The latter seems both a tractable problem and one which will improve coding agent functionality. But the former seems like its no longer on the critical path, probably because its hard.
Such a weird thing to start with. The legal status of Fable does not mean that it's not intelligent. If anything, the problem is the opposite, someone thinks it's too intelligent (and/or that Anthropic wouldn't share its last gen intelligent models on the terms the government demanded).
The article uses the example of GLM being smaller than DeepSeek, yet better on hallucinations as "smaller can be good too"
But the GLM family itself is scaling up fast: GLM-5.x family is 754B, double the previous generation of GLM-4.x
> comes within just 4 points of GPT-5.5 and 9 points of Fable 5
9 percentage points IS a big difference
And, of course, it was burning 10 times more tokens for this output.
Idk if it was the harness (OpenCode), my AGENT or my prompts, but I was getting exactly what I wanted, and quickly.
With GPT-5.5 it tries to play smart, takes much more times and is often stuck on basic stuff that DeepSeek solves oneshot.
You have any session logs or similar that shows this thing? Never once, since I started using the codex TUI when it became available, has GPT models gotten stuck on something another model breeze through, I quite literally run every prompt I do through multiple providers, this would be very visible very quickly for me.
I remember trying every -codex variant of the models and could never get them to be productive for tasks taking longer than 5-10 minutes, compared to GPT 5.5 which quite literally worked through the night day (with the /goal feature), and actually had something valuable and useful in the end this morning that wasn't exploding in LOC and complexity. I don't think any of the -codex variants would have been able to do this at all, based on how they worked when I last used them.