Top
Best
New

Posted by atgctg 12/11/2025

GPT-5.2(openai.com)
https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

1195 points | 1083 comments
svara 12/12/2025|
In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.

The thing that would now make the biggest difference isn't "more intelligence", whatever that might mean, but better grounding.

It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.

I think Google/Gemini realize this, since their "verify" feature is designed to address exactly this. Unfortunately it hasn't worked very well for me so far.

But to me it's very clear that the product that gets this right will be the one I use.

stacktrace 12/12/2025||
> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.

Exactly! One important thing LLMs have made me realise deeply is "No information" is better than false information. The way LLMs pull out completely incorrect explanations baffles me - I suppose that's expected since in the end it's generating tokens based on its training and it's reasonable it might hallucinate some stuff, but knowing this doesn't ease any of my frustration.

IMO if LLMs need to focus on anything right now, they should focus on better grounding. Maybe even something like a probability/confidence score, might end up experience so much better for so many users like me.

biofox 12/12/2025|||
I ask for confidence scores in my custom instructions / prompts, and LLMs do surprisingly well at estimating their own knowledge most of the time.
EastLondonCoder 12/12/2025|||
I’m with the people pushing back on the “confidence scores” framing, but I think the deeper issue is that we’re still stuck in the wrong mental model.

It’s tempting to think of a language model as a shallow search engine that happens to output text, but that metaphor doesn’t actually match what’s happening under the hood. A model doesn’t “know” facts or measure uncertainty in a Bayesian sense. All it really does is traverse a high‑dimensional statistical manifold of language usage, trying to produce the most plausible continuation.

That’s why a confidence number that looks sensible can still be as made up as the underlying output, because both are just sequences of tokens tied to trained patterns, not anchored truth values. If you want truth, you want something that couples probability distributions to real world evidence sources and flags when it doesn’t have enough grounding to answer, ideally with explicit uncertainty, not hand‑waviness.

People talk about hallucination like it’s a bug that can be patched at the surface level. I think it’s actually a feature of the architecture we’re using: generating plausible continuations by design. You have to change the shape of the model or augment it with tooling that directly references verified knowledge sources before you get reliability that matters.

kznewman 12/12/2025|||
Solid agree. Hallucination for me IS the LLM use case. What I am looking for are ideas that may or may not be true that I have not considered and then I go try to find out which I can use and why.
sheeshe 12/12/2025||
In essence it is a thing that is actually promoting your own brain… seems counter intuitive but that’s how I believe this technology should be used.
tsunamifury 12/12/2025|||
This technology (which I had a small part in inventing) was not based on intelligently navigating the information space, it’s fundamentally based on forecasting your own thoughts by weighting your pre-linguistic vectors and feeding them back to you. Attention layers in conjunction of roof later allowed that to be grouped in higher order and scan a wider beam space to reward higher complexity answers.

When trained on chatting (a reflection system on your own thoughts) it mostly just uses a false mental model to pretend to be a desperate intelligence.

Thus the term stochastic parrot (which for many us actually pretty useful)

sheeshe 12/12/2025|||
Thanks for your input - great to hear from someone involved that this is the direction of travel.

I remain highly skeptical of this idea that it will replace anyone - the biggest danger I see is people falling for the illusion. That the thing is intrinsically smart when it’s not - it can be highly useful in the hands of disciplined people who know a particular area well and augment their productivity no doubt. Because the way we humans come up with ideas and so on is highly complex. Personally my ideas come out of nowhere and mostly are derived from intuition that can only be expressed in logical statements ex-post.

JAlexoid 12/12/2025||
Is intuition really that different than LLM having little knowledge about something? It's just responding with the most likely sequence of tokens using the most adjacent information to the topic... just like your intuition.
sheeshe 12/12/2025||
With all due respect I’m not even going to give a proper response to this… intuition that yields great ideas is based on deep understanding. LLM’s exhibit no such thing.

These comparisons are becoming really annoying to read.

JAlexoid 12/15/2025||
I think you need to first understand what the word intuition means, before writing such a condescending reply.
sheeshe 12/12/2025|||
Meant to say prompting*
coldtea 12/12/2025||||
>A model doesn’t “know” facts or measure uncertainty in a Bayesian sense. All it really does is traverse a high‑dimensional statistical manifold of language usage, trying to produce the most plausible continuation.

And is that that different than what we do under the scenes? Is there a difference between an actual fact vs some false information stored in our brain? Or both have the same representation in some kind of high‑dimensional statistical manifold in our brains, and we also "try to produce the most plausible continuation" using them?

There might be one major difference is at a different level: what we're fed (read, see, hear, etc) we also evaluate before storing. Does LLM training do that, beyond some kind of manually assigned crude "confidence tiers" applied to input material during training (e.g. trust Wikipedia more than Reddit threads)?

literatepeople 12/12/2025||
I would say it's very different to what we do. Go to a friend and ask them a very niche question. Rather than lie to you, they'll tell you "I don't know the answer to that". Even if a human absorbed every single bit of information a language model has, their brain probably could not store and process it all. Unless they were a liar, they'd tell you they don't know the answer either! So I personally reject the framing that it's just like how a human behaves, because most of the people I know don't lie when they lack information.
coldtea 12/13/2025|||
>Go to a friend and ask them a very niche question. Rather than lie to you, they'll tell you "I don't know the answer to that"

Don't know about that, bullshitting is a thing. Especially online, where everybody pretends to be an expert on everything, and many even believe it.

But even if so, is that because of some fundamental difference between how a human and an LLM store/encode/retrieve information, or more because it has been instilled into a human through negative reinforcement (other people calling them out, shame of correction, even punishment, etc) not to make things up?

AuryGlenz 12/14/2025|||
I see you haven’t met my brother-in-law.
tsunamifury 12/12/2025||||
Hallucinations are a feature of reality that LLMs have inherited.

It’s amazing that experts like yourself who have a good grasp of the manifold MoE configuration don’t get that.

LLMs much like humans weight high dimensionality across the entire model then manifold then string together an attentive answer best weighted.

Just like your doctor occasionally giving you wrong advice too quickly so does this sometimes either get confused by lighting up too much of the manifold or having insufficient expertise.

jakewins 12/12/2025|||
I asked Gemini the other day to research and summarise the pinout configuration for CANbus outputs on a list of hardware products, and to provide references for each. It came back with a table summarising pin outs for each of the eight products, and a URL reference for each.

Of the 8, 3 were wrong, and the references contained no information about pin outs whatsoever.

That kind of hallucination is, to me, entirely different than what a human researcher would ever do. They would say “for these three I couldn’t find pinouts” or perhaps misread a document and mix up pinouts from one model for another.. they wouldn’t make up pinouts and reference a document that had no such information in it.

Of course humans also imagine things, misremember etc, but what the LLMs are doing is something entirely different, is it not?

fspeech 12/12/2025|||
Humans are also not rewarded for making pronouncements all the time. Experts actually have a reputation to maintain and are likely more reluctant to give opionions that they are not reasonably sure of. LLMs trained on typical written narratives found in books, articles etc can be forgiven to think that they should have an opionion on any and everything. Point being that while you may be able to tune it to behave some other way you may find the new behavior less helpful.
JAlexoid 12/12/2025|||
Newer models can run a search and summarize the pages. They're becoming just a faster way of doing research, but they're still not as good as humans.
acdha 12/12/2025||||
> Hallucinations are a feature of reality that LLMs have inherited.

Huh? Are you arguing that we still live in a pre-scientific era where there’s no way to measure truth?

As a simple example, I asked Google about houseplant biology recently. The answer was very confidently wrong telling me that spider plants have a particular metabolic pathway because it confused them with jade plants and the two are often mentioned together. Humans wouldn’t make this mistake because they’d either know the answer or say that they don’t. LLMs do that constantly because they lack understanding and metacognitive abilities.

coldtea 12/12/2025||
>Huh? Are you arguing that we still live in a pre-scientific era where there’s no way to measure truth?

No. A strange way to interpet their statement! Almost as if you ...hallucinated their intend!

They are arguing that humans also hallucinate: "LLMs much like humans" (...) "Just like your doctor occasionally giving you wrong advice too quickly".

As an aside, there was never a "pre-scientific era where there [was] no way to measure truth". Prior to the rise of modern science fields, there have still always been objective ways to judge truth in all kinds of domains.

acdha 12/13/2025||
Yes, that’s basically the point: what are termed hallucinations with LLMs are different than what we see in humans – even the confabulations which people with severe mental disorders exhibit tend to have some kind of underlying order or structure to them. People detect inconsistencies in their own behavior and that of others, which is why even that rushed doctor in the original comment won’t suggest something wildly off the way LLMs do routinely - they might make a mistake or have incomplete information but they will suggest things which fit a theory based on their reasoning and understanding, which yields errors at a lower rate and different class.
freejazz 12/12/2025|||
> Hallucinations are a feature of reality that LLMs have inherited.

Really? When I search for cases on LexisNexis, it does not return made-up cases which do not actually exist.

coldtea 12/12/2025||
When you ask humans however there are all kinds of made-up "facts" they will tell you. Which is the point the parent makes (in the context of comparing to LLM), not whether some legal database has wrong cases.

Since your example comes from the legal field, you'll probably very well know that even well intentioned witnesses that don't actively try to lie, can still hallucinate all kinds of bullshit, and even be certain of it. Even for eye witnesses, you can ask 5 people and get several different incompatible descriptions of a scene or an attacker.

freejazz 12/13/2025||
>When you ask humans however there are all kinds of made-up "facts" they will tell you. Which is the point the parent makes (in the context of comparing to LLM), not whether some legal database has wrong cases.

Context matters. This is the context LLMs are being commercially pushed to me in. Legal databases also inherit from reality as they consist entirely of things from the real world.

airstrike 12/12/2025||||
It's not even a manifold https://arxiv.org/abs/2504.01002
wan23 12/13/2025||||
A different way to look at it is language models do know things, but the contents of their own knowledge is not one of those things.
paulddraper 12/12/2025||||
You have a subtle slight of hand.

You use the word “plausible” instead of “correct.”

EastLondonCoder 12/12/2025|||
That’s deliberate. “Correct” implies anchoring to a truth function the model doesn’t have. “Plausible” is what it’s actually optimising for, and the disconnect between the two is where most of the surprises (and pitfalls) show up.

As someone else put it well: what an LLM does is confabulate stories. Some of them just happen to be true.

paulddraper 12/12/2025||
It absolutely has a correctness function.

That’s like saying linear regression produces plausible results. Which is true but derogatory.

MyOutfitIsVague 12/12/2025|||
Do you have a better word that describes "things that look correct without definitely being so"? I think "plausible" is the perfect word for that. It's not a sleight of hand to use a word that is exactly defined as the intention.
JAlexoid 12/12/2025|||
I mean... That is exactly how our memory works. So in a sense, the factually incorrect information coming from LLM is as reliable as someone telling you things from memory.
dgacmu 12/12/2025||
But not really? If you ask me a question about Thai grammar or how to build a jet turbine, I'm going to tell you that I don't have a clue. I have more of a meta-cognitive map of my own manifold of knowledge than an LLM does.
JAlexoid 12/15/2025||
Try it out. Ask "Do you know who Emplabert Kloopermberg is?" and ChatGPT/Gemini literally responded with "I don't know".

You, on the other hand, truly have never encountered any information about Thai grammar or (surprisingly) hot to build a jet turbine. (I can explain in general terms how to build one from just watching Discovery channel)

The difference is that the models actually have some information on those topics.

drclau 12/12/2025||||
How do you know the confidence scores are not hallucinated as well?
kiliankoe 12/12/2025|||
They are, the model has no inherent knowledge about its confidence levels, it just adds plausible-sounding numbers. Obviously they _can_ be plausible, but trusting these is just another level up from trusting the original output.

I read a comment here a few weeks back that LLMs always hallucinate, but we sometimes get lucky when the hallucinations match up with reality. I've been thinking about that a lot lately.

TeMPOraL 12/12/2025|||
> the model has no inherent knowledge about its confidence levels

Kind of. See e.g. https://openreview.net/forum?id=mbu8EEnp3a, but I think it was established already a year ago that LLMs tend to have identifiable internal confidence signal; the challenge around the time of DeepSeek-R1 release was to, through training, connect that signal to tool use activation, so it does a search if it "feels unsure".

losvedir 12/12/2025||
Wow, that's a really interesting paper. That's the kind of thing that makes me feel there's a lot more research to be done "around" LLMs and how they work, and that there's still a fair bit of improvement to be found.
fragmede 12/12/2025|||
In science, before LLMs, there's this saying: all models are wrong, some are useful. We model, say, gravity as 9.8m/s² on Earth, knowing full well that it doesn't hold true across the universe, and we're able to build things on top of that foundation. Whether that foundation is made of bricks, or is made of sand, for LLMs, is for us to decide.
xhkkffbf 12/12/2025||
It doesn't hold true across the universe? I thought this was one of the more universal things like the speed of light.
procflora 12/12/2025|||
G, the gravitational constant is (as far as we know) universal. I don't think this is what they meant, but the use of "across the universe" in the parent comment is confusing.

g, the net acceleration from gravity and the Earth's rotation is what is 9.8m/s² at the surface, on average. It varies slightly with location and altitude (less than 1% for anywhere on the surface IIRC), so "it's 9.8 everywhere" is the model that's wrong but good enough a lot of the time.

fragmede 12/13/2025||
It doesn't even hold true on Earth! Nevermind other planets being of different sizes making that number change, that equation doesn't account for the atmosphere and air resistance from that. If we drop a feather that isn't crumpled up, it'll float down gently at anything but 9.8m/s². In sports, air resistance of different balls is enough that how fast something drops is also not exactly 9.8m/s², which is why peak athlete skills often don't transfer between sports. So, as a model, when we ignore air resistance it's good enough, a lot of the time, but sometimes it's not a good model because we do need to care about air resistance.
hackeman300 12/12/2025|||
Gravity isn't 9.8m/s/s across the universe. If you're at higher or lower elevations (or outside the Earth's gravitational pull entirely), the acceleration will be different.

Their point was the 9.8 model is good enough for most things on Earth, the model doesn't need to be perfect across the universe to be useful.

JAlexoid 12/12/2025||
g(lower case) is literally gravitational force of Earth at surface level. It's universally true, as there's only one Earth in this universe.

G is the gravitational constant which is also universally true(erm... to the best of our knowledge), g is calculated using gravitational constant.

dfsegoat 12/12/2025|||
they 100% are unless you provide a RUBRIC / basically make it ordinal.

"Return a score of 0.0 if ...., Return a score of 0.5 if .... , Return a score of 1.0 if ..."

ryoshu 12/12/2025||||
LLMs fail at causal accuracy. It's a fundamental problem with how they work.
kromokromo 12/13/2025|||
Asking an LLM to give itself a «confidence score» is like asking a teenager to grade his own exam. I LLMs doesn’t «feel» uncertainty and confidence like we do.
robocat 12/12/2025||||
> wrong or misleading explanations

Exactly the same issue occurs with search.

Unfortunately not everybody knows to mistrust AI responses, or have the skills to double-check information.

darkwater 12/12/2025|||
No, it's not the same. Search results send/show you one or more specific pages/websites. And each website has a different trust factor. Yes, plenty of people repeat things they "read on the Internet" as truths, but it's easy to debunk some of them just based on the site reputation. With AI responses, the reputation is shared with the good answers as well, because they do give good answers most of the time, but also hallucinate errors.
SebastianSosa1 12/12/2025||
Community notes on X seems to be one of the highest profile recent experiments trying to address this issue
dexterlagan 12/12/2025||
My attempt: https://www.cleverthinkingsoftware.com/truth-or-extinction/
darkwater 12/12/2025||
> Tools like SourceFinder must be paired with education — teaching people how to trace information themselves, to ask: Where did this come from? Who benefits if I believe it?

These are very important and relevant questions to ask oneself when you read about anything, but we also keep in mind that even those question can be misused and they can drive you to conspiracy theories.

incrudible 12/12/2025||||
If somebody asks a question on Stackoverflow, it is unlikely that a human who does not know the answer will take time out of their day to completely fabricate a plausible sounding answer.
jaxn 12/12/2025|||
People are confidently incorrect all the time. It is very likely that people will make up plausible sounding answers on StackOverflow.

You and I have both taken time out of our days to write plausible sounding answers that are essentially opposing hallucinations.

linen 12/12/2025||
Sites like stackoverflow are inherently peer-reviewed, though; they've got a crowdsourced voting system and comments that accumulate over time. People test the ideas in question.

This whole "people are just as incorrect as LLMs" is a poor argument, because it compares the single human and the single LLM response in a vacuum. When you put enough humans together on the internet you usually get a more meaningful result.

balder1991 12/12/2025||||
At least it used to be true.
JAlexoid 12/12/2025|||
Have you ever heard of Dunning Kruger effect?

There's a reason why there are upvotes, solution and third party edit system in StackOverflow - people will spend time to write their "hallucinations" very confidently.

lins1909 12/12/2025||||
What is it about people making up lies to defend LLMs? In what world is it exactly the same as search? They're literally different things, since you get information from multiple sources and can do your own filtering.
actionfromafar 12/12/2025||||
I wonder if the only way to fix this with current LLMs, would be to generate a lot synthetic data for a select number topics you really don't want it "go off the rails" with. That synthetic data would be lots of variations on that "I don't know how to do X with Y".
dolmen 12/13/2025||
I would not bet on synthetic data.

LLMs are very good at detecting patterns.

RHSman2 12/12/2025||||
The problem is not the intelligence of the LLM. It is the intelligence and desire to make things easy of the intelligence using them.
XCSme 12/12/2025||||
But most benchmarks are not about that...

Are there even any "hallucination" public benchmarks?

andrepd 12/12/2025||
"Benchmarks" for LLMs are a total hoax, since you can train them on the benchmarks themselves.
XCSme 12/12/2025||
I would assume a good benchmark has hidden tests, or something randomly generated that is harder to game
basisword 12/12/2025|||
I think the thing even worse than false information is the almost-correct information. You do a quick Google to confirm it's on the right page but find there's an important misunderstanding. These are so much harder to spot I think than the blatantly false.
fauigerzigerk 12/12/2025|||
I agree, but the question is how better grounding can be achieved without a major research breakthrough.

I believe the real issue is that LLMs are still so bad at reasoning. In my experience, the worst hallucinations occur where only handful of sources exist for some set of facts (e.g laws of small countries or descriptions of niche products).

LLMs know these sources and they refer to them but they are interpreting them incorrectly. They are incapable of focusing on the semantics of one specific page because they get "distracted" by their pattern matching nature.

Now people will say that this is unavoidable given the way in which transformers work. And this is true.

But shouldn't it be possible to include some measure of data sparsity in the training so that models know when they don't know enough? That would enable them to boost the weight of the context (including sources they find through inference time search/RAG) relative to to their pretraining.

balder1991 12/12/2025||
Anything that is very specific has the same problem, because LLMs can’t have the same representation of all topics in the training. It doesn’t have to be too niche, just specific enough for it to start to fabricate it.

One of these days I had a doubt about something related to how pointers work in Swift and I tried discussing with ChatGPT (don’t remember exactly what, but it was purely intellectual curiosity). It gave me a lot of explanations that seemed correct, but being skeptical and started pushing it for ways to confirm what it was saying and eventually realized it was all bullshit.

This kind of thing makes me basically wary of using LLMs for anything that isn’t brainstorming, because anything that requires knowing information that isn’t easily/plentifully found online will likely be incorrect or have sprinkles of incorrect all over the explanations.

cachius 12/12/2025|||
Grounding in search results is what Perplexity pioneered and Google also does with AI mode and ChatGPT and others with web search tool.

As a user I want it but as webadmin it kills dynamic pages and that's why Proof of work aka CPU time captchas like Anubis https://github.com/TecharoHQ/anubis#user-content-anubis or BotID https://vercel.com/docs/botid are now everywhere. If only these AI crawlers did some caching, but no just go and overrun the web. To the effect that they can't anymore, at the price of shutting down small sites and making life worse for everyone, just for few months of rapacious crawling. Literally Perplexity moved fast and broke things.

cachius 12/12/2025||
This dance to get access is just a minor annoyance for me, but I question how it proves I’m not a bot. These steps can be trivially and cheaply automated.

I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.

From Tavis Ormandy who wrote a C program to solve the Anubis challenges out of browser https://lock.cmpxchg8b.com/anubis.html via https://news.ycombinator.com/item?id=45787775

Guess a mix of Markov tarpits and llm meta instructions will be added, cf. Feed the bots https://news.ycombinator.com/item?id=45711094 and Nephentes https://news.ycombinator.com/item?id=42725147

BatteryMountain 12/12/2025|||
My biggest problem with LLM's at this point is that they produce different and inconsistent results or behave differently, given the same prompt. The better grounding would be amazing at this point. I want to give an LLM the same prompt on different days and I want to be able to trust that it will do the same thing as yesterday. Currently they misbehave multiple times a week and I have to manually steer it a bit which destroys certain automated workflows completely.
fragmede 12/12/2025|||
It sounds like you have dug into this problem with some depth so I would love to hear more. When you've tried to automate things, I'm guessing you've got a template and then some data and then the same or similar input gives totally different results? What details about how different the results are can you share? Are you asking for eg JSON output and it totally isn't, or is it a more subtle difference perhaps?
conception 12/12/2025||||
You need to change the temperature to 0 and tune your prompts for automated workflows.
balder1991 12/12/2025|||
It doesn’t really solve it as a slight shift in the prompt can have totally unpredictable results anyway. And if your prompt is always exactly the same, you’d just cache it and bypass the LLM anyway.

What would really be useful is a very similar prompt should always give a very very similar result.

jknightco 12/12/2025|||
This doesn't work with the current architecture, because we have to introduce some element of stochastic noise into the generation or else they're not "creatively" generative.

Your brain doesn't have this problem because the noise is already present. You, as an actual thinking being, are able to override the noise and say "no, this is false." An LLM doesn't have that capability.

sheeshe 12/12/2025||
Well that’s because if you look at the structure of the brain there’s a lot more going on than what goes on within an LLM.

It’s the same reason why great ideas almost appear to come randomly - something is happening in the background. Underneath the skin.

tsunamifury 12/12/2025|||
That’s a way different problem my guy.
dominotw 12/13/2025|||
have you tried this? this doesnt work because the way inference runs at big companies. its not just running your query in isolation.

maybe it can work if you are running your own inference.

sebastiennight 12/12/2025|||
> I want to give an LLM the same prompt on different days and I want to be able to trust that it will do the same thing as yesterday

Bad news, it's winter now in the Northern hemisphere, so expect all of our AIs to get slightly less performant as they emulate humans under-performing until Spring.

phorkyas82 12/12/2025|||
Isn't that what no LLM can provide: being free of hallucinations?
arw0n 12/12/2025|||
I think the better word is confabulation; fabricating plausible but false narratives based on wrong memory. Fundamentally, these models try to produce plausible text. With language models getting large, they start creating internal world models, and some research shows they actually have truth dimensions. [0]

I'm not an expert on the topic, but to me it sounds plausible that a good part of the problem of confabulation comes down to misaligned incentives. These models are trained hard to be a 'helpful assistant', and this might conflict with telling the truth.

Being free of hallucinations is a bit too high a bar to set anyway. Humans are extremely prone to confabulations as well, as can be seen by how unreliable eye witness reports tend to be. We usually get by through efficient tool calling (looking shit up), and some of us through expressing doubt about our own capabilities (critical thinking).

[0] https://arxiv.org/abs/2407.12831

Tepix 12/12/2025|||
> false narratives based on wrong memory

I don't think "wrong memory" is accurate, it's missing information and doesn't know it or is trained not to admit it.

Checkout the Dwarkesh Podcast episode https://www.dwarkesh.com/p/sholto-trenton-2 starting at 1:45:38

Here is the relevant quote by Trenton Bricken from the transcript:

One example I didn't talk about before with how the model retrieves facts: So you say, "What sport did Michael Jordan play?" And not only can you see it hop from like Michael Jordan to basketball and answer basketball. But the model also has an awareness of when it doesn't know the answer to a fact. And so, by default, it will actually say, "I don't know the answer to this question." But if it sees something that it does know the answer to, it will inhibit the "I don't know" circuit and then reply with the circuit that it actually has the answer to. So, for example, if you ask it, "Who is Michael Batkin?" —which is just a made-up fictional person— it will by default just say, "I don't know." It's only with Michael Jordan or someone else that it will then inhibit the "I don't know" circuit.

But what's really interesting here and where you can start making downstream predictions or reasoning about the model, is that the "I don't know" circuit is only on the name of the person. And so, in the paper we also ask it, "What paper did Andrej Karpathy write?" And so it recognizes the name Andrej Karpathy, because he's sufficiently famous, so that turns off the "I don't know" reply. But then when it comes time for the model to say what paper it worked on, it doesn't actually know any of his papers, and so then it needs to make something up. And so you can see different components and different circuits all interacting at the same time to lead to this final answer.

BoredPositron 12/12/2025||
Architecture wise the "admit" part is impossible.
rbranson 12/12/2025|||
Bricken isn’t just making this up. He’s one of the leading researchers in model interpretability. See: https://arxiv.org/abs/2411.14257
Tepix 12/12/2025|||
Why do you think it's impossible? I just quoted him saying 'by default, it will actually say, "I don't know the answer to this question"'

We already see that ­­- given the right prompting - we can get LLMs to say more often that they don't know things.

svara 12/12/2025||||
That's right - it does seem to have to do with trying to be helpful.

One demo of this that reliably works for me:

Write a draft of something and ask the LLM to find the errors.

Correct the errors, repeat.

It will never stop finding a list of errors!

The first time around and maybe the second it will be helpful, but after you've fixed the obvious things, it will start complaining about things that are perfectly fine, just to satisfy your request of finding errors.

thunky 12/13/2025||
> It will never stop finding a list of errors!

Not my experience. I find after a couple of rounds it tells me it's perfect.

officialchicken 12/12/2025|||
No, the correct word is hallucinating. That's the word everyone uses and has been using. While it might not be technically correct, everyone knows what it means and more importantly, it's not a $3 word and everyone can relate to the concept. I also prefer all the _other_ more accurate alternative words Wikipedia offers to describe it:

"In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting,[1][2] confabulation,[3] or delusion[4]) is"

kyletns 12/12/2025||||
For the record, brains are also not free of hallucinations.
rimeice 12/12/2025|||
I still don’t really get this argument/excuse for why it’s acceptable that LLMs hallucinate. These tools are meant to support us, but we end up with two parties who are, as you say, prone to “hallucination” and it becomes a situation of the blind leading the blind. Ideally in these scenarios there’s at least one party with a definitive or deterministic view so the other party (i.e. us) at least has some trust in the information they’re receiving and any decisions they make off the back of it.
TeMPOraL 12/12/2025|||
For these types of problems (i.e. most problems in the real world), the "definitive or deterministic" isn't really possible. An unreliable party you can throw at the problem from a hundred thousand directions simultaneously and for cheap, is still useful.
Libidinalecon 12/12/2025||||
"The airplane wing broke and fell off during flight"

"Well humans break their leg too!"

It is just a mindlessly stupid response and a giant category error.

The way an airplane wing and a human limb is not at all the same category.

There is even another layer to this that comparing LLMs to the brain might be wrong because the mereological fallacy is attributing the brain "thinks" vs the person/system as a whole thinks.

johnisgood 12/12/2025||
You are right that the wing/leg comparison is often lazy rhetoric: we hold engineered systems to different failure standards for good reason.

But you are misusing the mereological fallacy. It does not dismiss LLM/brain comparisons: it actually strengthens them. If the brain does not "think" (the person does), then LLMs do not "think" either. Both are subsystems in larger systems. That is not a category error; it is a structural similarity.

This does not excuse LLM limitations - rimeice's concern about two unreliable parties is valid. But dismissing comparisons as "category errors" without examining which properties are being compared is just as lazy as the wing/leg response.

ssl-3 12/12/2025||||
Have you ever employed anyone?

People, when tasked with a job, often get it right. I've been blessed by working with many great people who really do an amazing job of generally succeeding to get things right -- or at least, right-enough.

But in any line of work: Sometimes people fuck it up. Sometimes, they forget important steps. Sometimes, they're sure they did it one way when instead they did it some other way and fix it themselves. Sometimes, they even say they did the job and did it as-prescribed and actually believe themselves, when they've done neither -- and they're perplexed when they're shown this. They "hallucinate" and do dumb things for reasons that aren't real.

And sometimes, they just make shit up and lie. They know they're lying and they lie anyway, doubling-down over and over again.

Sometimes they even go all spastic and deliberately throw monkey wrenches into the works, just because they feel something that makes them think that this kind of willfully-destructive action benefits them.

All employees suck some of the time. They each have their own issues. And all employees are expensive to hire, and expensive to fire, and expensive to keep going. But some of their outputs are useful, so we employ people anyway. (And we're human; even the very best of us are going to make mistakes.)

LLMs are not so different in this way, as a general construct. They can get things right. They can also make shit up. They can skip steps. The can lie, and double-down on those lies. They hallucinate.

LLMs suck. All of them. They all fucking suck. They aren't even good at sucking, and they persist at doing it anyway.

(But some of their outputs are useful, and LLMs generally cost a lot less to make use of than people do, so here we are.)

vitorfblima 12/12/2025|||
I don’t get the comparison. It would be like saying it’s okay if an excel formula gives me different outcomes everytime with the same arguments, sometimes right, but mostly wrong.
ssl-3 12/12/2025||
People can accomplish useful things, but sometimes make mistakes and do shit wrong.

The bot can also accomplish useful things, and sometimes make mistakes and do shit wrong.

(These two statements are more similar in their truthiness than they are different.)

tsunamifury 12/12/2025|||
As far as I can tell (as someone who worked on the early foundation of this tech at Google for 10 years) making up “shit” then using your force of will to make it true is a huge part of the construction of reality with intelligence.

Will to reality through forecasting possible worlds is one of our two primary functions.

andrei_says_ 12/12/2025||||
How much do you hallucinate at work? How many of your work hallucinations do you confidently present as reality in communication or code?

LLMs are being sold as viable replacement of paid employees.

If they were not, they wouldn’t be funded the way they are.

delaminator 12/12/2025||||
That’s not a very useful observation though is it?

The purpose of mechanisation is to standardise and over the long term reduce errors to zero.

Otoh “The final truth is there is no truth”

michaelscott 12/12/2025||
A lot of mechanisation, especially in the modern world, is not deterministic and is not always 100% right; it's a fundamental "physics at scale" issue, not something new to LLMs. I think what happened when they first appeared was that people immediately clung to a superintelligence-type AI idea of what LLMs were supposed to do, then realised that's not what they are, then kept going and swung all the way over to "these things aren't good at anything really" or "if they only fix this ONE issue I have with them, they'll actually be useful"
delaminator 12/12/2025||
That's why I said tend to zero error. I'm a Six Sigma guy. We take accurate over precise.
krzyk 12/12/2025|||
Hallucinations are not bad. It adds some kind of creativity, which is good for e.g. image generation, coding, or story telling.

It is bad only in case of reporting on facts.

svara 12/12/2025||||
Yes, they'll probably not go away, but it's got to be possible to handle them better.

Gemini (the app) has a "mitigation" feature where it tries to to Google searches to support its statements. That doesn't currently work properly in my experience.

It also seems to be doing something where it adds references to statements (With a separate model? With a second pass over the output? Not sure how that works.). That works well where it adds them, but it often doesn't do it.

intended 12/12/2025|||
Doubt it. I suspect it’s fundamentally not possible in the spirit you intend it.

Reality is perfectly fine with deception and inaccuracy. For language to magically be self constraining enough to only make verified statements is… impossible.

svara 12/12/2025||
Take a look at the new experimental AI mode in Google scholar, it's going in the right direction.

It might be true that a fundamental solution to this issue is not possible without a major breakthrough, but I'm sure you can get pretty far with better tooling that surfaces relevant sources, and that would make a huge difference.

intended 12/12/2025||
So lets run it through the rubric test -

What’s your level of expertise in this domain or subject? How did you use it? What were your results?

It’s basically gauging expertise vs usage to pin down the variance that seems endemic to LLM utility anecdotes/examples. For code examples I also ask which language was used, the submitters familiarity with the language, their seniority/experience and familiarity with the domain.

svara 12/12/2025||
A lot of words to call me stupid ;) You seem to have put me in some convenient mental box of yours, I don't know which one.
intended 12/12/2025||
Oh heck no! Definitely no!

I am genuinely asking, because I think one of the biggest determinants of utility obtained from LLMs is the operator.

Damn, I didn’t consider that it could be read that way. I am sorry for how it came across.

SecretDreams 12/12/2025||||
Find me a human that doesn't occasionally talk out of their ass =[
svara 12/12/2025|||
A part of it is reproducing incorrect information in the training data as well.

One area that I've found to be a great example of this is sports science.

Depending on how you ask, you can get a response lifted from scientific literature, or the bro science one, even in the course of the same discussion.

It makes sense, both have answers to similar questions and are very commonly repeated online.

sebastiennight 12/12/2025|||
> It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things,

Due to how LLMs are implemented, you are always most likely to get a bogus explanation if you ask for an answer first, and why second.

A useful mental model is: imagine if I presented you with a potential new recruit's complete data (resume, job history, recordings of the job interview, everything) but you only had 1 second to tell me "hired: YES OR NO"

And then, AFTER you answered that, I gave you 50 pages worth of space to tell me why your decision is right. You can't go back on that decision, so all you can do is justify it however you can.

Do you see how this would give radically different outcomes vs. giving you the 50-page scratchpad first to think things through, and then only giving me a YES/NO answer?

jillesvangurp 12/12/2025|||
It's increasingly a space that is constrained by the tools and integrations. Models provide a lot of raw capability. But with the right tools even the simpler, less capable models become useful.

Mostly we're not trying to win a nobel prize, develop some insanely difficult algorithm, or solve some silly leetcode problem. Instead we're doing relatively simple things. Some of those things are very repetitive as well. Our core job as programmers is automating things that are repetitive. That always was our job. Using AI models to do boring repetitive things is a smart use of time. But it's nothing new. There's a long history of productivity increasing tools that take boring repetitive stuff away. Compilation used to be a manual process that involved creating stacks of punch cards. That's what the first automated compilers produced as output: stacks of punch cards. Producing and stacking punchcards is not a fun job. It's very repetitive work. Compilers used to be people compiling punchcards. Women mostly, actually. Because it was considered relatively low skilled work. Even though it arguably wasn't.

Some people are very unhappy that the easier parts of their job are being automated and they are worried that they get completely automated away completely. That's only true if you exclusively do boring, repetitive, low value work. Then yes, your job is at risk. If your work is a mix of that and some higher value, non repetitive, and more fun stuff to work on, your life could get a lot more interesting. Because you get to automate away all the boring and repetitive stuff and spend more time on the fun stuff. I'm a CTO. I have lots of fun lately. Entire new side projects that I had no time for previously I can now just pull off in a spare few hours.

Ironically, a lot of people currently get the worst of both worlds because they now find themselves baby sitting AIs doing a lot more of the boring repetitive stuff than they would be able to do without that to the point where that is actually all that they do. It's still boring and repetitive. And it should be automated away ultimately. Arguably many years ago actually. The reason so many react projects feel like Ground Hog Day is because they are very repetitive. You need a login screen, and a cookies screen, and a settings screen, etc. Just like the last 50 projects you did. Why are you rebuilding those things from scratch? Manually? These are valid questions to ask yourself if you are a frontend programmer. And now you have AI to do that for you.

Find something fun and valuable to work on and AI gets a lot more fun because it gives you more quality time with the fun stuff. AI is about doing more with less. About raising the ambition level.

giancarlostoro 12/12/2025|||
Yeah in my case I want the coding models to be less stupid, I asked for multiple file uploading, it kept the original button and it added a second one for additional files, when I pointed that out “You're absolutely correct!” Well why didnt you think of it before you cranked out code, I see coding agents as really capable Junior devs its really funny. I dont mind it though, saved me hours on my side project if not weeks worth of work.
withinboredom 12/12/2025|||
I was using an LLM to summarize benchmarks for me, and I realized after awhile it was omitting information that made the algorithm being benchmarked look bad. I'm glad I caught it early, before I went to my peers and was like "look at this amazing algorithm".
coffeecat 12/12/2025||
It's important not to assume that LLMs are giving you an impartial perspective on any given topic. The perspective you're most likely getting is that of whoever created the most training data related to that topic.
andai 12/12/2025|||
So there's two levels to this problem.

Retrieval.

And then hallucination even in the face of perfect context.

Both are currently unsolved.

(Retrieval's doing pretty good but it's a Rube Goldberg machine of workarounds. I think the second problem is a much bigger issue.)

cachius 12/12/2025||
Re: retrieval: That's where the snake eats its tail as AI slop floods the web, grounding is like laying a foundation in a swamp. And that Rube Goldberg machine tries to prevent the snake from reaching its tail. But RGs are brittle and not exactly the thing you want to build infrstructure on. Just look at https://news.ycombinator.com/item?id=46239752 for an example how easy it can break.
jacquesm 12/14/2025|||
There are four words that would make the output of any LLM instantly 1000x more useful and I haven't seen them yet: "I do not know.".
f_k 12/13/2025|||
> verifying their claims ends up taking time.

I've been working on this problem with https://citellm.com, specifically for PDFs.

Instead of relying on the LLM answer alone, each extracted field links to its source in the original document (page number + highlighted snippet + confidence score).

Checking any claim becomes simple: click and see the exact source.

rafaelmn 12/12/2025|||
I constantly see top models (opus 4.5, gemini 3) get a stroke mid task - they will solve the problem correctly in one place, or have a correct solution that needs to be reapplied in context - and then completely miss the mark in another place. "Lack of intelligence" is very much a limiting factor. Gemini especially will get into random reasoning loops - reading thinking traces - it gets unhinged pretty fast.

Not to mention it's super easy to gaslight these models, just asserting something wrong with vaguely plausible explanation and you get no pushback or reasoning validation.

So I know you qualified your post with "for your use case", but personally I would very much like more intelligence from LLMs.

virtuosarmo 12/12/2025|||
I've had better success finding information using Google Gemini vs. ChatGPT. I.e. someone mentions to me the name of someone or some company, but doesn't give the full details (i.e. Joe @ XYZ Company doing this, or this company with 10,000 people, in ABC industry)...sometimes i don't remember the full name. Gemini has been more effective for me in filling in the gaps and doing fuzzy search. I even asked ChatGPT why this was the case, and it affirmed my experience, saying that Gemini is better for these queries because of Search integration, Knowledge Graph, etc. Especially useful for recent role changes, which haven't been propagated through other channels on a widespread basis.
HeavyStorm 12/12/2025|||
All of them are heavily invested in improving grounding. The money isn't on personal use but enterprise customers and for those, grounding is essential.
anentropic 12/12/2025|||
Yeah I basically always use "web search" option in ChatGPT for this reason, if not using one of the more advanced modes.
BrtByte 12/12/2025|||
I'm pretty much in the same camp. For a lot of everyday use, raw "intelligence" already feels good enough
chuckSu 12/12/2025||
[dead]
breakingcups 12/11/2025||
Is it me, or did it still get at least three placements of components (RAM and PCIe slots, plus it's DisplayPort and not HDMI) in the motherboard image[0] completely wrong? Why would they use that as a promotional image?

0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnOMruN3f...

tedsanders 12/11/2025||
Yep, the point we wanted to make here is that GPT-5.2's vision is better, not perfect. Cherrypicking a perfect output would actually mislead readers, and that wasn't our intent.
BoppreH 12/11/2025|||
That would be a laudable goal, but I feel like it's contradicted by the text:

> Even on a low-quality image, GPT‑5.2 identifies the main regions and places boxes that roughly match the true locations of each component

I would not consider it to have "identified the main regions" or to have "roughly matched the true locations" when ~1/3 of the boxes have incorrect labels. The remark "even on a low-quality image" is not helping either.

Edit: credit where credit is due, the recently-added disclaimer is nice:

> Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.

hnuser123456 12/11/2025|||
Yeah, what it's calling RAM slots is the CMOS battery. What it's calling the PCIE slot is the interior side of the DB-9 connector. RAM slots and PCIE slots are not even visible in the image.
hexaga 12/11/2025||
It just overlaid a typical ATX pattern across the motherboard-like parts of the image, even if that's not really what the image is showing. I don't think it's worthwhile to consider this a 'local recognition failure', as if it just happened to mistake CMOS for RAM slots.

Imagine it as a markdown response:

# Why this is an ATX layout motherboard (Honest assessment, straight to the point, *NO* hallucinations)

1. *RAM* as you can clearly see, the RAM slots are to the right of the CPU, so it's obviously ATX

2. *PCIE* the clearly visible PCIE slots are right there at the bottom of the image, so this definitely cannot be anything except an ATX motherboard

3. ... etc more stuff that is supported only by force of preconception

--

It's just meta signaling gone off the rails. Something in their post-training pipeline is obviously vulnerable given how absolutely saturated with it their model outputs are.

Troubling that the behavior generalizes to image labeling, but not particularly surprising. This has been a visible problem at least since o1, and the lack of change tells me they do not have a real solution.

furyofantares 12/11/2025||||
They also changed "roughly match" to "sometimes match".
MichaelZuo 12/11/2025||
Did they really change a meaningful word like that after publication without an edit note…?
dwohnitmok 12/11/2025|||
This has definitely happened before with e.g. the o1 release. I will sometimes use the Wayback Machine to verify changes that have been made.
MichaelZuo 12/12/2025||
Wow sounds pretty shady then.
piker 12/11/2025|||
Eh, I'm no shill but their marketing copy isn't exactly the New York Times. They're given some license to respond to critical feedback in a manner that makes the statements more accurate without the same expectations of being objective journalism of record.
mkesper 12/12/2025||
Yes, but they should clearly mark updates. That would be professional.
guerrilla 12/12/2025||||
Leave it to OpenAI to be dishonest about being dishonest. It seems they're also editing this post without notice as well.
Grimblewald 12/16/2025|||
Look, just give the Qwen3-vl models a go. I've found them to be fantastic as this kind of thing so far, and what I'm seeing on display here, is laughable in comparison. Close source / closed weight paid model with worse performance than open? common. OpenAI really is a bubble.
arscan 12/11/2025||||
I think you may have inadvertently misled readers in a different way. I feel misled after not catching the errors myself, assuming it was broadly correct, and then coming across this observation here. Might be worth mentioning this is better but still inaccurate. Just a bit of feedback, I appreciate you are willing to show non-cherry-picked examples and are engaging with this question here.

Edit: As mentioned by @tedsanders below, the post was edited to include clarifying language such as: “Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.”

tedsanders 12/11/2025||
Thanks for the feedback - I agree our text doesn't make the models' mistakes clear enough. I'll make some small edits now, though it might take a few minutes to appear.
g947o 12/11/2025||||
When I saw that it labeled DP ports as HDMI I immediately decided that I am not going to touch this until it is at least 5x better with 95% accuracy with basic things.

I don't see any advantage in using the tool.

jacquesm 12/11/2025||
That's a far more dangerous territory. A machine that is obviously broken will not get used. A machine that is subtly broken will propagate errors because it will have achieved a high enough trust level that it will actually get used.

Think 'Therac-25', it worked in 99.5% of the time. In fact it worked so well that reports of malfunctions were routinely discarded.

AdamN 12/12/2025||
There was a low-level Google internal service that worked so well that other teams took a hard dependency on it (against advice). So the internal team added a cron job to drop it every once in a while to get people to trust it less :-)
layer8 12/11/2025||||
You know what would be great? If it had added some boxes with “might be X or Y, but not sure”.
iwontberude 12/11/2025||||
But it’s completely wrong.
johnwheeler 12/11/2025||||
Oh and you guys don't mislead people ever. Your management is just completely trustworthy, and I'm sure all you guys are too. Give me a break, man. If I were you, I would jump ship or you're going to be like a Theranos employee on LinkedIn.
yard2010 12/12/2025||
Hey no need to personally attack anyone. A bad organization can still consist good people.
johnwheeler 12/12/2025||
I disagree. I think the whole organization is egregious and full of Sam Altman sycophants that are causing a real and serious harm to our society. Should we not personally attack the Nazis either? These people are literally pushing for a society where you're at a complete disadvantage. And they're betting on it. They're banking on it.
iamdanieljohns 12/11/2025||||
Is Adaptive Reasoning gone from GPT-5.2? It was a big part of the release of 5.1 and Codex-Max. Really felt like the future.
tedsanders 12/11/2025||
Yes, GPT-5.2 still has adaptive reasoning - we just didn't call it out by name this time. Like 5.1 and codex-max, it should do a better job at answering quickly on easy queries and taking its time on harder queries.
iamdanieljohns 12/12/2025||
Why have "light" or "low" thinking then? I've mentioned this before in other places, but there should only be "none," "standard," "extended," and maybe "heavy."

Extended and heavy are about raising the floor (~25% and ~45% or some other ratio respectively) not determining the ceiling.

d--b 12/11/2025|||
[flagged]
honeycrispy 12/11/2025|||
Not sure what you mean, Altman does that fake-humility thing all the time.

It's a marketing trick; show honesty in areas that don't have much business impact so the public will trust you when you stretch the truth in areas that do (AGI cough).

d--b 12/11/2025||
I'm confident that GP is good faithed though. Maybe I am falling for it. Who knows? It doesn't really matter, I just wanted to be nice to the guy. It takes some balls posting as OpenAi employee here, and I wish we heard from them more often, as I am pretty sure all of them lurk around.
rvnx 12/11/2025||
It's the only reasonable choice you can make. As an employee with stock options you do not want to get trashed on Hackernews because this affects your income directly if you try to conduct a secondary share sale or plan to hold until IPO.

Once the IPO is done, and the lockup period is expired, then a lot of employees are planning to sell their shares. But until that, even if the product is behind competitors there is no way you can admit it without putting your money at risk.

Esophagus4 12/11/2025||
I know HN commenters like to see themselves as contrarians, as do I sometimes, but man… this seems like a serious stretch to assume such malicious intent that an employee of the world’s top AI name would astroturf a random HN thread about a picture on a blog.

I’m fairly comfortable taking this OpenAI employee’s comment at face value.

Frankly, I don’t think a HN thread will make a difference to his financial situation, anyway…

rvnx 12/11/2025||
Malicious ? No, and this is far from astroturfing, he even speaks as "we". It's just a logical move to defend your company when people claim your product is buggy.

There is no other logical move, this is what I am saying, contrary to people above say this requires a lot of courage. It's not about courage, it's just normal and logic (and yes Hackernews matters a lot, this place is a very strong source of signal for investors).

Not bad at all, just observing it.

wilg 12/11/2025|||
What did Sam Altman say? Or is this more of a vague impression thing?
d--b 12/11/2025||
[flagged]
minimaxir 12/11/2025||
Using ChatGPT to ironically post AI-generated comments is still posting of AI-generated comments.
az226 12/12/2025|||
And here is Gemini 3: https://media.licdn.com/dms/image/v2/D5610AQH7v9MtrZxxug/ima...
saejox 12/12/2025|||
This is very impressive. Google really is ahead
pietz 12/12/2025||
They are definitely ahead in multi modality and I'd argue they have been for a long time. Their image understanding was already great, when their core LLM was still terrible.
FinnKuhn 12/12/2025||||
This is genuinly impressive. The OpenAI equivalent is less detailed AND less correct.
Lionga 12/12/2025|||
When OpenAI Marketing Material is actually showing how far Gemini3 is ahead...
8organicbits 12/12/2025|||
Promotional content for LLMs is really poor. I was looking at Claude Code and the example on their homepage implements a feature, ignoring a warning about a security issue, commits locally, does not open a PR and then tries to close the GitHub issue. Whatever code it wrote they clearly didn't use as the issue from the prompt is still open. Bizarre examples.
timerol 12/11/2025|||
Also a "stacked pair" of USB type-A ports, when there are clearly 4
fumeux_fume 12/12/2025|||
General purpose LLMs aren't very good with generating bounding boxes, so with that context, this is actually seen as decent performance for certain use cases.
dolmen 12/12/2025|||
Not that bad compared to product images seen on AliExpress.
jasonlotito 12/11/2025|||
FTA: Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.

You can find it right next to the image you are talking about.

tedsanders 12/11/2025|||
To be fair to OP, I just added this to our blog after their comment, in response to the correct criticisms that our text didn't make it clear how bad GPT-5.2's labels are.

LLMs have always been very subhuman at vision, and GPT-5.2 continues in this tradition, but it's still a big step up over GPT-5.1.

One way to get a sense of how bad LLMs are at vision is to watch them play Pokemon. E.g.,: https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...

They still very much struggle with basic vision tasks that adults, kids, and even animals can ace with little trouble.

da_grift_shift 12/11/2025|||
'Commented after article was already edited in response to HN feedback' award
whalesalad 12/11/2025|||
to be fair that image has the resolution of a flip phone from 2003
malfist 12/11/2025|||
If I ask you a question and you don't have enough information to answer, you don't confidently give me an answer, you say you don't know.

I might not know exactly how many USB ports this motherboard has, but I wouldn't select a set of 4 and declare it to be a stacked pair.

AstroBen 12/11/2025||
No-one should have the expectation LLMs are giving correct answers 100% of the time. It's inherent to the tech for them to be confidently wrong

Code needs to be checked

References need to be checked

Any facts or claims need to be checked

malfist 12/11/2025|||
According to the benchmarks here they're claiming up to 97% accuracy. That ought to be good enough to trust them right?

Or maybe these benchmarks are all wrong

JimDabell 12/12/2025|||
Something that is 97% accurate is wrong 3% of the time, so pointing out that it has gotten something wrong does not contradict 97% accuracy in the slightest.
refactor_master 12/12/2025||||
Gemini routinely makes up stuff about BigQuery’s workings. “It’s poorly documented”. Well, read the open source code, reason it out.

Makes you wonder what 97% is worth. Would we accept a different service with only 97% availability, and all downtime during lunch break?

TeMPOraL 12/12/2025||
I.e. like most restaurants and food delivery? :). Though 3% problem rate is optimistic.
AstroBen 12/11/2025||||
Does code work if it's 97% correct?

It's not okay if claims are totally made up 1/30 times

Of course people aren't always correct either, but we're able to operate on levels of confidence. We're also able to weight others' statements as more or less likely to be correct based on what we know about them

fooker 12/12/2025||
> Does code work if it's 97% correct?

Of course it does. The vast majority of software has bugs. Yes, even critical one like compilers and operating systems.

mbesto 12/12/2025|||
> Or maybe these benchmarks are all wrong

You must be new to LLM benchmarks.

dolmen 12/12/2025|||
"confidently" is a feature selected in the system prompt.

As a user you can influence that behavior.

malfist 12/12/2025||
No it isn't. It isn't intelligent, it's a statistical engine. Telling it to be confident or less confident doesn't make it apply confidence appropriately. It's all a facade
ben_w 12/12/2025||||
That shouldn't be what causes this problems; if we can see it's wrong despite the low resolution, the AI isn't going to fully replace humans for all tasks involving this kind of thing.

That said, even with this kind of error rate an AI can speed *some* things up, because having a human whose sole job is to ask "is this AI correct?" is easier and cheaper than having one human for "do all these things by hand" followed by someone else whose sole job is to check "was this human output correct?" because a human who has been on a production line for 4 hours and is about ready for a break also makes a certain number of mistakes.

But at the same time, why use a really expensive general-purpose AI like this, instead of a dedicated image model for your domain? Special purpose AI are something you can train on a decent laptop, and once trained will run on a phone at perhaps 10fps give or take what the performance threshold is and how general you need it to be.

If you're in a factory and you're making a lot of some small widget or other (so, not a whole motherboard), having answers faster than the ping time to the LLM may be important all by itself.

And at this point, you can just ask the LLM to write the training setup for the image-to-bounding-box AI, and then you "just" need to feed in the example images.

redox99 12/12/2025|||
It's trivial for a human that knows what a pc looks like. Maybe mistaking displayport for hdmi.
an0malous 12/11/2025|||
Because the whole culture of AI enthusiasts is to just generate slop and never check the results
tennisflyi 12/12/2025||
You seen the charts on their last release? They obviously don’t check - too rich
goobatrooba 12/11/2025||
I feel there is a point when all these benchmarks are meaningless. What I care about beyond decent performance is the user experience. There I have grudges with every single platform and the one thing keeping me as a paid ChatGPT subscriber is the ability to sort chats in "projects" with associated files (hello Google, please wake up to basic user-friendly organisation!)

But all of them * Lie far too often with confidence * Refuse to stick to prompts (e.g. ChatGPT to the request to number each reply for easy cross-referencing; Gemini to basic request to respond in a specific language) * Refuse to express uncertainty or nuance (i asked ChatGPT to give me certainty %s which it did for a while but then just forgot...?) * Refuse to give me short answers without fluff or follow up questions * Refuse to stop complimenting my questions or disagreements with wrong/incomplete answers * Don't quote sources consistently so I can check facts, even when I ask for it * Refuse to make clear whether they rely on original documents or an internal summary of the document, until I point out errors * ...

I also have substance gripes, but for me such basic usability points are really something all of the chatbots fail on abysmally. Stick to instructions! Stop creating walls of text for simple queries! Tell me when something is uncertain! Tell me if there's no data or info rather than making something up!

razster 12/12/2025||
The latest of the big three... OpenAI, Claude, and Google, none of their models are good. I've spent too much time monitoring them than just enjoying them. I've found it easier to run my own local LLM. The latest Gemini release, I gave it another go but only for it to misspell words and drift off into a fantasy world after a few chats with help restructuring guides. ChatGPT has become lazy for some reason and changes things I told it to ignore, randomly too. Claude was doing great until the latest release, then it started getting lazy after 20+k tokens. I tried making sure to keep a guide to refresh it if it started forgetting, but that didn't help.

Locals are better; I can script and have them script for me to build a guide creation process. They don't forget because that is all they're trained on. I'm done paying for 'AI'.

marcosscriven 12/12/2025|||
What are your best local models, and what hardware do you run them on?
balder1991 12/12/2025||||
I have this impression that LLMs are so complicated and entangled (in comparison to previous machine learning models) that they’re just too difficult to tune all around.

What I mean is, it seems they try to tune them to a few certain things, that will make them worse on a thousand other things they’re not paying attention to.

striking 12/12/2025|||
What's to stop you from using the APIs the way you'd like?
joshribakoff 12/12/2025||
The API is a way to access a model, he is criticizing the model not the access the method (at least until the last sentence where he incorrectly implied you can only script a local model, but I don’t think thats a silver bullet, in my experience that is even more challenging than starting with a working agent)
fleischhauf 12/12/2025|||
I'm always impressed how fast people get used to new things. couple of years ago something like chatgpt was completely impossible, and now people complain it something's does mit do what you told it to and sometimes lies. (not saying your points are not valid or you should not raise them) Some of the points are just not fixable at this point due to tech limitations. A language model currently simply has no way to give an estimate of its confidence. Also there is no way to completely do away with hallucinations (lies). there need to be some more fundamental improvements for this to work reliably.
davebren 12/12/2025||
Your point would stand if the entire economy wasn't shifted around this product and employees weren't being told to use it or lose their jobs.
empiko 12/12/2025|||
Consider using structured output. You can define a JSON with specific fields, and LLMs are only used to fill in the values.

https://ai.google.dev/gemini-api/docs/structured-output

ifwinterco 12/12/2025|||
I'm not an expert but my understanding is transformers based models simply can't do some of those things, it isn't really how they work.

Especially something like expressing a certainty %, you might be able to get it to output one but it's just making it up. LLMs are incredibly useful (I use them every day) but you'll always have to check important output

carsoon 12/12/2025||
Yeah I have seen multiple people use this certainty % thing but its terrible. A percentage is something calculated mathemtatically and these models cannot do that.

Potentially they could figure it out if they looks into a comparison of next token probabilites, but this is not exposed in any modern model and especially not fed back into the chat/output.

Instead people should just ask it to explain BOTH sides of an argument or explain why something is BOTH correct and incorrect. This way you see how it can halluciate either way and get to make up your own mind about the correct outcome.

nullbound 12/11/2025|||
<< I feel there is a point when all these benchmarks are meaningless.

I am relatively certain you are not alone in this sentiment. The issue is that the moment we move past seemingly objective measurements, it is harder to convince people that what we measure is appropriate, but the measurable stuff can be somewhat gamed, which adds a fascinating layer of cat and mouse game to this.

delifue 12/12/2025|||
Once a metric becomes optimization target, it ceases to become good metric.
hnfong 12/12/2025|||
There's a leaderboard that measures user experience, the "lmsys" Chatbot Arena Leaderboard ( https://huggingface.co/spaces/lmarena-ai/lmarena-leaderboard ). Main issue with it these days are that it kinda measures sycophancy and user preferred tone more than substance.

Some issues you mentioned like length of response might be user preference. Other issues like "hallucination" are areas of active research (and there are benchmarks for these).

carsoon 12/12/2025|||
I have a kinda strange chatgpt personalization prompt but it's been working well for me. The focus is me to get the model to analyze 2 sides and the extremes on both ends so it explains both and lets me decide. This is much better than asking it to make up accuracy percentages.

I think we align on what we want out of models:

""" Don't add useless babelling before the chats, just give the information direct and explain the info.

DO NOT USE ENGAGEMENT BAITING QUESTIONS AT THE END OF EVERY RESPONSE OR I WILL USE GROK FROM NOW ON FOREVER AND CANCEL MY GPT SUBSCRIPTION PERMANENTLY ONLY. GIVE USEFUL FACTUAL INFORMATION AND FOLLOW UPS which are grounded in first principles thinking and logic. Do not take a side and look at think about the extreme on both ends of a point before taking a side. Do not take a side just because the user has chosen that but provide infomration on both extremes. Respond with raw facts and do not add opinions.

Do not use random emojis. Prefer proper marks for lists etc. """

Those spelling/grammar errors are actually there and I don't want to change it as its working well for me.

dontlikeyoueith 12/12/2025||
> Refuse to express uncertainty or nuance (i asked ChatGPT to give me certainty %s which it did for a while but then just forgot...?)

They're literally incapable of this. Any number they give you is bullshit.

agentifysh 12/11/2025||
Looks like they've begun censoring posts at r/Codex and not allowing complaint threads so here is my honest take:

- It is faster which is appreciated but not as fast as Opus 4.5

- I see no changes, very little noticeable improvements over 5.1

- I do not see any value in exchange for +40% in token costs

All in all I can't help but feel that OpenAI is facing an existential crisis. Gemini 3 even when its used from AI Studio offers close to ChatGPT Pro performance for free. Anthropic's Claude Code $100/month is tough to beat. I am using Codex with the $40 credits but there's been a silent increase in token costs and usage limitations.

AstroBen 12/11/2025||
Did you notice much improvement going from Gemini 2.5 to 3? I didn't

I just think they're all struggling to provide real world improvements

chillfox 12/12/2025|||
Gemini 3 Pro is the first model from Google that I have found usable, and it's very good. It has replaced Claude for me in some cases, but Claude is still my goto for use in coding agents.

(I only access these models via API)

neuah 12/12/2025||||
Using it in a specialized subfield of neuroscience, Gemini 3 w/ thinking is a huge leap forward in terms of knowledge and intelligence (with minimal hallucinations). I take it that the majority of people on here are software engineers. If you're evaluating it on writing boilerplate code, you probably have to squint to see differences between the (excellent) raw model performances. whereas in more niche edge cases there is more daylight between them.
dominotw 12/13/2025||
what specalized usecases did you use it on and what were the outcomes.

can you share your experience and data for "leap forward" ?

dcre 12/12/2025||||
Nearly everyone else (and every measure) seems to have found 3 a big improvement over 2.5.
agentifysh 12/12/2025||||
oh yes im noticing significant improvements across the board but mainly having 1,000,000 token context makes a ton of difference, I can keep digging at a problem with out compaction.
cmrdporcupine 12/12/2025||||
I think what they're actually struggling with is costs. And I think they're all behind the scenes quantizing models to manage load here and there, and they're all giving inconsistent results.

I noticed huge improvement from Sonnet 4.5 to Opus 4.5 when it became unthrottled a couple weeks ago. I wasn't going to sign back up with Anthropic but I did. But two weeks in it's already starting to seem to be inconsistent. And when I go back to Sonnet it feels like they did something to lobotomize it.

Meanwhile I can fire up DeepSeek 3.2 or GLM 4.6 for a fraction of the cost and get almost as good as results.

XCSme 12/11/2025||||
Maybe they are just more consistent, which is a bit hard to notice immediately.
dudeinhawaii 12/12/2025||||
I noticed a quite noticeable improvement to the point where I made it my go-to model for questions. Coding-wise, not so much. As an intelligent model, writing up designs, investigations, general exploration/research tasks, it's top notch.
free652 12/12/2025||||
yes, 2.5 just couldnt use tools right. 3.0 is way better at coding. better than sonnet 4.5/
enraged_camel 12/12/2025|||
Gemini 3 was a massive improvement over 2.5, yes.
hmottestad 12/12/2025|||
I’m curious about if the model has gotten more consistent throughout the full context window? It’s something that OpenAI touted in the release, and I’m curious if it will make a difference for long running tasks or big code reviews.
agentifysh 12/12/2025||
one positive is that 5.2 is very good at finding bugs but not sure about throughputs I'd imagine it might be improved but haven't seen a real task to benchmark it on.

what I am curious about is 5.2-codex but many of us complained about 5.1-codex (it seemed to get tunnel visioned) and I have been using vanilla 5.1

its just getting very tiring to deal with 5 different permutations of 3 completely separate models but perhaps this is the intent and will keep you on a chase.

BrtByte 12/12/2025|||
The speed bump is nice, but speed alone isn't a compelling upgrade if the qualitative difference isn't obvious in day-to-day use
fellowniusmonk 12/13/2025||
5.2 is performing worse in technical reading comprehension for information and logic dense puzzles. It's way more confidently wrong and stubborn about understanding definitions of words.
zone411 12/11/2025||
I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):

The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.

The medium-reasoning version also improves: 62.7 → 72.1.

The no-reasoning version also improves: 22.1 → 27.5.

Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.

Donald 12/11/2025||
Gemini 3 Pro Preview gets 96.8% on the same benchmark? That's impressive
capitainenemo 12/11/2025|||
And performs very well on the latest 100 puzzles too, so isn't just learning the data set (unless I guess they routinely index this repo).

I wonder how well AIs would do at bracket city. I tried gemini on it and was underwhelmed. It made a lot of terrible connections and often bled data from one level into the next.

wooger 12/12/2025|||
> unless I guess they routinely index this repo

This sounds like exactly the kind of thing any tech company would do when confronted with a competitive benchmark.

rsanek 12/12/2025||
I mean, the repo has <200 stars, it's not like it's so mainstream that you'd expect LLM makers to be watching it actively. If they wanted to game it, they could more easily do that in RL with synthetic data anyway.
capitainenemo 12/17/2025|||
Belated update on this. Gemini reasoning did much better than quick on bracket city today (an easy puzzle but still). It only failed to solve one clue outright, got another wrong but due to ambiguity in the expression referenced and in a way that still fit the next level down making the final answer fairly cleanly solved. Still clearly has a harder time with it than the connections puzzle.
bigyabai 12/11/2025|||
GPT-5.2 might be Google's best Gemini advertisement yet.
outside1234 12/11/2025||
Especially when you see the price
tikotus 12/11/2025|||
Here's someone else testing models on a daily logic puzzle (Clues by Sam): https://www.nicksypteras.com/blog/cbs-benchmark.html GPT 5 Pro was the winner already before in that test.
thanhhaimai 12/11/2025|||
This link doesn't have Gemini 3 performance on it. Do you have an updated link with the new models?
dezgeg 12/12/2025||
I've also tried Gemini 3 for Clues by Sam and it can do really well, have not seen it make a single mistake even for Hard and Tricky ones. Haven't run it on too many puzzles though.
crapple8430 12/11/2025|||
GPT 5 Pro is a good 10x more expensive so it's an apples to oranges comparison.
fellowniusmonk 12/13/2025|||
I think they are overfitting more, I'm seeing it perform worse on esoteric logic puzzles.
Bombthecat 12/12/2025|||
I would like to see a cost per percent or so row. I feel like grok would beat them all
scrollop 12/11/2025||
Why no grok 4.1 reasoning?
sanex 12/12/2025||
Do people other than Elon fans use grok? Honest question. I've never tried it.
buu700 12/12/2025|||
I use Grok pretty heavily, and Elon doesn't factor into it any more than Sam and Sundar do when I use GPT and Gemini. A few use cases where it really shines:

* Research and planning

* Writing complex isolated modules, particularly when the task depends on using a third-party API correctly (or even choosing an API/library at its own discretion)

* Reasoning through complicated logic, particularly in cases that benefit from its eagerness to throw a ton of inference at problems where other LLMs might give a shallower or less accurate answer without more prodding

I'll often fire off an off-the-cuff message from my phone to have Grok research some obscure topic that involves finding very specific data and crunching a bunch of numbers, or write a script for some random thing that I would previously never have bothered to spend time automating, and it'll churn for ~5 minutes on reasoning before giving me exactly what I wanted with few or no mistakes.

As far as development, I personally get a lot of mileage out of collaborating with Grok and Gemini on planning/architecture/specs and coding with GPT. (I've stopped using Claude since GPT seems interchangeable at lower cost.)

For reference, I'm only referring to the Grok chatbot right now. I've never actually tried Grok through agentic coding tooling.

mac-attack 12/12/2025||||
I can't understand why people would trust a CEO that regularly lies about product timelines, product features, his own personal life, etc. And that's before politicizing his entire kingdom by literally becoming a part of government and one of the larger donations of the current administration.
delaminator 12/12/2025|||
You’re not narrowing it down.
lkjdsklf 12/12/2025||||
If we stopped using products of every company that had a CEO that lied about their products, we’d all be sitting in caves staring at the dirt
fatata123 12/12/2025|||
Because not everyone makes their decisions through the prism of politics
sz4kerto 12/12/2025||||
I'm using Gemini in general, but Grok too. That's because sometimes Gemini Thinking is too slow, but Fast can get confused a lot. Grok strikes a nice balance between being quite smart (not Gemini 3 Pro level, but close) and very fast.
ralusek 12/12/2025||||
Only thing I use grok for is if there is a current event/meme that I keep seeing referenced and I don't understand, it's good at pulling from tweets
wdroz 12/12/2025||||
Unlike openai, you can use the latest grok models without verifying your organization and giving your ID.
jbm 12/12/2025||||
I use a few AIs together to examine the same code base. I find Grok better than some of the Chinese ones I've used, but it isn't in the same league as Claude or Codex.
rsanek 12/12/2025||||
it's the biggest model on OpenRouter, even if you exclude free tier usage https://openrouter.ai/state-of-ai
irthomasthomas 12/12/2025||
Roleplay is the largest use-case on openrouter.
bumling 12/12/2025||||
I dislike Musk, and use Grok. I find it most useful for analyzing text to help check if there's anything I've missed in my own reading. Having it built in to Twitter is convenient and it has a generous free tier.
scrollop 12/12/2025||||
I hate the guy, however grok scores high on arc-2 so it would be silly to not at least rank it.
fatata123 12/12/2025|||
[dead]
simonw 12/11/2025||
Wow, there's a lot going on with this pelican riding a bicycle: https://gist.github.com/simonw/c31d7afc95fe6b40506a9562b5e83...
alechewitt 12/12/2025||
Nice work on these benchmarks Simon. I’ve followed your blog closely since your great talk at the AI Engineers World Fair, and I want to say thank you for all the high quality content you share for free. It’s become my primary source for keeping up to date.

I’ve been working on a few benchmarks to test how well LLMs can recreate interfaces from screenshots. (https://github.com/alechewitt/llm-ui-challenge). From my basic tests, it seems GPT-5.2 is slightly better at these UI recreations. For example, in the MS Word replica, it implemented the undo/redo buttons as well as the bold/italic formatting that GPT-5.1 handled, and it generally seemed a bit closer to the original screenshot (https://alechewitt.github.io/llm-ui-challenge/outputs/micros...).

In the VS Code test, it also added the tabs that weren’t visible in the screenshot! (https://alechewitt.github.io/llm-ui-challenge/outputs/vs_cod...).

simonw 12/12/2025||
That is a very good benchmark. Interesting to see GPT-5.2 delivering on the promise of better vision support there.
Stevvo 12/11/2025|||
The variance is way too high for this test to have any value at all. I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.
golly_ned 12/11/2025|||
Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.
refulgentis 12/12/2025|||
[flagged]
getnormality 12/12/2025||||
Well, the variance is itself interesting.
throwaway102398 12/12/2025|||
[dead]
BeetleB 12/11/2025|||
They probably saw your complaint that 5.1 was too spartan and a regression (I had the same experience with 5.1 in the POV-Ray version - have yet to try 5.2 out...).
tkgally 12/12/2025|||
I added GPT-5.2 Pro to my pelican-alternatives benchmark for the first three prompts:

Generate an SVG of an octopus operating a pipe organ

Generate an SVG of a giraffe assembling a grandfather clock

Generate an SVG of a starfish driving a bulldozer

https://gally.net/temp/20251107pelican-alternatives/index.ht...

GPT-5.2 Pro cost about 80 cents per prompt through OpenRouter, so I stopped there. I don’t feel like spending that much on all thirty prompts.

smusamashah 12/12/2025|||
Hi, it doesn't have Gemini 3.5 Pro which seems to be the best at this
svantana 12/12/2025||
That's probably because "Gemini 3.5 Pro" doesn't exist
philipgross 12/13/2025|||
That gallery is an excellent advertisement for Gemini 3.0 Pro.
AstroBen 12/11/2025|||
Seems to be getting more aerodynamic. A clear sign of AI intelligence
fxwin 12/11/2025|||
the only benchmark i trust
belter 12/11/2025|||
What happens if you ask for a pterodactyl on a motorbike?

Would like to know how much they are optimizing for your pelican....

simonkagedal 12/11/2025||
He commented on this here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
irthomasthomas 12/11/2025||
I was expecting to see a pterodactyl :(
minimaxir 12/11/2025|||
Is that the first SVG pelican with drop shadows?
simonw 12/11/2025||
No, I got drop shadows from DeepSeek 3.2 recently https://simonwillison.net/2025/Dec/1/deepseek-v32/ (probably others as well.)
tootie 12/12/2025|||
Do you think the big guys are on to your game and have been adding extra pelicans to the training data?
sroussey 12/11/2025|||
What is good at SVG design?
culi 12/12/2025|||
Not svg, but basically the same challenge:

https://clocks.brianmoore.com/

Probably Kimi or Deepseek are best

azinman2 12/12/2025||||
Graphic designers?
KellyCriterion 12/12/2025|||
Ive not seen any model being good in graphic/svg creation so far - all of the stuff mostly looks ugly and somewhat "synthetic-disorted".

And lately, Claude (web) started to draw ascii charts from one day to another indstead of colorful infographicstyled-images as it did before (they were only slightly better than the ascii charts)

tmaly 12/11/2025|||
seems to be eating something
danans 12/11/2025||
Probably a jellyfish. You're seeing the tentacles
nightshift1 12/11/2025||
benchmarks probably should not be used for so long.
mmaunder 12/11/2025||
Weirdly, the blog announcement completely omits the actual new context window size which is 400,000: https://platform.openai.com/docs/models/gpt-5.2

Can I just say !!!!!!!! Hell yeah! Blog post indicates it's also much better at using the full context.

Congrats OpenAI team. Huge day for you folks!!

Started on Claude Code and like many of you, had that omg CC moment we all had. Then got greedy.

Switched over to Codex when 5.1 came out. WOW. Really nice acceleration in my Rust/CUDA project which is a gnarly one.

Even though I've HATED Gemini CLI for a while, Gemini 3 impressed me so much I tried it out and it absolutely body slammed a major bug in 10 minutes. Started using it to consult on commits. Was so impressed it became my daily driver. Huge mistake. I almost lost my mind after a week of this fighting it. Isane bias towards action. Ignoring user instructions. Garbage characters in output. Absolutely no observability in its thought process. And on and on.

Switched back to Codex just in time for 5.1 codex max xhigh which I've been using for a week, and it was like a breath of fresh air. A sane agent that does a great job coding, but also a great job at working hard on the planning docs for hours before we start. Listens to user feedback. Observability on chain of thought. Moves reasonably quickly. And also makes it easy to pay them more when I need more capacity.

And then today GPT-5.2 with an xhigh mode. I feel like xmass has come early. Right as I'm doing a huge Rust/CUDA/Math-heavy refactor. THANK YOU!!

ubutler 12/12/2025||
> Weirdly, the blog announcement completely omits the actual new context window size which is 400,000: https://platform.openai.com/docs/models/gpt-5.2

As @lopuhin points out, they already claimed that context window for previous iterations of GPT-5.

The funny thing is though, I'm on the business plan, and none of their models, not GPT-5, GPT-5.1, GPT-5.2, GPT-5.2 Extended Thinking, GPT-5.2 Pro, etc., can really handle inputs beyond ~50k tokens.

I know because, when working with a really long Python file (>5k LoCs), it often claims there is a bug because, somewhere close to the end of the file, it cuts off and reads as '...'.

Gemini 3 Pro, by contrast, can genuinely handle long contexts.

andybak 12/12/2025||
Why would you put that whole python file in the context at all? Doesn't Codex work like Claude Code in this regard and use tools to find the correct parts of a larger file to read into context?
lopuhin 12/11/2025|||
Context window size of 400k is not new, gpt-5, 5.1, 5-mini, etc. have the same. But they do claim they improved long context performance which if true would be great.
energy123 12/11/2025||
But 400k was never usable in ChatGPT Plus/Pro subscriptions. It was nerfed down to 60-100k. If you submitted too long of a prompt they deleted the tokens on the end of your prompt before calling the model. Or if the chat got too long (still below 100k however) they deleted your first messages. This was 3 months ago.

Can someone with an active sub check whether we can submit a full 400k prompt (or at least 200k) and there is no prompt truncatation in the backend? I don't mean attaching a file which uses RAG.

piskov 12/11/2025|||
Context windows for web

Fast (GPT‑5.2 Instant) Free: 16K Plus / Business: 32K Pro / Enterprise: 128K

Thinking (GPT‑5.2 Thinking) All paid tiers: 196K

https://help.openai.com/en/articles/11909943-gpt-52-in-chatg...

energy123 12/12/2025|||
But can you do that in one message or is that a best case scenario in a long multi turn chat?
dr_dshiv 12/12/2025|||
That’s… too bad
eru 12/12/2025||||
> Or if the chat got too long (still below 100k however) they deleted your first messages. This was 3 months ago.

I can believe that, but it also seems really silly? If your max context window is X and the chat has approached that, instead of outright deleting the first messages outright, why not have your model summarise the first quarter of tokens and place those at the beginning of the log you feed as context? Since the chat history is (mostly) immutable, this only adds a minimal overhead: you can cache the summarisation, and don't have to do that over and over again for each new message. (If partially summarised log gets too long, you summarise again.)

Since I can come up with this technique in half a minute of thinking about the problem, and the OpenAI folks are presumably not stupid, I wonder what downside I'm missing.

Aeolun 12/12/2025||
Don’t think you are missing anything. I do this with the API, and it works great. I’m not sure why they don’t do it, but I can only guess it’s because it completely breaks the context caching. If you summarize the full buffer at least you know you are down to a few thousand tokens to cache again, instead of 100k tokens to cache again.
eru 12/12/2025||
> [...] but I can only guess it’s because it completely breaks the context caching.

Yes, but you only re-do this every once in a while? It's a constant factor overhead. If you essentially feed the last few thousand tokens, you have no caching at all (and you are big enough that this window of 'last few thousand tokens' doesn't get you the whole conversation)?

gunalx 12/11/2025|||
API use was not merged in this way.
freedomben 12/11/2025|||
I haven't done a ton of testing due to cost, but so far I've actually gotten worse results with xhigh than high with gpt-5.1-codex-max. Made me wonder if it was somehow a PEBKAC error. Have you done much comparison between high and xhigh?
dudeinhawaii 12/11/2025|||
This is one of those areas where I think it's about the complexity of the task. What I mean is, if you set codex to xhigh by default, you're wasting compute. IF you're setting it at xhigh when troubleshooting a complex memory bug or something, you're presumably more likely to get a quality response.

I think in general, medium ends up being the best all-purpose setting while high+ are good for single task deep-drive. Or at least that has been my experience so far. You can theoretically let with work longer on a harder task as well.

A lot appears to depend on the problem and problem domain unfortunately.

I've used max in problem sets as diverse as "troubleshooting Cyberpunk mods" and figuring out a race condition in a server backend. In those cases, it did a pretty good job of exhausting available data (finding all available logs, digging into lua files), and narrowing a bug that every other model failed to get.

I guess in some sense you have to know from the onset that it's a "hard problem". That in and of itself is subjective.

wahnfrieden 12/12/2025||
You should also be making handoffs to/from Pro
robotswantdata 12/11/2025||||
For a few weeks the Codex model has been cursed. Recommend sticking with 5.1 high , 5.2 feels good too but early days
tekacs 12/11/2025|||
I found the same with Max xhigh. To the point that I switched back to just 5.1 High from 5.1 Codex Max. Maybe I should’ve tried Max high first.
lhl 12/12/2025|||
Anecdotally, I will say that for my toughest jobs GPT-5+ High in `codex` has been the best tool I've used - CUDA->HIP porting, finding bugs in torch, websockets, etc, it's able to test, reason deeply and find bugs. It can't make UI code for it's life however.

Sonnet/Opus 4.5 is faster, generally feels like a better coder, and make much prettier TUI/FEs, but in my experience, for anything tough any time it tells you it understands now, it really doesn't...

Gemini 3 Pro is unusable - I've found the same thing, opinionated in the worst way, unreliable, doesn't respect my AGENTS.md and for my real world problems, I don't think it's actually solved anything that I can't get through w/ GPT (although I'll say that I wasn't impressed w/ Max, hopefully 5.2 xhigh improves things). I've heard it can do some magic from colleagues working on FE, but I'll just have to take their word for it.

tgtweak 12/12/2025|||
have been on 1M context window with claude since 4.0 - it gets pretty expensive when you run 1M context on a long running project (mostly using it in cline for coding). I think they've realized more context length = more $ when dealing with most agentic coding workflows on api.
Workaccount2 12/12/2025||
You should be doing everything you can to keep context under 200k, ideally even 100k. All the models unwind so badly as context grows.
patates 12/12/2025||
I don't have that experience with gemini. Up to 90% full, it's just fine.
tgtweak 12/15/2025||
If the models are designed around it, and not resorting to compression to get to higher input token lengths, they don't 'fall off' as they get near the context window limit. When working with large codebases, exhausting or compressing the context actually causes more issues since the agent forgets what was in the other libraries and files. Google has realized this internally and were among the first to get to 2M token context length (internally then later released publicly).
BrtByte 12/12/2025|||
This is one of those updates where the value only really shows up if you're already deep in the weeds
nathants 12/12/2025|||
Usable input limit has not changed, and remains 400 - 128 = 272. Confirmed by looking for any changes in codex cli source, nope.
Suppafly 12/12/2025|||
>Can I just say !!!!!!!! Hell yeah!

...

>THANK YOU!!

Man you're way too excited.

twisterius 12/11/2025||
[flagged]
mmaunder 12/11/2025||
My name is Mark Maunder. Not the fisheries expert. The other one when you google me. I’m 51 and as skeptical as you when it comes to tech. I’m the CTO of a well known cybersecurity company and merely a user of AI.

Since you critiqued my post, allow me to reciprocate: I sense the same deflector shields in you as many others here. I’d suggest embracing these products with a sense of optimism until proven otherwise and I’ve found that path leads to some amazing discoveries and moments where you realize how important and exciting this tech really is. Try out math that is too hard for you or programming languages that are labor intensive or languages that you don’t know. As the GitHub CEO said: this technology lets you increase your ambition.

bgwalter 12/12/2025|||
I have tried the models and in domains I know well they are pathetic. They remove all nuance, make errors that non-experts do not notice and generally produce horrible code.

It is even worse in non-programming domains, where they chop up 100 websites and serve you incorrect bland slop.

If you are using them as a search helper, that sometimes works, though 2010 Google produced better results.

Oracle dropped 11% today due to over-investment in OpenAI. Non-programmers are acutely aware of what is going on.

muppetman 12/12/2025|||
Exactly this. It's like reading the news! It seems perfectly fine until a news article in a domain you have intimate knowledge of, and then you realise how bad/hacked together the news is. AI feels just like that. But AI can improve, so I'm in the middle with my optimism.
jfreds 12/12/2025||||
> they remove all nuance

Said in a sweeping generalization with zero sense of irony :D

jrflowers 12/12/2025||
This is a good point. It is a sweeping generalization if you do not read the sentence that comes before that quote
re-thc 12/12/2025||||
> Oracle dropped 11% today due to over-investment in OpenAI

Not even remotely true. Oracle is building out infrastructure mostly for AI workloads. It dropped because it couldn’t explain its financing and if the investment was worth it. OpenAI or not wouldn’t have mattered.

what-the-grump 12/12/2025|||
You pretend that humans don’t produce slop?

I can recognize the short comings of AI code but it can produce a mock or a full blown class before I can find a place to save the file it produced.

Pretending that we are all busy writing novelty and genius is silly, 99% are writing for CRUD tasks and basic business flows, the code isn’t going to be perfect it doesn’t need to be but it will get the job done.

All the logical gotchas of the work flows that you’d be refactoring for hours are done in minutes.

Use pro with search… are it going to read 200 pages of documentation in 7 minutes come up with a conclusion and validate it or invalidate it in another 5? No you still trying accept the cookie prompt on your 6th result.

You might as well join the flat earth society if you still think that AI can’t help you complete day to day tasks.

jacquesm 12/12/2025||||
[flagged]
mmaunder 12/12/2025||
That's like telling a pig to become a pork producer.
GolfPopper 12/12/2025||||
Replace 'products' with 'message', 'tech' with 'religion' and 'CEO' with 'prophet' and you have a bog-standard cult recruitment pitch.
Aeolun 12/12/2025||
Because most recruitment pitches are the same regardless of the subject.
bluefirebrand 12/12/2025|||
[flagged]
eru 12/12/2025||
Maybe you are holding it wrong?

Contemporary LLMs still have huge limitations and downsides. Just like hammer or a saw has limitations. But millions of people are getting good value out of them already (both LLMs and hammers and saws). I find it hard to believe that they are all deluded.

skydhash 12/12/2025||
What limitations does an hammer have if the job is hammering? Or a saw with sawing? Even `ed` doesn't have any issue with editing text files.
eru 12/12/2025||
Well, ask the people who invented better hammers or better saws. Or better text editors than ed.
nbardy 12/11/2025||
Those arc agi 2 improvements are insane.

Thats especially encouraging to me because those are all about generalization.

5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.

It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.

This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.

Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.

I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.

delifue 12/12/2025||
It's also possible that OpenAI use many human-generated similar-to-ARC data to train (semi-cheating). OpenAI has enough incentive to fake high score.

Without fully disclosing training data you will never be sure whether good performance comes from memorization or "semi-memorization".

deaux 12/12/2025|||
> 5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.

This is simply the "openness vs directive-following" spectrum, which as a side-effect results in the sycophancy spectrum, which still none of them have found an answer to.

Recent GPT models follow directives more closely than Claude models, and are less sycophantic. Even Claude 4.5 models are still somewhat prone to "You're absolutely right!". GPT 5+ (API) models never do this. The byproduct is that the former are willing to self-correct, and the latter is more stubborn.

baq 12/12/2025||
Opus 4.5 answers most of my non-question comments with ‘you’re right.’ as the first thing in the output. At least I’m not absolutely right, I’ll take this as an improvement.
deaux 12/13/2025||
Hah, maybe 5th gen Claude will change to "you may be right".

The positive thing is that it seems to be more performative than anything. Claude models will say "you're [absolutely] right" and then immediately do something that contradicts it (because you weren't right).

Gemini 3 Pro seems to have struck a decent balance between stubbornness and you're-right-ness, though I still need to test it more.

fellowniusmonk 12/13/2025|||
5.2 seems worse on overfitting for esoteric logic puzzles in my testing. Tests using precise language where attention has to be paid to use the correct definition among many for a given word. It charges ahead with wrong definitions in a far lower accuracy and worse way now.
mmaunder 12/11/2025||
Same. Also got my attention re ARC-AGI-2. That's meaningful. And a HUGE leap.
cbracketdash 12/12/2025||
Slight tangent yet I think is quite interesting... you can try out the ARC-AGI 2 tasks by hand at this website [0] (along with other similar problem sets). Really puts into perspective the type of thinking AI is learning!

[0] https://neoneye.github.io/arc/?dataset=ARC-AGI-2

onraglanroad 12/11/2025||
I suppose this is as good a place as any to mention this. I've now met two different devs who complained about the weird responses from their LLM of choice, and it turned out they were using a single session for everything. From recipes for the night, presents for the wife and then into programming issues the next day.

Don't do that. The whole context is sent on queries to the LLM, so start a new chat for each topic. Or you'll start being told what your wife thinks about global variables and how to cook your Go.

I realise this sounds obvious to many people but it clearly wasn't to those guys so maybe it's not!

holtkam2 12/12/2025||
I know I sound like a snob but I’ve had many moments with Gen AI tools over the years that made me wonder: I wonder what these tools are like for someone who doesn’t know how LLMs work under the hood? It’s probably completely bizarre? Apps like Cursor or ChatGPT would be incomprehensible to me as a user, I feel.
Workaccount2 12/12/2025|||
Using my parents as a reference, they just thought it was neat when I showed them GPT-4 years ago. My jaw was on the floor for weeks, but most regular folks I showed had a pretty "oh thats kinda neat" response.

Technology is already so insane and advanced that most people just take it as magic inside boxes, so nothing is surprising anymore. It's all equally incomprehensible already.

jacobedawson 12/12/2025|||
This mirrors my experience, the non-technical people in my life either shrugged and said 'oh yeah that's cool' or started pointing out gnarly edge cases where it didn't work perfectly. Meanwhile as a techie my mind was (and still is) spinning with the shock and joy of using natural human language to converse with a super-humanly adept machine.
throw310822 12/12/2025||
I don't think the divide is between technical and non-technical people. HN is full of people that are weirdly, obstinately dismissive of LLMs (stochastic parrots, glorified autocompletes, AI slop, etc.). Personal anecdote: my father (85yo, humanistic culture) was astounded by the perfectly spot-on analysis Claude provided of a poetic text he had written. He was doubly astounded when, showing Claude's analysis to a close friend, he reacted with complete indifference as if it were normal for computers to competently discuss poetry.
khafra 12/12/2025||||
LLMs are an especially tough case, because the field of AI had to spend sixty years telling people that real AI was nothing like what you saw in the comics and movies; and now we have real AI that presents pretty much exactly like what you used to see in the comics and movies.
xwolfi 12/12/2025||
But it cannot think or mean anything, it's just a clever parrot so it's a bit weird. I guess uncanny is the word. I use it as google now, like just to search stuff that are hard to express with keywords.
adventured 12/12/2025|||
99% of humans are mimics, they contribute essentially zero original thought across 75 years. Mimicry is more often an ideal optimization of nature (of which an LLM is part) rather than a flaw. Most of what you'll ever want an LLM to do is to be a highly effective parrot, not an original thinker. Origination as a process is extraordinarily expensive and wasteful (see: entrepreneurial failure rates).

How often do you need original thought from an LLM versus parrot thought? The extreme majority of all use cases globally will only ever need a parrot.

robocat 12/12/2025||||
> clever parrot

Is it irony that you duckspeak this term? Are you a stochastically clever monkey to avoid using the standard cliche?

The thing I find most educating about AI is that it unfortunately mimics the standard of thinking of many humans...

LEDThereBeLight 12/12/2025|||
Try asking it a question you know has never been asked before. Is it parroting?
Agentlien 12/12/2025|||
My parents reacted in just the same way and the lackluster response really took me by surprise.
d-lisp 12/12/2025|||
Most non tech people I talked with don't care at all about LLMs.

They also are not impressed at all ("Okay, that's like google and internet").

lostmsu 12/12/2025||
Old people? I think it would be hard to find a lot of people under 20 who don't use ChatGPT daily. At least among ones that are still studying.
d-lisp 12/12/2025|||
People older than 25 or 30 maybe.

It would be funny that in the end, the most use is made by student cheating at uni.

d-lisp 12/12/2025|||
I wanted to reflect a bit on this.

I have hard time to imagine why non-tech people would find a use for LLMs, let's say nothing in your life forces you to produce information (be it textual, pictural or anything that can be related to information). Let's say your needs are focused on spending good times with friends or your family, eating nice dishes (home cooked or restaurant), spending your money on furnitures, rents, clothes, tools and etc.

Why would you need an AI that produce information in an information-bloated world ?

You probably met someone that "fell in love with woodworking" or idk, after having watched youtube videos (that person probably built a chair, a table or something akin). I don't think stuff like "Hi, I have these materials, what can I do with it" produce more interesting results than just nerding on the internet or in a library looking for references (on japaneese handcrafted furnitures, vintage ikea designs, old school woodworking, ...). (Or maybe the LLM will be able to give you a list of good reads, which is nice but somewhat of a limited and basic use).

Agentic AI and more efficient/intelligent AIs are not very interesting for people like <wood lover> and are at best a proxy for otherly findable information. Of course, not everyone is like <wood lover>, the majority of people don't even need to invest time in a "creative" hobby and instead they will watch movies, invest time in sport, invest time in sociability, go to museums, read books; you could imagine having AIs that write books, invent films, invent artworks, talk with you, but I am pretty sure that there is something more than just "watch a movie" or "read a book" when performing these activities; as someone who likes reading or watching movies, what I enjoy is following the evolutions of the authors of the pieces, understanding their posture toward its ancestors, its era-mates, toward its own previous visions and whatnot. I enjoy to find a movie "weird" "goofy" "sublime" and whatnot, because I enjoy a small amount of parasociality with the authors and am finally brought to say things like "Ahah, Lynch was such a weirdo when he shot Blue Velvet" (okay, maybe not that type of bully judgement, but you may be understanding what I mean).

I think I would find it uninspiring to read an AI written book, because I couldn't live this small parasocial experience. Maybe you could get me with music, but I still think there's a lot of activity in loving a song. I love Bach, but am pretty sure also I like Bach the character (from what I speculate from the songs I listen). I imagine that guy in front of his keyboard, having the chance to live a -weird- moment of extasy when he produces the best lines of the chaconne (if he was living in our times he would relisten to what he produced again and again and nodding to himself "man, that's sick").

What could I experience from an LLM ? "Here is the perfect novel I wrote specifically for you based on your tastes:". There would be no imaginary Bach that I would like to drink a beer with, no testimony of a human reaching the state of mind in which you produce an absolute (in fact highly relative, but you need to lie to yourself) "hit".

All of this is highly personnal, but I would be curious to know what others think.

lostmsu 12/13/2025||
This is a weird take. Basically no one is just a wood lover. In fact, basically no one is an expert or even decently knowledgeable in more than 0-2 areas. But life has hundreds of things everyone must participate in. Where does you wood lover shop? How does he find his movies? File taxes? Gets travel ideas? And even a wood lover after watching 100500th niche video on woodworking on YouTube might have some questions. AI is the new, much better Google.

Re: books. Your imagination falters here too. I love sci-fi. I use voice AIs ( even made one: https://apps.apple.com/app/apple-store/id6737482921?pt=12710... ). A couple of times when I was on a walk I had an idea for a weird sci-fi setting, and I would ask AI to generate a story in that setting, and listen to it. It's interesting because you don't know what will actually happen to the characters and what the resolution would be. So it's fun to explore a few takes on it.

d-lisp 12/13/2025||
> Your imagination falters here too.

I think I just don't find what you described as interesting as you find. I tried AI dungeoning also, but I find it less interesting than with people, because I think I like people more than specific mechanisms of sociality. Also, in a sense, my brain is capable of producing suprising things and when I am writing a story as a hobby, I don't know what will actually happen to the characters and what the resolution would be, and it's very very exciting !

> no one is an expert or even decently knowledgeable in more than 0-2 areas

I might be biased and I don't want to show off, but there are some of these people around here, let's say it's rare that people are decently knowledgeable in more than 5 areas.

I am okay with what you said :

- AI is a better google

But also google became shit, and as far as I can remember, it was somewhat of an incredible tool before. If AI became what is the old google for those people, then wouldn't you say, if you were them, that it's not very impressive and somewhat "like google".

edit; all judgements I made about "not interesting" do not mean "not impressive"

edit2: I think eventually AI will be capable of writing a book akin to Egan's Diaspora, and I would love to reflect on what I said at this time

lostmsu 12/14/2025||
What you described re books are preferences. I don't think majority of people care about authors at all. So it might not work for you, but that's not a valid argument why it won't work for most. Therefore your reasoning about that is flawed.

It also seems pretty obvious (did u not think majority don't care about authors? I doubt it). So it stands that some bias made you overlook that fact (as well as OpenAI MAUs and other such glaring data) when you were writing your statement above. If I were you I'd look hard into what that bias might be, cause it could affect other less directly related areas.

mmaunder 12/11/2025|||
Yeah I think a lot of us are taking knowing how LLMs work for granted. I did the fast.ai course a while back and then went off and played with VLLM and various LLMs optimizing execution, tweaking params etc. Then moved on and started being a user. But knowing how they work has been a game changer for my team and I. And context window is so obvious, but if you don't know what it is you're going to think AI sucks. Which now has me wondering: Is this why everyone thinks AI sucks? Maybe Simon Willison should write about this. Simon?
eru 12/12/2025||
> Is this why everyone thinks AI sucks?

Who's everyone? There are many, many people who think AI is great.

In reality, our contemporary AIs are (still) tools with glaring limitations. Some people overlook the limitations, or don't see them, and really hype them up. I guess the people who then take the hype at face value are those that think that AI sucks? I mean, they really do honestly suck in comparison to the hypest of hypes.

eru 12/12/2025|||
> I realise this sounds obvious to many people but it clearly wasn't to those guys so maybe it's not!

It's worse: Gemini (and ChatGPT, but to a lesser extent) have started suggesting random follow-up topics when they conclude that a chat in a session has exhausted a topic. Well, when I say random, I mean that they seem to be pulling it from the 'memory' of our other chats.

For a naive user without preconceived notions of how to use these tools, this guidance from the tools themselves would serve as a pretty big hint that they should intermingle their sessions.

ghostpepper 12/12/2025||
For ChatGPT you can turn this memory off in settings and delete the ones it's already created.
eru 12/12/2025||
I'm not complaining about the memory at all. I was complaining about the suggestion to continue with unrelated topics.
noname120 12/11/2025|||
Problem is that by default ChatGPT has the “Reference chat history” option enabled in the Memory options. This causes any previous conversation to leak into the current one. Just creating a new conversation is not enough, you also need to disable that option.
0xdeafbeef 12/11/2025|||
Only your questions are in it though
noname120 12/12/2025||
Are you sure? What makes you think so?
0xdeafbeef 12/15/2025||
https://www.shloked.com/writing/chatgpt-memory-bitter-lesson

Mb something 've changed since post

redhed 12/11/2025||||
This is also the default in Gemini pretty sure, at least I remember turning it off. Make's no sense to me why this is the default.
gordonhart 12/11/2025|||
> Makes no sense to me why this is the default.

You’re probably pretty far from the average user, who thinks “AI is so dumb” because it doesn’t remember what you told it yesterday.

redhed 12/11/2025||
I was thinking more people would be annoyed by it bringing up unrelated conversations, thinking more I'd say you're probably right that more people are expecting it to remember everything they say.
tiahura 12/12/2025||
It’s not that it brings it up in unrelated conversations, it’s that it nudges related conversations in unwanted directions.
astrange 12/12/2025|||
Mostly because they built the feature and so that implicitly means they think it's cool.

I recommend turning it off because it makes the models way more sycophantic and can drive them (or you) insane.

onraglanroad 12/11/2025|||
That seems like a terrible default. Unless they have a weighting system for different parts of context?
eru 12/12/2025||
They do (or at least they have something that behaves like weighting).
wickedsight 12/11/2025|||
This is why I love that ChatGPT added branching. Sometimes I end up going some random direction in a thread about some code and then I can go back and start a new branch from the part where the chat was still somewhat clean.

Also works really well when some of my questions may not have been worded correctly and ChatGPT has gone in a direction I don't want it to go. Branch, word my question better and get a better answer.

vintermann 12/11/2025|||
It's not at all obvious where to drop the context, though. Maybe it helps to have similar tasks in the context, maybe not. It did really, shockingly well on a historical HTR task I gave it, so I gave it another one, in some ways an easier one... Thought it wouldn't hurt to have text in a similar style in the context. But then it suddenly did very poorly.

Incidentally, one of the reasons I haven't gotten much into subscribing to these services, is that I always feel like they're triaging how many reasoning tokens to give me, or AB testing a different model... I never feel I can trust that I interact with the same model.

dcre 12/12/2025|||
The models you interact with through the API (as opposed to chat UIs) are held stable and let you specify reasoning effort, so if you use a client that takes API keys, you might be able to solve both of those problems.
eru 12/12/2025|||
> Incidentally, one of the reasons I haven't gotten much into subscribing to these services, is that I always feel like they're triaging how many reasoning tokens to give me, or AB testing a different model... I never feel I can trust that I interact with the same model.

That's what websites have been doing for ages. Just like you can't step twice in the same river, you can't use the same version of Google Search twice, and never could.

chasd00 12/11/2025|||
I was listening to a podcast about people becoming obsessed and "in love" with an LLM like ChatGPT. Spouses were interviewed describing how mentally damaging it is to their partner and how their marriage/relationship is seriously at risk because of it. I couldn't believe no one has told these people to just goto the LLM and reset the context, that reverts the LLM back to a complete stranger. Granted that would be pretty devastating to the person in "the relationship" with the LLM since it wouldn't know them at all after that.
jncfhnb 12/11/2025|||
It’s the majestic, corrupting glory of having a loyal cadre of empowering yes men normally only available to the rich and powerful, now available to the normies.
adamesque 12/11/2025|||
that's not quite what parent was talking about, which is — don't just use one giant long conversation. resetting "memories" is a totally different thing (which still might be valuable to do occasionally, if they still let you)
onraglanroad 12/11/2025||
Actually, it's kind of the same. LLMs don't have a "new memory" system. They're like the guy from Memento. Context memory and long term from the training data. Can't make new memories from the context though.

(Not addressed to parent comment, but the inevitable others: Yes, this is an analogy, I don't need to hear another halfwit lecture on how LLMs don't really think or have memories. Thank you.)

dragonwriter 12/11/2025||
Context memory arguably is new memory, but because we abused the metaphor of “learning” rather than something more like shaping inborn instinct for trained model weights, we have no fitting metaphor what happens during the “lifetime” of the interaction with a model via its context window as formation of skills/memories.
SubiculumCode 12/12/2025|||
I constantly switch out, even when it's on the same topic. It starts forming its own 'beliefs and assumptions', gets myopic. I also make use of the big three services in turn to attack ideas from multiple directions
nrds 12/12/2025||
> beliefs and assumptions

Unfortunately during coding I have found many LLMs like to encode their beliefs and assumptions into comments; and even when they don't, they're unavoidably feeding them into the code. Then future sessions pick up on these.

SubiculumCode 12/12/2025||
YES! I've tried to provide instructions asking it to not leave comments at all.
ramoz 12/12/2025|||
Send them this https://backnotprop.substack.com/p/50-first-dates-with-mr-me...
blindhippo 12/12/2025|||
Thing is, context management is NOT obvious to most users of these tools. I use agentic coding tools on a daily basis now and still struggle with keeping context focused and useful, usually relying on patterns such as memory banks and task tracking documents to try to keep a log of things as I pop in and out of different agent contexts. Yet still, one false move and I've blown the window leading to a "compression" which is utterly useless.

The tools need to figure out how to manage context for us. This isn't something we have to deal with when working with other humans - we reliably trust that other humans (for the most part) retain what they are told. Agentic use now is like training a team mate to do one thing, then taking it out back to shoot it in the head before starting to train another one. It's inefficient and taxing on the user.

getnormality 12/12/2025|||
In my recent explorations [1] I noticed it got really stuck on the first thing I said in the chat, obsessively returning to it as a lens through which every new message had to be interpreted. Starting new sessions was very useful to get a fresh perspective. Like a human, an AI that works on a writing piece with you is too close to the work to see any flaw.

[1] https://renormalize.substack.com/p/on-renormalization

okthrowman283 12/12/2025|||
Interesting I’ve noticed the same behavior with Gemini 3.0 but not with Claude, and Gemini 2.5 did not have this behavior. I wonder what tuning is optimising for here.
ljlolel 12/12/2025|||
Probably because the chat name is named after that first message
faxmeyourcode 12/12/2025|||
My boss (great engineer) had been complaining about this with his internal github copilot quality no matter the model or task. Turns out he never cleared the context. It was just the same conversation spread thin across nearly a dozen completely separate repositories because they were all in his massive vscode workspace at once.

This was earlier this year... So I started giving internal presentations on basic context management, best practices, etc after that for our engineering team.

layman51 12/12/2025|||
That is interesting. I already knew about that idea that you’re not supposed to let the conversation drag on too much because its problem solving performance might take a big hit, but then it kind of makes me think that over time, people got away with still using a single conversation for many different topics because of the big context windows.

Now I kind of wonder if I’m missing out by not continuing the conversation too much, or by not trying to use memory features.

plaidfuji 12/11/2025|||
It is annoying though, when you start a new chat for each topic you tend to have to re-write context a lot. I use Gemini 3, which I understand doesn’t have as good of a memory system as OpenAI. Even on single-file programming stuff, after a few rounds of iteration I tend to get to its context limit (the thinking model). Either because the answers degrade or it just throws the “oops something went wrong” error. Ok, time to restart from scratch and paste in the latest iteration.

I don’t understand how agentic IDEs handle this either. Or maybe it’s easier - it just resends the entire codebase every time. But where to cut the chat history? It feels to me like every time you re-prompt a convo, it should first tell itself to summarize the existing context as bullets as its internal prompt rather than re-sending the entire context.

int_19h 12/11/2025||
Agentic IDEs/extensions usually continue the conversation until the context gets close to 80% full, then do the compacting. With both Codex and Claude Code you can actually observe that happening.

That said I find that in practice, Codex performance degrades significantly long before it comes to the point of automated compaction - and AFAIK there's no way to trigger it manually. Claude, on the other hand, has a command for to force compacting, but at the same time I rarely use it because it's so good at managing it by itself.

As far as multiple conversations, you can tell the model to update AGENTS.md (or CLAUDE.md or whatever is in their context by default) with things it needs to remember.

wahnfrieden 12/12/2025||
Codex has `/compact`
TechDebtDevin 12/11/2025||
How are these devs employed or trusted with anything..
jumploops 12/11/2025|
> “a new knowledge cutoff of August 2025”

This (and the price increase) points to a new pretrained model under-the-hood.

GPT-5.1, in contrast, was allegedly using the same pretraining as GPT-4o.

FergusArgyll 12/11/2025||
A new pretrain would definitely get more than a .1 version bump & would get a whole lot more hype I'd think. They're expensive to do!
caconym_ 12/11/2025|||
Releasing anything as "GPT-6" which doesn't provide a generational leap in performance would be a PR nightmare for them, especially after the underwhelming release of GPT-5.

I don't think it really matters what's under the hood. People expect model "versions" to be indexed on performance.

ACCount37 12/11/2025||||
Not necessarily. GPT-4.5 was a new pretrain on top of a sizeable raw model scale bump, and only got 0.5 - because the gains from reasoning training in o-series overshadowed GPT-4.5's natural advantage over GPT-4.

OpenAI might have learned not to overhype. They already shipped GPT-5 - which was only an incremental upgrade over o3, and was received poorly, with this being a part of the reason why.

diego_sandoval 12/12/2025||
I jumped straight from 4o (free user) into GPT-5 (paid user).

It was a generational leap if there ever has been one. Much bigger than 3.5 to 4.

ACCount37 12/12/2025|||
Yes, if OpenAI released GPT-5 after GPT-4o, then it would have been seen as a proper generational leap.

But o3 existing and being good at what it does? Took the wind out of GPT-5's sails.

kadushka 12/12/2025|||
What kind of improvements do you expect when going from 5 straight to 6?
hannesfur 12/11/2025||||
Maybe they felt the increase in capability is not worth of a bigger version bump. Additionally pre-training isn't as important as it used to be. Most of the advances we see now probably come from the RL stage.
femiagbabiaka 12/11/2025||||
Not if they didn't feel that it delivered customer value no? It's about under promising and over delivering, in every instance
jumploops 12/12/2025||||
It’s possible they’re using some new architecture to get more up-to-date data, but I think that’d be even more of a headline.

My hunch is that this is the same 5.1 post-training on a new pretrained base.

Likely rushed out the door faster than they initially expected/planned.

OrangeMusic 12/12/2025||||
Yeah because OpenAI has been great at naming their models so far? ;)
boc 12/12/2025||||
Maybe the rumors about failed training runs weren't wrong...
redwood 12/11/2025|||
Not if it underwhelms
redox99 12/12/2025|||
I think it's more likely to be the old base model checkpoint further trained on additional data.
jumploops 12/12/2025||
Is that technically not a new pretrained model?

(Also not sure how that would work, but maybe I’ve missed a paper or two!)

redox99 12/12/2025||
I'd say for it to be called a new pretrained model, it'd need to be trained from scratch (like llama 1, 2, 3).

But it's just semantics.

98Windows 12/11/2025|||
or maybe 5.1 was an older checkpoint and has more quantization
MagicMoonlight 12/11/2025||
No, they just feed in another round of slop to the same model.
More comments...