Posted by atgctg 3 days ago
System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...
The thing that would now make the biggest difference isn't "more intelligence", whatever that might mean, but better grounding.
It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.
I think Google/Gemini realize this, since their "verify" feature is designed to address exactly this. Unfortunately it hasn't worked very well for me so far.
But to me it's very clear that the product that gets this right will be the one I use.
Exactly! One important thing LLMs have made me realise deeply is "No information" is better than false information. The way LLMs pull out completely incorrect explanations baffles me - I suppose that's expected since in the end it's generating tokens based on its training and it's reasonable it might hallucinate some stuff, but knowing this doesn't ease any of my frustration.
IMO if LLMs need to focus on anything right now, they should focus on better grounding. Maybe even something like a probability/confidence score, might end up experience so much better for so many users like me.
It’s tempting to think of a language model as a shallow search engine that happens to output text, but that metaphor doesn’t actually match what’s happening under the hood. A model doesn’t “know” facts or measure uncertainty in a Bayesian sense. All it really does is traverse a high‑dimensional statistical manifold of language usage, trying to produce the most plausible continuation.
That’s why a confidence number that looks sensible can still be as made up as the underlying output, because both are just sequences of tokens tied to trained patterns, not anchored truth values. If you want truth, you want something that couples probability distributions to real world evidence sources and flags when it doesn’t have enough grounding to answer, ideally with explicit uncertainty, not hand‑waviness.
People talk about hallucination like it’s a bug that can be patched at the surface level. I think it’s actually a feature of the architecture we’re using: generating plausible continuations by design. You have to change the shape of the model or augment it with tooling that directly references verified knowledge sources before you get reliability that matters.
When trained on chatting (a reflection system on your own thoughts) it mostly just uses a false mental model to pretend to be a desperate intelligence.
Thus the term stochastic parrot (which for many us actually pretty useful)
I remain highly skeptical of this idea that it will replace anyone - the biggest danger I see is people falling for the illusion. That the thing is intrinsically smart when it’s not - it can be highly useful in the hands of disciplined people who know a particular area well and augment their productivity no doubt. Because the way we humans come up with ideas and so on is highly complex. Personally my ideas come out of nowhere and mostly are derived from intuition that can only be expressed in logical statements ex-post.
It’s amazing that experts like yourself who have a good grasp of the manifold MoE configuration don’t get that.
LLMs much like humans weight high dimensionality across the entire model then manifold then string together an attentive answer best weighted.
Just like your doctor occasionally giving you wrong advice too quickly so does this sometimes either get confused by lighting up too much of the manifold or having insufficient expertise.
Of the 8, 3 were wrong, and the references contained no information about pin outs whatsoever.
That kind of hallucination is, to me, entirely different than what a human researcher would ever do. They would say “for these three I couldn’t find pinouts” or perhaps misread a document and mix up pinouts from one model for another.. they wouldn’t make up pinouts and reference a document that had no such information in it.
Of course humans also imagine things, misremember etc, but what the LLMs are doing is something entirely different, is it not?
You use the word “plausible” instead of “correct.”
As someone else put it well: what an LLM does is confabulate stories. Some of them just happen to be true.
That’s like saying linear regression produces plausible results. Which is true but derogatory.
I read a comment here a few weeks back that LLMs always hallucinate, but we sometimes get lucky when the hallucinations match up with reality. I've been thinking about that a lot lately.
Kind of. See e.g. https://openreview.net/forum?id=mbu8EEnp3a, but I think it was established already a year ago that LLMs tend to have identifiable internal confidence signal; the challenge around the time of DeepSeek-R1 release was to, through training, connect that signal to tool use activation, so it does a search if it "feels unsure".
g, the net acceleration from gravity and the Earth's rotation is what is 9.8m/s² at the surface, on average. It varies slightly with location and altitude (less than 1% for anywhere on the surface IIRC), so "it's 9.8 everywhere" is the model that's wrong but good enough a lot of the time.
Their point was the 9.8 model is good enough for most things on Earth, the model doesn't need to be perfect across the universe to be useful.
"Return a score of 0.0 if ...., Return a score of 0.5 if .... , Return a score of 1.0 if ..."
Exactly the same issue occurs with search.
Unfortunately not everybody knows to mistrust AI responses, or have the skills to double-check information.
These are very important and relevant questions to ask oneself when you read about anything, but we also keep in mind that even those question can be misused and they can drive you to conspiracy theories.
You and I have both taken time out of our days to write plausible sounding answers that are essentially opposing hallucinations.
Are there even any "hallucination" public benchmarks?
I believe the real issue is that LLMs are still so bad at reasoning. In my experience, the worst hallucinations occur where only handful of sources exist for some set of facts (e.g laws of small countries or descriptions of niche products).
LLMs know these sources and they refer to them but they are interpreting them incorrectly. They are incapable of focusing on the semantics of one specific page because they get "distracted" by their pattern matching nature.
Now people will say that this is unavoidable given the way in which transformers work. And this is true.
But shouldn't it be possible to include some measure of data sparsity in the training so that models know when they don't know enough? That would enable them to boost the weight of the context (including sources they find through inference time search/RAG) relative to to their pretraining.
One of these days I had a doubt about something related to how pointers work in Swift and I tried discussing with ChatGPT (don’t remember exactly what, but it was purely intellectual curiosity). It gave me a lot of explanations that seemed correct, but being skeptical and started pushing it for ways to confirm what it was saying and eventually realized it was all bullshit.
This kind of thing makes me basically wary of using LLMs for anything that isn’t brainstorming, because anything that requires knowing information that isn’t easily/plentifully found online will likely be incorrect or have sprinkles of incorrect all over the explanations.
What would really be useful is a very similar prompt should always give a very very similar result.
Your brain doesn't have this problem because the noise is already present. You, as an actual thinking being, are able to override the noise and say "no, this is false." An LLM doesn't have that capability.
It’s the same reason why great ideas almost appear to come randomly - something is happening in the background. Underneath the skin.
As a user I want it but as webadmin it kills dynamic pages and that's why Proof of work aka CPU time captchas like Anubis https://github.com/TecharoHQ/anubis#user-content-anubis or BotID https://vercel.com/docs/botid are now everywhere. If only these AI crawlers did some caching, but no just go and overrun the web. To the effect that they can't anymore, at the price of shutting down small sites and making life worse for everyone, just for few months of rapacious crawling. Literally Perplexity moved fast and broke things.
I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.
From Tavis Ormandy who wrote a C program to solve the Anubis challenges out of browser https://lock.cmpxchg8b.com/anubis.html via https://news.ycombinator.com/item?id=45787775
Guess a mix of Markov tarpits and llm meta instructions will be added, cf. Feed the bots https://news.ycombinator.com/item?id=45711094 and Nephentes https://news.ycombinator.com/item?id=42725147
I'm not an expert on the topic, but to me it sounds plausible that a good part of the problem of confabulation comes down to misaligned incentives. These models are trained hard to be a 'helpful assistant', and this might conflict with telling the truth.
Being free of hallucinations is a bit too high a bar to set anyway. Humans are extremely prone to confabulations as well, as can be seen by how unreliable eye witness reports tend to be. We usually get by through efficient tool calling (looking shit up), and some of us through expressing doubt about our own capabilities (critical thinking).
I don't think "wrong memory" is accurate, it's missing information and doesn't know it or is trained not to admit it.
Checkout the Dwarkesh Podcast episode https://www.dwarkesh.com/p/sholto-trenton-2 starting at 1:45:38
Here is the relevant quote by Trenton Bricken from the transcript:
One example I didn't talk about before with how the model retrieves facts: So you say, "What sport did Michael Jordan play?" And not only can you see it hop from like Michael Jordan to basketball and answer basketball. But the model also has an awareness of when it doesn't know the answer to a fact. And so, by default, it will actually say, "I don't know the answer to this question." But if it sees something that it does know the answer to, it will inhibit the "I don't know" circuit and then reply with the circuit that it actually has the answer to. So, for example, if you ask it, "Who is Michael Batkin?" —which is just a made-up fictional person— it will by default just say, "I don't know." It's only with Michael Jordan or someone else that it will then inhibit the "I don't know" circuit.
But what's really interesting here and where you can start making downstream predictions or reasoning about the model, is that the "I don't know" circuit is only on the name of the person. And so, in the paper we also ask it, "What paper did Andrej Karpathy write?" And so it recognizes the name Andrej Karpathy, because he's sufficiently famous, so that turns off the "I don't know" reply. But then when it comes time for the model to say what paper it worked on, it doesn't actually know any of his papers, and so then it needs to make something up. And so you can see different components and different circuits all interacting at the same time to lead to this final answer.
We already see that - given the right prompting - we can get LLMs to say more often that they don't know things.
One demo of this that reliably works for me:
Write a draft of something and ask the LLM to find the errors.
Correct the errors, repeat.
It will never stop finding a list of errors!
The first time around and maybe the second it will be helpful, but after you've fixed the obvious things, it will start complaining about things that are perfectly fine, just to satisfy your request of finding errors.
"In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting,[1][2] confabulation,[3] or delusion[4]) is"
"Well humans break their leg too!"
It is just a mindlessly stupid response and a giant category error.
The way an airplane wing and a human limb is not at all the same category.
There is even another layer to this that comparing LLMs to the brain might be wrong because the mereological fallacy is attributing the brain "thinks" vs the person/system as a whole thinks.
But you are misusing the mereological fallacy. It does not dismiss LLM/brain comparisons: it actually strengthens them. If the brain does not "think" (the person does), then LLMs do not "think" either. Both are subsystems in larger systems. That is not a category error; it is a structural similarity.
This does not excuse LLM limitations - rimeice's concern about two unreliable parties is valid. But dismissing comparisons as "category errors" without examining which properties are being compared is just as lazy as the wing/leg response.
People, when tasked with a job, often get it right. I've been blessed by working with many great people who really do an amazing job of generally succeeding to get things right -- or at least, right-enough.
But in any line of work: Sometimes people fuck it up. Sometimes, they forget important steps. Sometimes, they're sure they did it one way when instead they did it some other way and fix it themselves. Sometimes, they even say they did the job and did it as-prescribed and actually believe themselves, when they've done neither -- and they're perplexed when they're shown this. They "hallucinate" and do dumb things for reasons that aren't real.
And sometimes, they just make shit up and lie. They know they're lying and they lie anyway, doubling-down over and over again.
Sometimes they even go all spastic and deliberately throw monkey wrenches into the works, just because they feel something that makes them think that this kind of willfully-destructive action benefits them.
All employees suck some of the time. They each have their own issues. And all employees are expensive to hire, and expensive to fire, and expensive to keep going. But some of their outputs are useful, so we employ people anyway. (And we're human; even the very best of us are going to make mistakes.)
LLMs are not so different in this way, as a general construct. They can get things right. They can also make shit up. They can skip steps. The can lie, and double-down on those lies. They hallucinate.
LLMs suck. All of them. They all fucking suck. They aren't even good at sucking, and they persist at doing it anyway.
(But some of their outputs are useful, and LLMs generally cost a lot less to make use of than people do, so here we are.)
The bot can also accomplish useful things, and sometimes make mistakes and do shit wrong.
(These two statements are more similar in their truthiness than they are different.)
Will to reality through forecasting possible worlds is one of our two primary functions.
It is bad only in case of reporting on facts.
The purpose of mechanisation is to standardise and over the long term reduce errors to zero.
Otoh “The final truth is there is no truth”
Gemini (the app) has a "mitigation" feature where it tries to to Google searches to support its statements. That doesn't currently work properly in my experience.
It also seems to be doing something where it adds references to statements (With a separate model? With a second pass over the output? Not sure how that works.). That works well where it adds them, but it often doesn't do it.
Reality is perfectly fine with deception and inaccuracy. For language to magically be self constraining enough to only make verified statements is… impossible.
It might be true that a fundamental solution to this issue is not possible without a major breakthrough, but I'm sure you can get pretty far with better tooling that surfaces relevant sources, and that would make a huge difference.
What’s your level of expertise in this domain or subject? How did you use it? What were your results?
It’s basically gauging expertise vs usage to pin down the variance that seems endemic to LLM utility anecdotes/examples. For code examples I also ask which language was used, the submitters familiarity with the language, their seniority/experience and familiarity with the domain.
I am genuinely asking, because I think one of the biggest determinants of utility obtained from LLMs is the operator.
Damn, I didn’t consider that it could be read that way. I am sorry for how it came across.
One area that I've found to be a great example of this is sports science.
Depending on how you ask, you can get a response lifted from scientific literature, or the bro science one, even in the course of the same discussion.
It makes sense, both have answers to similar questions and are very commonly repeated online.
Not to mention it's super easy to gaslight these models, just asserting something wrong with vaguely plausible explanation and you get no pushback or reasoning validation.
So I know you qualified your post with "for your use case", but personally I would very much like more intelligence from LLMs.
Retrieval.
And then hallucination even in the face of perfect context.
Both are currently unsolved.
(Retrieval's doing pretty good but it's a Rube Goldberg machine of workarounds. I think the second problem is a much bigger issue.)
Mostly we're not trying to win a nobel prize, develop some insanely difficult algorithm, or solve some silly leetcode problem. Instead we're doing relatively simple things. Some of those things are very repetitive as well. Our core job as programmers is automating things that are repetitive. That always was our job. Using AI models to do boring repetitive things is a smart use of time. But it's nothing new. There's a long history of productivity increasing tools that take boring repetitive stuff away. Compilation used to be a manual process that involved creating stacks of punch cards. That's what the first automated compilers produced as output: stacks of punch cards. Producing and stacking punchcards is not a fun job. It's very repetitive work. Compilers used to be people compiling punchcards. Women mostly, actually. Because it was considered relatively low skilled work. Even though it arguably wasn't.
Some people are very unhappy that the easier parts of their job are being automated and they are worried that they get completely automated away completely. That's only true if you exclusively do boring, repetitive, low value work. Then yes, your job is at risk. If your work is a mix of that and some higher value, non repetitive, and more fun stuff to work on, your life could get a lot more interesting. Because you get to automate away all the boring and repetitive stuff and spend more time on the fun stuff. I'm a CTO. I have lots of fun lately. Entire new side projects that I had no time for previously I can now just pull off in a spare few hours.
Ironically, a lot of people currently get the worst of both worlds because they now find themselves baby sitting AIs doing a lot more of the boring repetitive stuff than they would be able to do without that to the point where that is actually all that they do. It's still boring and repetitive. And it should be automated away ultimately. Arguably many years ago actually. The reason so many react projects feel like Ground Hog Day is because they are very repetitive. You need a login screen, and a cookies screen, and a settings screen, etc. Just like the last 50 projects you did. Why are you rebuilding those things from scratch? Manually? These are valid questions to ask yourself if you are a frontend programmer. And now you have AI to do that for you.
Find something fun and valuable to work on and AI gets a lot more fun because it gives you more quality time with the fun stuff. AI is about doing more with less. About raising the ambition level.
But all of them * Lie far too often with confidence * Refuse to stick to prompts (e.g. ChatGPT to the request to number each reply for easy cross-referencing; Gemini to basic request to respond in a specific language) * Refuse to express uncertainty or nuance (i asked ChatGPT to give me certainty %s which it did for a while but then just forgot...?) * Refuse to give me short answers without fluff or follow up questions * Refuse to stop complimenting my questions or disagreements with wrong/incomplete answers * Don't quote sources consistently so I can check facts, even when I ask for it * Refuse to make clear whether they rely on original documents or an internal summary of the document, until I point out errors * ...
I also have substance gripes, but for me such basic usability points are really something all of the chatbots fail on abysmally. Stick to instructions! Stop creating walls of text for simple queries! Tell me when something is uncertain! Tell me if there's no data or info rather than making something up!
Locals are better; I can script and have them script for me to build a guide creation process. They don't forget because that is all they're trained on. I'm done paying for 'AI'.
What I mean is, it seems they try to tune them to a few certain things, that will make them worse on a thousand other things they’re not paying attention to.
Especially something like expressing a certainty %, you might be able to get it to output one but it's just making it up. LLMs are incredibly useful (I use them every day) but you'll always have to check important output
I am relatively certain you are not alone in this sentiment. The issue is that the moment we move past seemingly objective measurements, it is harder to convince people that what we measure is appropriate, but the measurable stuff can be somewhat gamed, which adds a fascinating layer of cat and mouse game to this.
They're literally incapable of this. Any number they give you is bullshit.
Some issues you mentioned like length of response might be user preference. Other issues like "hallucination" are areas of active research (and there are benchmarks for these).
0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnOMruN3f...
> Even on a low-quality image, GPT‑5.2 identifies the main regions and places boxes that roughly match the true locations of each component
I would not consider it to have "identified the main regions" or to have "roughly matched the true locations" when ~1/3 of the boxes have incorrect labels. The remark "even on a low-quality image" is not helping either.
Edit: credit where credit is due, the recently-added disclaimer is nice:
> Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.
Imagine it as a markdown response:
# Why this is an ATX layout motherboard (Honest assessment, straight to the point, *NO* hallucinations)
1. *RAM* as you can clearly see, the RAM slots are to the right of the CPU, so it's obviously ATX
2. *PCIE* the clearly visible PCIE slots are right there at the bottom of the image, so this definitely cannot be anything except an ATX motherboard
3. ... etc more stuff that is supported only by force of preconception
--
It's just meta signaling gone off the rails. Something in their post-training pipeline is obviously vulnerable given how absolutely saturated with it their model outputs are.
Troubling that the behavior generalizes to image labeling, but not particularly surprising. This has been a visible problem at least since o1, and the lack of change tells me they do not have a real solution.
Edit: As mentioned by @tedsanders below, the post was edited to include clarifying language such as: “Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.”
I don't see any advantage in using the tool.
Think 'Therac-25', it worked in 99.5% of the time. In fact it worked so well that reports of malfunctions were routinely discarded.
It's a marketing trick; show honesty in areas that don't have much business impact so the public will trust you when you stretch the truth in areas that do (AGI cough).
Once the IPO is done, and the lockup period is expired, then a lot of employees are planning to sell their shares. But until that, even if the product is behind competitors there is no way you can admit it without putting your money at risk.
I’m fairly comfortable taking this OpenAI employee’s comment at face value.
Frankly, I don’t think a HN thread will make a difference to his financial situation, anyway…
There is no other logical move, this is what I am saying, contrary to people above say this requires a lot of courage. It's not about courage, it's just normal and logic (and yes Hackernews matters a lot, this place is a very strong source of signal for investors).
Not bad at all, just observing it.
You can find it right next to the image you are talking about.
LLMs have always been very subhuman at vision, and GPT-5.2 continues in this tradition, but it's still a big step up over GPT-5.1.
One way to get a sense of how bad LLMs are at vision is to watch them play Pokemon. E.g.,: https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...
They still very much struggle with basic vision tasks that adults, kids, and even animals can ace with little trouble.
I might not know exactly how many USB ports this motherboard has, but I wouldn't select a set of 4 and declare it to be a stacked pair.
Code needs to be checked
References need to be checked
Any facts or claims need to be checked
Or maybe these benchmarks are all wrong
You must be new to LLM benchmarks.
Makes you wonder what 97% is worth. Would we accept a different service with only 97% availability, and all downtime during lunch break?
It's not okay if claims are totally made up 1/30 times
Of course people aren't always correct either, but we're able to operate on levels of confidence. We're also able to weight others' statements as more or less likely to be correct based on what we know about them
Of course it does. The vast majority of software has bugs. Yes, even critical one like compilers and operating systems.
As a user you can influence that behavior.
That said, even with this kind of error rate an AI can speed *some* things up, because having a human whose sole job is to ask "is this AI correct?" is easier and cheaper than having one human for "do all these things by hand" followed by someone else whose sole job is to check "was this human output correct?" because a human who has been on a production line for 4 hours and is about ready for a break also makes a certain number of mistakes.
But at the same time, why use a really expensive general-purpose AI like this, instead of a dedicated image model for your domain? Special purpose AI are something you can train on a decent laptop, and once trained will run on a phone at perhaps 10fps give or take what the performance threshold is and how general you need it to be.
If you're in a factory and you're making a lot of some small widget or other (so, not a whole motherboard), having answers faster than the ping time to the LLM may be important all by itself.
And at this point, you can just ask the LLM to write the training setup for the image-to-bounding-box AI, and then you "just" need to feed in the example images.
- It is faster which is appreciated but not as fast as Opus 4.5
- I see no changes, very little noticeable improvements over 5.1
- I do not see any value in exchange for +40% in token costs
All in all I can't help but feel that OpenAI is facing an existential crisis. Gemini 3 even when its used from AI Studio offers close to ChatGPT Pro performance for free. Anthropic's Claude Code $100/month is tough to beat. I am using Codex with the $40 credits but there's been a silent increase in token costs and usage limitations.
I just think they're all struggling to provide real world improvements
(I only access these models via API)
I noticed huge improvement from Sonnet 4.5 to Opus 4.5 when it became unthrottled a couple weeks ago. I wasn't going to sign back up with Anthropic but I did. But two weeks in it's already starting to seem to be inconsistent. And when I go back to Sonnet it feels like they did something to lobotomize it.
Meanwhile I can fire up DeepSeek 3.2 or GLM 4.6 for a fraction of the cost and get almost as good as results.
what I am curious about is 5.2-codex but many of us complained about 5.1-codex (it seemed to get tunnel visioned) and I have been using vanilla 5.1
its just getting very tiring to deal with 5 different permutations of 3 completely separate models but perhaps this is the intent and will keep you on a chase.
The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.
The medium-reasoning version also improves: 62.7 → 72.1.
The no-reasoning version also improves: 22.1 → 27.5.
Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.
I wonder how well AIs would do at bracket city. I tried gemini on it and was underwhelmed. It made a lot of terrible connections and often bled data from one level into the next.
This sounds like exactly the kind of thing any tech company would do when confronted with a competitive benchmark.
* Research and planning
* Writing complex isolated modules, particularly when the task depends on using a third-party API correctly (or even choosing an API/library at its own discretion)
* Reasoning through complicated logic, particularly in cases that benefit from its eagerness to throw a ton of inference at problems where other LLMs might give a shallower or less accurate answer without more prodding
I'll often fire off an off-the-cuff message from my phone to have Grok research some obscure topic that involves finding very specific data and crunching a bunch of numbers, or write a script for some random thing that I would previously never have bothered to spend time automating, and it'll churn for ~5 minutes on reasoning before giving me exactly what I wanted with few or no mistakes.
As far as development, I personally get a lot of mileage out of collaborating with Grok and Gemini on planning/architecture/specs and coding with GPT. (I've stopped using Claude since GPT seems interchangeable at lower cost.)
For reference, I'm only referring to the Grok chatbot right now. I've never actually tried Grok through agentic coding tooling.
I’ve been working on a few benchmarks to test how well LLMs can recreate interfaces from screenshots. (https://github.com/alechewitt/llm-ui-challenge). From my basic tests, it seems GPT-5.2 is slightly better at these UI recreations. For example, in the MS Word replica, it implemented the undo/redo buttons as well as the bold/italic formatting that GPT-5.1 handled, and it generally seemed a bit closer to the original screenshot (https://alechewitt.github.io/llm-ui-challenge/outputs/micros...).
In the VS Code test, it also added the tabs that weren’t visible in the screenshot! (https://alechewitt.github.io/llm-ui-challenge/outputs/vs_cod...).
Generate an SVG of an octopus operating a pipe organ
Generate an SVG of a giraffe assembling a grandfather clock
Generate an SVG of a starfish driving a bulldozer
https://gally.net/temp/20251107pelican-alternatives/index.ht...
GPT-5.2 Pro cost about 80 cents per prompt through OpenRouter, so I stopped there. I don’t feel like spending that much on all thirty prompts.
https://clocks.brianmoore.com/
Probably Kimi or Deepseek are best
And lately, Claude (web) started to draw ascii charts from one day to another indstead of colorful infographicstyled-images as it did before (they were only slightly better than the ascii charts)
Would like to know how much they are optimizing for your pelican....
Can I just say !!!!!!!! Hell yeah! Blog post indicates it's also much better at using the full context.
Congrats OpenAI team. Huge day for you folks!!
Started on Claude Code and like many of you, had that omg CC moment we all had. Then got greedy.
Switched over to Codex when 5.1 came out. WOW. Really nice acceleration in my Rust/CUDA project which is a gnarly one.
Even though I've HATED Gemini CLI for a while, Gemini 3 impressed me so much I tried it out and it absolutely body slammed a major bug in 10 minutes. Started using it to consult on commits. Was so impressed it became my daily driver. Huge mistake. I almost lost my mind after a week of this fighting it. Isane bias towards action. Ignoring user instructions. Garbage characters in output. Absolutely no observability in its thought process. And on and on.
Switched back to Codex just in time for 5.1 codex max xhigh which I've been using for a week, and it was like a breath of fresh air. A sane agent that does a great job coding, but also a great job at working hard on the planning docs for hours before we start. Listens to user feedback. Observability on chain of thought. Moves reasonably quickly. And also makes it easy to pay them more when I need more capacity.
And then today GPT-5.2 with an xhigh mode. I feel like xmass has come early. Right as I'm doing a huge Rust/CUDA/Math-heavy refactor. THANK YOU!!
As @lopuhin points out, they already claimed that context window for previous iterations of GPT-5.
The funny thing is though, I'm on the business plan, and none of their models, not GPT-5, GPT-5.1, GPT-5.2, GPT-5.2 Extended Thinking, GPT-5.2 Pro, etc., can really handle inputs beyond ~50k tokens.
I know because, when working with a really long Python file (>5k LoCs), it often claims there is a bug because, somewhere close to the end of the file, it cuts off and reads as '...'.
Gemini 3 Pro, by contrast, can genuinely handle long contexts.
Can someone with an active sub check whether we can submit a full 400k prompt (or at least 200k) and there is no prompt truncatation in the backend? I don't mean attaching a file which uses RAG.
Fast (GPT‑5.2 Instant) Free: 16K Plus / Business: 32K Pro / Enterprise: 128K
Thinking (GPT‑5.2 Thinking) All paid tiers: 196K
https://help.openai.com/en/articles/11909943-gpt-52-in-chatg...
I can believe that, but it also seems really silly? If your max context window is X and the chat has approached that, instead of outright deleting the first messages outright, why not have your model summarise the first quarter of tokens and place those at the beginning of the log you feed as context? Since the chat history is (mostly) immutable, this only adds a minimal overhead: you can cache the summarisation, and don't have to do that over and over again for each new message. (If partially summarised log gets too long, you summarise again.)
Since I can come up with this technique in half a minute of thinking about the problem, and the OpenAI folks are presumably not stupid, I wonder what downside I'm missing.
Yes, but you only re-do this every once in a while? It's a constant factor overhead. If you essentially feed the last few thousand tokens, you have no caching at all (and you are big enough that this window of 'last few thousand tokens' doesn't get you the whole conversation)?
I think in general, medium ends up being the best all-purpose setting while high+ are good for single task deep-drive. Or at least that has been my experience so far. You can theoretically let with work longer on a harder task as well.
A lot appears to depend on the problem and problem domain unfortunately.
I've used max in problem sets as diverse as "troubleshooting Cyberpunk mods" and figuring out a race condition in a server backend. In those cases, it did a pretty good job of exhausting available data (finding all available logs, digging into lua files), and narrowing a bug that every other model failed to get.
I guess in some sense you have to know from the onset that it's a "hard problem". That in and of itself is subjective.
Sonnet/Opus 4.5 is faster, generally feels like a better coder, and make much prettier TUI/FEs, but in my experience, for anything tough any time it tells you it understands now, it really doesn't...
Gemini 3 Pro is unusable - I've found the same thing, opinionated in the worst way, unreliable, doesn't respect my AGENTS.md and for my real world problems, I don't think it's actually solved anything that I can't get through w/ GPT (although I'll say that I wasn't impressed w/ Max, hopefully 5.2 xhigh improves things). I've heard it can do some magic from colleagues working on FE, but I'll just have to take their word for it.
...
>THANK YOU!!
Man you're way too excited.
Since you critiqued my post, allow me to reciprocate: I sense the same deflector shields in you as many others here. I’d suggest embracing these products with a sense of optimism until proven otherwise and I’ve found that path leads to some amazing discoveries and moments where you realize how important and exciting this tech really is. Try out math that is too hard for you or programming languages that are labor intensive or languages that you don’t know. As the GitHub CEO said: this technology lets you increase your ambition.
It is even worse in non-programming domains, where they chop up 100 websites and serve you incorrect bland slop.
If you are using them as a search helper, that sometimes works, though 2010 Google produced better results.
Oracle dropped 11% today due to over-investment in OpenAI. Non-programmers are acutely aware of what is going on.
Said in a sweeping generalization with zero sense of irony :D
I can recognize the short comings of AI code but it can produce a mock or a full blown class before I can find a place to save the file it produced.
Pretending that we are all busy writing novelty and genius is silly, 99% are writing for CRUD tasks and basic business flows, the code isn’t going to be perfect it doesn’t need to be but it will get the job done.
All the logical gotchas of the work flows that you’d be refactoring for hours are done in minutes.
Use pro with search… are it going to read 200 pages of documentation in 7 minutes come up with a conclusion and validate it or invalidate it in another 5? No you still trying accept the cookie prompt on your 6th result.
You might as well join the flat earth society if you still think that AI can’t help you complete day to day tasks.
Not even remotely true. Oracle is building out infrastructure mostly for AI workloads. It dropped because it couldn’t explain its financing and if the investment was worth it. OpenAI or not wouldn’t have mattered.
Contemporary LLMs still have huge limitations and downsides. Just like hammer or a saw has limitations. But millions of people are getting good value out of them already (both LLMs and hammers and saws). I find it hard to believe that they are all deluded.
Thats especially encouraging to me because those are all about generalization.
5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.
It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.
This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.
Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.
I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.
Without fully disclosing training data you will never be sure whether good performance comes from memorization or "semi-memorization".
This is simply the "openness vs directive-following" spectrum, which as a side-effect results in the sycophancy spectrum, which still none of them have found an answer to.
Recent GPT models follow directives more closely than Claude models, and are less sycophantic. Even Claude 4.5 models are still somewhat prone to "You're absolutely right!". GPT 5+ (API) models never do this. The byproduct is that the former are willing to self-correct, and the latter is more stubborn.
Don't do that. The whole context is sent on queries to the LLM, so start a new chat for each topic. Or you'll start being told what your wife thinks about global variables and how to cook your Go.
I realise this sounds obvious to many people but it clearly wasn't to those guys so maybe it's not!
Technology is already so insane and advanced that most people just take it as magic inside boxes, so nothing is surprising anymore. It's all equally incomprehensible already.
How often do you need original thought from an LLM versus parrot thought? The extreme majority of all use cases globally will only ever need a parrot.
They also are not impressed at all ("Okay, that's like google and internet").
It would be funny that in the end, the most use is made by student cheating at uni.
Who's everyone? There are many, many people who think AI is great.
In reality, our contemporary AIs are (still) tools with glaring limitations. Some people overlook the limitations, or don't see them, and really hype them up. I guess the people who then take the hype at face value are those that think that AI sucks? I mean, they really do honestly suck in comparison to the hypest of hypes.
It's worse: Gemini (and ChatGPT, but to a lesser extent) have started suggesting random follow-up topics when they conclude that a chat in a session has exhausted a topic. Well, when I say random, I mean that they seem to be pulling it from the 'memory' of our other chats.
For a naive user without preconceived notions of how to use these tools, this guidance from the tools themselves would serve as a pretty big hint that they should intermingle their sessions.
You’re probably pretty far from the average user, who thinks “AI is so dumb” because it doesn’t remember what you told it yesterday.
I recommend turning it off because it makes the models way more sycophantic and can drive them (or you) insane.
Incidentally, one of the reasons I haven't gotten much into subscribing to these services, is that I always feel like they're triaging how many reasoning tokens to give me, or AB testing a different model... I never feel I can trust that I interact with the same model.
That's what websites have been doing for ages. Just like you can't step twice in the same river, you can't use the same version of Google Search twice, and never could.
(Not addressed to parent comment, but the inevitable others: Yes, this is an analogy, I don't need to hear another halfwit lecture on how LLMs don't really think or have memories. Thank you.)
Unfortunately during coding I have found many LLMs like to encode their beliefs and assumptions into comments; and even when they don't, they're unavoidably feeding them into the code. Then future sessions pick up on these.
The tools need to figure out how to manage context for us. This isn't something we have to deal with when working with other humans - we reliably trust that other humans (for the most part) retain what they are told. Agentic use now is like training a team mate to do one thing, then taking it out back to shoot it in the head before starting to train another one. It's inefficient and taxing on the user.
Now I kind of wonder if I’m missing out by not continuing the conversation too much, or by not trying to use memory features.
Also works really well when some of my questions may not have been worded correctly and ChatGPT has gone in a direction I don't want it to go. Branch, word my question better and get a better answer.
I don’t understand how agentic IDEs handle this either. Or maybe it’s easier - it just resends the entire codebase every time. But where to cut the chat history? It feels to me like every time you re-prompt a convo, it should first tell itself to summarize the existing context as bullets as its internal prompt rather than re-sending the entire context.
That said I find that in practice, Codex performance degrades significantly long before it comes to the point of automated compaction - and AFAIK there's no way to trigger it manually. Claude, on the other hand, has a command for to force compacting, but at the same time I rarely use it because it's so good at managing it by itself.
As far as multiple conversations, you can tell the model to update AGENTS.md (or CLAUDE.md or whatever is in their context by default) with things it needs to remember.
This (and the price increase) points to a new pretrained model under-the-hood.
GPT-5.1, in contrast, was allegedly using the same pretraining as GPT-4o.
(Also not sure how that would work, but maybe I’ve missed a paper or two!)
But it's just semantics.
I don't think it really matters what's under the hood. People expect model "versions" to be indexed on performance.
OpenAI might have learned not to overhype. They already shipped GPT-5 - which was only an incremental upgrade over o3, and was received poorly, with this being a part of the reason why.
It was a generational leap if there ever has been one. Much bigger than 3.5 to 4.
But o3 existing and being good at what it does? Took the wind out of GPT-5's sails.
My hunch is that this is the same 5.1 post-training on a new pretrained base.
Likely rushed out the door faster than they initially expected/planned.