Top
Best
New

Posted by atgctg 12/11/2025

GPT-5.2(openai.com)
https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

1195 points | 1083 commentspage 3
doctoboggan 12/11/2025|
This seems like another "better vibes" release. With the number of benchmarks exploding, random luck means you can almost always find a couple showing what you want to show. I didn't see much concrete evidence this was noticeably better than 5.1 (or even 5.0).

Being a point release though I guess that's fair. I suspect there is also some decent optimizations on the backend that make it cheaper and faster for OpenAI to run, and those are the real reasons they want us to use it.

sebzim4500 12/11/2025||
>I suspect there is also some decent optimizations on the backend that make it cheaper and faster for OpenAI to run, and those are the real reasons they want us to use it.

I doubt it, given it is more expensive than the old model.

rat9988 12/11/2025|||
> I didn't see much concrete evidence this was noticeably better than 5.1

Did you test it?

doctoboggan 12/11/2025||
No, I would like to but I don't see it in my paid ChatGPT plan or in the API yet. I based my comment solely off of what I read in the linked announcement.
BrtByte 12/12/2025||
At this point the benchmark soup is so dense that it's hard to tell signal from selective framing
flkiwi 12/11/2025||
I gave up my OpenAI subscription a few days ago in favor of Claude. My quality of life (and quality of results) has gone up substantially. Several of our tools at work have GPT-5x as their backend model, and it is incredible how frustrating they are to use, how predictable their AI-isms are, and how inconsistent their output is. OpenAI is going to have to do a lot more than an incremental update to convince me they haven't completely lost the thread.
brisket_bronson 12/11/2025||
You are absolutely right!
flkiwi 12/11/2025||
Someone didn't think so, lol. I debated not saying anything because the AI partisans are just so awful.
jpkw 12/12/2025|||
I think the above comment was a joke (Claude frequently says that whenever you challenge it, whether you are right or wrong)
jstummbillig 12/12/2025||
At least this once the AI-ism was not spotted.
flkiwi 12/12/2025||
Goodness no, I chuckled.
petesergeant 12/12/2025||
I have found Codex to be a phenomenal code-review tool, fwiw. Shitty at writing code, _great_ at reviewing it.
Tiberium 12/11/2025||
The only table where they showed comparisons against Opus 4.5 and Gemini 3:

https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

varenc 12/11/2025|
100% on the AIME (assuming its not in the training data) is pretty impressive. I got like 4/15 when I was in HS...
hellojimbo 12/11/2025||
The no tools part is impressive, with tools every model gets 100%
varenc 12/11/2025||
If I recall, the AIME answers are always 4 digits numbers. And most of the problems are of the type where if you have a candidate number it's reasonable to validate its correctness. So easy to brute force all 4 digit ints with code.

tl;dr; humans would do much better too if they could use programming tools :)

Davidzheng 12/12/2025||
uh no it's not solved by looping over 4 digit numbers when it uses tools
blitz_skull 12/12/2025||
Again I just tap the sign.

All of your benchmarks mean nothing to me until you include Claude Sonnet on them.

In my experience, GPT hasn’t been able to compete with Claude in years for the daily “economically valuable” tasks I work on.

jstummbillig 12/12/2025||
Since as per Anthropics own benchmarks Sonnet 4.5 is beaten by Opus 4.5 would it not suffice to infer the rest?

https://x.com/OpenAI/status/1999182104362668275

nextworddev 12/12/2025||
Claude is pretty trash for anything besides coding
wyre 12/12/2025|||
What are you basing that on? Between Sonnet and Opus I don't think I'm reaching for Gemini 3 at all.
romanovcode 12/12/2025||||
Yeah, but that is the whole point of Claude. And that's why we are interested in the comparison.
timmg 12/12/2025|||
That hasn't been my experience at all. I always wondered if we just get used to how to prompt a given model and that it hard to transition to another.
ComputerGuru 12/11/2025||
Wish they would include or leak more info about what this is, exactly. 5.1 was just released, yet they are claiming big improvements (on benchmarks, obviously). Did they purposely not release the best they had to keep some cards to play in case of Gemini 3 success or is this a tweak to use more time/tokens to get better output, or what?
eldenring 12/11/2025||
I'm guessing they were waiting to figure out more efficient serving before a release, and have decided to eat the inference cost temporarily to stay at the frontier.
famouswaffles 12/11/2025|||
Open AI sat on GPT-4 for 8 months and even released 3.5 months after 4 was trained. While i don't expect such big lag times anymore, generally, it's a given the public is behind whatever models they have internally at the frontier. By all indications, they did not want to release this yet, and only did so because of Gemini-3-pro.
nathan-wall 12/12/2025|||
If you look at their own chart[1] it shows 5.1 was lagging behind Gemini 3 Pro in almost every score listed there, sometimes significantly. They needed to come out with something to stay ahead. I'm guessing they threw what they had at their disposal together to keep the lead as long as they can. It sounds like 5.2 has a more recent knowledge cutoff; a reasonable guess is they could have already had that but were trying to make bigger improvements out of it for a more major 5.5 release before Gemini 3 Pro came out and then they had to rush something out. Also 5.2 has a new "Extended Thinking" option for Pro. I'm guessing they just turned up a lever that told it to think even longer, which helps them score higher, even if it does take a long time. (One thing about Gemini 3 Pro is it's very fast relative to even ChatGPT 5.1 Pro Thinking. A lot of the scores they're putting out to show they're staying ahead aren't showing that piece.)

[1] https://imgur.com/e0iB8KC

dalemhurley 12/11/2025||
My guess is they develop multiple models in parallel.
youngermax 12/12/2025||
Isn't it interesting how this incremental release includes so many testimonials from companies who claim the model has improved? It also focuses on "economically valuable tasks." There was nothing of this sort in GPT-5.1's release. Looks like OpenAI feeling the pressure from investors now.
sfmike 12/11/2025||
Everything is still based on 4 4o still right? is a new model training just too expensive? They can consult deepseek team maybe for cost constrained new models.
elgatolopez 12/11/2025||
Where did you get that from? Cutoff date says august 2025. Looks like a newly pretrained model
FergusArgyll 12/11/2025|||
> This stands in sharp contrast to rivals: OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024, highlighting the significant technical hurdle that Google’s TPU fleet has managed to overcome.

- https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...

It's also plainly obvious from using it. The "Broadly deployed" qualifier is presumably referring to 4.5

ric2b 12/13/2025||
How is that a technical hurdle if they obviously were able to do it before?

It's probably just a question of cost/benefit analysis, it's very expensive to do, so the benefits need to be significant.

SparkyMcUnicorn 12/11/2025|||
If the pretraining rumors are true, they're probably using continued pretraining on the older weights. Right?
verdverm 12/11/2025|||
Apparently they have not had a successful pre training run in 1.5 years
fouronnes3 12/11/2025|||
I want to read a short scify story set in 2150 about how, mysteriously, no one has been able to train a better LLM for 125 years. The binary weights are studied with unbelievably advanced quantum computers but no one can really train a new AI from scratch. This starts cults, wars and legends and ultimately (by the third book) leads to the main protagonist learning to code by hand, something that no human left alive still knows how to do. Could this be the secret to making a new AI from scratch, more than a century later?
WhyOhWhyQ 12/11/2025|||
There's a scifi short story about a janitor who knows how to do basic arithmetic and becomes the most important person in the world when some disaster happens. Of course after things get set up again due to his expertise, he becomes low status again.
bradfitz 12/11/2025||
I had to go look that up! I assume that's https://en.wikipedia.org/wiki/The_Feeling_of_Power ? (Not a janitor, but "a low grade Technician"?)
WhyOhWhyQ 12/11/2025||
Hmm it could be a false memory, since this was almost 15 years ago, but I really do remember it differently than the text of 'Feeling of Power'.
YouAreWRONGtoo 12/12/2025||
[dead]
verdverm 12/11/2025||||
You can ask 2025 Ai to write such a book, it's happy to comply and may or may not actually write the book

https://www.pcgamer.com/software/ai/i-have-been-fooled-reddi...

ssl-3 12/11/2025||||
Sounds good.

Might sell better with the protagonist learning iron age leatherworking, with hides tanned from cows that were grown within earshot, as part of a process of finding the real root of the reason for why any of us ever came to be in the first place. This realization process culminates in the formation of a global, unified steampunk BDSM movement and a wealth of new diseases, and then: Zombies.

(That's the end. Zombies are always the end.)

astrange 12/12/2025|||
This is somewhat similar to a Piers Anthony series that I suspect noone has ever read except for me.

What was with that guy anyway.

wafflemaker 12/11/2025|||
Sorry, but compared with the parent, my money is in you ssl-3. Do you get better results from prompting by being more poetic?
ssl-3 12/12/2025||
> Do you get better results from prompting by being more poetic?

Is that yet-another accusation of having used the bot?

I don't use the bot to write English prose. If something I write seems particularly great or poetic or something, then that's just me: I was in the right mood, at the right time, with the right idea -- and with the right audience.

When it's bad or fucked-up, then that's also just me. I most-assuredly fuck up plenty.

They can't all be zingers. I'm fine with that.

---

I do use the hell out of the bot for translating my ideas (and the words that I use to express them) into languages that I can't speak well, like Python, C, and C++. But that's very different. (And at least so far I haven't shared any of those bot outputs with the world at all, either.)

So to take your question very literally: No, I don't get better results from prompting being more poetic. The responses to my prompts don't improve by those prompts being articulate or poetic.

Instead, I've found that I get the best results from the bot fastest by carrying a big stick, and using that stick to hammer and welt it into compliance.

Things can get rather irreverent in my interactions with the bot. Poeticism is pretty far removed from any of that business.

wafflemaker 12/12/2025||
No. I just genuinely liked your style, and didn't notice previous posts by you. I haven't yet learned to look at names on hn, it's mostly anonymous posts for me. No snark here. And was also genuinely curious if better writing style yields better results.

I've observed that using proper grammar gives slightly better answers. And using more "literacy"(?) kind of language in prompts sometimes gives better answers and sometimes just more interesting ones, when bots try to follow my style.

Sorry for using the word poetic, I'm travelling and sleep deprived and couldn't find the proper word, but didn't want to just use "nice" instead either.

ssl-3 12/12/2025||
It's all good. I'm largely "face-blind", myself, in that I don't often recognize others in person or online -- which is certainly not to say that I think I'm particularly memorable myself.

As to the bot: Man, I beat the bot to death. It's pretty brutal.

I'm profane and demanding because that's the most terse language I know how to construct in English.

When I set forth to have the bot do a thing for me, the slowest part of the process that I can improve on my part is the quantity of the words that I use.

I can type fast and think fast, but my one-letter-at-a-time response to the bot is usually the only part that that I can make a difference with. So I tend to be very terse.

"a+b=c, you fuck!" is certainly terse, unambiguous, and fast to type, so that's my usual style.

Including the emphatic "you fuck!" appendage seems to stir up the context more than without. Its inclusion or omission is a dial that can be turned.

Meanwhile: "I have some reservations about the proposed implementation. Might it be possible for you to revise it so as to be in a different form? As previously discussed, it is my understanding that a+b=c. Would you like to try again to implement a solution that incorporates this understanding?" is very slow to write.

They both get similar results. One method is faster for me than the other, just because I can only type so fast. The operative function of the statement is ~the same either way.

(I don't owe the bot anything. It isn't alive. It is just a computer running a program. I could work harder to be more polite, empathetic, or cordial, but: It's just code running on a box somewhere in a datacenter that is raising my electric rate and making the RAM for my next system upgrade very expensive. I don't owe it anything, much less politeness or poeticism.

Relatedly, my inputs at the bash prompt on my home computer are also very terse. For instance I don't have any desire or ability to be polite to bash; I just issue commands like ls and awk and grep without any filler-words or pleasantries. The bot is no different to me.

When I want something particularly poetic or verbose as output from the bot, I simply command it to be that way.

It's just a program.)

georgefrowny 12/11/2025||||
An software version of Asimov's Holmes-Ginsbook device? https://sfwritersworkshop.org/node/1232

I feel like there was a similar one about software, but it might have been mathematics (also Asimov: The Feeling of Power)

barrenko 12/11/2025||||
Monsieur, if I may offer a vaaaguely similar story on how things may progress https://www.owlposting.com/p/a-body-most-amenable-to-experim...
armenarmen 12/11/2025|||
I’d read it!
ijl 12/11/2025|||
What kind of issues could prevent a company with such resources from that?
verdverm 12/11/2025||
Drama if I had to pick the symptom most visible from the outside.

A lot of talent left OpenAI around that time, most notably in this regard would be Ilya in May '24. Remember that time Ilya and the board ousted Sam only to reverse it almost immediately?

https://arstechnica.com/information-technology/2024/05/chief...

Wowfunhappy 12/11/2025|||
I thought whenever the knowledge cutoff increased that meant they’d trained a new model, I guess that’s completely wrong?
rockinghigh 12/11/2025|||
They add new data to the existing base model via continuous pre-training. You save on pre-training, the next token prediction task, but still have to re-run mid and post training stages like context length extension, supervised fine tuning, reinforcement learning, safety alignment ...
astrange 12/12/2025||
Continuous pretraining has issues because it starts forgetting the older stuff. There is some research into other approaches.
brokencode 12/11/2025|||
Typically I think, but you could pre-train your previous model on new data too.

I don’t think it’s publicly known for sure how different the models really are. You can improve a lot just by improving the post-training set.

catigula 12/11/2025||
The irony is that Deepseek is still running with a distilled 4o model.
blovescoffee 12/11/2025||
Source?
tpurves 12/11/2025||
Undoubtedly each new model from OpenAi has numerous training and orchestration improvements etc.

But how much of each product they release also just a factor of how much they are willing to spend on inference per query in order to stay competitive?

I always wonder how much is technical change vs turning a knob up and down on hardware and power consumption.

GTP5.0 for example seemed like a lot of changes more for OpenAI's internal benefit (terser responses, dynamic 'auto' mode to scale down thinking when not required etc.)

Wondering if GPT5.2 is also case of them in 'code red mode' just turning what they already have up to 11 as a fastest way to respond to fiercer competion.

simonsarris 12/12/2025|
I always liked the definition of technology as "doing more with less". 100 oxen replaced by 1 gallon of diesel, etc.

That it costs more does suggest it's "doing more with more", at least.

psychoslave 12/12/2025||
Good luck with reproducing and eating diesel like can be done with oxen and related species.

Humanity won't be able to tap into this highly compressed energy stock that was generated through processes taking literally geological scales time to bed achieved.

That is, technology is more about what alternative tradeoffs can we leverage on to organize differently with resources at hand.

Frugality can definitely be a possible way to shape the technologies we want to deploy. But it's not all possible technologies, just a subset.

Also better technology is not necessarily bringing societies to morale and well-being excellency. Improving technology for efficient genocides for example is going to bring human disaster as obvious outcome, even if it's done in a manner that is the most green, zero-carbon emissions and growing more forests delivered beyond expectations of the specifications.

sigmar 12/11/2025||
Are there any specifics about how this was trained? Especially when 5.1 is only a month old. I'm a little skeptical of benchmarks these days and wish they put this up on llmarena

edit: noticed 5.2 is ranked in the webdev arena (#2 tied with gemini-3.0-pro), but not yet in text arena (last update 22hrs ago)

emp17344 12/11/2025||
I’m extremely skeptical because of all those articles claiming OpenAI was freaking out about Gemini - now it turns out they just casually had a better model ready to go? I don’t buy it.
Workaccount2 12/11/2025|||
I (and others) have a strong suspicion that they can modulate models intelligence in almost real time by adjusting quantization and thinking time.

It seems if anyone wants, they can really gas a model up in the moment and back it off after the hype wave.

qeternity 12/12/2025|||
Quantization is not some magical dial you can just turn. In practice you basically have 3 choices: fp16, fp8 and fp4.

Also thinking time means more tokens which costs more especially at the API level where you are paying per token and would be trivially observable.

There is basically no evidence that either of these are occurring in the way you suggest (boosting up and down).

Workaccount2 12/12/2025||
API users probably wouldn't be affected since they are paying in full. Most people complaining are free users, followed by $20/mo users.
bamboozled 12/12/2025|||
Yeah I've noticed with Claude, around the time of the Opus 4.5 release, at least for a few days, Sonnet 4.5 was just dumb, but it seems temporary. I feel that redirected resources to Opus.
tempaccount420 12/11/2025||||
They had to rush it out, I'm sure the internal safety folks are not happy about it.
robots0only 12/12/2025||||
how do you know this is a better model? I wouldn't take any of the numbers at face value especially when all they have done is more/better post-training and thus the base pre-trained model capabilities is still the same. The model may just elicit some of the benchmark capabilities better. You really need to spend time using the model to come to any reliable conclusions.
bamboozled 12/12/2025|||
It's very inline with their PR strategy, or lack of.
kouteiheika 12/11/2025||
Unfortunately there are never any real specifics about how any of their models were trained. It's OpenAI we're talking about after all.
nezaj 12/12/2025|
We saw it do better at making counter-strike! https://x.com/instant_db/status/1999278134504620363?s=20
More comments...