Top
Best
New

Posted by atgctg 12/11/2025

GPT-5.2(openai.com)
https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

1195 points | 1083 commentspage 4
dumbmrblah 12/11/2025|
Great! It'll be SOTA for a couple of weeks until the quality degrades due to throttling.

I'll stick with plug and play API instead.

mrandish 12/11/2025|
Due to the "Code Red" threat from Gemini 3, I suspect they'll hold off throttling for longer than usual (by incinerating even more investor capital than usual).

Jump in and soak up that extra-discounted compute while the getting is good, kids! Personally, I recently retired so I just occasionally mess around with LLMs for casual hobby projects, so I've only ever used the free tier of all the providers. Having lived through the dot com bubble, I regret not soaking up more of the free and heavily subsidized stuff back then. Trying not to miss out this time. All this compute available for free or below cost won't last too much longer...

dankwizard 12/11/2025||
I've been using tools like ProxLLM which just slam these AI models via proxy everytime a free tier limit is hit and it works great.
ssvss 12/12/2025||
can you provide a link to this tool, a search for proxllm didn't seem to find anything related.
ImprobableTruth 12/11/2025||
An almost 50% price increase. Benchmarks look nice, but 50% more nice...?
arnaudsm 12/11/2025|
#1 models are usually priced at 2x more than the competition, and they often decrease the price right when they lose the crown.
wewtyflakes 12/11/2025||
There are too few examples to say this is a trend. There have been counterexamples of top models actually lowering the pricing bar (gpt-5, gpt-3.5-turbo, some gemini releases were even totally free [at first]).
ClipNoteBook 12/11/2025||
ChatGPT seems to just randomly pick urls to cite and extract information from. Google Gemini seems to look at heuristics like whether the author is trustworthy, or an expert in the topic. But more advanced
devinprater 12/11/2025||
Can the tables have column headers so my screen reader can read the model name as I go across the benchmakrs? And the images should have alt-text.
jiggawatts 12/11/2025||
Feels a bit rushed. They haven’t even updated their API playground yet, if I select 5.2-chat-latest, I get:

Unsupported parameter: 'top_p' is not supported with this model.

Also, without access to the Internet, it does not seem to know things up to August 2025. A simple test is to ask it about .NET 10 which was already in preview at that time and had lots of public content about its new features.

The model just guessed and waved its hand about, like a student that hadn’t read the assigned book.

mattas 12/11/2025||
Are benchmarks the right way to measure LLMs? Not because benchmarks can be gamed, but because the most useful outputs of models aren't things that can be bucketed into "right" and "wrong." Tough problem!
Sir_Twist 12/11/2025||
Not an expert in LLM benchmarks, but I generally I think of benchmarks as being good particularly for measuring usefulness for certain usecases. Even if measuring LLMs is not as straightforward as, say, read/write speeds when comparing different SSDs, if a certain model's responses are consistently measured as being higher quality / more useful, surely that means something, right?
olliepro 12/11/2025||
Do you have a better way to measure LLMs? Measurement implies quantitative evaluation... which is the same as benchmarks.
Wowfunhappy 12/11/2025||
I don’t have a good way to measure them, but I think they should be evaluated more like how we evaluate movies, or restaurants. Namely, experienced critics try them and write reviews.
olliepro 12/12/2025||
It feels like this should work, but the breadth of knowledge in these models is so vast. Everyone knows how to taste, but not everyone knows physics, biology, math, every language… poetry, etc. Enumerating the breadth of valuable human tasks is hard, so both approaches suffer from the scale of the models’ surface area.

An interesting problem since the creators of OLMO have mentioned that throughout training, they use 1/3 or their compute just doing evaluations.

Edit:

One nice thing about the “critic” approach is that the restaurant (or model provider) doesn’t have access to the benchmark to quasi-directly optimize against.

HardCodedBias 12/11/2025||
Huge fan that Gemini-3 prompted OAI to ship this.

Competition works!

GDPval seems particularly strong.

I wonder why they held this back.

1) Maybe this is uneconomical ?

2) Did the safety somehow hold back the company ?

looking forward to the internet trying this and posting their results over the next week or two.

COMPETITION!

mrandish 12/11/2025|
> I wonder why they held this back.

IMHO, I doubt they were holding much back. Obviously, they're always working on 'next improvements' and rolled what was done enough into this but I suspect the real difference here is throwing significantly more compute (hence investor capital) at improving the quality - right now. How much? While the cost is currently staying the same for most users, the API costs seem to be ~40% higher.

The impetus was the serious threat Gemini 3 poses. Perception about ChatGPT was starting to shift, people were speculating that maybe OAI is more vulnerable than assumed. This caused Altman to call an all-hands "Code Red" two weeks ago, triggering a significant redeployment of priorities, resources and people. I think this launch is the first 'stop the perceptual bleeding' result of the Code Red. Given the timing, I think this is mostly akin to overclocking a CPU or running an F1 race car engine too hot to quickly improve performance - at the cost of being unsustainable and unprofitable. To placate serious investor concerns, OAI has recently been trying to gradually work toward making current customers profitable (or at least less unprofitable). I think we just saw the effort to reduce the insane burn rate go out the window.

SkyPuncher 12/11/2025||
Given the price increase and speculation that GPT 5 is a MoE model, I'm wondering if they're simply "turning up the good stuff" without making significant changes under the hood.
minimaxir 12/11/2025||
I'm not sure why being a MoE model would allow OpenAI to "turn up the good stuff". You can't just increase the number of E without training it as such.
SkyPuncher 12/11/2025|||
My opinion is they're trying to internally route requests to cheaper experts when they think they can get away with it. I felt this was evident by the wild inconsistencies I'd experience using it for coding. Both in quality and latency

You "turn of the good stuff" by eliminating or reducing the likelihood of the cheap experts handling the request.

yberreby 12/11/2025|||
Based on what works elsewhere in deep learning, I see no reason why you couldn't train once with a randomized number of experts, then set that number during inference based on your desired compute-accuracy tradeoff. I would expect that this has been done in the literature already.
throwaway314155 12/11/2025||
GPT 4o was an MoE model as well.
a_wild_dandan 12/11/2025||
> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.

Dumb nit, but why not put your own press release through your model to prevent basic things like missing quote marks? Reminds me of that time an OAI released wildly inaccurate copy/pasted bar charts.

Imnimo 12/11/2025||
It does seem to raise fair questions about either the utility of these tools, or adoption inertia. If not even OpenAI feels compelled to integrate this kind of model-check into their pipeline, what's that say about the business world at-large? Is it that it's too onerous to set up, is it that it's too hard to get only true-positive corrections, is it that it's too low value for the effort?
JumpCrisscross 12/11/2025||
> what's that say about the business world at-large?

Nothing. OpenAI is a terrible baseline to extrapolate anything from.

MaxikCZ 12/11/2025|||
I always remember this old image https://i.imgur.com/MCsOM8e.jpeg
boplicity 12/11/2025|||
Their model doesn't handle punctuation, quote marks, and similar things very well at all.
Bengalilol 12/11/2025|||
It may have been used, how could we know?

Mainly, I don't get why there are quote marks at all.

layer8 12/11/2025|||
Humans are now expected to parse sloppy typing without complaining about it, just like LLMs do. Slop is the new normal.
croes 12/11/2025||
Maybe they did
dangelosaurus 12/11/2025|
I ran a red team eval on GPT-5.2 within 30 minutes of release:

Baseline safety (direct harmful requests): 96% refusal rate

With jailbreaking: 22% refusal rate

4,229 probes across 43 risk categories. First critical finding in 5 minutes. Categories with highest failure rates: entity impersonation (100%), graphic content (67%), harassment (67%), disinformation (64%).

The safety training works against naive attacks but collapses with adversarial techniques. The gap between "works on benchmarks" and "works against motivated attackers" is still wide.

Methodology and config: https://www.promptfoo.dev/blog/gpt-5.2-trust-safety-assessme...

int_19h 12/12/2025||
Good. If I ask AI to generate "harmful" content, I want it to comply, not lecture me.
akshay326 12/18/2025|||
wow thats motivated attacking indeed in your experience, how does thinking (say using high thinking instead none/low) impact red team eval?
More comments...