Top
Best
New

Posted by mudkipdev 6 hours ago

GPT-5.4(openai.com)
https://openai.com/index/gpt-5-4-thinking-system-card/

https://x.com/OpenAI/status/2029620619743219811

533 points | 479 commentspage 3
timpera 6 hours ago|
> Steerability: Similarly to how Codex outlines its approach when it starts working, GPT‑5.4 Thinking in ChatGPT will now outline its work with a preamble for longer, more complex queries. You can also add instructions or adjust its direction mid-response.

This was definitely missing before, and a frustrating difference when switching between ChatGPT and Codex. Great addition.

jryio 6 hours ago||
1 million tokens is great until you notice the long context scores fall off a cliff past 256K and the rest is basically vibes and auto compacting.
olliepro 1 hour ago|
I bet they lack good long context training data and need to start a flywheel of collecting it via their api (from willing customers)
motbus3 4 hours ago||
Sam Altman can keep his model intentionally to himself. Not doing business with mass murderers
senko 3 hours ago||
Just tested it with my version of the pelican test: a minimal RTS game implementation (zero-shot in codex cli): https://gist.github.com/senko/596a657b4c0bfd5c8d08f44e4e5347... (you'll have to download and open the file, sadly GitHub refuses to serve it with the correct content type)

This is on the edge of what the frontier models can do. For 5.4, the result is better than 5.3-Codex and Opus 4.6. (Edit: nowhere near the RPG game from their blog post, which was presumably much more specced out and used better engineering setup).

I also tested it with a non-trivial task I had to do on an existing legacy codebase, and it breezed through a task that Claude Code with Opus 4.6 was struggling with.

I don't know when Anthropic will fire back with their own update, but until then I'll spend a bit more time with Codex CLI and GPT 5.4.

hmokiguess 4 hours ago||
They hired the dude from OpenClaw, they had Jony Ive for a while now, give us something different!
daft_pink 4 hours ago||
I’ve officially got model fatigue. I don’t care anymore.
zeeebeee 4 hours ago||
same same same
postalrat 4 hours ago||
I'd suggest not clicking for things you don't care about.
ZeroCool2u 6 hours ago||
Bit concerning that we see in some cases significantly worse results when enabling thinking. Especially for Math, but also in the browser agent benchmark.

Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.

Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.

oersted 6 hours ago||
I believe you are looking at GPT 5.4 Pro. It's confusing in the context of subscription plan names, Gemini naming and such. But they've had the Pro version of the GPT 5 models (and I believe o3 and o1 too) for a while.

It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.

Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.

nsingh2 5 hours ago|||
From what I've read online it's not necessarily a unquantized version, it seems to go through longer reasoning traces and runs multiple reasoning traces at once. Probably overkill for most tasks.
ZeroCool2u 5 hours ago||||
Yup, that was it. Didn't realize they're different models. I suppose naming has never been OpenAI's strong suit.
logicchains 5 hours ago|||
>It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.

The performance improvement isn't marginal if you're doing something particularly novel/difficult.

highfrequency 6 hours ago|||
Can you be more specific about which math results you are talking about? Looks like significant improvement on FrontierMath esp for the Pro model (most inference time compute).
ZeroCool2u 6 hours ago||
Frontier Math, GPQA Diamond, and Browsecomp are the benchmarks I noticed this on.
csnweb 6 hours ago||
Are you may be comparing the pro model to the non pro model with thinking? Granted it’s a bit confusing but the pro model is 10 times more expensive and probably much larger as well.
ZeroCool2u 6 hours ago||
Ah yes, okay that makes more sense!
andoando 5 hours ago|||
The thinking models are additionally trained with reinforcement learning to produce chain of thought reasoning
aplomb1026 6 hours ago||
[dead]
nickandbro 5 hours ago||
Beat Simon Willison ;)

https://www.svgviewer.dev/s/gAa69yQd

Not the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.

GaggiX 5 hours ago|
This pelican is actually bad, did you use xhigh?
nickandbro 5 hours ago||
yep, just double checked used gpt-5.4 xhigh. Though had to select it in codex as don't have access to it on the chatgpt app or web version yet. It's possible that whatever code harness codex uses, messed with it.
nubg 4 hours ago||
this is proof they are not benchmaxxing the pelican's :-)
bazmattaz 5 hours ago||
Anyone else feel that it’s exhausting keeping up with the pace of new model releases. I swear every other week there’s a new release!
coffeemug 5 hours ago||
Why do you need to keep up? Just use the latest models and don't worry about it.
pupppet 4 hours ago|||
I think it's fun, it's like we're reliving the browser wars of the early days.
davnicwil 5 hours ago|||
If you think about it there shouldn't really be a reason to care as long as things don't get worse.

Presumably this is where it'll evolve to with the product just being the brand with a pricing tier and you always get {latest} within that, whatever that means (you don't have to care). They could even shuffle models around internally using some sort of auto-like mode for simpler questions. Again why should I care as long as average output is not subjectively worse.

Just as I don't want to select resources for my SaaS software to use or have that explictly linked to pricing, I don't want to care what my OpenAI model or Anthropic model is today, I just want to pay and for it to hopefully keep getting better but at a minimum not get worse.

throwup238 5 hours ago||
Yes, that's a common feeling. 5.3-Codex was released a month ago on Feb 5 so we're not even getting a full month within a single brand, let alone between competitors.
dandiep 5 hours ago|
Anyone know why OpenAI hasn't released a new model for fine tuning since 4.1? It'll be a year next month since their last model update for fine tuning.
zzleeper 5 hours ago||
For me the issue is why there's not a new mini since 5-mini in August.

I have now switched web-related and data-related queries to Gemini, coding to Claude, and will probably try QWEN for less critical data queries. So where does OpenAI fits now?

Rapzid 3 hours ago|||
Also interested in this and a replacement for 4.1/4.1-mini that focuses on low latency and high accuracy for voice applications(not the all-in-one models).
qoez 5 hours ago||
I think they just did that because of the energy around it for open source models. Their heart probably wasn't in it and the amount of people fine tuning given the prices were probably too low to continue putting in attention there.
More comments...