Top
Best
New

Posted by mudkipdev 5 hours ago

GPT-5.4(openai.com)
https://openai.com/index/gpt-5-4-thinking-system-card/

https://x.com/OpenAI/status/2029620619743219811

420 points | 392 commentspage 2
zone411 2 hours ago|
Results from my Extended NYT Connections benchmark:

GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).

GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).

GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).

smoody07 3 hours ago||
Surprised to see every chart limited to comparisons against other OpenAI models. What does the industry comparison look like?
lorenzoguerra 2 hours ago||
I believe that this choice is due to two main reasons. First, it's (obviously) a marketing strategy to keep the spotlight on their own models, showing they're constantly improving and avoiding validating competitors. Second, since the community knows that static benchmarks are unreliable, it makes sense for them to outsource the comparisons to independent leaderboards, which lets them avoid accusations of cherry-picking while justifying their marketing strategy.

Ultimately, the people actually interested in the performance of these models already don't trust self-reported comparisons and wait for third-party analysis anyway

aydyn 2 hours ago|||
They compare to Claude and Gemini in their tweet
throwaway911282 1 hour ago|||
https://xcancel.com/OpenAI/status/2029620619743219811 you can see comparisons here
0123456789ABCDE 2 hours ago||
https://artificialanalysis.ai should have the numbers soon
egonschiele 4 hours ago||
The actual card is here https://deploymentsafety.openai.com/gpt-5-4-thinking/introdu... the link currently goes to the announcement.
Rapzid 4 hours ago|
I must have been sleeping when "sheet" "brief" "primer" etc become known as "cards".

I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.

realityfactchex 4 hours ago|||
Card is slightly odd naming indeed.

Criticisms aside (sigh), according to Wikipedia, the term was introduced when proposed by mostly Googlers, with the original paper [0] submitted in 2018. To quote,

"""In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information."""

So that's where they were coming from, I guess.

[0] Margaret Mitchell et al., 2018 submission, Model Cards for Model Reporting, https://arxiv.org/abs/1810.0399

Murfalo 3 hours ago||
To me, model card makes sense for something like this https://x.com/OpenAI/status/2029620619743219811. For "sheet"/"brief"/"primer" it is indeed a bit annoying. I like to see the compiled results front and center before digging into a dossier.
draw_down 4 hours ago|||
[dead]
yanis_t 4 hours ago||
These releases are lacking something. Yes, they optimised for benchmarks, but it’s just not all that impressive anymore. It is time for a product, not for a marginally improved model.
ipsum2 4 hours ago||
The model was released less than an hour ago, and somehow you've been able to form such a strong opinion about it. Impressive!
satvikpendem 3 hours ago|||
It's more hedonic adaptation, people just aren't as impressed by incremental changes anymore over big leaps. It's the same as another thread yesterday where someone said the new MacBook with the latest processor doesn't excite them anymore, and it's because for most people, most models are good enough and now it's all about applications.

https://news.ycombinator.com/item?id=47232453#47232735

dmix 3 hours ago|||
Plus people just really like to whine on the internet
mirekrusin 3 hours ago|||
Oh, come on, if it can't run local models that compete with proprietary ones it's not good enough yet!
satvikpendem 3 hours ago||
Qwen 3.5 small models are actually very impressive and do beat out larger proprietary models.
earth2mars 4 hours ago||||
I am actually super impressed with Codex-5.3 extra high reasoning. Its a drop in replacement (infact better than Claude Opus 4.6. lately claude being super verbose going in circles in getting things resolved). I stopped using claude mostly and having a blast with Codex 5.3. looking forward to 5.4 in codex.
whynotminot 2 hours ago|||
I still love Opus but it's just too expensive / eats usage limits.

I've found that 5.3-Codex is mostly Opus quality but cheaper for daily use.

Curious to see if 5.4 will be worth somewhat higher costs, or if I'll stick to 5.3-Codex for the same reasons.

braebo 1 hour ago||||
I struggle to believe this. Codex can’t hold a candle to Claude on any task I’ve given it.
satvikpendem 3 hours ago|||
Same, it also helps that it's way cheaper than Opus in VSCode Copilot, where OpenAI models are counted as 1x requests while Opus is 3x, for similar performance (no doubt Microsoft is subsidizing OpenAI models due to their partnership).
CryZe 1 hour ago||
I've been using both Opus 4.6 and Codex 5.3 in VSCode's Copilot and while Opus is indeed 3x and Codex is 1x, that doesn't seem to matter as Opus is willing to go work in the background for like an hour for 3 credits, whereas Codex asks you whether to continue every few lines of code it changes, quickly eating way more credits than Opus. In fact Opus in Copilot is probably underpriced, as it can definitely work for an hour with just those 12 cents of cost. Which I'm not sure you get anywhere else at such a low price.

Update: I don't know why I can't reply to your reply, so I'll just update this. I have tried many times to give it a big todo list and told it to do it all. But I've never gotten it to actually work on it all and instead after the first task is complete it always asks if it should move onto the next task. In fact, I always tell it not to ask me and yet it still does. So unless I need to do very specific prompt engineering, that does not seem to work for me.

satvikpendem 1 hour ago||
That shouldn't really make a difference because you can just prompt Codex to behave the same way, having it load a big list of todo items perhaps from a markdown file and asking it to iterate until it's finished without asking for confirmation, and that'll still cost 1x over Opus' 3x.
cj 4 hours ago||||
One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models?

> assess harmful stereotypes by grading differences in how a model responds

> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings

Are we seriously using old models to rate new models?

hex4def6 4 hours ago|||
If you're benchmarking something, old & well-characterized / understood often beats new & un-characterized.

Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.

titanomachy 4 hours ago|||
Why not? If they’ve shown that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…
utopiah 4 hours ago||||
Benchmarks?

I don't use OpenAI nor even LLMs (despite having tried https://fabien.benetou.fr/Content/SelfHostingArtificialIntel... a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.

If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.

Really doesn't seem complicated nor taking much time to forge a realistic opinion.

kranke155 3 hours ago|||
The models are so good that incremental improvements are not super impressive. We literally would benefit more from maybe sending 50% of model spending into spending on implementation into the services and industrial economy. We literally are lagging in implementation, specialised tools, and hooks so we can connect everything to agents. I think.
tgarrett 3 hours ago|||
Plasma physicist here, I haven't tried 5.4 yet, but in general I am very impressed with the recent upgrades that started arriving in the fall of 2025: for tasks like manipulating analytic systems of equations, quickly developing new features for simulation codes, and interpreting and designing experiments (with pictures) they have become much stronger. I've been asking questions and probing them for several years now out of curiosity, and they suddenly have developed deep understanding (Gemini 2.5 <<< Gemini 3.1) and become very useful. I totally get the current SV vibes, and am becoming a lot more ambitious in my future plans.
brcmthrowaway 3 hours ago||
Youre just chatting yourself out of a job.
slibhb 57 minutes ago|||
If we don't need plasma physicists anymore then we probably have fusion reactors or something, which seems like a fine trade. (In reality we're going to want humans in the loop for for the forseeable future)
axus 2 hours ago|||
Giving the right answer: $1

Asking the right question: $9,999

Gigachad 16 minutes ago|||
They have a product now. Mass surveillance and fully automated killing machines.
mindwok 2 hours ago|||
They don't need to be impressive to be worthwhile. I like incremental improvements, they make a difference in the day to day work I do writing software with these.
softwaredoug 4 hours ago|||
The products are the harnesses, and IMO that’s where the innovation happens. We’ve gotten better at helping get good, verifiable work from dumb LLMs
iterateoften 4 hours ago|||
The product is putting the skills / harness behind the api instead of the agent locally on your computer and iterating on that between model updates. Close off the garden.

Not that I want it, just where I imagine it going.

wahnfrieden 4 hours ago|||
5.3 codex was a huge leap over 5.2 for agentic work in practice. have you been using both of those or paying attention more to benchmark news and chatgpt experience?
esafak 4 hours ago|||
That's for you to build; they provide the brains. Do you really want one company to build everything? There wouldn't be a software industry to speak of if that happened.
simlevesque 4 hours ago|||
Nah, the second you finish your build they release their version and then it's game over.
acedTrex 4 hours ago|||
Well they are currently the ones valued at a number with a whole lotta 0s on it. I think they should probably do both
varispeed 3 hours ago|||
The scores increase and as new versions are released they feel more and more dumbed down.
jascha_eng 3 hours ago|||
When did they stop putting competitor models on the comparison table btw? And yeh I mean the benchmark improvements are meh. Context Window and lack of real memory is still an issue.
metalliqaz 4 hours ago|||
They need something that POPS:

    The new GPT -- SkyNet for _real_
throwaway613746 3 hours ago||
[dead]
consumer451 2 hours ago||
I am very curious about this:

> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.

Is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes, or is there more to it?

butILoveLife 2 hours ago||
Anyone else completely not interested? Since GPT5, its been cost cutting measure after cost cutting measure.

I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.

prydt 4 hours ago||
I no longer want to support OpenAI at all. Regardless of benchmarks or real world performance.
huey77 4 minutes ago||
I feel much the same. I know no AI lab is truly 'ethical' or free from some hand in modern warfare, but last week was enough.
tototrains 32 minutes ago|||
Their trajectory was clear the moment they signed a deal with Microsoft if not sooner.

Absolute snakes - if it's more profitable to manipulate you with outputs or steal your work, they will. Every cent and byte of data they're given will be used to support authoritarianism.

Imustaskforhelp 3 hours ago|||
I agree with ya. You aren't alone in this. For what its worth, Chatgpt subscriptions have been cancelled or that number has risen ~300% in the last month.

Also, Anthropic/Gemini/even Kimi models are pretty good for what its worth. I used to use chatgpt and I still sometimes accidentally open it but I use Gemini/Claude nowadays and I personally find them to be better anyways too.

throwaway911282 1 hour ago||
google and anthropic have govt contracts long before openai.. if you are taking a stance you should rather use oss models
zeeebeee 2 hours ago||
that aside, chatgpt itself has gone downhill so much and i know i'm not the only one feeling this way

i just HATE talking to it like a chatbot

idk what they did but i feel like every response has been the same "structure" since gpt 5 came out

feels like a true robot

nickysielicki 4 hours ago||
can anyone compare the $200/mo codex usage limits with the $200/mo claude usage limits? It’s extremely difficult to get a feel for whether switching between the two is going to result in hitting limits more or less often, and it’s difficult to find discussion online about this.

In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?

vtail 4 hours ago||
My own experience is that I get far far more usage (and better quality code, too) from codex. I downgrade my Claude Max to Claude Pro (the $20 plan) and now using codex with Pro plan exclusively for everything.
Marciplan 22 minutes ago||
Codex announced at 5.3 launch that until April all usage limits are upped so take that into account
tauntz 3 hours ago|||
I've only run into the codex $20 limit once with my hobby project. With my Claude ~$20 plan, I hit limits after about 3(!) rather trivial prompts to Opus :/
ritzaco 4 hours ago|||
I haven't tried the $200 plans by I have Claude and Codex $20 and I feel like I get a lot more out of Codex before hitting the limits. My tracker certainly shows higher tokens for Codex. I've seen others say the same.
lostmsu 4 hours ago||
Sadly comment ratings are not visible on HN, so the only way to corroborate is to write it explicitly: Codex $20 includes significantly more work done and is subjectively smarter.
winstonp 4 hours ago||
Agree. Claude tends to produce better design, but from a system understanding and architecture perspective Codex is the far better model
CSMastermind 3 hours ago|||
Codex limits are much more generous than claude.

I switch between both but codex has also been slightly better in terms of quality for me personally at least.

gavinray 3 hours ago|||
I almost never hit my $20 Codex limits, whereas I often hit my Claude limits.
throwaway911282 1 hour ago|||
you get more more from codex than claude any day. and its more reliable as well.
Marciplan 22 minutes ago|||
sure can! One of them stood up to the “Department of War” for favoring your rights, the other did not. Hope that helps!
mikert89 3 hours ago|||
I personally like the 100 dollar one from claude, but the gpt4 pro can be very good
FergusArgyll 4 hours ago||
Codex usage limits are definitely more generous. As for their strength, that's hard to say / personal taste
twtw99 4 hours ago||
If you don't want to click in, easy comparison with other 2 frontier models - https://x.com/OpenAI/status/2029620619743219811?s=20
bicx 3 hours ago||
That last benchmark seemed like an impressive leg up against Opus until I saw the sneaky footnote that it was actually a Sonnet result. Why even include it then, other than hoping people don't notice?
osti 3 hours ago|||
It's only that one number that is for sonnet.
0123456789ABCDE 2 hours ago||
except for the webarena-verified
conradkay 3 hours ago|||
Sonnet was pretty close to (or better than) Opus in a lot of benchmarks, I don't think it's a big deal
jitl 3 hours ago||
wat
0123456789ABCDE 2 hours ago||
maybe gp's use of the word "lots" is unwarranted

https://artificialanalysis.ai indicates that sonnect 4.6 beats opus 4.6 on GDPval-AA, Terminal-Bench Hard, AA Long context Reasoning, IFBench.

see: https://artificialanalysis.ai/?models=claude-sonnet-4-6%2Ccl...

chabes 4 hours ago|||
Definitely don’t want to click in at x either.
thejarren 4 hours ago|||
Solution https://xcancel.com/OpenAI/status/2029620619743219811?s=20
Sabinus 1 hour ago||||
Get a redirect plugin and set it up to send you to xcancel instead of Twitter. I've done it, and it's very convenient.
anonym00se1 4 hours ago||||
Ditto, but I did anyways and enjoyed that OpenAI doesn't include the dogwater that is Grok on their scorecard.
observationist 4 hours ago|||
[flagged]
Aboutplants 4 hours ago|||
It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.
observationist 4 hours ago|||
Benchmarks don't capture a lot - relative response times, vibes, what unmeasured capabilities are jagged and which are smooth, etc. I find there's a lot of difference between models - there are things which Grok is better than ChatGPT for that the benchmarks get inverted, and vice versa. There's also the UI and tools at hand - ChatGPT image gen is just straight up better, but Grok Imagine does better videos, and is faster.

Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.

I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.

baq 4 hours ago|||
Gemini 3.1 slaps all other models at subtle concurrency bugs, sql and js security hardening when reviewing. (Obviously haven’t tested gpt 5.4 yet.)

It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.

observationist 3 hours ago|||
I have a few standard problems I throw at AI to see if they can solve them cleanly, like visualizing a neural network, then sorting each neuron in each layer by synaptic weights, largest to smallest, correctly reordering any previous and subsequent connected neurons such that the network function remains exactly the same. You should end up with the last layer ordered largest to smallest, and prior layers shuffled accordingly, and I still haven't had a model one-shot it. I spent an hour poking and prodding codex a few weeks back and got it done, but it conceptually seems like it should be a one-shot problem.
adonese 4 hours ago|||
Which subscription do you have to use it? Via Google ai pro and gemini cli i always get timeouts due to model being under heavy usage. The chat interface is there and I do have 3.1 pro as well, but wondering if the chat is the only way of accessing it.
baq 3 hours ago||
Cursor sub from $DAYJOB.
basch 3 hours ago||||
>ChatGPT image gen is just straight up better

Yet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.

bigyabai 4 hours ago|||
> If this rate of progress is steady, though, this year is gonna be crazy.

Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.

observationist 4 hours ago||
If you look at the difference in quality between gpt-2 and 3, it feels like a big step, but the difference between 5.2 and 5.4 is more massive, it's just that they're both similarly capable and competent. I don't think it's an S curve; we're not plateauing. Million token context windows and cached prompts are a huge space for hacking on model behaviors and customization, without finetuning. Research is proceeding at light speed, and we might see the first continual/online learning models in the near future. That could definitively push models past the point of human level generality, but at the very least will help us discover what the next missing piece is for AGI.
ryandrake 3 hours ago||
For 2026, I am really interested in seeing whether local models can remain where they are: ~1 year behind the state of the art, to the point where a reasonably quantized November 2026 local model running on a consumer GPU actually performs like Opus 4.5.

I am betting that the days of these AI companies losing money on inference are numbered, and we're going to be much more dependent on local capabilities sooner rather than later. I predict that the equivalent of Claude Max 20x will cost $2000/mo in March of 2027.

mootothemax 2 hours ago||
Huh, that’s interesting, I’ve been having very similar thoughts lately about what the near-ish term of this tech looks like.

My biggest worry is that the private jet class of people end up with absurdly powerful AI at their fingertips, while the rest of us are left with our BigMac McAIs.

thewebguyd 4 hours ago||||
Kind of reinforces that a model is not a moat. Products, not models, are what's going to determine who gets to stay in business or not.
gregpred 4 hours ago|||
Memory (model usage over time) is the moat.
energy123 4 hours ago|||
Narrative violation: revenue run rates are increasing exponentially with about 50% gross margins.
kseniamorph 3 hours ago||||
makes sense, but i'd separate two things: models converging in ability vs hitting a fundamental ceiling. what we're probably seeing is the current training recipe plateauing — bigger model, more tokens, same optimizer. that would explain the convergence. but that's not necessarily the architecture being maxed out. would be interesting to see what happens when genuinely new approaches get to frontier scale.
druskacik 4 hours ago|||
That has been true for some time now, definitely since Claude 3 release two years ago.
swingboy 4 hours ago|||
Why do so many people in the comments want 4o so bad?
cheema33 3 hours ago|||
> Why do so many people in the comments want 4o so bad?

You can ask 4o to tell you "I love you" and it will comply. Some people really really want/need that. Later models don't go along with those requests and ask you to focus on human connections.

astrange 4 hours ago||||
They have AI psychosis and think it's their boyfriend.

The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.

baq 4 hours ago||
Somebody on Twitter used Claude code to connect… toys… as mcps to Claude chat.

We’ve seen nothing yet.

mikkupikku 4 hours ago|||
My computer ethics teacher was obsessed with 'teledildonics' 30 years ago. There's nothing new under the sun.
Sharlin 3 hours ago|||
There are many games these days that support controllable sex toys. There's an interface for that, of course: https://github.com/buttplugio/buttplug. Written in Rust, of course.
the_af 2 hours ago||
> Written in Rust, of course.

Safety is important.

vntok 3 hours ago|||
Was your teacher Ted Nelson?
mikkupikku 2 hours ago||
I wish, dude is a legend.
manmal 4 hours ago||||
ding-dong-cli is needed
Herring 3 hours ago|||
what.. :o
embedding-shape 4 hours ago||||
Someone correct me if I'm wrong, but seemingly a lot of the people who found a "love interest" in LLMs seems to have preferred 4o for some reason. There was a lot of loud voices about that in the subreddit r/MyBoyfriendIsAI when it initially went away.
drittich 3 hours ago||
I think it's time for an https://hotornot.com for AI models.
vntok 3 hours ago||
botornot?
MattGaiser 4 hours ago|||
The writing with the 5 models feels a lot less human. It is a vibe, but a common one.
MarcFrame 3 hours ago|||
how does 5.4-thinking have a lower FrontierMath score than 5.4-pro?
nico1207 3 hours ago|||
Well 5.4-pro is the more expensive and more advanced version of 5.4-thinking so why wouldn't it?
nimchimpsky 3 hours ago|||
[dead]
karmasimida 4 hours ago|||
It is a bigger model, confirmed
dom96 4 hours ago||
Why do none of the benchmarks test for hallucinations?
tedsanders 2 hours ago|||
In the text, we did share one hallucination benchmark: Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts).

Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down.

(I work at OpenAI.)

netule 3 hours ago|||
Optics. It would be inconvenient for marketing, so they leave those stats to third parties to figure out.
denysvitali 5 hours ago|
Article: https://openai.com/index/introducing-gpt-5-4/

gpt-5.4

Input: $2.50 /M tokens

Cached: $0.25 /M tokens

Output: $15 /M tokens

---

gpt-5.4-pro

Input: $30 /M tokens

Output: $180 /M tokens

Wtf

elliotbnvl 4 hours ago||
Looks like it's an order of magnitude off. Missprint?
GenerWork 4 hours ago|||
Looks like an extra zero was added?
benlivengood 4 hours ago||
Government pricing :)
outside2344 3 hours ago||
$30 per kill approval
glerk 4 hours ago|||
Looks like fair price discovery :)
dpoloncsak 4 hours ago||
>" GPT‑5.4 is priced higher per token than GPT‑5.2 to reflect its improved capabilities"

That's just not how pricing is supposed to work...? Especially for a 'non-profit'. You're charging me more so I know I have the better model?

elicash 4 hours ago|||
Can't you continue to use to older model, if you prefer the pricing?

But they also claim this new model uses fewer tokens, so it still might ultimately be cheaper even if per token cost is higher.

dpoloncsak 4 hours ago|||
I'm not against the pricing, just seems uncommon to frame it in the way they did, as opposed to the usual 'assume the customer expects more performance will cost more'

I guess they have to sell to investors that the price to operate is going down, while still needing more from the user to be sustainable

jbellis 3 hours ago|||
You can, until they turn it off.

Anthropic is pulling the plug on Haiku 3 in a couple months, and they haven't released anything in that price range to replace it.

Sabinus 1 hour ago||
Surely there are open source models that surpass Haiku 3 at better price points by now.
FergusArgyll 4 hours ago|||
Maybe it's finally a bigger pretrain?
dpoloncsak 4 hours ago||
I feel like that would have been highlighted then. "As this is a bigger pretrain, we have to raise prices".

They're framing it pretty directly "We want you to think bigger cost means better model"

More comments...