Posted by mudkipdev 5 hours ago
GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).
GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).
GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).
Ultimately, the people actually interested in the performance of these models already don't trust self-reported comparisons and wait for third-party analysis anyway
I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.
Criticisms aside (sigh), according to Wikipedia, the term was introduced when proposed by mostly Googlers, with the original paper [0] submitted in 2018. To quote,
"""In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information."""
So that's where they were coming from, I guess.
[0] Margaret Mitchell et al., 2018 submission, Model Cards for Model Reporting, https://arxiv.org/abs/1810.0399
I've found that 5.3-Codex is mostly Opus quality but cheaper for daily use.
Curious to see if 5.4 will be worth somewhat higher costs, or if I'll stick to 5.3-Codex for the same reasons.
Update: I don't know why I can't reply to your reply, so I'll just update this. I have tried many times to give it a big todo list and told it to do it all. But I've never gotten it to actually work on it all and instead after the first task is complete it always asks if it should move onto the next task. In fact, I always tell it not to ask me and yet it still does. So unless I need to do very specific prompt engineering, that does not seem to work for me.
> assess harmful stereotypes by grading differences in how a model responds
> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings
Are we seriously using old models to rate new models?
Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.
I don't use OpenAI nor even LLMs (despite having tried https://fabien.benetou.fr/Content/SelfHostingArtificialIntel... a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.
If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.
Really doesn't seem complicated nor taking much time to forge a realistic opinion.
Asking the right question: $9,999
Not that I want it, just where I imagine it going.
The new GPT -- SkyNet for _real_> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.
Is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes, or is there more to it?
I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.
Absolute snakes - if it's more profitable to manipulate you with outputs or steal your work, they will. Every cent and byte of data they're given will be used to support authoritarianism.
Also, Anthropic/Gemini/even Kimi models are pretty good for what its worth. I used to use chatgpt and I still sometimes accidentally open it but I use Gemini/Claude nowadays and I personally find them to be better anyways too.
i just HATE talking to it like a chatbot
idk what they did but i feel like every response has been the same "structure" since gpt 5 came out
feels like a true robot
In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?
I switch between both but codex has also been slightly better in terms of quality for me personally at least.
https://artificialanalysis.ai indicates that sonnect 4.6 beats opus 4.6 on GDPval-AA, Terminal-Bench Hard, AA Long context Reasoning, IFBench.
see: https://artificialanalysis.ai/?models=claude-sonnet-4-6%2Ccl...
Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.
I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.
It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.
Yet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.
Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.
I am betting that the days of these AI companies losing money on inference are numbered, and we're going to be much more dependent on local capabilities sooner rather than later. I predict that the equivalent of Claude Max 20x will cost $2000/mo in March of 2027.
My biggest worry is that the private jet class of people end up with absurdly powerful AI at their fingertips, while the rest of us are left with our BigMac McAIs.
You can ask 4o to tell you "I love you" and it will comply. Some people really really want/need that. Later models don't go along with those requests and ask you to focus on human connections.
The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.
We’ve seen nothing yet.
Safety is important.
Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down.
(I work at OpenAI.)
gpt-5.4
Input: $2.50 /M tokens
Cached: $0.25 /M tokens
Output: $15 /M tokens
---
gpt-5.4-pro
Input: $30 /M tokens
Output: $180 /M tokens
Wtf
That's just not how pricing is supposed to work...? Especially for a 'non-profit'. You're charging me more so I know I have the better model?
But they also claim this new model uses fewer tokens, so it still might ultimately be cheaper even if per token cost is higher.
I guess they have to sell to investors that the price to operate is going down, while still needing more from the user to be sustainable
Anthropic is pulling the plug on Haiku 3 in a couple months, and they haven't released anything in that price range to replace it.
They're framing it pretty directly "We want you to think bigger cost means better model"