Top
Best
New

Posted by mudkipdev 9 hours ago

GPT-5.4(openai.com)
https://openai.com/index/gpt-5-4-thinking-system-card/

https://x.com/OpenAI/status/2029620619743219811

642 points | 556 commentspage 5
atkrad 5 hours ago|
What is the main difference between this version with the previous one?
alpineman 8 hours ago||
No thanks. Already cancelled my sub.
butILoveLife 6 hours ago||
Anyone else completely not interested? Since GPT5, its been cost cutting measure after cost cutting measure.

I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.

XCSme 8 hours ago||
Seems to be quite similar to 5.3-codex, but somehow almost 2x more expensive: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...
smusamashah 7 hours ago||
I only want to see how it performs on the Bullshit-benchmark https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

GPT is not even close yo Claude in terms of responding to BS.

mistercow 3 hours ago|
My current hunch is that that benchmark captures most of the relevant gap between Anthropic and the rest. “Can’t distinguish truth from fiction” has long been one of the deeper complaints about LLMs, and the bullshit benchmark seems like a clever approach to testing at least some of that.
jstummbillig 7 hours ago||
Inline poll: What reasoning levels do you work with?

This becomes increasingly less clear to me, because the more interesting work will be the agent going off for 30mins+ on high / extra high (it's mostly one of the two), and that's a long time to wait and an unfeasible amount of code to a/b

newtwilly 4 hours ago|
For directed coding (implementing an already specified plan) or asking questions about a codebase I use 5.3 codex with medium reasoning effort. It is relatively quick feeling.

I like Sonnet 4.6 a lot too at medium reasoning effort, but at least in Cursor it is sometimes quite slow because it will start "thinking" for a long time.

7777777phil 9 hours ago||
83% win rate over industry professionals across 44 occupations.

I'd believe it on those specific tasks. Near-universal adoption in software still hasn't moved DORA metrics. The model gets better every release. The output doesn't follow. Just had a closer look on those productivity metrics this week: https://philippdubach.com/posts/93-of-developers-use-ai-codi...

NiloCK 8 hours ago||
This March 2026 blog post is citing a 2025 study based on Sonnet 3.5 and 3.7 usage.

Given that organization who ran the study [1] has a terrifying exponential as their landing page, I think they'd prefer that it's results are interpreted as a snapshot of something moving rather than a constant.

[1] - https://metr.org/

7777777phil 8 hours ago||
Good catch, thanks (I really wrote that myself.) Added a note to the post acknowledging the models used were Claude 3.5 and 3.7 Sonnet.
twitchard 8 hours ago||
Not sure DORA is that much of an indictment. For "Change Failure Rate" for instance these are subject to tradeoffs. Organizations likely have a tolerance level for Change Failure Rate. If changes are failing too often they slow down and invest. If changes aren't failing that much they speed up -- and so saying "change failure rate hasn't decreased, obviously AI must not be working" is a little silly.

"Change Lead Time" I would expect to have sped up although I can tell stories for why AI-assisted coding would have an indeterminate effect here too. Right now at a lot of orgs, the bottle neck is the review process because AI is so good at producing complete draft PRs quickly. Because reviews are scarce (not just reviews but also manual testing passes are scarce) this creates an incentive ironically to group changes into larger batches. So the definition of what a "change" is has grown too.

Aldipower 6 hours ago||
So did they raised the ridiculous small "per tool call token limit" when working with MCP servers? This makes Chat useless... I do not care, but my users.
strongpigeon 9 hours ago|
It's interesting that they charge more for the > 200k token window, but the benchmark score seems to go down significantly past that. That's judging from the Long Context benchmark score they posted, but perhaps I'm misunderstanding what that implies.
_heitoo 4 hours ago||
It makes sense in scenarios where a model needs >200k tokens to answer a single prompt. You're shackled to a single session, and if the model hits compaction limits, it'll get lobotomized and give a shitty answer, so higher limits, even with degraded performance, are still an improvement.
Tiberium 9 hours ago|||
They don't actually seem to charge more for the >200k tokens on the API. OpenRouter and OpenAI's own API docs do not have anything about increased pricing for >200k context for GPT-5.4. I think the 2x limit usage for higher context is specific to using the model over a subscription in Codex.
simianwords 9 hours ago||
[flagged]
strongpigeon 7 hours ago||
I guess that you pay more for worse quality to unlock use cases that could maybe be solved by better context management.
More comments...