Top
Best
New

Posted by maheshrijal 4/14/2025

GPT-4.1 in the API(openai.com)
680 points | 492 commentspage 2
taikahessu 4/14/2025|
> They feature a refreshed knowledge cutoff of June 2024.

As opposed to Gemini 2.5 Pro having cutoff of Jan 2025.

Honestly this feels underwhelming and surprising. Especially if you're coding with frameworks with breaking changes, this can hurt you.

forbiddenvoid 4/14/2025||
It's definitely an issue. Even the simplest use case of "create React app with Vite and Tailwind" is broken with these models right now because they're not up to date.
lukev 4/14/2025|||
Time to start moving back to Java & Spring.

100% backwards compatibility and well represented in 15 years worth of training data, hah.

speedgoose 4/14/2025||
Write once, run nowhere.
aledalgrande 4/15/2025||
LOOOOL you have my upvote

(I did use Spring, once, ages ago, and we deployed the app to a local Tomcat server in the office...)

int_19h 4/14/2025||||
Maybe LLMs will be the forcing function to finally slow down the crazy pace of changing (and breaking) things in JavaScript land.
yokto 4/14/2025||||
Whenever an LLM struggles with a particular library version, I use Cursor Rules to auto-include migration information and that generally worked well enough in my cases.
tengbretson 4/14/2025||||
A few weeks back I couldn't even get ChatGPT to output TypeScript code that correctly used the OpenAI SDK.
seuros 4/14/2025||
You should give it documentation is can't guess.
Zambyte 4/14/2025||||
By "broken" you mean it doesn't use the latest and greatest hot trend, right? Or does it literally not work?
dbbk 4/14/2025|||
Periodically I keep trying these coding models in Copilot and I have yet to have an experience where it produced working code with a pretty straightforward TypeScript codebase. Specifically, it cannot for the life of it produce working Drizzle code. It will hallucinate methods that don't exist despite throwing bright red type errors. Does it even check for TS errors?
dalmo3 4/14/2025||
Not sure about Copilot, but the Cursor agent runs both eslint and tsc by default and fixes the errors automatically. You can tell it to run tests too, and whatever other tools. I've had a good experience writing drizzle schemas with it.
taikahessu 4/14/2025|||
It has been really frustrating learning Godot (or any new technology you are not familiar with) 4.4.x with GPT4o or even worse, with custom GPT which use older GPT4turbo.

As you are new in the field, it kinda doesn't make sense to pick an older version. It would be better if there was no data than incorrect data. You literally have to include the version number on every prompt and even that doesn't guarantee a right result! Sometimes I have to play truth or dare three times before we finally find the right names and instructions. Yes I have the version info on all custom information dialogs, but it is not as effective as including it in the prompt itself.

Searching the web feels like an on-going "I'm feeling lucky" mode. Anyway, I still happen to get some real insights from GPT4o, even though Gemini 2.5 Pro has proven far superior for larger and more difficult contexts / problems.

The best storytelling ideas have come from GPT 4.5. Looking forward to testing this new 4.1 as well.

jonfw 4/14/2025||
hey- curious what your experience has been like learning godot w/ LLM tooling.

are you doing 3d? The 3D tutorial ecosystem is very GUI heavy and I have had major problems trying to get godot to do anything 3D

taikahessu 4/15/2025||
I'm afraid I'm only doing 2d ... Yes, GUI related LLM instructions have been exceptionally bad, with multiple prompts me saying "no there is no such thing"... But as I commented earlier, GPT has had it's moments.

I strongly recommend giving Gemini 2.5 Pro a shot. Personally I don't like their bloated UI, but you can set the temperature value, which is especially helpful when you are more certain what and how you want, then just lower that value. If you want to get some wilder ideas, turn it up. Also highly recommend reading the thought process it does! That was actually key in having very complex ideas working. Just spotting couple of lines there, that seem too vague or even just a little bit inaccurate ... then pasting them back, with your own comments, have helped me a ton.

Is there a specific part in which you struggle? And FWIW, I've been on a heavy learning spree for 2 weeks. I feel like I'm starting to see glimbses from the barrel's bottom ... it's not so deep, you just gotta hang in there and bombard different LLMs with different questions, different angles, stripping away most and trying the simplest variation, for both prompt and godot. Or sometimes by asking more general advice "what is current godot best practice in doing x".

And YouTube has also been helpful source, by listening how more experienced users make their stuff. You can mostly skim through the videos with doublespeed and just focus on how they are doing the basics. Best of luck!

alangibson 4/14/2025||||
Try getting then to output Svelte 5 code...
division_by_0 4/14/2025||
Svelte 5 is the antidote to vibe coding.
asadm 4/14/2025|||
usually enabling "Search" fixes it sometimes as they fetch the newer methods.
TIPSIO 4/14/2025|||
It it annoying. The bigger cheaper context windows help this a little though:

E.g.: If context windows get big and cheap enough (as things are trending), hopefully you can just dump the entire docs, examples, and more in every request.

czk 4/15/2025||
sometimes it feels like openai keeps serving the same base dish—just adding new toppings. sure, the menu keeps changing, but it all kinda tastes the same. now the menu is getting too big.

nice to see that we aren't stuck in october of 2023 anymore!

runako 4/14/2025||
ChatGPT currently recommends I use o3-mini-high ("great at coding and logic") when I start a code conversation with 4o.

I don't understand why the comparison in the announcement talks so much about comparing with 4o's coding abilities to 4.1. Wouldn't the relevant comparison be to o3-mini-high?

4.1 costs a lot more than o3-mini-high, so this seems like a pertinent thing for them to have addressed here. Maybe I am misunderstanding the relationship between the models?

zamadatix 4/14/2025||
4.1 is a pinned API variant with the improvements from the newer iterations of 4o you're already using in the app, so that's why the comparison focuses between those two.

Pricing wise the per token cost of o3-mini is less than 4.1 but keep in mind o3-mini is a reasoning model and you will pay for those tokens too, not just the final output tokens. Also be aware reasoning models can take a long time to return a response... which isn't great if you're trying to use an API for interactive coding.

ac29 4/14/2025||
> I don't understand why the comparison in the announcement talks so much about comparing with 4o's coding abilities to 4.1. Wouldn't the relevant comparison be to o3-mini-high?

There are tons of comparisons to o3-mini-high in the linked article.

comex 4/14/2025||
Sam Altman wrote in February that GPT-4.5 would be "our last non-chain-of-thought model" [1], but GPT-4.1 also does not have internal chain-of-thought [2].

It seems like OpenAI keeps changing its plans. Deprecating GPT-4.5 less than 2 months after introducing it also seems unlikely to be the original plan. Changing plans is necessarily a bad thing, but I wonder why.

Did they not expect this model to turn out as well as it did?

[1] https://x.com/sama/status/1889755723078443244

[2] https://github.com/openai/openai-cookbook/blob/6a47d53c967a0...

observationist 4/14/2025||
Anyone making claims with a horizon beyond two months about structure or capabilities will be wrong - it's sama's job to show confidence and vision and calm stakeholders, but if you're paying attention to the field, the release and research cycles are still contracting, with no sense of slowing any time soon. I've followed AI research daily since GPT-2, the momentum is incredible, and even if the industry sticks with transformers, there are years left of low hanging fruit and incremental improvements before things start slowing.

There doesn't appear to be anything that these AI models cannot do, in principle, given sufficient data and compute. They've figured out multimodality and complex integration, self play for arbitrary domains, and lots of high-cost longer term paradigms that will push capabilities forwards for at least 2 decades in conjunction with Moore's law.

Things are going to continue getting better, faster, and weirder. If someone is making confident predictions beyond those claims, it's probably their job.

sottol 4/14/2025|||
Maybe that's true for absolute arm-chair-engineering outsiders (like me) but these models are in training for months, training data is probably being prepared year(s) in advance. These models have a knowledge cut-off in 2024 - so they have been in training for a while. There's no way sama did not have a good idea that this non-COT was in the pipeline 2 months ago. It was probably finished training then and undergoing evals.

Maybe

1. he's just doing his job and hyping OpenAI's competitive advantages (afair most of the competition didn't have decent COT models in Feb), or

2. something changed and they're releasing models now that they didn't intend to release 2 months ago (maybe because a model they did intend to release is not ready and won't be for a while), or

3. COT is not really as advantageous as it was deemed to be 2+ months ago and/or computationally too expensive.

fragmede 4/14/2025||
With new hardware from Nvidia announced coming out, those months turn into weeks.
sottol 4/14/2025||
I doubt it's going to be weeks, the months were already turning into years despite Nvidia's previous advances.

(Not to say that it takes openai years to train a new model, just that the timeline between major GPT releases seems to double... be it for data gathering, training, taking breaks between training generations, ... - either way, model training seems to get harder not easier).

GPT Model | Release Date | Months Passed Between Former Model

GPT-1 | 11.06.2018

GPT-2 | 14.02.2019 | 8.16

GPT-3 | 28.05.2020 | 15.43

GPT-4 | 14.03.2023 | 33.55

[1]https://www.lesswrong.com/posts/BWMKzBunEhMGfpEgo/when-will-...

observationist 4/14/2025|||
The capabilities and general utility of the models are increasing on an entirely different trajectory than model names - the information you posted is 99% dependent on internal OAI processes and market activities as opposed to anything to do with AI.

I'm talking more broadly, as well, including consideration of audio, video, and image modalities, general robotics models, and the momentum behind applying some of these architectures to novel domains. Protocols like MCP and automation tooling are rapidly improving, with media production and IT work rapidly being automated wherever possible. When you throw in the chemistry and materials science advances, protein modeling, etc - we have enormously powerful AI with insufficient compute and expertise to apply it to everything we might want to. We have research being done on alternate architectures, and optimization being done on transformers that are rapidly reducing the cost/performance ratio. There are models that you can run on phones that would have been considered AGI 10 years ago, and there doesn't seem to be any fundamental principle decreasing the rate of improvement yet. If alternate architectures like RWKV get funded, there might be several orders of magnitude improvement with relatively little disruption to production model behaviors, but other architectures like text diffusion could obsolete a lot of the ecosystem being built up around LLMs right now.

There are a million little considerations pumping transformer LLMs right now because they work and there's every reason to expect them to continue improving in performance and value for at least a decade. There aren't enough researchers and there's not enough compute to saturate the industry.

fragmede 4/15/2025|||
Fair point, I guess my question is how long it would take them to train GPT-2 on the absolute bleedingest generation of Nvidia chips vs what they had in 2019, with the budget they have to blow on Nvidia supercomputers today.
authorfly 4/14/2025||||
the release and research cycles are still contracting

Not necessarily progress or benchmarks that as a broader picture you would look at (MMLU etc)

GPT-3 was an amazing step up from GPT-2, something scientists in the field really thought was 10-15 years out at least done in 2, instruct/RHLF for GPTs was a similar massive splash, making the second half of 2021 equally amazing.

However nothing since has really been that left field or unpredictable from then, and it's been almost 3 years since RHLF hit the field. We knew good image understanding as input, longer context, and improved prompting would improve results. The releases are common, but the progress feels like it has stalled for me.

What really has changed since Davinci-instruct or ChatGPT to you? When making an AI-using product, do you construct it differently? Are agents presently more than APIs talking to databases with private fields?

hectormalot 4/14/2025|||
In some dimensions I recognize the slow down in how fast new capabilities develop, but the speed still feels very high:

Image generation suddenly went from gimmick to useful now that prompt adherence is so much better (eagerly waiting for that to be in the API)

Coding performance continues to improve noticeably (for me). Claude 3.7 felt like a big step from 4o/3.5. Gemini 2.5 in a similar way.compared to just 6 months ago I can give bigger and more complex pieces of work to it and get relatively good output back. (Net acceleration)

Audio-2-audio seems like it will be a big step as well. I think this has much more potential than the STT-LLM-TTS architecture commonly used today (latency, quality)

kadushka 4/14/2025||||
I see a huge progress made since the first gpt-4 release. The reliability of answers has improved an order of magnitude. Two years ago, more than half of my questions resulted in incorrect or partially correct answers (most of my queries are about complicated software algorithms or phd level research brainstorming). A simple “are you sure” prompt would force the model to admit it was wrong most of the time. Now with o1 this almost never happens and the model seems to be smarter or at least more capable than me - in general. GPT-4 was a bright high school student. o1 is a postdoc.
liamwire 4/14/2025|||
Excuse the pedantry; for those reading, it’s RLHF rather than RHLF.
moojacob 4/14/2025|||
> Things are going to continue getting better, faster, and weirder.

I love this. Especially the weirder part. This tech can be useful in every crevice of society and we still have no idea what new creative use cases there are.

Who would’ve guessed phones and social media would cause mass protests because bystanders could record and distribute videos of the police?

staunton 4/14/2025||
> Who would’ve guessed phones and social media would cause mass protests because bystanders could record and distribute videos of the police?

That would have been quite far down on my list of "major (unexpected) consequences of phones and social media"...

ewoodrich 4/15/2025||
Yep, it’s literally just a slightly higher tech version of (for example) the 1992 Los Angeles riots over Rodney King but with phones and Facebook instead of handheld camcorders and television.
wongarsu 4/14/2025|||
Maybe that's why they named this model 4.1, despite coming out after 4.5 and supposedly outperforming it. They can pretend GPT-4.5 is the last non-chain-of-thought model by just giving all non-chain-of-thought-models version numbers below 4.5
chrisweekly 4/14/2025||
Ok, I know naming things is hard, but 4.1 comes out after 4.5? Just, wat.
CamperBob2 4/14/2025||
For a long time, you could fool models with questions like "Which is greater, 4.10 or 4.5?" Maybe they're still struggling with that at OpenAI.
ben_w 4/14/2025||
At this point, I'm just assuming most AI models — not just OpenAI's — name themselves. And that they write their own press releases.
Cheer2171 4/14/2025|||
Why do you expect to believe a single word Sam Altman says?
sigmoid10 4/14/2025||
Everyone assumed malice when the board fired him for not always being "candid" - but it seems more and more that he's just clueless. He's definitely capable when it comes to raising money as a business, but I wouldn't count on any tech opinion from him.
zitterbewegung 4/14/2025|||
I think that people balked at the cost of 4.5 and really wanted just a slightly more improved 4o . Now it almost seems that they will have a separate products that are non chain of thought and chain of thought series which actually makes sense because some want a cheap model and some don't.
freehorse 4/14/2025|||
> Deprecating GPT-4.5 less than 2 months after introducing it also seems unlikely to be the original plan.

Well they actually hinted already of possible depreciation in their initial announcement of gpt4.5 [0]. Also, as others said, this model was already offered in the api as chatgpt-latest, but there was no checkpoint which made it unreliable for actual use.

[0] https://openai.com/index/introducing-gpt-4-5/#:~:text=we%E2%...

resource_waste 4/14/2025|||
When I saw them say 'no more non COT models', I was minorly panicked.

While their competitors have made fantastic models, at the time I perceived ChatGPT4 was the best model for many applications. COT was often tricked by my prompts, assuming things to be true, when a non-COT model would say something like 'That isnt necessarily the case'.

I use both COT and non when I have an important problem.

Seeing them keep a non-COT model around is a good idea.

adamgordonbell 4/14/2025||
Perhaps it is a distilled 4.5, or based on it's lineage, as some suggested.
vinhnx 4/14/2025||
• Flagship GPT-4.1: top‑tier intelligence, full endpoints & premium features

• GPT-4.1-mini: balances performance, speed & cost

• GPT-4.1-nano: prioritizes throughput & low cost with streamlined capabilities

All share a 1 million‑token context window (vs 120–200k on 4o-o3/o1), excelling in instruction following, tool calls & coding.

Benchmarks vs prior models:

• AIME ’24: 48.1% vs 13.1% (~3.7× gain)

• MMLU: 90.2% vs 85.7% (+4.5 pp)

• Video‑MME: 72.0% vs 65.3% (+6.7 pp)

• SWE‑bench Verified: 54.6% vs 33.2% (+21.4 pp)

ZeroCool2u 4/14/2025||
No benchmark comparisons to other models, especially Gemini 2.5 Pro, is telling.
dmd 4/14/2025||
Gemini 2.5 Pro gets 64% on SWE-bench verified. Sonnet 3.7 gets 70%

They are reporting that GPT-4.1 gets 55%.

egeozcan 4/14/2025|||
Very interesting. For my use cases, Gemini's responses beat Sonnet 3.7's like 80% of the time (gut feeling, didn't collect actual data). It beats Sonnet 100% of the time when the context gets above 120k.
int_19h 4/14/2025||
As usual with LLMs. In my experience, all those metrics are useful mainly to tell which models are definitely bad, but doesn't tell you much about which ones are good, and especially not how the good ones stack against each other in real world use cases.

Andrej Karpathy famously quipped that he only trusts two LLM evals: Chatbot Arena (which has humans blindly compare and score responses), and the r/LocalLLaMA comment section.

ezyang 4/14/2025||
Lmarena isn't that useful anymore lol
int_19h 4/15/2025||
I actually agree with that, but it's generally better than other scores. Also, the quote is like a year old at this point.

In practice you have to evaluate the models yourself for any non-trivial task.

hmottestad 4/14/2025|||
Are those with «thinking» or without?
sanxiyn 4/14/2025|||
Sonnet 3.7's 70% is without thinking, see https://www.anthropic.com/news/claude-3-7-sonnet
aledalgrande 4/15/2025||||
The thinking tokens (even just 1024) make a massive difference in real world tasks with 3.7 in my experience
chaos_emergent 4/14/2025||||
based on their release cadence, I suspect that o4-mini will compete on price, performance, and context length with the rest of these models.
hecticjeff 4/14/2025||
o4-mini, not to be confused with 4o-mini
energy123 4/14/2025|||
With
poormathskills 4/14/2025|||
Go look at their past blog posts. OpenAI only ever benchmarks against their own models.

This is pretty common across industries. The leader doesn’t compare themselves to the competition.

christianqchung 4/14/2025|||
Okay, it's common across other industries, but not this one. Here is Google, Facebook, and Anthropic comparing their frontier models to others[1][2][3].

[1] https://blog.google/technology/google-deepmind/gemini-model-...

[2] https://ai.meta.com/blog/llama-4-multimodal-intelligence/

[3] https://www.anthropic.com/claude/sonnet

poormathskills 4/14/2025||
Right. Those labs aren’t leading the industry.
comp_throw7 4/15/2025||
Confusing take - Gemini 2.5 is probably the best general purpose coding model right now, and before that it was Sonnet 3.5. (Maybe 3.7 if you can get it to be less reward-hacky.) OpenAI hasn't had the best coding model for... coming up on a year, now? (o1-pro probably "outperformed" Sonnet 3.5 but you'd be waiting 10 minutes for a response, so.)
oofbaroomf 4/14/2025||||
Leader is debatable, especially given the actual comparisons...
dimitrios1 4/14/2025||||
There is no uniform tactic for this type of marketing. They will compare against whomever they need to to suit their marketing goals.
kweingar 4/14/2025||||
That would make sense if OAI were the leader.
awestroke 4/14/2025||||
Except they are far from the lead in model performance
poormathskills 4/14/2025||
Who has a (publicly released) model that is SOTA is constantly changing. It’s more interesting to see who is driving the innovation in the field, and right now that is pretty clearly OpenAI (GPT-3, first multi-modal model, first reasoning model, ect).
swyx 4/14/2025|||
also sometimes if you get it wrong you catch unnecessary flak
kristianp 4/14/2025||
Looks like the Quasar and Optimus stealth models on Openrouter were in fact GPT-4.1. This is what I get when I try to access the openrouter/optimus-alpha model now:

    {"error":
        {"message":"Quasar and Optimus were stealth models, and 
        revealed on April 14th as early testing versions of GPT 4.1. 
        Check it out: https://openrouter.ai/openai/gpt-4.1","code":404}
osigurdson 4/14/2025||
Sam made a strange statement imo in a recent Ted Talk. He said (something like) models come and go but they want to be the best platform.

For me, it was jaw dropping. Perhaps he didn't mean it the way it sounded, but seemed like a major shift to me.

mrieck 4/15/2025||
Before everyone caught up:

    We are in a race to make a new God, and the company that wins the race will have omnipotent power beyond our comprehension. 
After everyone else caught up:

    The models come and go, some are SOTA in evals and some not.  What matters is our platform and market share.
mvkel 4/15/2025||
OpenAI has been a product company ever since ChatGPT launched.

Their value is firmly rooted in how they wrap ux around models.

clbrmbr 4/15/2025||
The deprecation of GPT-4.5 makes me sad. It's an amazing model with great world-knowledge and subtly. It KNOWS THINGS that, on a quick experiment, 4.1 just does not. 4.5 could tell me what I would see from a random street corner in New Jersey, or how to use minor features of my niche API (well, almost), and it could write remarkably. But 4.1 doesn't hold a candle to it. Please, continue to charge me $150/1M tokens. Sometimes you need a Big Model. Tells me it was costing more than $150/1M to serve (!).
miki123211 4/15/2025||
Most of the improvements in this model, basically everything except the longer context, image understanding and better pricing, are basically things that reinforcement learning (without human feedback) should be good at.

Getting better at code is something you can verify automatically, same for diff formats and custom response formats. Instruction following is also either automatically verifiable, or can be verified via LLM as a judge.

I strongly suspect that this model is a GPT-4.5 (or GPT-5???) distill, with the traditional pretrain -> SFT -> RLHF pipeline augmented with an RLVR stage, as described in Lambert et al[1], and a bunch of boring technical infrastructure improvements sprinkled on top.

[1] https://arxiv.org/abs/2411.15124

clbrmbr 4/15/2025|
If so, the loss of fidelity versus 4.5 is really noticeable and a loss for numerous applications. (Finding a vegan restaurant in a random city neighborhood, for example.)
weird-eye-issue 4/15/2025||
In your example the LLM should not be responsible for that directly. It should be calling out to an API or search results to get accurate and up-to-date information (relatively speaking) and then use that context to generate a response
clbrmbr 4/15/2025||
You should actually try it. The really big models (4 and 4.5, sadly not 4o) have truly breathtaking ability to dig up hidden gems that have a really low profile on the internet. The recommendations also seem to cut through all the SEO and review manipulation and deliver quality recommendations. It really all can be in one massive model.
muzani 4/15/2025|
The real news for me is GPT 4.5 being deprecated and the creativity is being brought to "future models" and not 4.1. 4.5 was okay in many ways but it was absolutely a genius in production for creative writing. 4o writes like a skilled human, but 4.5 can actually write a 10 minute scene that gives me goosebumps. I think it's the context window that allows for it to actually build up scenes to hammer it down much later.
oezi 4/15/2025|
Cool to hear that you got something out of it, but for most users 4.5 might have just felt less capable on their solution-oriented questions. I guess this why they are deprecating it.

It is just such a big failure of OpenAI not to include smart routing on each question and hide the complexity of choosing a model from users.

More comments...