As opposed to Gemini 2.5 Pro having cutoff of Jan 2025.
Honestly this feels underwhelming and surprising. Especially if you're coding with frameworks with breaking changes, this can hurt you.
100% backwards compatibility and well represented in 15 years worth of training data, hah.
(I did use Spring, once, ages ago, and we deployed the app to a local Tomcat server in the office...)
As you are new in the field, it kinda doesn't make sense to pick an older version. It would be better if there was no data than incorrect data. You literally have to include the version number on every prompt and even that doesn't guarantee a right result! Sometimes I have to play truth or dare three times before we finally find the right names and instructions. Yes I have the version info on all custom information dialogs, but it is not as effective as including it in the prompt itself.
Searching the web feels like an on-going "I'm feeling lucky" mode. Anyway, I still happen to get some real insights from GPT4o, even though Gemini 2.5 Pro has proven far superior for larger and more difficult contexts / problems.
The best storytelling ideas have come from GPT 4.5. Looking forward to testing this new 4.1 as well.
are you doing 3d? The 3D tutorial ecosystem is very GUI heavy and I have had major problems trying to get godot to do anything 3D
I strongly recommend giving Gemini 2.5 Pro a shot. Personally I don't like their bloated UI, but you can set the temperature value, which is especially helpful when you are more certain what and how you want, then just lower that value. If you want to get some wilder ideas, turn it up. Also highly recommend reading the thought process it does! That was actually key in having very complex ideas working. Just spotting couple of lines there, that seem too vague or even just a little bit inaccurate ... then pasting them back, with your own comments, have helped me a ton.
Is there a specific part in which you struggle? And FWIW, I've been on a heavy learning spree for 2 weeks. I feel like I'm starting to see glimbses from the barrel's bottom ... it's not so deep, you just gotta hang in there and bombard different LLMs with different questions, different angles, stripping away most and trying the simplest variation, for both prompt and godot. Or sometimes by asking more general advice "what is current godot best practice in doing x".
And YouTube has also been helpful source, by listening how more experienced users make their stuff. You can mostly skim through the videos with doublespeed and just focus on how they are doing the basics. Best of luck!
E.g.: If context windows get big and cheap enough (as things are trending), hopefully you can just dump the entire docs, examples, and more in every request.
nice to see that we aren't stuck in october of 2023 anymore!
I don't understand why the comparison in the announcement talks so much about comparing with 4o's coding abilities to 4.1. Wouldn't the relevant comparison be to o3-mini-high?
4.1 costs a lot more than o3-mini-high, so this seems like a pertinent thing for them to have addressed here. Maybe I am misunderstanding the relationship between the models?
Pricing wise the per token cost of o3-mini is less than 4.1 but keep in mind o3-mini is a reasoning model and you will pay for those tokens too, not just the final output tokens. Also be aware reasoning models can take a long time to return a response... which isn't great if you're trying to use an API for interactive coding.
There are tons of comparisons to o3-mini-high in the linked article.
It seems like OpenAI keeps changing its plans. Deprecating GPT-4.5 less than 2 months after introducing it also seems unlikely to be the original plan. Changing plans is necessarily a bad thing, but I wonder why.
Did they not expect this model to turn out as well as it did?
[1] https://x.com/sama/status/1889755723078443244
[2] https://github.com/openai/openai-cookbook/blob/6a47d53c967a0...
There doesn't appear to be anything that these AI models cannot do, in principle, given sufficient data and compute. They've figured out multimodality and complex integration, self play for arbitrary domains, and lots of high-cost longer term paradigms that will push capabilities forwards for at least 2 decades in conjunction with Moore's law.
Things are going to continue getting better, faster, and weirder. If someone is making confident predictions beyond those claims, it's probably their job.
Maybe
1. he's just doing his job and hyping OpenAI's competitive advantages (afair most of the competition didn't have decent COT models in Feb), or
2. something changed and they're releasing models now that they didn't intend to release 2 months ago (maybe because a model they did intend to release is not ready and won't be for a while), or
3. COT is not really as advantageous as it was deemed to be 2+ months ago and/or computationally too expensive.
(Not to say that it takes openai years to train a new model, just that the timeline between major GPT releases seems to double... be it for data gathering, training, taking breaks between training generations, ... - either way, model training seems to get harder not easier).
GPT Model | Release Date | Months Passed Between Former Model
GPT-1 | 11.06.2018
GPT-2 | 14.02.2019 | 8.16
GPT-3 | 28.05.2020 | 15.43
GPT-4 | 14.03.2023 | 33.55
[1]https://www.lesswrong.com/posts/BWMKzBunEhMGfpEgo/when-will-...
I'm talking more broadly, as well, including consideration of audio, video, and image modalities, general robotics models, and the momentum behind applying some of these architectures to novel domains. Protocols like MCP and automation tooling are rapidly improving, with media production and IT work rapidly being automated wherever possible. When you throw in the chemistry and materials science advances, protein modeling, etc - we have enormously powerful AI with insufficient compute and expertise to apply it to everything we might want to. We have research being done on alternate architectures, and optimization being done on transformers that are rapidly reducing the cost/performance ratio. There are models that you can run on phones that would have been considered AGI 10 years ago, and there doesn't seem to be any fundamental principle decreasing the rate of improvement yet. If alternate architectures like RWKV get funded, there might be several orders of magnitude improvement with relatively little disruption to production model behaviors, but other architectures like text diffusion could obsolete a lot of the ecosystem being built up around LLMs right now.
There are a million little considerations pumping transformer LLMs right now because they work and there's every reason to expect them to continue improving in performance and value for at least a decade. There aren't enough researchers and there's not enough compute to saturate the industry.
Not necessarily progress or benchmarks that as a broader picture you would look at (MMLU etc)
GPT-3 was an amazing step up from GPT-2, something scientists in the field really thought was 10-15 years out at least done in 2, instruct/RHLF for GPTs was a similar massive splash, making the second half of 2021 equally amazing.
However nothing since has really been that left field or unpredictable from then, and it's been almost 3 years since RHLF hit the field. We knew good image understanding as input, longer context, and improved prompting would improve results. The releases are common, but the progress feels like it has stalled for me.
What really has changed since Davinci-instruct or ChatGPT to you? When making an AI-using product, do you construct it differently? Are agents presently more than APIs talking to databases with private fields?
Image generation suddenly went from gimmick to useful now that prompt adherence is so much better (eagerly waiting for that to be in the API)
Coding performance continues to improve noticeably (for me). Claude 3.7 felt like a big step from 4o/3.5. Gemini 2.5 in a similar way.compared to just 6 months ago I can give bigger and more complex pieces of work to it and get relatively good output back. (Net acceleration)
Audio-2-audio seems like it will be a big step as well. I think this has much more potential than the STT-LLM-TTS architecture commonly used today (latency, quality)
I love this. Especially the weirder part. This tech can be useful in every crevice of society and we still have no idea what new creative use cases there are.
Who would’ve guessed phones and social media would cause mass protests because bystanders could record and distribute videos of the police?
That would have been quite far down on my list of "major (unexpected) consequences of phones and social media"...
Well they actually hinted already of possible depreciation in their initial announcement of gpt4.5 [0]. Also, as others said, this model was already offered in the api as chatgpt-latest, but there was no checkpoint which made it unreliable for actual use.
[0] https://openai.com/index/introducing-gpt-4-5/#:~:text=we%E2%...
While their competitors have made fantastic models, at the time I perceived ChatGPT4 was the best model for many applications. COT was often tricked by my prompts, assuming things to be true, when a non-COT model would say something like 'That isnt necessarily the case'.
I use both COT and non when I have an important problem.
Seeing them keep a non-COT model around is a good idea.
• GPT-4.1-mini: balances performance, speed & cost
• GPT-4.1-nano: prioritizes throughput & low cost with streamlined capabilities
All share a 1 million‑token context window (vs 120–200k on 4o-o3/o1), excelling in instruction following, tool calls & coding.
Benchmarks vs prior models:
• AIME ’24: 48.1% vs 13.1% (~3.7× gain)
• MMLU: 90.2% vs 85.7% (+4.5 pp)
• Video‑MME: 72.0% vs 65.3% (+6.7 pp)
• SWE‑bench Verified: 54.6% vs 33.2% (+21.4 pp)
They are reporting that GPT-4.1 gets 55%.
Andrej Karpathy famously quipped that he only trusts two LLM evals: Chatbot Arena (which has humans blindly compare and score responses), and the r/LocalLLaMA comment section.
In practice you have to evaluate the models yourself for any non-trivial task.
This is pretty common across industries. The leader doesn’t compare themselves to the competition.
[1] https://blog.google/technology/google-deepmind/gemini-model-...
[2] https://ai.meta.com/blog/llama-4-multimodal-intelligence/
{"error":
{"message":"Quasar and Optimus were stealth models, and
revealed on April 14th as early testing versions of GPT 4.1.
Check it out: https://openrouter.ai/openai/gpt-4.1","code":404}For me, it was jaw dropping. Perhaps he didn't mean it the way it sounded, but seemed like a major shift to me.
We are in a race to make a new God, and the company that wins the race will have omnipotent power beyond our comprehension.
After everyone else caught up: The models come and go, some are SOTA in evals and some not. What matters is our platform and market share.Their value is firmly rooted in how they wrap ux around models.
Getting better at code is something you can verify automatically, same for diff formats and custom response formats. Instruction following is also either automatically verifiable, or can be verified via LLM as a judge.
I strongly suspect that this model is a GPT-4.5 (or GPT-5???) distill, with the traditional pretrain -> SFT -> RLHF pipeline augmented with an RLVR stage, as described in Lambert et al[1], and a bunch of boring technical infrastructure improvements sprinkled on top.
It is just such a big failure of OpenAI not to include smart routing on each question and hide the complexity of choosing a model from users.