Top
Best
New

Posted by maheshrijal 4/14/2025

GPT-4.1 in the API(openai.com)
680 points | 492 commentspage 4
oofbaroomf 4/14/2025|
I'm not really bullish on OpenAI. Why would they only compare with their own models? The only explanation could be that they aren't as competitive with other labs as they were before.
greenavocado 4/14/2025||
See figure 1 for up-to-date benchmarks https://github.com/KCORES/kcores-llm-arena

(Direct Link) https://raw.githubusercontent.com/KCORES/kcores-llm-arena/re...

gizmodo59 4/14/2025|||
Apple compares against its own products most of the times.
kcatskcolbdi 4/14/2025|||
I don't mind what they benchmark against as long as, when I use the model, it continues to give me better results than their competition.
poormathskills 4/14/2025||
Go look at their past blog posts. OpenAI only ever benchmarks against their own models.
oofbaroomf 4/14/2025||
Oh, ok. But it's still quite telling of their attitude as an organization.
rvnx 4/14/2025||
It's the same organization that kept repeating that sharing weights of GPT would be "too dangerous for the world". Eventually DeepSeek thankfully did something like that, though they are supposed to be the evil guys.
jmkni 4/14/2025||
The increased context length is interesting.

It would be incredible to be able to feed an entire codebase into a model and say "add this feature" or "we're having a bug where X is happening, tell me why", but then you are limited by the output token length

As others have pointed out too, the more tokens you use, the less accuracy you get and the more it gets confused, I've noticed this too

We are a ways away yet from being able to input an entire codebase, and have it give you back an updated version of that codebase.

starchild3001 4/15/2025||
I feel there's some "benchmark-hacking" is going on with GPT4.1 model as its metrics on livebench.com aren't all that exciting.

- It's basically GPT4o level on average.

- More optimized for coding, but slightly inferior in other areas.

It seems to be a better model than 4o for coding tasks, but I'm not sure if it will replace the current leaders -- Gemini 2.5 Pro, o3-mini / o1, Claude 3.7/3.5.

elAhmo 4/15/2025||
Company worth hundreds of billions of dollars, on paper at least, has one of the worst naming schemes for their products in the recent history.

Sam acknowledged this a few months ago, but with another release not really bringing any clarity, this is getting ridiculous now.

ComputerGuru 4/14/2025||
The benchmarks and charts they have up are frustrating because they don’t include 03-mini(-high) which they’ve been pushing as the low-latency+low-cost smart model to use for coding challenges instead of 4o and 4o-mini. Why won’t they include that in the charts?
bartkappenburg 4/14/2025||
By leaving out scale or prior models they are effectively manipulating improvement. If from 3 to 4 it was from 10 to 80, and from 4 to 4o it was 80 to 82, leaving out 3 would let us see a steep line instead of steep decrease of growth.

Lies, damn lies and statistics ;-)

asdev 4/14/2025||
> We will also begin deprecating GPT‑4.5 Preview in the API, as GPT‑4.1 offers improved or similar performance on many key capabilities at much lower cost and latency.

why would they deprecate when it's the better model? too expensive?

ComputerGuru 4/14/2025||
> why would they deprecate when it's the better model? too expensive?

Too expensive, but not for them - for their customers. The only reason they’d deprecated it is if it wasn’t seeing usage worth keeping it up and that probably stems from it being insanely more expensive and slower than everything else.

simianwords 4/14/2025|||
Where did you find that 4.5 is a better model? Everything from the video told me that 4.5 was largely a mistake and 4.1 beats 4.5 at everything. There's no point keeping 4.5 at this point.
rob 4/14/2025||
Bigger numbers are supposed to mean better. 3.5, 4, 4.5. Going from 4 to 4.5 to 4.1 seems weird to most people. If it's better, it should of been GPT-4.6 or 5.0 or something else, not a downgraded number.
HDThoreaun 4/14/2025||
OpenAI has decided to troll via crappy naming conventions as a sort of in joke. Sam Altman tweets about it pretty often
tootyskooty 4/14/2025||
sits on too many GPUs, they mentioned it during the stream

I'm guessing the (API) demand isn't there to saturate them fully

lsaferite 4/15/2025||
Is there an API endpoint at OpenAI that gives the information on this page as structured data?

https://platform.openai.com/docs/models/gpt-4.1

As far as I can tell there's no way to discover the details of a model via the API right now.

Given the announced adoption of MCP and MCP's ability to perform model selection for Sampling based on a ranking for speed and intelligence, it would be great to have a model discovery endpoint that came with all the details on that page.

XCSme 4/14/2025||
I tried 4.1-mini and 4.1-nano. The response are a lot faster, but for my use-case they seem to be a lot worse than 4o-mini(they fail to complete the task when 4o-mini could do it). Maybe I have to update my prompts...
XCSme 4/14/2025||
Even after updating my prompts, 4o-mini still seems to do better than 4.1-mini or 4.1-nano for a data-processing task.
BOOSTERHIDROGEN 4/14/2025||
Mind sharing your system prompt?
XCSme 4/14/2025||
It's quite complex, but the task is to parse some HTML content, or to choose from a list of URLs which one is the best.

I will check again the prompt, maybe 4o-mini ignores some instructions that 4.1 doesn't (instructions which might result in the LLM returning zero data).

jjani 4/15/2025||
That sounds incredibly disappointing given how high their benchmarks are, indicating they might be overtuned for those, similar to Llama4.
XCSme 4/15/2025||
Yeah, I think so too. They seemed to be better at specific tasks, but worse overall, at broader tasks.
Ninjinka 4/14/2025|
I've been using it in Cursor for the past few hours and prefer it to Sonnet 3.7. It's much faster and doesn't seem to make the sort of stupid mistakes Sonnet has been making recently.
More comments...