Top
Best
New

Posted by maheshrijal 4/14/2025

GPT-4.1 in the API(openai.com)
680 points | 492 comments
lxgr 4/14/2025|
As a ChatGPT user, I'm weirdly happy that it's not available there yet. I already have to make a conscious choice between

- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)

- o3-mini (web search, CoT, canvas, but no image generation)

- o1 (CoT, maybe better than o3, but no canvas or web search and also no images)

- Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)

- 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)

- 4o "with scheduled tasks" (why on earth is that a model and not a tool that the other models can use!?)

Why do I have to figure all of this out myself?

throwup238 4/14/2025||
> - Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)

Same here, which is a real shame. I've switched to DeepResearch with Gemini 2.5 Pro over the last few days where paid users have a 20/day limit instead of 10/month and it's been great, especially since now Gemini seems to browse 10x more pages than OpenAI Deep Research (on the order of 200-400 pages versus 20-40).

The reports are too verbose but having it research random development ideas, or how to do something particularly complex with a specific library, or different approaches or architectures to a problem has been very productive without sliding into vibe coding territory.

qingcharles 4/15/2025|||
Wow, I wondered what the limit was. I never checked, but I've been using it hesitantly since I burn up OpenAI's limit as soon as it resets. Thanks for the clarity.

I'm all-in on Deep Research. It can conduct research on niche historical topics that have no central articles in minutes, which typically were taking me days or weeks to delve into.

namaria 4/15/2025|||
I like Deep Research but as a historian I have to tell you. I've used it for history themes to calibrated my expectations and it is a nice tool but... It can easily brush over nuanced discussions and just return folk wisdom from blogs.

What I love most about history is it has lots of irreducible complexity and poring over the literature, both primary and secondary sources, is often the only way to develop an understanding.

fullofbees 4/15/2025|||
I read Being and Time recently and it has a load of concepts that are defined iteratively. There's a lot wrong with how it's written but it's an unfinished book written a 100 years ago so, I cant complain too much.

Because it's quite long, if I asked Perplexity* to remind me what something meant, it would very rarely return something helpful, but, to be fair, I cant really fault it for being a bit useless with a very difficult to comprehend text, where there are several competing styles of reading, many of whom are convinced they are correct.

But I started to notice a pattern of where it would pull answers from some weird spots, especially when I asked it to do deep research. Like, a paper from a University's server that's using concepts in the book to ground qualitative research, which is fine and practical explications are often useful ways into a dense concept, but it's kinda a really weird place to be the first initial academic source. It'll draw on Reddit a weird amount too, or it'll somehow pull a page of definitions from a handout for some University tutorial. And it wont default to the peer reviewed free philosophy encyclopedias that are online and well known.

It's just weird. I was just using it to try and reinforce my actual reading of the text but I more came away thinking that in certain domains, this end of AI is allowing people to conflate having access to information, with learning about something.

*it's just what I have access to.

laggyluke 4/16/2025||
If you're asking an LLM about a particular text, even if it's a well-known text, you might get significantly better results if you provide said text as part of your prompt (context) instead of asking a model to "recall it from memory".

So something like this: "Here's a PDF file containing Being and Time. Please explain the significance of anxiety (Angst) in the uncovering of Being."

tekacs 4/15/2025||||
When I've wanted it to not do things like this, I've had good luck directing it to... not look at those sources.

For example when I've wanted to understand an unfolding story better than the news, I've told it to ignore the media and go only to original sources (e.g. speech transcripts, material written by the people involved, etc.)

namaria 4/15/2025|||
Deep Search is pretty good for current news stories. I've had it analyze some legal developments in a European nation recently and it gave me a great overview.
iamacyborg 4/15/2025|||
That use case seems pretty self defeating when a good news source will usually try to at least validate first-party materials which an llm cannot do.
taurath 4/15/2025|||
LLMs seem fantastic at generalizing broad thought and is not great at outliers. It sort of smooths over the knowledge curve confidently, which is a bit like in psychology where only CBT therapy is accepted, even if there are many much more highly effectual methodologies on individuals, just not at the population level.
antman 4/15/2025|||
Interesting use case. My problem is that for niche subjects the crawled pages probably have not captured the information and the response becomes irrelevant. Perhaps gemini will produce better results just because it takes into account much more pages
chrisshroba 4/15/2025|||
I also like Perplexity’s 3/day limit! If I use them up (which I almost never do) I can just refresh the next day
behnamoh 4/15/2025||
I've only ever had to use DeepResearch for academic literature review. What do you guys use it for which hits your quotas so quickly?
jml78 4/15/2025|||
I use it for mundane shit that I don’t want to spend hours doing.

My son and I go to a lot of concerts and collect patches. Unfortunately we started collecting long after we started going to concerts.

I had a list of about 30 bands I wanted patches for.

I was able to give precise instructions on what I wanted. Deep research came back with direct links for every patch I wanted.

It took me two minutes to write up the prompt and it did all the heavy lifting.

sunnybeetroot 4/15/2025||||
Write a comparison between X and Y
szundi 4/15/2025|||
[dead]
resters 4/14/2025|||
I use them as follows:

o1-pro: anything important involving accuracy or reasoning. Does the best at accomplishing things correctly in one go even with lots of context.

deepseek R1: anything where I want high quality non-academic prose or poetry. Hands down the best model for these. Also very solid for fast and interesting analytical takes. I love bouncing ideas around with R1 and Grok-3 bc of their fast responses and reasoning. I think R1 is the most creative yet also the best at mimicking prose styles and tone. I've speculated that Grok-3 is R1 with mods and think it's reasonably likely.

4o: image generation, occasionally something else but never for code or analysis. Can't wait till it can generate accurate technical diagrams from text.

o3-mini-high and grok-3: code or analysis that I don't want to wait for o1-pro to complete.

claude 3.7: occasionally for code if the other models are making lots of errors. Sometimes models will anchor to outdated information in spite of being informed of newer information.

gemini models: occasionally I test to see if they are competitive, so far not really, though I sense they are good at certain things. Excited to try 2.5 Deep Research more, as it seems promising.

Perplexity: discontinued subscription once the search functionality in other models improved.

I'm really looking forward to o3-pro. Let's hope it's available soon as there are some things I'm working on that are on hold waiting for it.

rushingcreek 4/14/2025|||
Phind was fine-tuned specifically to produce inline Mermaid diagrams for technical questions (I'm the founder).
underlines 4/15/2025|||
I really loved Phind and always think of it as the OG perplexity / RAG search engine.

Sadly stopped my subscription, when you removed the ability to weight my own domains...

Otherwise the fine-tune for your output format for technical questions is great, with the options, the pro/contra and the mermaid diagrams. Just way better for technical searches, than what all the generic services can provide.

bsenftner 4/15/2025|||
Have you been interviewed anywhere? Curious to read your story.
shortcord 4/14/2025||||
Gemini 2.5 Pro is quite good at code.

Has become my go to for use in Cursor. Claude 3.7 needs to be restrained too much.

artdigital 4/15/2025|||
Same here, 2.5 Pro is very good at coding. But it’s also cocky and blames everything but itself for something not working. Eg “the linter must be wrong you should reinstall it”, “looks to be a problem with the Go compiler”, “this function HAS to exist, that’s weird that we’re getting an error”

And it often just stops like “ok this is still not working. You fix it and tell me when it’s done so I can continue”.

But for coding: Gemini Pro 2.5 > Sonnet 3.5 > Sonnet 3.7

valenterry 4/15/2025||||
Weird. For me, sonnet 3.7 is much more focussed and in particular works much better when finding the places that needs change and using other tooling. I guess the integration in cursor is just much better and more mature.
behnamoh 4/15/2025||||
This. sonnet 3.7 is a wild horse. Gemini 2.5 Pro is like a 33 yo expert. o1 feels like a mature, senior colleague.
benhurmarcel 4/15/2025|||
I find that Gemini 2.5 Pro tends to produce working but over-complicated code more often than Claude 3.7.
torginus 4/15/2025||
Which might be a side-effect of the reasoning.

In my experience whenever these models solve a math or logic puzzle with reasoning, they generate extremely long and convoluted chains of thought which show up in the solution.

In contrast a human would come up with a solution with 2-3 steps. Perhaps something similar is going on here with the generated code.

motoboi 4/14/2025||||
You probably know this but it can already generate accurate diagrams. Just ask for the output in a diagram language like mermaid or graphviz
bangaladore 4/14/2025|||
My experience is it often produces terrible diagrams. Things clearly overlap, lines make no sense. I'm not surprised as if you told me to layout a diagram in XML/YAML there would be obvious mistakes and layout issues.

I'm not really certain a text output model can ever do well here.

resters 4/14/2025|||
FWIW I think a multimodal model could be trained to do extremely well with it given sufficient training data. A combination of textual description of the system and/or diagram, source code (mermaid, SVG, etc.) for the diagram, and the resulting image, with training to translate between all three.
bangaladore 4/14/2025||
Agreed. Even simply I'm sure a service like this already exists (or could easily exist) where the workflow is something like:

1. User provides information

2. LLM generates structured output for whatever modeling language

3. Same or other multimodal LLM reviews the generated graph for styling / positioning issues and ensure its matches user request.

4. LLM generates structured output based on the feedback.

5. etc...

But you could probably fine-tune a multimodal model to do it in one shot, or way more effectively.

behnamoh 4/15/2025|||
I had a latex tikz diagram problem which sonnet 3.7 couldn't handle even after 10 attempts. Gemini 2.5 Pro solved it on the second try.
gunalx 4/15/2025||
Had the same experience. o3-mini failing misreably, claude 3.7 as well, but gemini 2.5 pro solved it perfectly. (image of diagram without source to tikz diagram)
resters 4/14/2025||||
I've had mixed and inconsistent results and it hasn't been able to iterate effectively when it gets close. Could be that I need to refine my approach to prompting. I've tried mermaid and SVG mostly, but will also try graphviz based on your suggestion.
antman 4/15/2025|||
Plantuml (action) diagrams are my go to
wavewrangler 4/15/2025||||
You probably know this and are looking for consistency but, a little trick I use is to feed the original data of what I need as a diagram and to re-imagine, it as an image “ready for print” - not native, but still a time saver and just studying with unstructured data or handles this surprisingly well. Again not native…naive, yes. Native, not yet. Be sure to double check triple check as always. give it the ol’ OCD treatment.
barrkel 4/15/2025||||
Gemini 2.5 is very good. Since you have to wait for reasoning tokens, it takes longer to come back, but the responses are high quality IME.
czk 4/15/2025|||
re: "grok-3 is r1 with mods" -- do you mean you believe they distilled deepseek r1? that was my assumption as well, though i thought it more jokingly at first it would make a lot of sense. i actually enjoy grok 3 quite a lot, it has some of the most entertaining thinking traces.
StephenAshmore 4/15/2025|||
> 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers

Ha! That's the funniest and best description of 4.5 I've seen.

cafeinux 4/14/2025|||
> 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)

Is that an LLM hallucination?

cheschire 4/14/2025|||
It’s a tongue in cheek reference to how audiophiles claim to hear differences in audio quality.
SadTrombone 4/14/2025||||
Pretty dark times on HN, when a silly (and obvious) joke gets someone labeled as AI.
netdevphoenix 4/15/2025||
Obvious to you perhaps not to everyone. Self-awareness goes a long way
lxgr 4/15/2025||||
Possibly, but it's running on 100% wetware, I promise!
divan 4/15/2025|||
Looks like NDA violation )
SweetSoftPillow 4/15/2025|||
Switch to Gemini 2.5 Pro, and be happy. It's better in every aspect.
exadeci 4/17/2025|||
It's somehow not, I've been asking it the same questions as ChatGPT and the answers feel off.
miroljub 4/15/2025|||
Warning to potential users: it's Google.
tomalbrc 4/15/2025||
Not sure how or why OpenAI would be any better?
miroljub 4/15/2025||
It's not. It's closed source. But Google is still the worst when it comes to privacy.

I prefer to use only open source models that don't have the possibility to share my data with a third party.

jrk 4/15/2025||
The notion that Google is worse at carefully managing PII than a Wild West place like OpenAI (or Meta, or almost any major alternative) is…not an accurate characterization, in my experience. Ad tech companies (and AI companies) obsessively capture data, but Google internally has always been equally obsessive about isolating and protecting that data. Almost no one can touch it; access is highly restricted and carefully managed; anything that even smells adjacent to ML on personal data has gotten high-level employees fired.

Fully private and local inference is indeed great, but of the centralized players, Google, Microsoft, and Apple are leagues ahead of the newer generation in conservatism and care around personal data.

miroljub 4/22/2025||
I'm not convinced Google is the gold standard for protecting PII. Data breaches can still happen despite internal controls, and their ad-based business model incentivizes data collection. The "high-level employees getting fired" story sounds like PR - how often does that actually happen? I'm not buying that they're leagues ahead of everyone else in data protection.
cr4zy 4/15/2025|||
For code it's actually quite good so far IME. Not quite as good as Gemini 2.5 Pro but much faster. I've integrated it into polychat.co if you want to try it out and compare with other models. I usually ask 2 to 5 models the same question there to reduce the model overload anxiety.
rockwotj 4/15/2025|||
My thoughts is this model release is driven by the agentic app push if this year. Since to my knowledge all the big agentic apps (cursor, bolt, shortwave) that I know of use claude 3.7 because it’s so much better at instruction following and tool calling than GPT 4o so this model feels like GPT 4o (or distilled 4.5?) with some post training focusing on what these agentic workloads need most
anshumankmr 4/15/2025|||
Hey also try out Monday, it did something pretty cool. Its a version of 4o which switched between reasoning and plain token generation on the fly. My guess is that is what GPT V will be.
lucaskd 4/15/2025|||
I'm also very curious of each limit for each model. Never thought about limit before upgrading my plan
youssefabdelm 4/15/2025|||
Disagree. It's really not complicated at all to me. Not sure why people make a big fuss over this. I don't want an AI automating which AI it chooses for me. I already know through lots of testing intuitively which one I want.

If they abstract all this away into one interface I won't know which model I'm getting. I prefer reliability.

yousif_123123 4/15/2025|||
I do like the vinyl and analog amplifiers. I certainly hear the warmth in this case.
xnx 4/15/2025|||
This sounds like whole lot of mental overhead to avoid using Gemini.
guillaume8375 4/15/2025|||
What do you mean when you say that 4o doesn’t have chain-of-thought?
fragmede 4/14/2025|||
what's hilarious to me is that I asked ChatGPT about the model names and approachs and it did a better job than they have.
chrisandchris 4/15/2025|||
Just ask the first AI that comes to mind which one you could ask.
konart 4/15/2025||
Must be weird to not have an "AI router" in this case.
modeless 4/14/2025||
Numbers for SWE-bench Verified, Aider Polyglot, cost per million output tokens, output tokens per second, and knowledge cutoff month/year:

             SWE  Aider Cost Fast Fresh
 Claude 3.7  70%  65%   $15  77   8/24
 Gemini 2.5  64%  69%   $10  200  1/25
 GPT-4.1     55%  53%   $8   169  6/24
 DeepSeek R1 49%  57%   $2.2 22   7/24
 Grok 3 Beta ?    53%   $15  ?    11/24
I'm not sure this is really an apples-to-apples comparison as it may involve different test scaffolding and levels of "thinking". Tokens per second numbers are from here: https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr... and I'm assuming 4.1 is the speed of 4o given the "latency" graph in the article putting them at the same latency.

Is it available in Cursor yet?

anotherpaulg 4/14/2025||
I just finished updating the aider polyglot leaderboard [0] with GPT-4.1, mini and nano. My results basically agree with OpenAI's published numbers.

Results, with other models for comparison:

    Model                       Score   Cost

    Gemini 2.5 Pro Preview 03-25 72.9%  $ 6.32
    claude-3-7-sonnet-20250219   64.9%  $36.83
    o3-mini (high)               60.4%  $18.16
    Grok 3 Beta                  53.3%  $11.03
  * gpt-4.1                      52.4%  $ 9.86
    Grok 3 Mini Beta (high)      49.3%  $ 0.73
  * gpt-4.1-mini                 32.4%  $ 1.99
    gpt-4o-2024-11-20            18.2%  $ 6.74
  * gpt-4.1-nano                  8.9%  $ 0.43
Aider v0.82.0 is also out with support for these new models [1]. Aider wrote 92% of the code in this release, a tie with v0.78.0 from 3 weeks ago.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html

pzo 4/15/2025|||
Did you benchmarked combo: DeepSeek R1 + DeepSeek V3 (0324)? There is combo on 3rd place : DeepSeek R1 + claude-3-5-sonnet-20241022 and also V3 new beating claude 3.5 so in theory R1 + V3 should be even on 2nd place. Just curious if that would be the case
purplerabbit 4/15/2025|||
What model are you personally using in your aider coding? :)
anotherpaulg 4/15/2025||
Mostly Gemini 2.5 Pro lately.

I get asked this often enough that I have a FAQ entry with automatically updating statistics [0].

  Model               Tokens     Pct

  Gemini 2.5 Pro   4,027,983   88.1%
  Sonnet 3.7         518,708   11.3%
  gpt-4.1-mini        11,775    0.3%
  gpt-4.1             10,687    0.2%
[0] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-bui...
jsnell 4/14/2025|||
https://aider.chat/docs/leaderboards/ shows 73% rather than 69% for Gemini 2.5 Pro?

Looks like they also added the cost of the benchmark run to the leaderboard, which is quite cool. Cost per output token is no longer representative of the actual cost when the number of tokens can vary by an order of magnitude for the same problem just based on how many thinking tokens the model is told to use.

anotherpaulg 4/14/2025|||
Aider author here.

Based on some DMs with the Gemini team, they weren't aware that aider supports a "diff-fenced" edit format. And that it is specifically tuned to work well with Gemini models. So they didn't think to try it when they ran the aider benchmarks internally.

Beyond that, I spend significant energy tuning aider to work well with top models. That is in fact the entire reason for aider's benchmark suite: to quantitatively measure and improve how well aider works with LLMs.

Aider makes various adjustments to how it prompts and interacts with most every top model, to provide the very best possible AI coding results.

BonoboIO 4/14/2025|||
Thank you for providing such amazing tools for us. Aider is a godsend, when working with large codebase to get an overview.
modeless 4/14/2025|||
Thanks, that's interesting info. It seems to me that such tuning, while making Aider more useful, and making the benchmark useful in the specific context of deciding which model to use in Aider itself, reduces the value of the benchmark in evaluating overall model quality for use in other tools or contexts, as people use it for today. Models that get more tuning will outperform models that get less tuning, and existing models will have an advantage over new ones by virtue of already being tuned.
jmtulloss 4/15/2025||
I think you could argue the other side too... All of these models do better and worse with subtly different prompting that is non-obvious and unintuitive. Anybody using different models for "real work" are going to be tuning their prompts specifically to a model. Aider (without inside knowledge) can't possibly max out a given model's ability, but it can provide a reasonable approximation of what somebody can achieve with some effort.
modeless 4/14/2025|||
There are different scores reported by Google for "diff" and "whole" modes, and the others were "diff" so I chose the "diff" score. Hard to make a real apples-to-apples comparison.
jsnell 4/14/2025|||
The 73% on the current leaderboard is using "diff", not "whole". (Well, diff-fenced, but the difference is just the location of the filename.)
modeless 4/14/2025||
Huh, seems like Aider made a special mode specifically for Gemini[1] some time after Google's announcement blog post with official performance numbers. Still not sure it makes sense to quote that new score next to the others. In any case Gemini's 69% is the top score even without a special mode.

[1] https://aider.chat/docs/more/edit-formats.html#diff-fenced:~...

jsnell 4/14/2025||
The mode wasn't added after the announcement, Aider has had it for almost a year: https://aider.chat/HISTORY.html#aider-v0320

This benchmark has an authoritative source of results (the leaderboard), so it seems obvious that it's the number that should be used.

modeless 4/14/2025||
OK but it was still added specifically to improve Gemini and nobody else on the leaderboard uses it. Google themselves do not use it when they benchmark their own models against others. They use the regular diff mode that everyone else uses. https://blog.google/technology/google-deepmind/gemini-model-...
tcdent 4/14/2025|||
They just pick the best performer out of the built-in modes they offer.

Interesting data point about the models behavior, but even moreso it's a recommendation of which way to configure the model for optimal performance.

I do consider this to be an apple-to-apples benchmark since they're evaluating real world performance.

meetpateltech 4/14/2025|||
Yes, it is available in Cursor[1] and Windsurf[2] as well.

[1] https://twitter.com/cursor_ai/status/1911835651810738406

[2] https://twitter.com/windsurf_ai/status/1911833698825286142

cellwebb 4/14/2025||
And free on windsurf for a week! Vibe time.
tomjen3 4/14/2025|||
Its available for free in Windsurf so you can try it out there.

Edit: Now also in Cursor

ilrwbwrkhv 4/15/2025|||
Yup GPT 4.1 isn't good at all compared to the others. I tried a bunch of different scenarios, for me the winners:

Deepseek for general chat and research Claude 3.7 for coding Gemini 2.5 Pro experimental for deep research

In terms of price Deepseek is still absolutely fire!

OpenAI is in trouble honestly.

torginus 4/15/2025||
One task I do is I feed the models the text of entire books, and ask them various questions about it ('what happened in Chapter 4', 'what did character X do in the book' etc.).

GPT 4.1 is the first model that has provided a human-quality answer to these questions. It seems to be the first model that can follow plotlines, and character motivations accurately.

I'd say since text processing is a very important use case for LLMs, that's quite noteworthy.

soheil 4/14/2025|||
Yes on both Cursor and Windsurf.

https://twitter.com/cursor_ai/status/1911835651810738406

swyx 4/14/2025||
don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:

- telling the model to be persistent (+20%)

- dont self-inject/parse toolcalls (+2%)

- prompted planning (+4%)

- JSON BAD - use XML or arxiv 2406.13121 (GDM format)

- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD

- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work

source: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...

pton_xd 4/14/2025||
As an aside, one of the worst aspects of the rise of LLMs, for me, has been the wholesale replacement of engineering with trial-and-error hand-waving. Try this, or maybe that, and maybe you'll see a +5% improvement. Why? Who knows.

It's just not how I like to work.

zoogeny 4/14/2025|||
I think trial-and-error hand-waving isn't all that far from experimentation.

As an aside, I was working in the games industry when multi-core was brand new. Maybe Xbox-360 and PS3? I'm hazy on the exact consoles but there was one generation where the major platforms all went multi-core.

No one knew how to best use the multi-core systems for gaming. I attended numerous tech talks by teams that had tried different approaches and were give similar "maybe do this and maybe see x% improvement?". There was a lot of experimentation. It took a few years before things settled and best practices became even somewhat standardized.

Some people found that era frustrating and didn't like to work in that way. Others loved the fact it was a wide open field of study where they could discover things.

jorvi 4/14/2025|||
Yes, it was the generation of the X360 and PS3. X360 was 3 core and the PS3 was 1+7 core (sort of a big.little setup).

Although it took many, many more years until games started to actually use multi-core properly. With rendering being on a 16.67ms / 8.33ms budget and rendering tied to world state, it was just really hard to not tie everything into eachother.

Even today you'll usually only see 2-4 cores actually getting significant load.

Nullabillity 4/15/2025||||
Performance optimization is different, because there's still some kind of a baseline truth. Every knows what a FPS is, and +5% FPS is +5% FPS. Even the tricky cases have some kind of boundary (+5% FPS on this hardware but -10% on this other hardware, +2% on scenes meeting these conditions but -3% otherwise, etc).

Meanwhile, nobody can agree on what a "good" LLM in, let alone how to measure it.

hackernewds 4/15/2025|||
there probably was still a structured way to test this through cross hatching but yeah like blind guessing might take longer and arrive at the same solution
barrkel 4/15/2025||||
The disadvantage is that LLMs are probabilistic, mercurial, unreliable.

The advantage is that humans are probabilistic, mercurial and unreliable, and LLMs are a way to bridge the gap between humans and machines that, while not wholly reliable, makes the gap much smaller than it used to be.

If you're not making software that interacts with humans or their fuzzy outputs (text, images, voice etc.), and have the luxury of well defined schema, you're not going to see the advantage side.

pclmulqdq 4/14/2025||||
Software engineering has involved a lot of people doing trial-and-error hand-waving for at least a decade. We are now codifying the trend.
brokencode 4/14/2025||||
Out of curiosity, what do you work on where you don’t have to experiment with different solutions to see what works best?
FridgeSeal 4/14/2025|||
Usually when we’re doing it in practice there’s _somewhat_ more awareness of the mechanics than just throwing random obstructions in and hoping for the best.
RussianCow 4/14/2025||
LLMs are still very young. We'll get there in time. I don't see how it's any different than optimizing for new CPU/GPU architectures other than the fact that the latter is now a decades-old practice.
th0ma5 4/15/2025|||
Not to pick on you, but this is exactly the objectionable handwaving. What makes you think we'll get there? The kinds of errors that these technologies make have not changed, and anything that anyone learns about how to make them better changes dramatically from moment to moment and no one can really control that. It is different because those other things were deterministic ...
Closi 4/15/2025||
In comp sci it’s been deterministic, but in other science disciplines (eg medicine) it’s not. Also in lots of science it looks non-deterministic until it’s not (eg medicine is theoretically deterministic, but you have to reason about it experimentally and with probabilities - doesn’t mean novel drugs aren’t technological advancements).

And while the kind of errors hasn’t changed, the quantity and severity of the errors has dropped dramatically in a relatively short span of time.

th0ma5 4/15/2025||
The problem has always been that every token is suspect.
Closi 4/17/2025||
It's the whole answer being correct that's the important thing, and if you compare GPT 3 vs where we are today only 5 years later the progress in accuracy, knowledge and intelligence is jaw dropping.
th0ma5 4/18/2025||
I have no idea what you're talking about because they still screw up in the exact same way as gpt3.
Closi 4/20/2025||
The hallucination quantity and severity is way less in new frontier models.
th0ma5 4/21/2025||
But not more predictable or regular.
girvo 4/14/2025|||
> I don't see how it's any different than optimizing for new CPU/GPU architectures

I mean that seems wild to say to me. Those architectures have documentation and aren't magic black boxes that we chuck inputs at and hope for the best: we do pretty much that with LLMs.

If that's how you optimise, I'm genuinely shocked.

swyx 4/14/2025||
i bet if we talked to a real low level hardware systems/chip engineer they'd laugh and take another shot at how we put them on a pedestal
girvo 4/15/2025||
Not really, in my experience. There's still fundamental differences between designed systems and trained LLMs.
greenchair 4/14/2025|||
most people are building straightforward crud apps. no experimentation required.
RussianCow 4/14/2025|||
[citation needed]

In my experience, even simple CRUD apps generally have some domain-specific intricacies or edge cases that take some amount of experimentation to get right.

brokencode 4/14/2025|||
Idk, it feels like this is what you’d expect versus the actual reality of building something.

From my experience, even building on popular platforms, there are many bugs or poorly documented behaviors in core controls or APIs.

And performance issues in particular can be difficult to fix without trial and error.

karn97 4/15/2025||
Not helpful when the llm knowledge cutoff is a year out of date and api and lib has been changed since
muzani 4/15/2025||||
One of the major advantages and disadvantages of LLMs is they act a bit more like humans. I feel like most "prompt advice" out there is very similar to how you would teach a person as well. Teachers and parents have some advantages here.
moffkalast 4/15/2025||||
Yeah this is why I don't like statistical and ML solutions in general. Monte Carlo sampling is already kinda throwing bullshit at the wall and hoping something works with absolutely zero guarantees and it's perfectly explainable.

But unfortunately for us, clean and logical classical methods suck ass in comparison so we have no other choice but to deal with the uncertainty.

make3 4/15/2025||||
prompt tuning is a temporary necessity
kitsunemax 4/14/2025|||
I feel like this a common pattern with people who work in STEM. As someone who is used to working with formal proofs, equations, math, having a startup taught me how to rewire myself to work with the unknowns, imperfect solutions, messy details. I'm going on a tangent, but just wanted to share.
minimaxir 4/14/2025|||
> no evidence that ALL CAPS or Bribes or Tips or threats to grandma work

Challenge accepted.

That said, the exact quote from the linked notebook is "It’s generally not necessary to use all-caps or other incentives like bribes or tips, but developers can experiment with this for extra emphasis if so desired.", but the demo examples OpenAI provides do like using ALL CAPS.

swyx 4/14/2025|||
references for all the above + added more notes here on pricing https://x.com/swyx/status/1911849229188022278

and we'll be publishing our 4.1 pod later today https://www.youtube.com/@latentspacepod

simonw 4/14/2025|||
I'm surprised and a little disappointed by the result concerning instructions at the top, because it's incompatible with prompt caching: I would much rather cache the part of the prompt that includes the long document and then swap out the user question at the end.
mmoskal 4/14/2025|||
The way I understand it: if the instruction are at the top, the KV entries computed for "content" can be influenced by the instructions - the model can "focus" on what you're asking it to do and perform some computation, while it's "reading" the content. Otherwise, you're completely relaying on attention to find the information in the content, leaving it much less token space to "think".
zaptrem 4/14/2025||||
Prompt on bottom is also easier for humans to read as I can have my actual question and the model’s answer on screen at the same time instead of scrolling through 70k tokens of context between them.
jeeeb 4/15/2025||||
Wouldn’t it be the other way around?

If the instructions are at the top the LV cache entries can be pre computed and cached.

If they’re at the bottom the entries at the lower layers will have a dependency on the user input.

a2128 4/15/2025||
It's placing instructions AND user query at top and bottom. So if you have a prompt like this:

    [Long system instructions - 200 tokens]
    [Very long document for reference - 5000 tokens]
    [User query - 32 tokens]
The key-values for first 5200 tokens can be cached and it's efficient to swap out the user query for a different one, you only need to prefill 32 tokens and generate output.

But the recommendation is to use this, where in this case you can only cache the first 200 tokens and need to prefill 5264 tokens every time the user submits a new query.

    [Long system instructions - 200 tokens]
    [User query - 32 tokens]
    [Very long document for reference - 5000 tokens]
    [Long system instructions - 200 tokens]
    [User query - 32 tokens]
jeeeb 4/15/2025||
Ahh I see. Thank you for the explanation. I didn’t realise their was user input straight after the system prompt.
swyx 4/14/2025|||
yep. we address it in the podcast. presumably this is just a recent discovery and can be post-trained away.
aoeusnth1 4/14/2025||
If you're skimming a text to answer a specific question, you can go a lot faster than if you have to memorize the text well enough to answer an unknown question after the fact.
kristianp 4/14/2025|||
The size of that SWE-bench Verified prompt shows how much work has gone into the prompt to get the highest possible score for that model. A third party might go to a model from a different provider before going to that extent of fine-tuning of the prompt.
Havoc 4/14/2025|||
>- dont self-inject/parse toolcalls (+2%)

What is meant by this?

intalentive 4/14/2025||
Use the OpenAI API/SDK for function calling instead of rolling your own inside the prompt.
behnamoh 4/14/2025|||
> - JSON BAD - use XML or arxiv 2406.13121 (GDM format)

And yet, all function calling and MCP is done through JSON...

swyx 4/14/2025|||
JSON is just MCP's transport layer. you can reformat to xml to pass into model
CSMastermind 4/14/2025|||
Yeah anyone who has worked with these models knows how much they struggle with JSON inputs.
cedws 4/15/2025||
Why XML over JSON? Are they just saying that because XML is more tokens so they can make more money?
omneity 4/14/2025||
I have been trying GPT-4.1 for a few hours by now through Cursor on a fairly complicated code base. For reference, my gold standard for a coding agent is Claude Sonnet 3.7 despite its tendency to diverge and lose focus.

My take aways:

- This is the first model from OpenAI that feels relatively agentic to me (o3-mini sucks at tool use, 4o just sucks). It seems to be able to piece together several tools to reach the desired goal and follows a roughly coherent plan.

- There is still more work to do here. Despite OpenAI's cookbook[0] and some prompt engineering on my side, GPT-4.1 stops quickly to ask questions, getting into a quite useless "convo mode". Its tool calls fails way too often as well in my opinion.

- It's also able to handle significantly less complexity than Claude, resulting in some comical failures. Where Claude would create server endpoints, frontend components and routes and connect the two, GPT-4.1 creates simplistic UI that calls a mock API despite explicit instructions. When prompted to fix it, it went haywire and couldn't handle the multiple scopes involved in that test app.

- With that said, within all these parameters, it's much less unnerving than Claude and it sticks to the request, as long as the request is not too complex.

My conclusion: I like it, and totally see where it shines, narrow targeted work, adding to Claude 3.7 - for creative work, and Gemini 2.5 Pro for deep complex tasks. GPT-4.1 does feel like a smaller model compared to these last two, but maybe I just need to use it for longer.

0: https://cookbook.openai.com/examples/gpt4-1_prompting_guide

ttul 4/14/2025||
I feel the same way about these models as you conclude. Gemini 2.5 is where I paste whole projects for major refactoring efforts or building big new bits of functionality. Claude 3.7 is great for most day to day edits. And 4.1 okay for small things.

I hope they release a distillation of 4.5 that uses the same training approach; that might be a pretty decent model.

sreeptkid 4/15/2025||
I completely agree. On initial takeaway I find 3.7 sonnet to still be the superior coding model. I'm suspicious now of how they decide these benchmarks...
marsh_mellow 4/14/2025||
From OpenAI's announcement:

> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better suggestion in 55% of cases. Notably, they found that GPT‑4.1 excels at both precision (knowing when not to make suggestions) and comprehensiveness (providing thorough analysis when warranted).

https://www.qodo.ai/blog/benchmarked-gpt-4-1/

arvindh-manian 4/14/2025||
Interesting link. Worth noting that the pull requests were judged by o3-mini. Further, I'm not sure that 55% vs 45% is a huge difference.
marsh_mellow 4/14/2025|||
Good point. They said they validated the results by testing with other models (including Claude), as well as with manual sanity checks.

55% to 45% definitely isn't a blowout but it is meaningful — in terms of ELO it equates to about a 36 point difference. So not in a different league but definitely a clear edge

servercobra 4/15/2025||||
Maybe not as much to us, but for people building these tools, 4.1 being significantly cheaper than Clause 3.7 is a huge difference.
elAhmo 4/15/2025|||
I first read it as 55% better, which sounds significantly higher than ~22% which they report here. Sounds misleading.
jsnell 4/14/2025|||
That's not a lot of samples for such a small effect, I don't think it's statistically significant (p-value of around 10%).
swyx 4/14/2025|||
is there a shorthand/heuristic to calculate pvalue given n samples and effect size?
tedsanders 4/14/2025||
There are no great shorthands, but here are a few rules of thumb I use:

- for N=100, worst case standard error of the mean is ~5% (it shrinks parabolically the further p gets from 50%)

- multiply by ~2 to go from standard error of the mean to 95% confidence interval

- scale sample size by sqrt(N)

So:

- N=100: +/- 10%

- N=1000: +/- 3%

- N=10000: +/- 1%

(And if comparing two independent distributions, multiply by sqrt(2). But if they’re measured on the same problems, then instead multiply by between 1 and sqrt(2) to account for them finding the same easy problems easy and hard problems hard - aka positive covariance.)

marsh_mellow 4/14/2025|||
p-value of 7.9% — so very close to statistical significance.

the p-value for GPT-4.1 having a win rate of at least 49% is 4.92%, so we can say conclusively that GPT-4.1 is at least (essentially) evenly matched with Claude Sonnet 3.7, if not better.

Given that Claude Sonnet 3.7 has been generally considered to be the best (non-reasoning) model for coding, and given that GPT-4.1 is substantially cheaper ($2/million input, $8/million output vs. $3/million input, $15/million output), I think it's safe to say that this is significant news, although not a game changer

jsnell 4/14/2025||
I make it 8.9% with a binomial test[0]. I rounded that to 10%, because any more precision than that was not justified.

Specifically, the results from the blog post are impossible: with 200 samples, you can't possibly have the claimed 54.9/45.1 split of binary outcomes. Either they didn't actually make 200 tests but some other number, they didn't actually get the results they reported, or they did some kind of undocumented data munging like excluding all tied results. In any case, the uncertainty about the input data is larger than the uncertainty from the rounding.

[0] In R, binom.test(110, 200, 0.5, alternative="greater")

jacobsenscott 4/15/2025|||
That's a marketing page for something called qodo that sells ai code reviews. At no point were the ai code reviews judged by competent engineers. It is just ai generated trash all the way down.
InkCanon 4/14/2025||
>4.1 Was better in 55% of cases

Um, isn't that just a fancy way of saying it is slightly better

>Score of 6.81 against 6.66

So very slightly better

wiz21c 4/14/2025|||
"they found that GPT‑4.1 excels at both precision..."

They didn't say it is better than Claude at precision etc. Just that it excels.

Unfortunately, AI has still not concluded that manipulations by the marketing dept is a plague...

kevmo314 4/14/2025||||
A great way to upsell 2% better! I should start doing that.
neuroelectron 4/14/2025||
Good marketing if you're selling a discount all purpose cleaner, not so much for an API.
marsh_mellow 4/14/2025|||
I don't think the absolute score means much — judge models have a tendency to score around 7/10 lol

55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge

kevmo314 4/14/2025||
Rarely are two models put head-to-head though. If Claude Sonnet 3.7 isn't able to generate a good PR review (for whatever reason), a 2% better review isn't all that strong of a value proposition.
swyx 4/14/2025||
the point is oai is saying they have a viable Claude Sonnet competitor now
pbmango 4/14/2025||
I think an under appreciated reality is that all of the large AI labs and OpenAI in particular are fighting multiple market battles at once. This is coming across in both the number of products and the packaging.

1, to win consumer growth they have continued to benefit on hyper viral moments, lately that was was image generation in 4o, which likely was technically possible a long time before launched. 2, for enterprise workloads and large API use, they seem to have focused less lately but the pricing of 4.1 is clearly an answer to Gemini which has been winning on ultra high volume and consistency. 3, for full frontier benchmarks they pushed out 4.5 to stay SOTA and attract the best researchers. 4, on top of all they they had to, and did, quickly answer the reasoning promise and DeepSeek threat with faster and cheaper o models.

They are still winning many of these battles but history highlights how hard multi front warfare is, at least for teams of humans.

spiderfarmer 4/14/2025||
On that note, I want to see benchmarks for which LLM's are best at translating between languages. To me, it's an entire product category.
pbmango 4/14/2025|||
There are probably many more small battles being fought or emerging. I think voice and PDF parsing are growing battles too.
oezi 4/15/2025|||
I would love to see a stackexchange-like site where humans ask questions and we get to vote on the reply by various LLMs.
anotherengineer 4/15/2025||
is this like what you're thinking of? https://lmarena.ai
oezi 4/15/2025||
Kind of. But lmarena.ai has no way to see results to questions people asked and it only lets you look at two responses side by side.
kristianp 4/14/2025||
I agree. 4.1 seems to be a release that addresses shortcomings of 4o in coding compared to Claude 3.7 and Gemini 2.0 and 2.5
simonw 4/14/2025||
Here's a summary of this Hacker News thread created by GPT-4.1 (the full sized model) when the conversation hit 164 comments: https://gist.github.com/simonw/93b2a67a54667ac46a247e7c5a2fe...

I think it did very well - it's clearly good at instruction following.

Total token cost: 11,758 input, 2,743 output = 4.546 cents.

Same experiment run with GPT-4.1 mini: https://gist.github.com/simonw/325e6e5e63d449cc5394e92b8f2a3... (0.8802 cents)

And GPT-4.1 nano: https://gist.github.com/simonw/1d19f034edf285a788245b7b08734... (0.2018 cents)

krat0sprakhar 4/15/2025||
Hey Simon, I love how you generates these summaries and share them on every model release. Do you have a quick script that allows you to do that? Would love to take a look if possible :)
jimmySixDOF 4/15/2025|||
He has a couple of nifty plugins to the LLM utility [1] so I would guess its something as simple as ```llm -t fabric:some_prompt_template -f hn:1234567890``` and that applies a template (in this case from a fabric library) and then appends a 'fragment' block from HN plugin which gets the comments, strips everything but the author and text, adds an index number (1.2.3.x), and inserts it into the prompt (+ SQLite).

[1] https://llm.datasette.io/en/stable/plugins/directory.html#fr...

simonw 4/15/2025|||
I use this one: https://til.simonwillison.net/llms/claude-hacker-news-themes
ilrwbwrkhv 4/15/2025||
Now try Deepseek V3 and see the magic!
elashri 4/14/2025||
Are there any benchmarks or someone who did tests of performance of using this long max token models in scenarios where you actually use more of this token limit?

I found from my experience with Gemini models that after ~200k that the quality drops and that it basically doesn't keep track of things. But I don't have any numbers or systematic study of this behavior.

I think all providers who announce increased max token limit should address that. Because I don't think it is useful to just say that max allowed tokens are 1M when you basically cannot use anything near that in practice.

kmeisthax 4/14/2025||
The problem is that while you can train a model with the hyperparameter of "context size" set to 1M, there's very little 1M data to train on. Most of your model's ability to follow long context comes from the fact that it's trained on lots of (stolen) books; in fact I believe OpenAI just outright said in court that they can't do long context without training on books.

Novels are usually measured in terms of words; and there's a rule of thumb that four tokens make up about three words. So that 200k token wall you're hitting is right when most authors stop writing. 150k is already considered long for a novel, and to train 1M properly, you'd need not only a 750k book, but many of them. Humans just don't write or read that much text at once.

To get around this, whoever is training these models would need to change their training strategy to either:

- Group books in a series together as a single, very long text to be trained on

- Train on multiple unrelated books at once in the same context window

- Amplify the gradients by the length of the text being trained on so that the fewer long texts that do exist have greater influence on the model weights as a whole.

I suspect they're doing #2, just to get some gradients onto the longer end of the context window, but that also is going to diminish long-context reasoning because there's no reason for the model to develop a connection between, say, token 32 and token 985,234.

omneity 4/14/2025|||
I'm not sure to which extent this opinion is accurately informed. It is well known that nobody trains on 1M token-long content. It wouldn't work anyway as the dependencies are too far fetched and you end up with vanishing gradients.

RoPE (Rotary Positional Embeddings, think modulo or periodic arithmetics) scaling is key, whereby the model is trained on 16k tokens long content, and then scaled up to 100k+ [0]. Qwen 1M (who has near perfect recall over the complete window [1]) and Llama 4 10M pushed the limits of this technique, with Qwen reliably training with a much higher RoPE base, and Llama 4 coming up with iRoPE which claims scaling to extremely long contexts up to infinity.

[0]: https://arxiv.org/html/2310.05209v2

[1]: https://qwenlm.github.io/blog/qwen2.5-turbo/#passkey-retriev...

christianqchung 4/14/2025|||
But Llama 4 Scout does badly on long context benchmarks despite claiming 10M. It scores 1 slot above Llama 3.1 8B in this one[1].

[1] https://github.com/adobe-research/NoLiMa

omneity 4/14/2025||
Indeed, but it does not take away the fact that long context is not trained through long content but by scaling short content instead.
kmeisthax 4/14/2025|||
Is there any evidence that GPT-4.1 is using RoPE to scale context?

Also, I don't know about Qwen, but I know Llama 4 has severe performance issues, so I wouldn't use that as an example.

omneity 4/14/2025||
I am not sure about public evidence. But the memory requirements alone to train on 1M long windows would make it a very unrealistic proposition compared to RoPE scaling. And as I mentioned RoPE is essential for long context anyway. You can't train it in the "normal way". Please see the paper I linked previously for more context (pun not intended) on RoPE.

Re: Llama 4, please see the sibling comment.

killerstorm 4/15/2025||||
No, there's a fundamental limitation of Transformer architecture:

  * information from the entire context has to be squeezed into an information channel of a fixed size; the more information you try to squeeze the more noise you get
  * selection of what information passes through is done using just dot-product
Training data isn't the problem.

In principle, as you scale transformer you get more heads and more dimensions in each vector, so bandwidth of attention data bus goes up and thus precision of recall goes up too.

wskish 4/14/2025||||
codebases of high quality open source projects and their major dependencies are probably another good source. also: "transformative fair use", not "stolen"
crimsoneer 4/14/2025||||
Isn't the problem more that the "needle in a haystack" eval (i said word X once, where) is really not relevant to most long context LLM use cases like code, where you need the context from all the stuff simultaneously rather than identifying a single, quite separate relevant section?
omneity 4/14/2025||
What you're describing as "needle in a haystack" is a necessary requirement for the downstream ability you want. The distinction is really how many "things" the LLM can process in a single shot.

LLMs process tokens sequentially, first in a prefilling stage, where it reads your input, then in the generation stage where it outputs response tokens. The attention mechanism is what allows the LLM as it is ingesting or producing tokens to "notice" that a token it has seen previously (your instruction) is related with a token it is now seeing (the code).

Of course this mechanism has limits (correlated with model size), and if the LLM needs to take the whole input in consideration to answer the question the results wouldn't be too good.

roflmaostc 4/14/2025||||
What about old books? Wikipedia? Law texts? Programming languages documentations?

How many tokens is a 100 pages PDF? 10k to 100k?

arvindh-manian 4/14/2025|||
For reference, I think a common approximation is one token being 0.75 words.

For a 100 page book, that translates to around 50,000 tokens. For 1 mil+ tokens, we need to be looking at 2000+ page books. That's pretty rare, even for documentation.

It doesn't have to be text-based, though. I could see films and TV shows becoming increasingly important for long-context model training.

handfuloflight 4/14/2025||
What about the role of synthetic data?
throwup238 4/14/2025||
Synthetic data requires a discriminator that can select the highest quality results to feed back into training. Training a discriminator is easier than a full blown LLM, but it still suffers from a lack of high quality training data in the case of 1M context windows. How do you train a discriminator to select good 2,000 page synthetic books if the only ones you have to train it with are Proust and concatenated Harry Potter/Game of Thrones/etc.
jjmarr 4/14/2025|||
Wikipedia does not have many pages that are 750k words. According to Special:LongPages[1], the longest page right now is a little under 750k bytes.

https://en.wikipedia.org/wiki/List_of_chiropterans

Despite listing all presently known bats, the majority of "list of chiropterans" byte count is code that generates references to the IUCN Red List, not actual text. Most of Wikipedia's longest articles are code.

[1] https://en.wikipedia.org/wiki/Special:LongPages

nneonneo 4/14/2025|||
I mean, can’t they just train on some huge codebases? There’s lots of 100KLOC codebases out there which would probably get close to 1M tokens.
enginoid 4/14/2025|||
There are some benchmarks such as Fiction.LiveBench[0] that give an indication and the new Graphwalks approach looks super interesting.

But I'd love to see one specifically for "meaningful coding." Coding has specific properties that are important such as variable tracking (following coreference chains) described in RULER[1]. This paper also cautions against Single-Needle-In-The-Haystack tests which I think the OpenAI one might be. You really need at least Multi-NIAH for it to tell you anything meaningful, which is what they've done for the Gemini models.

I think something a bit more interpretable like `pass@1 rate for coding turns at 128k` would so much more useful than "we have 1m context" (with the acknowledgement that good-enough performance is often domain dependant)

[0] https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...

[1] https://arxiv.org/pdf/2404.06654

daemonologist 4/14/2025|||
I ran NoLiMa on Quasar Alpha (GPT-4.1's stealth mode): https://news.ycombinator.com/item?id=43640166#43640790

Updated results from the authors: https://github.com/adobe-research/NoLiMa

It's the best known performer on this benchmark, but still falls off quickly at even relatively modest context lengths (85% perf at 16K). (Cutting edge reasoning models like Gemini 2.5 Pro haven't been evaluated due to their cost and might outperform it.)

jbentley1 4/14/2025|||
https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...

IMO this is the best long context benchmark. Hopefully they will run it for the new models soon. Needle-in-a-haystack is useless at this point. Llama-4 had perfect needle in a haystack results but horrible real-world-performance.

dr_kiszonka 4/15/2025|||
As much as I enjoy Gemini models, I have to agree with you. At some point, interactions with them start resembling talking to people with short-term memory issues, and answers become increasingly unreliable. Now, there are also reports of AI Studio glitching out and not loading these longer conversations.

Is there a reliable method for pruning, summarizing, or otherwise compressing context to overcome such issues?

consumer451 4/15/2025|||
This is a paper which echoes your experience, in general. I really wish that when papers like this one were created, someone took the methodology and kept running with it for every model:

> For instance, the NoLiMa benchmark revealed that models like GPT-4o experienced a significant drop from a 99.3% performance rate at 1,000 tokens to 69.7% at 32,000 tokens. Similarly, Llama 3.3 70B's effectiveness decreased from 97.3% at 1,000 tokens to 42.7% at 32,000 tokens, highlighting the challenges LLMs face with longer contexts.

https://arxiv.org/abs/2502.05167

gymbeaux 4/14/2025||
I’m not optimistic. It’s the Wild West and comparing models for one’s specific use case is difficult, essentially impossible at scale.
minimaxir 4/14/2025||
It's not the point of the announcement, but I do like the use of the (abs) subscript to demonstrate the improvement in LLM performance since in these types of benchmark descriptions I never can tell if the percentage increase is absolute or relative.
999900000999 4/14/2025|
Have they implemented "I don't know" yet.

I probably spend 100$ a month on AI coding, and it's great at small straightforward tasks.

Drop it into a larger codebase and it'll get confused. Even if the same tool built it in the first place due to context limits.

Then again, the way things are rapidly improving I suspect I can wait 6 months and they'll have a model that can do what I want.

mianos 4/14/2025||
I agree. I use it a lot but there is endless frustration when the C++ code I am working on gets both complex and largish. Once it gets to a certain size and the context gets too long they all pretty much lose the plot and start producing complete rubbish. It would be great for it to give some measure so I know to take over and not have it start injecting random bugs or deleting functional code. It even starts doing things like returning locally allocated pointers lately.
energy123 4/15/2025|||
> Then again, the way things are rapidly improving I suspect I can wait 6 months and they'll have a model that can do what I want.

I believe this. I've been having the forgetting problem happen less with Gemini 2.5 Pro. It does hallucinate, but I can get far just pasting all the docs and a few examples, and asking it to double check everything according to the docs instead of relying on its memory.

cheschire 4/14/2025|||
I wonder if documentation would help to create an carefully and intentionally tokenized overview of the system. Maximize the amount of routine larger scope information provided in minimal tokens in order to leave room for more immediate context.

Similar to the function documentation provides to developers today, I suppose.

yokto 4/14/2025||
It does, shockingly well in my experience. Check out this blog post outlining such an approach, called Literate Development by the author: https://news.ycombinator.com/item?id=43524673
paradite 4/15/2025|||
Have you tried using a tool like 16x Prompt to send only relevant code to the model?

This helps the model to focus on a subset of codebase thst is relevant to the current task.

https://prompt.16x.engineer/

(I built it)

sunnybeetroot 4/15/2025||
Just some tiny feedback if you didn’t mind; in the free version 10 prompts/day is unticked which sort of hints that there isn’t a 10 prompt/day limit, but I’m guessing that’s not what you want to say?
paradite 4/15/2025||
Ah I see what you mean. I was trying to convey that this is a limitation, hence not a tick symbol.

But I guess it could be interpreted differently like you said.

dev1ycan 4/15/2025||
bahahaha spoken like someone who spends $100 to do the task a single semi decent software developer (yourself) should be able to do for... $0
999900000999 4/15/2025||
It's a matter of time.

The promise of AI is I can spend 100$ to get 40 hours or so of work done.

More comments...