Top
Best
New

Posted by robotswantdata 6/30/2025

The new skill in AI is not prompting, it's context engineering(www.philschmid.de)
915 points | 518 comments
simonw 6/30/2025|
I wrote a bit about this the other day: https://simonwillison.net/2025/Jun/27/context-engineering/

Drew Breunig has been doing some fantastic writing on this subject - coincidentally at the same time as the "context engineering" buzzword appeared but actually unrelated to that meme.

How Long Contexts Fail - https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-ho... - talks about the various ways in which longer contexts can start causing problems (also known as "context rot")

How to Fix Your Context - https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.... - gives names to a bunch of techniques for working around these problems including Tool Loadout, Context Quarantine, Context Pruning, Context Summarization, and Context Offloading.

the_mitsuhiko 6/30/2025||
Drew Breunig's posts are a must read on this. This is not only important for writing your own agents, it is also critical when using agentic coding right now. These limitations/behaviors will be with us for a while.
outofpaper 6/30/2025||
They might be good reads on the topic but Drew makes some significant etymological mistakes. For example loadout doesn't come from gaming but military terminology. It's essentially the same as kit or gear.
simonw 6/30/2025|||
Drew isn't using that term in a military context, he's using it in a gaming context. He defines what he means very clearly:

> The term “loadout” is a gaming term that refers to the specific combination of abilities, weapons, and equipment you select before a level, match, or round.

In the military you don't select your abilities before entering a level.

xarope 7/1/2025|||
the military definitely do use the term loadout. It can be based on mission parameters e.g. if armored vehicles are expected, your loadout might include more MANPATS. It can also refer to the way each soldier might customize their gear, e.g. cutaway knife in boot or on vest, NODs if extended night operations expected (I know, I know, gamers would like to think you'd bring everything, but in real life no warfighter would want to carry extra weight unnecessarily!), or even the placement of gear on their MOLLE vests (all that velcro has a reason).
simonw 7/1/2025||
Nobody is disputing that. We are saying that the statement "The term 'loudout' is a gaming term" can be true at the same time.
GuinansEyebrows 6/30/2025|||
i think that software engineers using this terminology might be envisioning themselves as generals, not infantry :)
coldtea 7/1/2025||||
>Drew makes some significant etymological mistakes. For example loadout doesn't come from gaming but military terminology

Does he pretend to give the etymology and ultimately origin of the term, or just where he or other AI-discussions found it? Because if it's the latter, he is entitled to call it a "gaming" term, because that's what it is to him and those in the discussion. He didn't find it in some military manual or learned it at boot camp!

But I would mostly challenge this mistake, if we admit it as such, is "significant" in any way.

The origin of loadout is totally irrelevant to the point he makes and the subject he discusses. It's just a useful term he adopted, it's history is not really relevant.

DiggyJohnson 6/30/2025||||
This seems like a rather unimportant type of mistake, especially because the definition is still accurate, it’s just the etymology isn’t complete.
scubbo 6/30/2025||||
It _is_ a gaming term - it is also a military term (from which the gaming term arose).
ZYbCRq22HbJ2y7 6/30/2025||||
> They might be good reads on the topic but Drew makes some significant etymological mistakes. For example loadout doesn't come from gaming but military terminology. It's essentially the same as kit or gear.

Doesn't seem that significant?

Not to say those blog posts say anything much anyway that any "prompt engineer" (someone who uses LLMs frequently) doesn't already know, but maybe it is useful to some at such an early stage of these things.

luckydata 7/1/2025|||
this is textbook pointless pedantry. I'm just commenting to find it again in the future.
pbhjpbhj 7/1/2025||
Click on the 'time' part of the comment header, then you can 'favorite' the comment. That way you can avoid adding such comments in the future.
Daub 7/1/2025|||
For visual art I feel that the existing approaches in context engineering are very much lacking. An Ai understands well enough such simple things as content (bird, dog, owl etc), color (blue green etc) and has a fair understanding of foreground/background. However, the really important stuff is not addressed.

For example: in form, things like negative shape and overlap. In color contrast things like Ratio contrast and dynamic range contrast. Or how manipulating neighboring regional contrast produces tone wrap. I could go on.

One reason for this state of affairs is that artists and designers lack the consistent terminology to describe what they are doing (though this does not stop them from operating at a high level). Indeed, many of the terms I have used here we (my colleagues and I) had to invent ourselves. I would love to work with an AI guru to address this developing problem.

skydhash 7/1/2025||
> artists and designers lack the consistent terminology to describe what they are doing

I don't think they do. It may not be completely consistent, but open any art book and you find the same thing being explained again and again. Just for drawing humans, you will find emphasis on the skeleton and muscle volume for forms and poses, planes (especially the head) for values and shadows, some abstract things like stability and line weight, and some more concrete things like foreshortening.

Several books and course have gone over those concepts. They are not difficult to explain, they are just difficult to master. That's because you have to apply judgement for every single line or brush stroke deciding which factors matter most and if you even want to do the stroke. Then there's the whole hand eye coordination.

So unless you can solve judgement (which styles derive from), there's not a lot of hope there.

ADDENDUM

And when you do a study of another's work, it's not copying the data, extracting colors, or comparing labels,... It's just studying judgement. You know the complete formula from which a more basic version is being used for the work, and you only want to know the parameters. Whereas machine training is mostly going for the wrong formula with completely different variables.

Daub 7/2/2025||
I concur that there is, on some matters, a general agreement in art books. However, certainly it does not help that there is so much inconsistency of terminology. For example: the way that hue and color are so frequently used interchangeably, likewise lightness, brightness, tone and value.

What bothers me more is that so much truly important material is not being addressed as explicitly as it should be. For example: the exaggeration of contrast on which so much art relies exists in two dimensions: increase of difference and decrease of difference.

This application of contrast/affinity is a general principle that runs through the entirety of art. Indeed, I demonstrate it to my students by showing its application in Korean TV dramas. The only explicit mention I can find of this in art literature is in the work of Ruskin, nearly 200 years ago!

Even worse is that so much very important material is not being addressed at all. For example, a common device that painters employ is to configure the neighboring regional contrast of a form can be light against dark on one edge and dark against light on the opposing edge. In figurative paintings and in classic portrait photography this device is almost ubiquitous, yet as far as I am able to determine no one has named it or even written about it. We were obliged to name it ourselves (tone wrap).

> They are not difficult to explain, they are just difficult to master.

Completely agree that they can be difficult to master. However, a thing cannot be satisfactorily explained unless there is consistent (or even existent) terminology for that thing.

> So unless you can solve judgement (which styles derive from)

Nicely put.

skydhash 7/2/2025||
> For example, a common device that painters employ is to configure the neighboring regional contrast of a form can be light against dark on one edge and dark against light on the opposing edge.

I'm not fully sure of what you means. If we take the following example, are you talking about the neck and the collar of the girl?

https://i.pinimg.com/originals/ea/70/0b/ea700b6a0b366c13187e...

https://fr.pinterest.com/pin/453596993695189968/

I think the name of the concept is "edge control" (not really original). You can find some explanation here

https://www.youtube.com/watch?v=zpSlGmbUB08

To keep it short, there's no line in reality. So while you can use them when sketching, they are pretty crude, kinda like a piano with only 2 keys. The best thing is edges, meaning the delimitation between two contrasting area. If you're doing grayscale, your areas are values (light and shadow) and it's pretty easy. Once you add color, there's more dimension to play with and it became very difficult (warm and cold color, atmospheric colors, brush stroke that gives the illusion of details,...).

Again, this falls under the things that are easy to explain, but take a while to be able to observe it and longer to reproduce it.

There's a book called "Color and Light" by James Gurney that goes in depth about all of these. There's a lot of parameters that goes inside a brush stroke in a specific area of a painting.

Daub 7/2/2025||
> I'm not fully sure of what you means. If we take the following example, are you talking about the neck and the collar of the girl?

Yes... that's exactly it. It is also described in our teaching material here, (half way down the page):

https://rmit.instructure.com/courses/87565/pages/structural-...

Rembrandt was an avid user of this technique. In his portraits, one little trick he almost always used was to ensure that there was no edge contrast whatsoever in at least one region, usually located near the bottom of the figure. This served to blend the figure into the background and avoid the flat effect that would have happened had he not used it. In class I call this 'edge loss'. An equivalent in drawing is the notion of 'open lines' whereby silhouette lines are deliberately left open at select points.

> I think the name of the concept is "edge control" (not really original). You can find some explanation here.

I am aware of the term 'edge control' though I have not heard it used in this context. I feel that the term is too general to describe what is happening in the (so-called) tone wrap.

To extend the principle, wrap is an important concept in spatial rendering (painting, photography, filmmaking etc) and is a cousin of overlap. Simply... both serve to enhance form.

> To keep it short, there's no line in reality.

True that. I learned a lot about lines from reading about non-photorealistic rendering in 3D. There are some great papers on this subject (below) though I feel there is still work to be done.

Cole, Forrester, et al. "How well do line drawings depict shape?." ACM SIGGRAPH 2009 papers. 2009. 1-9.

Cole, Forrester, et al. "Where do people draw lines?." ACM SIGGRAPH 2008 papers. 2008. 1-11.

I made a stab at summarizing their wisdoms here:

https://rmit.instructure.com/courses/87565/pages/drawing-lin...

> There's a book called "Color and Light" by James Gurney that goes in depth about all of these. There's a lot of parameters that goes inside a brush stroke in a specific area of a painting.

Looking at it now. Any writer who references the Hudson River School is a friend of mine.

daxfohl 7/1/2025|||
I'm surprised there isn't already an ecosystem of libraries that just do this. When building agents you either have to roll your own or copy an algorithm out of some article.

I'd expect this to be a lot more plug and play, and as swappable as LLMs themselves by EOY, along with a bunch of tooling to help with observability, A/B testing, cost and latency analysis (since changing context kills the LLM cache), etc.

daxfohl 7/1/2025||
Or maybe it's that each of these things is pretty simple in itself. Clipping context is one line of code, summarizing could be a couple lines to have an LLM summarize it for you, etc. So not substantial enough for a formal library. Whereas the combinations of these techniques is very application dependent, so not reusable enough to warrant separating as an independent library.

Or maybe it just hasn't matured yet and we'll see more of it in the future. We'll see.

daxfohl 7/2/2025||
Though in a way, this feels similar to things like garbage collection, disk defragmentation, or even query planning. Yes, you could build libraries that do these sorts of things for you, but in all likelihood the LLM providers will embed custom-built versions of them that have been battle tested and trained thoroughly to interop well with the corresponding LLM. So whole there could still be an ecosystem, it would likely be a fairly niche thing for very specific use cases or home-grown LLMs.

Maybe something like the equivalent of AWS Firecracker for whatever the equivalent of AWS Lambda is in the future LLM world.

risyachka 6/30/2025|||
“A month-long skill” after which it won’t be a thing anymore, like so many other.
simonw 6/30/2025|||
Most of the LLM prompting skills I figured out ~three years ago are still useful to me today. Even the ones that I've dropped are useful because I know that things that used to be helpful aren't helpful any more, which helps me build an intuition for how the models have improved over time.
refulgentis 6/30/2025|||
I agree with you, but would echo OP's concern, in a way that makes me feel like a party pooper, but, is open about what I see us all expressing squeamish-ness about.

It is somewhat bothersome to have another buzz phrase. I don't why we are doing this, other than there was a Xeet from the Shopify CEO, QT'd approvingly by Karpathy, then its written up at length, and tied to another set of blog posts.

To wit, it went from "buzzphrase" to "skill that'll probably be useful in 3 years still" over the course of this thread.

Has it even been a week since the original tweet?

There doesn't seem to be a strong foundation here, but due to the reach potential of the names involved, and their insistence on this being a thing while also indicating they're sheepish it is a thing, it will now be a thing.

Smacks of a self-aware version of Jared Friedman's tweet re: watching the invention of "Founder Mode" was like a startup version of the Potsdam Conference. (which sorted out Earth post-WWII. and he was not kidding. I could not even remember the phrase for the life of me. Lasted maybe 3 months?)

dbreunig 6/30/2025|||
Sometimes buzzwords turn out to be mirages that disappear in a few weeks, but often they stick around.

I find they takeoff when someone crystallizes something many people are thinking about internally, and don’t realize everyone else is having similar thoughts. In this example, I think the way agent and app builders are wrestling with LLMs is fundamentally different than chatbots users (it’s closer to programming), and this phrase resonates with that crowd.

Here’s an earlier write up on buzzwords: https://www.dbreunig.com/2020/02/28/how-to-build-a-buzzword....

refulgentis 7/1/2025||
I agree - what distinguishes this is how rushed and self-aware it is. It is being pushed top down, sheepishly.

EDIT: Ah, you also wrote the blog posts tied to this. It gives 0 comfort that you have a blog post re: building buzz phrases in 2020, rather, it enhances the awkward inorganic rush people are self-aware of.

dbreunig 7/1/2025||
I studied linguistic anthropology, in addition to CS. Been at it since 2002.

And I wrote the first post before the meme.

refulgentis 7/1/2025||
I've read these ideas a 1000 times, I thought it was the most beautiful core of the "Sparks of AGI" paper. (6.2)

We should be able to name the source of this sheepishness and have fun with that we are all things at once: you can be a viral hit 2002 super PhD with expertise in all areas involved in this topic that has brought pop attention onto something important, and yet, the hip topic you feel centered on can also make people's eyes roll temporarily. You're doing God's work. The AI = F(C) thing is really important. Its just, in the short term, it will feel like a buzzword.

This is much more about me playing with, what we can reduce to, the "get off my lawn!" take. I felt it interesting to voice because it is a consistent undercurrent in the discussion and also leads to observable absurdities when trying to describe it. It is not questioning you, your ideas, or work. It has just come about at a time when things become hyperreal hyperquickly and I am feeling old.

simonw 6/30/2025|||
The way I see it we're trying to rebrand because the term "prompt engineering" got redefined to mean "typing prompts full of stupid hacks about things like tipping and dead grandmas into a chatbot".
joe5150 7/1/2025|||
It helps that the rebrand may lead some people to believe that there are actually new and better inputs into the system rather than just more elaborate sandcastles built in someone else's sandbox.
Dylan16807 7/1/2025|||
If that's what it takes to make good results, then it's respectable work even if the details are stupid.
dbreunig 6/30/2025|||
While researching the above posts Simon linked, I was struck by how many of these techniques came from the pre-ChatGPT era. NLP researchers have been dealing with this for awhile.
orbital-decay 6/30/2025||||
Many people figured it out two-three years ago when AI-assisted coding basically wasn't a thing, and it's still relevant and will stay relevant. These are fundamental principles, all big models work similarly, not just transformers and not just LLMs.

However, many fundamental phenomena are missing from the "context engineering" scope, so neither context engineering nor prompt engineering are useful terms.

coldtea 7/1/2025||||
What exactly month-long AI skills of 2023 AI are obsolete now?

Surely not prompt engineering itself, for example.

bird0861 7/2/2025||
Persona prompting. (Unless the persona is the point as in role-playing.)
tptacek 7/1/2025|||
If you're not writing your own agents, you can skip this skill.
anilgulecha 7/1/2025||
Are you sure? Looking forward - AI is going to be so pervasively used, that understanding what information is to be input will be a general skill. What we've been calling "prompt engineering" - the better ones were actually doing context engineering.
tptacek 7/1/2025||
If you're doing context engineering, you're writing an agent. It's mostly not the kind of stuff you can do from a web chat textarea.
storus 6/30/2025|||
Those issues are considered artifacts of the current crop of LLMs in academic circles; there is already research allowing LLMs to use millions of different tools at the same time, and stable long contexts, likely reducing the amount of agents to one for most use cases outside interfacing different providers.

Anyone basing their future agentic systems on current LLMs would likely face LangChain fate - built for GPT-3, made obsolete by GPT-3.5.

simonw 6/30/2025|||
Can you link to the research on millions of different terms and stable long contexts? I haven't come across that yet.
storus 6/30/2025||
You can look at AnyTool, 2024 (16,000 tools) and start looking at newer research from there.

https://arxiv.org/abs/2402.04253

For long contexts start with activation beacons and RoPE scaling.

simonw 6/30/2025|||
I would classify AnyTool as a context engineering trick. It's using GPT-4 function calls (what we would call tool calls today) to find the best tools for the current job based on a 3-level hierarchy search.

Drew calls that one "Tool Loadout" https://www.dbreunig.com/2025/06/26/how-to-fix-your-context....

timr 7/1/2025||
So great. We have not one, but two different ways of saying "use text search to find tools".

This field, I swear...it's the PPAP [1] of engineering.

[1] https://www.youtube.com/watch?v=NfuiB52K7X8

I have a toool...I have a seeeeearch...unh! Now I have a Tool Loadout!" *dances around in leopard print pyjamas*

Art9681 7/1/2025||||
RoPE scaling is not an ideal solution since all LLMs in general start degrading at around 8k. You also have the problem of cost by yolo'ing long context per task turn even if the LLM were capable of crunching 1M tokens. If you self host then you have the problem of prompt processing time. So it doesnt matter in the end if the problem is solved and we can invoke n number of tools per task turn. It will be a quick way to become poor as long as providers are charging per token. The only viable solution is to use a smart router so only the relevant tools and their descriptions are appended to the context per task turn.
nyrikki 6/30/2025|||
Thanks for the link. It finally explained why I was getting hit up by recruiters for a job that was for a data broker looking to do what seemed like silly uses.

Cloud API recommender systems must seem like a gift to that industry.

Not my area anyways but I couldn't see a profit model for a human search for an API when what they wanted is well covered by most core libraries in Python etc...

ZYbCRq22HbJ2y7 6/30/2025||||
How would "a million different tool calls at the same time" work? For instance, MCP is HTTP based, even at low latency in incredibly parallel environments that would take forever.
nkohari 7/1/2025|||
There's a difference between discovery (asking an MCP server what capabilities it has) and use (actually using a tool on the MCP server).

I think the comment you're replying to is talking about discovery rather than use; that is, offering a million tools to the model, not calling a million tools simultaneously.

kiitos 7/3/2025||||
HTTP is an implementation detail, and doesn't represent any kind of unavoidable bottleneck vs. any other transport protocol one might use to do these kinds of request/response interactions.
Art9681 7/1/2025||||
It wouldn't. There is a difference between theory and practicality. Just because we could, doesnt mean we should, especially when costs per token are considered. Capability and scale are often at odds.
Jarwain 7/1/2025|||
MCPs aren't the only way to embed tool calls into an LLM
coldtea 7/1/2025||
Doesn't change the argument.
tptacek 7/1/2025||
It obviously does.
Art9681 7/1/2025||
It does not. Context is context no matter how you process it. You can configure tools without MCP or with it. No matter. You still have to provide that as context to an LLM.
tptacek 7/1/2025||
If you're using native tool calls and not MCP, the latency of calls is a nonfactor; that was the concern raised by the root comment.
Foreignborn 6/30/2025||||
yes, but those aren’t released and even then you’ll always need glue code.

you just need to knowingly resource what glue code is needed, and build it in a way it can scale with whatever new limits that upgraded models give you.

i can’t imagine a world where people aren’t building products that try to overcome the limitations of SOTA models

storus 6/30/2025||
My point is that newer models will have those baked in, so instead of supporting ~30 tools before falling apart they will reliably support 10,000 tools defined in their context. That alone would dramatically change the need for more than one agent in most cases as the architectural split into multiple agents is often driven by the inability to reliably run many tools within a single agent. Now you can hack around it today by turning tools on/off depending on the agent's state but at some point in the future you might afford not to bother and just dump all your tools to a long stable context, maybe cache it for performance, and that will be it.
ZYbCRq22HbJ2y7 6/30/2025||
There will likely be custom, large, and expensive models at an enterprise level in the near future (some large entities and governments already have them (niprgpt)).

With that in mind, what would be the business sense in siloing a single "Agent" instead of using something like a service discovery service that all benefit from?

storus 6/30/2025||
My guess is the main issue is latency and accuracy; a single agent without all the routing/evaluation sub-agents around it that introduce cumulative errors, lead to infinite loops and slow it down would likely be much faster, accurate and could be cached at the token level on a GPU, reducing token preprocessing time further. Now different companies would run different "monorepo" agents and those would need something like MCP to talk to each other at the business boundary, but internally all this won't be necessary.

Also the current LLMs have still too many issues because they are autoregressive and heavily biased towards the first few generated tokens. They also still don't have full bidirectional awareness of certain relationships due to how they are masked during the training. Discrete diffusion looks interesting but I am not sure how does that one deal with tools as I've never seen a model from that class using any tools.

dinvlad 6/30/2025|||
> already research allowing LLMs to use millions of different tools

Hmm first time hearing about this, could you share any examples please?

simonw 6/30/2025||
See this comment https://news.ycombinator.com/item?id=44428548
dosnem 7/1/2025|||
Providing context makes sense to me, but do you have any examples of providing context and then getting the AI to produce something complex? I am quite a proponent of AI but even I find myself failing to produce significant results on complex problems, even when I have clone + memory bank, etc. it ends up being a time sink of trying to get the ai to do something only to have me eventually take over and do it myself.
simonw 7/1/2025|||
Quite a few times, I've been able to give it enough context to write me an entire working piece of software in a single shot. I use that for plugins pretty often, eg this:

  llm -m openai/o3 \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue
      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'
Which produced this: https://gist.github.com/simonw/249e16edffe6350f7265012bee9e3...
AnotherGoodName 7/1/2025|||
I had a series of “Using Manim create an animation for formula X rearranging into formula Y with a graph of values of the function”

Beautiful one shot results and i now have really nice animations of some complex maths to help others understand. (I’ll put it up on youtube soon).

I don't know the manim library at all so saved me about a week of work learning and implementing

old_man_cato 6/30/2025|||
[flagged]
jknoepfler 6/30/2025|||
Oh, and don't forget to retain the artist to correct the ever-increasingly weird and expensive mistakes made by the context when you need to draw newer, fancier pelicans. Maybe we can just train product to draw?
d0gsg0w00f 6/30/2025||||
This hits too close to home.
_carbyau_ 6/30/2025|||
[flagged]
crsv 7/1/2025|||
And then the AI doesn’t handle the front end caching properly for the 100th time in a row so you edit the owl and nothing changes after you press save.
NomDePlum 6/30/2025|||
[flagged]
TrainedMonkey 6/30/2025|||
Hire a context engineer to define the task of drawing an owl as drawing two owls.
zdw 7/1/2025|||
[flagged]
arbitrary_name 7/1/2025|||
From the first link:Read large enough context to ensure you get what you need.

How does this actually work, and how can one better define this to further improve the prompt?

This statement feels like the 'draw the rest of the fucking owl' referred to elsewhere in the thread

simonw 7/1/2025||
I'm not sure how you ended up on that page... my comment above links to https://simonwillison.net/2025/Jun/27/context-engineering/

The "Read large enough context to ensure you get what you need" quote is from a different post entirely, this one: https://simonwillison.net/2025/Jun/30/vscode-copilot-chat/

That's part of the system prompts used by the GitHub Copilot Chat extension for VS Code - from this line: https://github.com/microsoft/vscode-copilot-chat/blob/40d039...

The full line is:

  When using the {ToolName.ReadFile} tool, prefer reading a
  large section over calling the {ToolName.ReadFile} tool many
  times in sequence. You can also think of all the pieces you
  may be interested in and read them in parallel. Read large
  enough context to ensure you get what you need.
That's a hint to the tool-calling LLM that it should attempt to guess which area of the file is most likely to include the code that it needs to review.

It makes more sense if you look at the definition of the ReadFile tool:

https://github.com/microsoft/vscode-copilot-chat/blob/40d039...

  description: 'Read the contents of a file. Line numbers are
  1-indexed. This tool will truncate its output at 2000 lines
  and may be called repeatedly with offset and limit parameters
  to read larger files in chunks.'
The tool takes three arguments: filePath, offset and limit.
JoeOfTexas 6/30/2025|||
So who will develop the first Logic Core that automates the context engineer.
igravious 6/30/2025||
The first rule of automation: that which can be automated will be automated.

Observation: this isn't anything that can't be automated /

TZubiri 7/1/2025||
Rediscovering encapsulation
benreesman 6/30/2025||
The new skill is programming, same as the old skill. To the extent these things are comprehensible, you understand them by writing programs: programs that train them, programs that run inferenve, programs that analyze their behavior. You get the most out of LLMs by knowing how they work in detail.

I had one view of what these things were and how they work, and a bunch of outcomes attached to that. And then I spent a bunch of time training language models in various ways and doing other related upstream and downstream work, and I had a different set of beliefs and outcomes attached to it. The second set of outcomes is much preferable.

I know people really want there to be some different answer, but it remains the case that mastering a programming tool involves implemtenting such, to one degree or another. I've only done medium sophistication ML programming, and my understand is therefore kinda medium, but like compilers, even doing a medium one is the difference between getting good results from a high complexity one and guessing.

Go train an LLM! How do you think Karpathy figured it out? The answer is on his blog!

pyman 6/30/2025||
Saying the best way to understand LLMs is by building one is like saying the best way to understand compilers is by writing one. Technically true, but most people aren't interested in going that deep.
benreesman 6/30/2025|||
I don't know, I've heard that meme too but it doesn't track with the number of cool compiler projects on GitHub or that frontpage HN, and while the LLM thing is a lot newer, you see a ton of useful/interesting stuff at the "an individual could do this on their weekends and it would mean they fundamentally know how all the pieces fit together" type stuff.

There will always be a crowd that wants the "master XYZ in 72 hours with this ONE NEAT TRICK" course, and there will always be a..., uh, group of people serving that market need.

But most people? Especially in a place like HN? I think most people know that getting buff involves going to the gym, especially in a place like this. I have a pretty high opinion of the typical person. We're all tempted by the "most people are stupid" meme, but that's because bad interactions are memorable, not because most people are stupid or lazy or whatever. Most people are very smart if they apply themselves, and most people will work very hard if the reward for doing so is reasonably clear.

https://www.youtube.com/shorts/IQmOGlbdn8g

wickedsight 7/1/2025||||
The best way to understand a car is to build a car. Hardly anyone is going to do that, but we still all use them quite well in our daily lives. In large part because the companies who build them spend time and effort to improve them and take away friction and complexity.

If you want to be an F1 driver it's probably useful to understand almost every part of a car. If you're a delivery driver, it probably isn't, even if you use one 40+ hours a week.

benreesman 7/1/2025|||
Your example / analogy is useful in the sense that its usually useful to establish the thought experiment with the boundary conditions.

But in between someone commuting in a Toyota and an F1 driver are many, many people, the best example from inside the extremes is probably a car mechanic, and even there, there's the oil change place with the flat fee painted in the window, and the Koenigsberg dealership that orders the part from Europe. The guy who tunes those up can afford one himself.

In the use case segment where just about anyone can do it with a few hours training, yeah, maybe that investment is zero instead of a week now.

But I'm much more interested in the one where F1 cars break the sound barrier now.

eclecticfrank 7/1/2025||||
It might make sense to split the car analogy into different users:

1. For the majority of regular users the best way to understand the car is to read the manual and use the car.

2. For F1 drivers the best way to understand the car is to consult with engineers and use the car.

3. For a mechanic / engineer the best way to understand the car is to build and use the car.

Davidzheng 7/1/2025|||
yes except intelligence isn't like a car, there's no way to break the complicated emergent behaviors of these models into simple abstractions. you can understand a LLM by training one the same amount you can understand a brain by dissection.
LtWorf 7/1/2025||
I think making one would help you understand that they're not intelligent.
benreesman 7/1/2025|||
Your reply is enough of a zinger that I'll chuckle and not pile on, but there is a very real and very important point here, which is that it is strictly bad to get mystical about this.

There are interesting emergent behaviors in computationally feasible scale regimes, but it is not magic. The people who work at OpenAI and Anthropic worked at Google and Meta and Jump before, they didn't draw a pentagram and light candles during onboarding.

And LLMs aren't even the "magic. Got it." ones anymore, the zero shot robotics JEPA stuff is like, wtf, but LLM scaling is back to looking like a sigmoid and a zillion special cases. Half of the magic factor in a modern frontier company's web chat thing is an uncorrupted search index these days.

Davidzheng 7/1/2025|||
OK I, like the other commenter, also feel stupid to reply to zingers--but here goes.

First of all, I think a lot of the issue here is this sense of baggage over this word intelligence--I guess because believing machines can be intelligent goes against this core belief that people have that humans are special. This isn't meant as a personal attack--I just think it clouds thinking.

Intelligence of an agent is a spectrum, it's not a yes/no. I suspect most people would not balk at me saying that ants and bees exhibits intelligent behavior when they look for food and communicate with one another. We infer this from some of the complexity of their route planning, survival strategies, and ability to adapt to new situations. Now, I assert that those same strategies can not only be learned by modern ML but are indeed often even hard-codable! As I view intelligence as a measure of an agent's behaviors in a system, such a measure should not distinguish the bee and my hard-wired agent. This for me means hard-coded things can be intelligent as they can mimic bees (and with enough code humans).

However, the distribution of behaviors which humans inhabit are prohibitively difficult to code by hand. So we rely on data-driven techniques to search for such distributions in a space which is rich enough to support complexities at the level of the human brain. As such I certainly have no reason to believe, just because I can train one, that it must be less intelligent than humans. On the contrary, I believe in every verifiable domain RL must drive the agent to be the most intelligent (relative to RL award) it can be under the constraints--and often it must become more intelligent than humans in that environment.

LtWorf 7/1/2025|||
So according to your extremely broad definition of intelligence, also a casio calculator is intelligent?

Sure, if we define anything as intelligent, AI is intelligent.

Is this definition somehow helpful though?

Davidzheng 7/1/2025||
It's not binary...
benreesman 7/1/2025|||
Eh...kinda. The RL in RLHF is a very different animal than the RL in a Waymo car training pipeline, which is sort of obvious when you see that the former can be done by anyone with some clusters and some talent, and the latter is so hard that even Waymo has a marked preference for operating in July in Chandler AZ: everyone else is in the process of explaining why they didn't really want Level 5 per se anyways: all brakes no gas if you will.

The Q summations that are estimated/approximated by deep policy networks are famously unstable/ill-behaved under descent optimization in the general case, and it's not at all obvious that "point RL at it" is like, going to work at all. You get stability and convergence issues, you get stuck in minima, it's hard and not a mastered art yet, lot of "midway between alchemy and chemistry" vibes.

The RL in RLHF is more like Learning to Rank in a newsfeed optimization setting: it's (often) ranked-choice over human-rating preferences with extremely stable outcomes across humans. This phrasing is a little cheeky but gives the flavor: it's Instagram where the reward is "call it professional and useful" instead of "keep clicking".

When the Bitter Lesson essay was published, it was contrarian and important and most of all aimed at an audience of expert practitioners. The Bitter Bitter Lesson in 2025 is that if it looks like you're in the middle of an exponential process, wait a year or two and the sigmoid will become clear, and we're already there with the LLM stuff. Opus 4 is taking 30 seconds on the biggest cluster that billions can buy and they've stripped off like 90% of the correctspeak alignment to get that capability lift, we're hitting the wall.

Now this isn't to say that AI progress is over, new stuff is coming out all the time, but "log scale and a ruler" math is marketing at this point, this was a sigmoid.

Edit: don't take my word for it, this is LeCun (who I will remind everyone has the Turing) giving the Gibbs Lecture on the mathematics 10k feet view: https://www.youtube.com/watch?v=ETZfkkv6V7Y

Davidzheng 7/1/2025||
I'm in agreement--RLHF won't lead to massively more intelligent beings than humans. But I said RL not RLHF
benreesman 7/1/2025||
Well what you said is:

"On the contrary, I believe in every verifiable domain RL must drive the agent to be the most intelligent (relative to RL award) it can be under the constraints--and often it must become more intelligent than humans in that environment."

And I said it's not that simple, in no way demonstrated, unlikely with current technology, and basically, nope.

Davidzheng 7/2/2025||
Ah you're worried about convergence issues? My (Bad) understanding was that the self-driving car stuff is more about inadequacies of models in which you simulate training and data collection than convergence of algorithms but I could be wrong. I mean that statement was just a statement that I think you can get RL to converge to close to optimum--which I agree is a bit of a stretch as RL is famously finicky. But I don't see why one shouldn't expect this to happen as we tune the algorithms.
hackable_sand 7/2/2025|||
It's not that deep
Davidzheng 7/1/2025||
I highly highly doubt that training a LLM like gpt-2 will help you use something the size of GPT-4. And I guess most people can't afford to train something like GPT-4. I trained some NNs back before the ChatGPT era, I don't think any of it helps in using Chatgpt/alternatives
benreesman 7/1/2025||
With modern high-quality datasets and the plummeting H100 rental costs it is 100% a feasible undertaking for an individual to train a model with performance far closer to gpt-4-1106-preview than to gpt-2, in fact its difficult to train a model that performs as badly as gpt-2 without carefully selecting for datasets like OpenWebText with the explicit purpose of replicating runs of historical interest: modern datasets will do better than that by default.

GPT-4 is a 1.75 terraweight MoE (the rumor has it) and that's probably pushing it for an individual's discretionary budget unless they're very well off, but you don't need to match that exactly to learn how these things fundamentally work.

I think you underestimate how far the technology has come. torch.distributed works out of the box now, deepspeed and other strategies that are both data and model parallel are weekend projects to spin up on an 8xH100 SXM2 interconnected cluster that you can rent from Lambda Labs, HuggingFace has extreme quality curated datasets (the fineweb family I was alluding to from Karpathy's open stuff is stellar).

In just about any version of this you come to understand how tokenizers work (which makes a whole class of failure modes go from baffling to intuitive), how models behave and get evaled after pretraining, after instruct training / SFT rounds, how convergence does and doesn't happen, how tool use and other special tokens get used (and why they are abundant).

And no, doing all that doesn't make Opus 4 completely obvious in all aspects. But its about 1000x more effective as a learning technique than doing prompt engineer astrology. Opus 4 is still a bit mysterious if you don't work at a frontier lab, there's very interesting stuff going on there and I'm squarely speculating how some of that works if I make claims about it.

Models that look and act a lot like GPT-4 while having dramatically lower parameter counts are just completely understood in open source now. The more advanced ones require resources of a startup rather than an individual, but you don't need to eval the same as 1106 to take all the mystery out of how it works.

The "holy shit" models are like 3-4 generations old now.

Davidzheng 7/1/2025||
Ok I'm open (and happy to hear!) to being wrong on this. You are saying I can find tutorials which can train something like gpt3.5 level model (like a 7B model?) from scratch for under 1000 USD of cloud compute? Is there a guide on how to do this?
benreesman 7/1/2025||
The literally watch it on a live stream version does in fact start with the GPT-2 arch (but evals way better): https://youtu.be/l8pRSuU81PU

Lambda Labs full metas jacket accelerated interconnect clusters: https://lambda.ai/blog/introducing-lambda-1-click-clusters-a...

FineWeb-2 has versions with Llama-range token counts: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2

Ray Train is one popular choice for going distributed, RunHouse, bumcha stuff (and probably new versions since I last was doing this): https://docs.ray.io/en/latest/train/train.html

tiktokenizer is indispensable for going an intuition about tokenization and it does cl100k: https://tiktokenizer.vercel.app/

Cost comes into it, and doing things more cheaply (e.g. vast.ai) is harder. Doing a phi-2 / phi-3 style pretrain is like I said, more like the resources of a startup.

But in the video Karpathy evals better than gpt-2 overnight for 100 bucks and that will whet anyone's appetite.

If you get bogged down building FlashAttention from source or whatever, b7r6@b7r6.net

Davidzheng 7/1/2025||
Thanks for the links! Hopefully this doesn't come across as confrontational (this is really something I would like to try myself) but I don't think a gpt2 arch will get to close to gpt3.5 level intelligence? I feel like there was some boundary around gpt3.5 where the stuff started to feel slightly magical for me [maybe it was only the RLHF effect]. Do you think models in gpt2 size now are getting to that capability? I know sub 10B models have been getting really smart recently.
benreesman 7/1/2025||
I think you'll be surprised if you see the lift karpathy demonstrates from `fineweb.edu` vs `webtext` (he went back later and changed the `nanogpt` repository to use `openwebtext` because it was different enough that it wasn't a good replication of GPT-2).

But from an architecture point of view, you might be surprised at how little has changed. Rotary and/or alibi embeddings are useful, and there's a ton on the inference efficiency side (GQA -> MHA -> MLA), but you can fundamentally take a llama and start it tractably small, and then make it bigger.

You can also get checkpoint weights for tons of models that are trivially competitive, and tune heads on them for a fraction of the cost.

This leaked Google memo is a pretty good summary (and remarkably prescient in terms of how it's played out): https://semianalysis.com/2023/05/04/google-we-have-no-moat-a...

I hope I didn't inadvertently say or imply that you can make GPT-4 in a weekend, that's not true. But you can make models with highly comparable characteristics based on open software, weights, training sets, and other resources that are basically all on HuggingFace: you can know how it works.

GPT-2 is the one you can do completely by yourself starting from knowing a little Python in one day.

JohnMakin 6/30/2025||
> Building powerful and reliable AI Agents is becoming less about finding a magic prompt or model updates.

Ok, I can buy this

> It is about the engineering of context and providing the right information and tools, in the right format, at the right time.

when the "right" format and "right" time are essentially, and maybe even necessarily, undefined, then aren't you still reaching for a "magic" solution?

If the definition of "right" information is "information which results in a sufficiently accurate answer from a language model" then I fail to see how you are doing anything fundamentally differently than prompt engineering. Since these are non-deterministic machines, I fail to see any reliable heuristic that is fundamentally indistinguishable than "trying and seeing" with prompts.

v3ss0n 7/1/2025||
At this point , due to non-deterministic nature and hallucination context engineering is pretty much magic. But here are our findings.

1 - LLM Tends to pick up and understand contexts that comes at top 7-12 lines.Mostly first 1k token is best understood by llms ( tested on Claude and several opensource models ) so - most important contexts like parsing rules need to be placed there.

2 - Need to keep context short . Whatever context limit they claim is not true . They may have long context window of 1 mil tokens but only up to avg 10k token have good accuracy and recall capabilities , the rest is just bunk , just ignore them. Write the prompt and try compressing/summerizing it without losing key information manually or use of LLM.

3 - If you build agent-to-agent orchestration , don't build agents with long context and multiple tools, break them down to several agents with different set of tools and then put a planning agent which solely does handover.

4 - If all else fails , write agent handover logic in code - as it always should.

From building 5+ agent to agent orchestration project on different industries using autogen + Claude - that is the result.

zacksiri 7/1/2025|||
Based on my testing the larger the model the better it is at handling larger context.

I tested with 8B model, 14B model and 32B model.

I wanted it to create structured json, and the context was quite large like 60k tokens.

the 8B model failed miserably despite supporting 128k context, the 14b did better the 32B one almost got everything correct. However when jumping to a really large model like grok-3-mini it got it all perfect.

The 8B, 14B, 32B models I tried were Qwen 3. All the models I tested I disabled thinking.

Now for my agent workflows I use small models for most workflow (it works quite nicely) and only use larger models when the problem is harder.

v3ss0n 7/1/2025||
That is true too. But I found Qwen3 14B with 8bit quant fair better than 32B with 4b quant . Both kvcache at 8bit. ( i enabled thinking , i will try with /nothink)
lblume 7/1/2025||||
I have uploaded entire books to the latest Gemini and had the model reliably accurately answer specific questions requiring knowledge of multiple chapters.
FeepingCreature 7/1/2025|||
I think it works for info but not so well for instructions/guidance. That's why the standard advice is instructions at the start and repeated at the end.
grogenaut 7/1/2025|||
Or under the covers are just putting all the text you fed at into a rag database and doing embedding search define route and snippets and answer your questions when asked directly. Which is the difference approach than recalling instructions
raybb 7/1/2025|||
I wonder if the serial-position effect is happening with LLMs.

https://en.wikipedia.org/wiki/Serial-position_effect

potatolicious 7/1/2025||
Something like it definitely, though not exactly. We also know that recall improves with proximate position of bits within the context.

Adherence to context is lossy in a way reminiscent of human behavior but also different in crucial ways.

HSO 7/1/2025||||
I wonder if those books were already in the training set, i.e. in a way "hardcoded" before you even steered the model that way.
jimbokun 7/1/2025||
Should be easy to test: ask the question without the book in the context window, ask again with the book in the context window.
fwn 7/1/2025||||
That’s pretty typical, though not especially reliable. (Allthough in my experience, Gemini currently performs slightly better than ChatGPT for my case.)

In one repetitive workflow, for example, I process long email threads, large Markdown tables (which is a format from hell), stakeholder maps, and broader project context, such as roles, mailing lists, and related metadata. I feed all of that into the LLM, which determines the necessary response type (out of a given set), selects appropriate email templates, drafts replies, generates documentation, and outputs a JSON table.

It gets it right on the first try about 75% of the time, easily saving me an hour a day - often more.

Unfortunately, 10% of the time, the responses appear excellent but are fundamentally flawed in some way. Just so it doesn't get boring.

simonw 7/1/2025|||
Try reformatting the data from the markdown table into a JSON or YAML list of objects. You may find that repeating the keys for every value gives you more reliable results.
fwn 7/2/2025||
Thanks for the suggestion! I’ll start benchmarking my current md table setup against one using YAML. It's apparently slightly less verbose than JSON.
v3ss0n 7/1/2025|||
Gemini does a lot better at long context.
yahoozoo 7/2/2025||||
Mind if I ask how you’re doing this? I have uploaded short stories of <40,000 words in .txt format and when I ask questions like “How many chapters are there?” or “What is the last sentence in the story?” it gets it wrong. If I paste a chapter or two at a time then ask, it works better, but that’s tedious…
v3ss0n 7/1/2025|||
Try multi-turn and agent-to-agent , it will breakdown , but Gemini is a lot better at larger context.
zvitiate 7/1/2025|||
Claude’s system prompt is SO long though that the first 1k lines might not be as relevant for Gemini, GPT, or Grok.
mentalgear 6/30/2025|||
It's magical thinking all the way down. Whether they call it now "prompt" or "context" engineering because it's the same tinkering to find something that "sticks" in non-deterministic space.
nonethewiser 7/1/2025|||
>Whether they call it now "prompt" or "context" engineering because it's the same tinkering to find something that "sticks" in non-deterministic space.

I dont quite follow. Prompts and contexts are different things. Sure, you can get thing into contexts with prompts but that doesn't mean they are entirely the same.

You could have a long running conversation with a lot in the context. A given prompt may work poorly, whereas it would have worked quite well earlier. I don't think this difference is purely semantic.

For whatever it's worth I've never liked the term "prompt engineering." It is perhaps the quintessential example of overusing the word engineering.

Turskarama 7/1/2025|||
Both the context and the prompt are just part of the same input. To the model there is no difference, the only difference is the way the user feeds that input to the model. You could in theory feed the context into the model as one huge prompt.
__loam 7/1/2025||
Sometimes I wonder if LLM proponents even understand their own bullshit.

It's all just tokens in the context window right? Aren't system prompts just tokens that stay appended to the front of a conversation?

They're going to keep dressing this up six different ways to Sunday but it's always just going to be stochastic token prediction.

simonw 7/1/2025|||
System prompts don't even have to be appended to the front of the conversation. For many models they are actually modeled using special custom tokens - so the token stream looks a bit like:

  <system-prompt-starts>
  translate to English
  <system-prompt-ends>
  An explanation of dogs: ...
The models are then trained to (hopefully) treat the system prompt delimited tokens as more influential on how the rest of the input is treated.
throwdbaaway 7/1/2025||
> The models are then trained to (hopefully) treat the system prompt delimited tokens as more influential on how the rest of the input is treated.

I can't find any study that compares putting the same initial prompt in the system role versus in the user role. It is probably just position bias, i.e. the models can better follow the initial input, regardless of whether it is system prompt or user prompt.

StevenWaterman 7/1/2025||||
Yep, every AI call is essentially just asking it to predict what the next word is after:

  <system>
  You are a helpful assistant.
  </system>
  <user>
  Why is the sky blue?
  </user>
  <assistant>
  Because of Rayleigh scattering. The blue light refracts more.
  </assistant>
  <user>
  Why is it red at sunset then?
  </user>
  <assistant>
And we keep repeating that until the next word is `</assistant>`, then extract the bit in between the last assistant tags, and return it. The AI has been trained to look at `<user>` differently to `<system>`, but they're not physically different.

It's all prompt, it can all be engineered. Hell, you can even get a long way by pre-filling the start of the Assistant response. Usually works better than a system message. That's prompt engineering too.

Terr_ 7/2/2025|||
Yeah, ultimately it's Make Document Longer machine, and in many cases it's a hidden mad-libs script behind the scenes, where your question becomes "Next the User said", and some regular code is looking for "Next the Computer said" and "performing" it at you.

In other words, there's a deliberate illusion going on where we are encouraged to believe that generating a document about a character is the same as that character being a real entity.

phkahler 7/1/2025|||
This is why I enjoy calling AI "autocomplete" when people make big claims about it - because that's where it came from and exactly what it is.
mat_b 7/1/2025|||
AI is not autocomplete. LLMs are autocomplete.
phkahler 7/1/2025||
Yes. That's what I meant.
smokel 7/1/2025|||
Depending on what you mean exactly, "autocomplete" and big claims are not mutually exclusive.
ToucanLoucan 7/1/2025||||
> Sometimes I wonder if LLM proponents even understand their own bullshit.

Categorically, no. Most are not software engineers, in fact most are not engineers of any sort. A whole lot of them are marketers, the same kinds of people who pumped crypto way back.

LLMs have uses. Machine learning has a ton of uses. AI art is shit, LLM writing is boring, code generation and debugging is pretty cool, information digestion is a godsend some days when I simply cannot make my brain engage with whatever I must understand.

As with most things, it's about choosing the right tool for the right task, and people like AI hype folk are carpenters with a brand new, shiny hammer, and they're gonna turn every fuckin problem they can find into a nail.

Also for the love of god do not have ChatGPT draft text messages to your spouse, genuinely what the hell is wrong with you?

tilne 7/1/2025||
Leaving the “g” of the f word at the end made me re-read this in Fat Tony’s voice. It was an awesome touch.
buffzebra 7/3/2025|||
“It’s all just tokens in the context window” = “it’s all just fundamental particles,” I think. True, but reductive. Seems key that dude is talking about agentic AI not just chat. I’d revisit the email example in the post.
pennaMan 7/1/2025||||
I always used "prompting" to mean "providing context" in genral not necesarlly just clever instructions like people seem to be using the term.

And yes, I view clever instructions like "great grandma's last wish" still as just providing context.

>A given prompt may work poorly, whereas it would have worked quite well earlier.

The context is not the same! Of course the "prompt" (clever last sentence you just added to the context) is not going to work "the same". The model has a different context now.

ffsm8 7/1/2025||||
Yeah, if anything it should be called an art.

The term engineering makes little sense in this context, but really... Did it make sense for eg "QA Engineer" and all the other jobs we tacked it on, too? I don't think so, so it's kinda arguing after we've been misusing the term for well over 10 yrs

groestl 7/1/2025||
Well, to get the right thing into the context in a performant way when you dealing with a huge dataset is definitely engineering.
shakna 7/1/2025||
Engineering tends to mean "the application of scientific and mathematical principles to practical ends".

I'm not sure there's much scientific or mathematical about guessing how a non-deterministic system will behave.

SonOfLilit 7/1/2025|||
The moment you start building evaluation pipelines and running experiments to validate your ideas it stops being guessing
simonw 7/1/2025|||
Right: for me that's when "prompt engineering"/"context engineering" start to earn the "engineering" suffix: when people start being methodical and applying techniques like evals.
passwordqwe 7/1/2025||
Relevant XKCD: https://xkcd.com/397/ About if it's science or not, the difference is testing it through experiment.
ModernMech 7/1/2025||||
You've heard of science versus pseudo-science? Well..

Engineering: "Will the bridge hold? Yes, here's the analysis, backed by solid science."

Pseudo-engineering: "Will the bridge hold? Probably. I'm not really sure; although I have validated the output of my Rube Goldberg machine, which is supposedly an expert in bridges, and it indicates the bridge will be fine. So we'll go with that."

"prompt engineer" or "context engineer" to me sounds a lot closer to "paranormal investigator" than anything else. Even "software engineer" seems like proper engineering in comparison.

groestl 7/2/2025||
Engineering: "Will the bridge hold? Yes, with a confidence of 99.95%"
grugagag 7/1/2025|||
It’s validated and filtered but isn’t it still guessing at the core? Should we call it validated guessing?
shakna 7/2/2025|||
If it's actually validated, according to rigorous principles, it's not a guess, but a system of predictions with a known confidence interval, that allows you to know if you can be sure of something.

Right now, you cannot get that far. And if you happen to... Tomorrow it will be different.

Predicting tides is possible. It requires enormous amounts of data and processing to be sure of it. Right now, we've got tides, but we don't have the data from the satellites. Because the owner is constantly shifting the prompt, for good reasons of their own. So we can't be confident - or we can only be so blindly.

groestl 7/2/2025|||
I think a validated guess is exactly what a prediction is.
groestl 7/2/2025|||
Funny how you use a scientific term to discredit applied statistics. I've built useful non-deterministic systems many times and they had nothing to do with AI. Also, particle physics would like to have a word with you.
shakna 7/2/2025||
Guessing, how a non-deterministic system would behave.

Statistics isn't guessing. But it is guessing when the confidence interval is unknowable and constantly shifting. We're not talking relativity, we're talking about throwing pancakes at a wall to tell if there's a person behind it.

sethammons 7/1/2025|||
"Context Crafting"
belter 7/1/2025||||
Got it...updating CV to call myself a VibeOps Engineer in a team of Context Engineers...A few of us were let go last quarter, as they could only do Prompt Engineering.
tootie 7/1/2025||||
You say "magic" I say "heuristic"
ironmagma 7/1/2025||||
What is all software but tinkering?

I mean this not as an insult to software dev but to work generally. It’s all play in the end.

8n4vidtmkvmk 7/1/2025||
I don't buy this. With software engineering you can generally make incremental progress towards your goal. Yes, sometimes you have to scrap stuff, but usually not the entire thing because an LLM spout out pure nonsense.
surecoocoocoo 7/1/2025|||
We used to define a specification.

In other words; context.

But that was like old man programming.

As the laws of physics changed between 1970 and 2009.

prmph 7/1/2025||
Is this Haiku?
edwardbernays 6/30/2025|||
The state of the art theoretical frameworks typically separates these into two distinct exploratory and discovery phases. The first phase, which is exploratory, is best conceptualized as utilizing an atmospheric dispersion device. An easily identifiable marker material, usually a variety of feces, is metaphorically introduced at high velocity. The discovery phase is then conceptualized as analyzing the dispersal patterns of the exploratory phase. These two phases are best summarized, respectively, as "Fuck Around" followed by "Find Out."
Aeolun 7/1/2025|||
There is only so much you can do with prompts. To go from the 70% accuracy you can achieve with that to the 95% accuracy I see in Claude Code, the context is absolutely the most important, and it’s visible how much effort goes into making sure Claude retrieves exactly the right context, often at the expense of speed.
majormajor 7/1/2025||
Why are we drawing a difference between "prompt" and "context" exactly? The linked article is a bit of puffery that redefines a commonly-used term - "context" - to mean something different than what it's meant so far when we discuss "context windows." It seems to just be some puffery to generate new hype.

When you play with the APIs the prompt/context all blurs together into just stuff that goes into the text fed to the model to produce text. Like when you build your own basic chatbot UI and realize you're sending the whole transcript along with every step. Using the terms from the article, that's "State/History." Then "RAG" and "Long term memory" are ways of working around the limits of context window size and the tendency of models to lose the plot after a huge number of tokens, to help make more effective prompts. "Available tools" info also falls squarely in the "prompt engineering" category.

The reason prompt engineering is going the way of the dodo is because tools are doing more of the drudgery to make a good prompt themselves. E.g., finding relevant parts of a codebase. They do this with a combination of chaining multiple calls to a model together to progressively build up a "final" prompt plus various other less-LLM-native approaches (like plain old "find").

So yeah, if you want to build a useful LLM-based tool for users you have to write software to generate good prompts. But... it ain't really different than prompt engineering other than reducing the end user's need to do it manually.

It's less that we've made the AI better and more that we've made better user interfaces than just-plain-chat. A chat interface on a tool that can read your code can do more, more quickly, than one that relies on you selecting all the relevant snippets. A visual diff inside of a code editor is easier to read than a markdown-based rendering of the same in a chat transcript. Etc.

arugulum 7/1/2025|||
Because the author is artifically shrinking the scope of one thing (prompt engineering) to make its replacement look better (context engineering).

Never mind that prompt engineering goes back to pure LLMs before ChatGPT was released (i.e. before the conversation paradigm was even the dominant one for LLMs), and includes anything from few-shot prompting (including question-answer pairs), providing tool definitions and examples, retrieval augmented generation, and conversation history manipulation. In academic writing, LLMs are often defined as a distribution P(y|x) where X is not infrequently referred to as the prompt. In other words, anything that comes before the output is considered the prompt.

But if you narrow the definition of "prompt" down to "user instruction", then you get to ignore all the work that's come before and talk up the new thing.

simonw 7/1/2025||||
One crucial difference between prompt and the context: the prompt is just content that is provided by a user. The context also includes text that was output by the bot - in conversational interfaces the context incorporates the system prompt, then the user's first prompt, the LLMs reply, the user's next prompt and so-on.
majormajor 7/1/2025||
Here, even making that distinction of prompt-as-most-recent-user-input-only, if we use context as how it's generally been defined in "context window" then RAG and such are not then part of the context. They are just things that certain applications might use to enrich the context.

But personally I think a focus on "prompt" that refers to a specific text box in a specific application vs using it to refer to the sum total of the model input increases confusion about what's going on behind the scenes. At least when referring to products built on the OpenAI Chat Completions APIs, which is what I've used the most.

Building a simple dummy chatbot UI is very informative here for de-mystifying things and avoiding misconceptions about the model actually "learning" or having internal "memory" during your conversation. You're just supplying a message history as the model input prompt. It's your job to keep submitting the history - and you're perfectly able to change it if you like (such as rolling up older messages to keep a shorter context window).

Aeolun 7/1/2025|||
> Why are we drawing a difference between "prompt" and "context" exactly?

Because they’re different things? The prompt doesn’t dynamically change. The context changes all the time.

I’ll admit that you can just call it all ‘context’ or ‘prompt’ if you want, because it’s essentially a large chunk of text. But it’s convenient to be able to distinguish between the two so you can talk about the same thing.

__loam 7/1/2025||
It's all the same blob of text in the api call
chestervonwinch 7/1/2025|||
There is a conceptual difference between a blob of text drafted by a person and a dynamically generated blob of text initiated by a human, generated through multiple LLM calls that pull in information from targeted resources. Perhaps "dynamically generated prompts" is more fitting than "context", but nevertheless, there is a difference to be teased out, whatever the jargon we decide to use.
FeepingCreature 7/1/2025|||
There's always been a distinction between prompt and data.
simonw 7/1/2025|||
LLM's can't distinguish between instruction prompts and data prompts - that's why prompt injection attacks exist.
FeepingCreature 7/3/2025||
I agree, and that's a problem. It doesn't mean the distinction doesn't exist, in fact it shows the opposite.
oblio 7/1/2025|||
Spoken like a non Lisp programmer.
dinvlad 6/30/2025|||
> when the "right" format and "right" time are essentially, and maybe even necessarily, undefined, then aren't you still reaching for a "magic" solution?

Exactly the problem with all "knowing how to use AI correctly" advice out there rn. Shamans with drums, at the end of the day :-)

thomastjeffery 7/1/2025|||
Models are Biases.

There is no objective truth. Everything is arbitrary.

There is no such thing as "accurate" or "precise". Instead, we get to work with "consistent" and "exhaustive". Instead of "calculated", we get "decided". Instead of "defined" we get "inferred".

Really, the whole narrative about "AI" needs to be rewritten from scratch. The current canonical narrative is so backwards that it's nearly impossible to have a productive conversation about it.

andy99 6/30/2025|||
It's called over-fitting, that's basically what prompt engineering is.
evjan 7/1/2025||
That doesn't sound like how I understand over-fitting, but I'm intrigued! How do you mean?
felipeerias 7/1/2025|||
If someone asked you about the usages of a particular element in a codebase, you would probably give a more accurate answer if you were able to use a code search tool rather than reading every source file from top to bottom.

For that kind of tasks (and there are many of those!), I don't see why you would expect something fundamentally different in the case of LLMs.

bostik 7/1/2025|||
In my previous job I repeatedly told people that "git grep is a superpower". Especially in a monorepo, but works well in any big repository, really.

To this day I think the same. With the addition that knowing about "git log -S" grants you necromancy in addition to the regular superpowers. Ability to do rapid code search, and especially code history search, make you look like a wizard without the funny hat.

manishsharan 7/1/2025||||
I provided 'grep' as a tool to LLM (deepseek) and it does a better job of finding usages. This is especially true if the code is obfuscated JavaScript.
skydhash 7/1/2025|||
But why not provide the search tool instead of being an imperfect interface between it and the person asking? The only reason for the latter is that you have more applied knowledge in the context and can use the tool better. For any other case, the answer should be “use this tool”.
gpm 7/1/2025|||
Because the LLM is faster at typing the input, and faster at reading the output, than I am... the amount of input I have to give the LLM is less than what I have to give the search tool invocations, and the amount of output I have to read from the LLM is less than the amount of output from the search tool invocations.

To be fair it's also more likely to mess up than I am, but for reading search results to get an idea of what the code base looks like the speed/accuracy tradeoff is often worth it.

And if it was just a search tool this would be barely worth it, but the effects compound as you chain more tools together. For example: reading and running searches + reading and running compiler output is worth more than double just reading and running searches.

It's definitely an art to figure out when it's better to use an LLM, and when it's just going to be an impediment, though.

(Which isn't to agree that "context engineering" is anything other than "prompt engineering" rebranded, or has any staying power)

skydhash 7/2/2025||
So instead of building better tool, we're patching the last one with another tool that is not even reliable, just using it faster.

That reminds me of the first chapter in "The Programmer Brain" by Felienne Hermans. There's an explanation there that confusion when reading code is caused by three things:

- Lack of knowledge: When you don't have the faintest idea of the notation or symbol being used, aka the WHAT.

- Lack of information: When you know the WHAT, but you can't figure out the WHY.

- Lack of processing power: When you have an idea of the WHY, but can't grasp the HOW.

We already have methods and tooling for all the above and they work fine without having to do shamanic rituals.

__loam 7/1/2025|||
The uninformed would rather have a natural language interface rather than learn how to actually use the tools.
skydhash 7/1/2025||
The reason for the expert in this case (an uninformed that wants to solve a problem) is that the expert can use metaphors as a bridge for understanding. Just like in most companies, there's the business world (which is heterogeneous) and the software engineering world. A huge part of software engineer's time is spent translating concepts across the two. And the most difficult part of that is asking questions and knowing which question to ask as natural language is so ambiguous.
autobodie 7/1/2025|||
Tha problem is that "right" is defined circularly
ninetyninenine 7/1/2025|||
Yeah but do we have to make a new buzz word out of it? "Context engineer"
FridgeSeal 6/30/2025|||
It’s just AI people moving the goalposts now that everyone has realised that “prompt engineering” isn’t a special skill.
coliveira 7/1/2025|||
In other words, "if AI doesn't work for you the problem is not IA, it is the user", that's what AI companies want us to believe.
shermantanktop 7/1/2025||
That’s a good indicator of an ideology at work: no-true-Scotsman deployed at every turn.
j45 7/1/2025|||
Everything is new to someone and the tends of reference will evolve.
colordrops 7/1/2025|||
> Since these are non-deterministic machines, I fail to see any reliable heuristic that is fundamentally indistinguishable than "trying and seeing" with prompts

There are many sciences involving non-determinism that still have laws and patterns, e.g. biology and maybe psychology. It's not all or nothing.

Also, LLMs are deterministic, just not predictable. The non-determinism is injected by providers.

Anyway is there an essential difference between prompt engineering and context engineering? They seem like two names for the same thing.

simonw 7/1/2025||
They arguably are two names for the same thing.

The difference is that "prompt engineering" as a term has failed, because to a lot of people the inferred definition is "a laughably pretentious term for typing text into a chatbot" - it's become indistinguishable from end-user prompting.

My hope is that "context engineering" better captures the subtle art of building applications on top of LLMs through carefully engineering their context.

phyalow 7/1/2025|||
“non-deterministic machines“

Not correct. They are deterministic as long as a static seed is used.

kazga 7/1/2025||
That's not true in practice. Floating point arithmetic is not commutative due to rounding errors, and the parallel operations introduce non-determinisn even at temperature 0.
SetTheorist 7/1/2025|||
Nitpick: I think you mean that FP arithmetic is not _associative_ rather than non-commutative.

Commutative: A+B = B+A Associative: A+(B+C) = (A+B)+C

zorked 7/1/2025||||
That's basically a bug though, not an important characteristic of the system. Engineering tradeoff, not math.
e12e 7/1/2025||
It's pretty important when discussing concrete implementations though, just like when using floats as coordinates in a space/astronomy simulator and getting decreasing accuracy as your objects move away from your chosen origin.
phyalow 7/1/2025|||
What? You can get consistent output on local models.

I can train large nets deterministically too (CUBLAS flags). What your saying isn't true in practice. Hell I can also go on the anthropic API right now and get verbatim static results.

simonw 7/1/2025|||
"Hell I can also go on the anthropic API right now and get verbatim static results."

How?

Setting temperature to 0 won't guarantee the exact same output for the exact same input, because - as the previous commenter said - floating point arithmetic is non-commutative, which becomes important when you are running parallel operations on GPUs.

sva_ 7/2/2025||
Shouldn't it be the fact that they're non-associative? Because the reduction kernels will combine partial results (like the dot‑products in a GEMM or the sum across attention heads) in a way that the order of operations may change (non-associative), which can lead to the individual floats to be round off differently.
oxidi 7/1/2025|||
I think lots of people misunderstand that the "non-deterministic" nature of LLMs come from sampling the token distribution, not from the model itself.
simonw 7/1/2025||
It's also the way the model runs. Setting temperature to zero and picking a fixed seed would ideally result in deterministic output from the sampler, but in parallel execution of matrix arithmetic (eg using a GPU) the order of floating point operations starts to matter, so timing differences can produce different results.
oxidi 7/1/2025||
Good point. Though sampling generally happens on the CPU in a linear way. What you describe might influence the raw output logits from a single LLM step, but since the differences are only tiny, a well designed sampler could still make the output deterministic (so same seed = same text output). With a very high temperature these small differences might influence the output though, since the ranking of two tokens might be swapped.

I think the usual misconception is to think that LLM outputs are random "by default". IMHO this apparent randomness is more of a feature rather than a bug, but that may be a different conversation.

pbreit 7/1/2025|||
What's the difference?
PeterStuer 7/1/2025|||
"these are non-deterministic machines"

Only if you choose so by allowing some degree of randomness with the temperature setting.

pegasus 7/1/2025|||
They are usually nondeterministic even at temperature 0 - due to things like parallelism and floating point rounding errors.
Gracana 7/1/2025|||
This is dependent on configuration, you can get repeatable results if you need them. I know at least llama.cpp and vllm v0 are deterministic for a given version and backend, and vllm v1 is deterministic if you disable multiprocessing.
PeterStuer 7/1/2025|||
floating point rounding errors are still deterministic. Parallelism dynamics can impact results, but those are not specific to LLM's.
simonw 7/1/2025||
Here's something that isn't deterministic:

   a = 0.1, b = 0.2, c = 0.3
   a * (b * c) = 0.006
   (a * b) * c = 0.006000000000000001
If you are running these operations in parallel you can't guarantee which of those orders the operations will complete in.

When you're running models on a GPU (or any other architecture that runs a whole bunch of matrix operations in parallel) you can't guarantee the order of the operations.

zelphirkalt 7/1/2025||
The order of completion doesn't necessarily influence the overall result of a parallelized computation. This depends on how the results are aggregated. For example for reducing floating point error in calculating a sum of floating point numbers, you could have a sorting step before calculating the sum and then start summing up from the lowest values to the higher ones. Then it doesn't matter at all which of the values is calculated first, because you need them all anyway, to sort them and once they are sorted, the result will always be the same, given same input values.

So you can see, completion time is a completely orthogonal issue, or can be made one.

And even libraries like tensorflow can be made to give reproducible results, when setting the corresponding seeds for the underlying libraries. Have done that myself, speaking from experience in a machine learning setting.

edflsafoiewq 7/1/2025||||
In the strict sense, sure, but the point is they depend not only on the seed but on seemingly minor variations in the prompt.
zelphirkalt 7/1/2025|||
This is what irks me so often when reading these comments. This is just software inside a ordinary computer, it always does the same with the same input, which includes hidden and global state. Stating that they are "non-deterministic machines" sounds like throwing the towel and thinking "it's magic!". I am not even sure what people want to actually express, when they make these false statements.

If one wants to make something give the same answers every time, one needs to control all the variables of input. This is like any other software including other machine learning algorithms.

csallen 7/1/2025||
This is like telling a soccer player that no change in practice or technique is fundamentally different than another, because ultimately people are non-deterministic machines.
niemandhier 7/1/2025||
LLM agents remind me of the great Nolan movie „Memento“.

The agents cannot change their internal state hence they change the encompassing system.

They do this by injecting information into it in such a way that the reaction that is triggered in them compensates for their immutability.

For this reason I call my agents „Sammy Jenkins“.

StochasticLi 7/1/2025|
I think we can reasonably expect they will become non-stateless in the next few years.
tdaltonc 7/2/2025|||
If agents are stateful a few years form now it will be because they accrete a layer of context engineering.
roflyear 7/1/2025|||
Why?
StochasticLi 7/1/2025||
That's where the research is going.
baxtr 6/30/2025||
>Conclusion

Building powerful and reliable AI Agents is becoming less about finding a magic prompt or model updates. It is about the engineering of context and providing the right information and tools, in the right format, at the right time. It’s a cross-functional challenge that involves understanding your business use case, defining your outputs, and structuring all the necessary information so that an LLM can “accomplish the task."

That’s actually also true for humans: the more context (aka right info at the right time) you provide the better for solving tasks.

root_axis 6/30/2025||
I am not a fan of this banal trend of superficially comparing aspects of machine learning to humans. It doesn't provide any insight and is hardly ever accurate.
furyofantares 6/30/2025|||
I've seen a lot of cases where, if you look at the context you're giving the model and imagine giving it to a human (just not yourself or your coworker, someone who doesn't already know what you're trying to achieve - think mechanical turk), the human would be unlikely to give the output you want.

Context is often incomplete, unclear, contradictory, or just contains too much distracting information. Those are all things that will cause an LLM to fail that can be fixed by thinking about how an unrelated human would do the job.

EricMausler 6/30/2025|||
Alternatively, I've gotten exactly what I wanted from an LLM by giving it information that would not be enough for a human to work with, knowing that the llm is just going to fill in the gaps anyway.

It's easy to forget that the conversation itself is what the LLM is helping to create. Humans will ignore or depriotitize extra information. They also need the extra information to get an idea of what you're looking for in a loose sense. The LLM is much more easily influenced by any extra wording you include, and loose guiding is likely to become strict guiding

furyofantares 7/1/2025||
Yeah, it's definitely not a human! But it is often the case in my experience that problems in your context are quite obvious once looked at through a human lens.

Maybe not very often in a chat context, my experience is in trying to build agents.

0xdeafcafe 7/3/2025||||
Totally agree. We've found that a lot of "agent failures" trace back to assumptions, bad agent-decisions, or bloat buried in the context, stuff that makes perfect sense to the dev who built it when following the happy path, but can so easily fall apart in real-world scenarios.

We've been working on a way to test this more systematically by simulating full conversations with agents and surfacing the exact point where things go off the rails. Kind of like unit tests, but for context, behavior, and other ai jank.

Full disclosure, I work at the company building this, but the core library is open source, free to use, etc. https://github.com/langwatch/scenario

root_axis 7/1/2025|||
I don't see the usefulness of drawing a comparison to a human. "Context" in this sense is a technical term with a clear meaning. The anthropomorphization doesn't enlighten our understanding of the LLM in any way.

Of course, that comment was just one trivial example, this trope is present in every thread about LLMs. Inevitably, someone trots out a line like "well humans do the same thing" or "humans work the same way" or "humans can't do that either". It's a reflexive platitude most often deployed as a thought-terminating cliche.

furyofantares 7/1/2025||
I agree with you completely about the trend which has been going on for years. And it's usually used to trivialize the vast expanse between humans and LLMs.

In this case though it's a pretty weird and hard job to create a context dynamically for a task, cobbling together prompts, tool outputs, and other LLM outputs. This is hard enough and weird enough that you can often end up failing to make text that even a human could make sense of to produce the desired output. And there is practical value to taking a context the LLM failed at and checking if you'd expect a human to succeed.

stefan_ 6/30/2025||||
Theres all these philosophers popping up everywhere. This is also another one of these topics that featured in peoples favorite scifi hyperfixation so all discussions inevitably get ruined with scifi fanfic (see also: room temperature superconductivity).
ModernMech 6/30/2025||||
I agree, however I do appreciate comparisons to other human-made systems. For example, "providing the right information and tools, in the right format, at the right time" sounds a lot like a bureaucracy, particularly because "right" is decided for you, it's left undefined, and may change at any time with no warning or recourse.
baxtr 7/1/2025|||
Without my note I wouldn’t have seen this comment, which is very insightful to me at least.

https://news.ycombinator.com/item?id=44429880

layer8 7/1/2025|||
The difference is that humans can actively seek to acquire the necessary context by themselves. They don't have to passively sit there and wait for someone else to do the tedious work of feeding them all necessary context upfront. And we value humans who are able to proactively do that seeking by themselves, until they are satisfied that they can do a good job.
simonw 7/1/2025||
> The difference is that humans can actively seek to acquire the necessary context by themselves

These days, so can LLM systems. The tool calling pattern got really good in the last six months, and one of the most common uses of that is to let LLMs search for information they need to add to their context.

o3 and o4-mini and Claude 4 all do this with web search in their user-facing apps and it's extremely effective.

The same patterns is increasingly showing up in coding agents, giving them the ability to search for relevant files or even pull in official document documentation for libraries.

mentalgear 6/30/2025|||
Basically, finding the right buttons to push within the constraints of the environment. Not so much different from what (SW) engineering is, only non-deterministic in the outcomes.
QuercusMax 6/30/2025|||
Yeah... I'm always asking my UX and product folks for mocks, requirements, acceptance criteria, sample inputs and outputs, why we care about this feature, etc.

Until we can scan your brain and figure out what you really want, it's going to be necessary to actually describe what you want built, and not just rely on vibes.

therealdrag0 7/1/2025|||
Ya reminds me of social engineering. Like we’re seeing “How to Win Programming and Influence LLMs”.
fergal 7/1/2025|||
THis.. I was about to make a similar point; this conclusion reads like a job description for a technical lead role where they managed and define work for a team of human devs who execute implementation.
eviks 7/1/2025|||
Right info at the right time is not "more", and with humans it's pretty easy to overwhelm, so do the opposite - convert "more" into "wrong"
lupire 6/30/2025|||
Not "more" context. "Better" context.

(X-Y problem, for example.)

Davidzheng 7/1/2025||
I think too much context is harmful
zaptheimpaler 7/1/2025||
I feel like this is incredibly obvious to anyone who's ever used an LLM or has any concept of how they work. It was equally obvious before this that the "skill" of prompt-engineering was a bunch of hacks that would quickly cease to matter. Basically they have the raw intelligence, you now have to give them the ability to get input and the ability to take actions as output and there's a lot of plumbing to make that happen.
imiric 7/1/2025||
That might be the case, but these tools are marketed as having close to superhuman intelligence, with the strong implication that AGI is right around the corner. It's obvious that engineering work is required to get them to perform certain tasks, which is what the agentic trend is about. What's not so obvious is the fact that getting them to generate correct output requires some special skills or tricks. If these tools were truly intelligent and capable of reasoning, surely they would be able to inform human users when they lack contextual information instead of confidently generating garbage, and their success rate would be greater than 35%[1].

The idea that fixing this is just a matter of providing better training and contextual data, more compute or plumbing, is deeply flawed.

[1]: https://www.theregister.com/2025/06/29/ai_agents_fail_a_lot/

skort 7/1/2025||
Yeah, my reaction to this was "Big deal? How is this news to anyone"

It reads like articles put out by consultants at the height of SOA. Someone thought for a few minutes about something and figured it was worth an article.

crystal_revenge 6/30/2025||
Definitely mirrors my experience. One heuristic I've often used when providing context to model is "is this enough information for a human to solve this task?". Building some text2SQL products in the past it was very interesting to see how often when the model failed, a real data analyst would reply something like "oh yea, that's an older table we don't use any more, the correct table is...". This means the model was likely making a mistake that a real human analyst would have without the proper context.

One thing that is missing from this list is: evaluations!

I'm shocked how often I still see large AI projects being run without any regard to evals. Evals are more important for AI projects than test suites are for traditional engineering ones. You don't even need a big eval set, just one that covers your problem surface reasonably well. However without it you're basically just "guessing" rather than iterating on your problem, and you're not even guessing in a way where each guess is an improvement on the last.

edit: To clarify, I ask myself this question. It's frequently the case that we expect LLMs to solve problems without the necessary information for a human to solve them.

adiabatichottub 6/30/2025||
A classic law of computer programming:

"Make it possible for programmers to write in English and you will find that programmers cannot write in English."

It's meant to be a bit tongue-in-cheek, but there is a certain truth to it. Most human languages fail at being precise in their expression and interpretation. If you can exactly define what you want in English, you probably could have saved yourself the time and written it in a machine-interpretable language.

hobs 6/30/2025|||
The thing is, all the people cosplaying as data scientists don't want evaluations, and that's why you saw so little in fake C level projects, because telling people the emperor has no clothes doesn't pay.

For those actually using the products to make money well, hey - all of those have evaluations.

shermantanktop 7/1/2025||
I know this proliferation of excited wannabes is just another mark of a hype cycle, and there’s real value this time. But I find myself unreasonably annoyed by people getting high on their own supply and shouting into a megaphone.
kevin_thibedeau 6/30/2025|||
Asking yes no questions will get you a lie 50% of the time.
adriand 6/30/2025||
I have pretty good success with asking the model this question before it starts working as well. I’ll tell it to ask questions about anything it’s unsure of and to ask for examples of code patterns that are in use in the application already that it can use as a template.
zacharyvoase 6/30/2025||
I love how we have such a poor model of how LLMs work (or more aptly don't work) that we are developing an entire alchemical practice around them. Definitely seems healthy for the industry and the species.
simonw 6/30/2025||
The stuff that's showing up under the "context engineering" banner feels a whole lot less alchemical to me than the older prompt engineering tricks.

Alchemical is "you are the world's top expert on marketing, and if you get it right I'll tip you $100, and if you get it wrong a kitten will die".

The techniques in https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.... seem a whole lot more rational to me than that.

zacharyvoase 7/1/2025||
As it gets more rigorous and predictable I suppose you could say it approaches psychology.
hackable_sand 7/2/2025|||
This is offensive to alchemy.
__MatrixMan__ 7/1/2025||
Reminds me of quantum mechanics
munificent 7/1/2025||
All of these blog posts to me read like nerds speedrunning "how to be a tech lead for a non-disastrous internship".

Yes, if you have an over-eager but inexperienced entity that wants nothing more to please you by writing as much code as possible, as the entity's lead, you have to architect a good space where they have all the information they need but can't get easily distracted by nonessential stuff.

tptacek 7/1/2025|
Just to keep some clarity here, this is mostly about writing agents. In agent design, LLM calls are just primitives, a little like how a block cipher transform is just a primitive and not a cryptosystem. Agent designers (like cryptography engineers) carefully manage the inputs and outputs to their primitives, which are then composed and filtered.
dinvlad 6/30/2025|
I feel like ppl just keep inventing concepts for the same old things, which come down to dancing with the drums around the fire and screaming shamanic incantations :-)
viccis 6/30/2025|
When I first used these kinds of methods, I described it along those lines to my friend. I told him I felt like I was summoning a demon and that I had to be careful to do the right incantations with the right words and hope that it followed my commands. I was being a little disparaging with the comment because the engineer in me that wants reliability, repeatability, and rock solid testability struggles with something that's so much less under my control.

God bless the people who give large scale demos of apps built on this stuff. It brings me back to the days of doing vulnerability research and exploitation demos, in which no matter how much you harden your exploits, it's easy for something to go wrong and wind up sputtering and sweating in front of an audience.

More comments...