The new skill in AI is not prompting, it's context engineering

Posted by robotswantdata 9 hours ago

The new skill in AI is not prompting, it's context engineering(www.philschmid.de)

501 points | 267 comments

JohnMakin 8 hours ago|

> Building powerful and reliable AI Agents is becoming less about finding a magic prompt or model updates.

Ok, I can buy this

> It is about the engineering of context and providing the right information and tools, in the right format, at the right time.

when the "right" format and "right" time are essentially, and maybe even necessarily, undefined, then aren't you still reaching for a "magic" solution?

If the definition of "right" information is "information which results in a sufficiently accurate answer from a language model" then I fail to see how you are doing anything fundamentally differently than prompt engineering. Since these are non-deterministic machines, I fail to see any reliable heuristic that is fundamentally indistinguishable than "trying and seeing" with prompts.

mentalgear 7 hours ago||

It's magical thinking all the way down. Whether they call it now "prompt" or "context" engineering because it's the same tinkering to find something that "sticks" in non-deterministic space.

nonethewiser 2 hours ago||

>Whether they call it now "prompt" or "context" engineering because it's the same tinkering to find something that "sticks" in non-deterministic space.

I dont quite follow. Prompts and contexts are different things. Sure, you can get thing into contexts with prompts but that doesn't mean they are entirely the same.

You could have a long running conversation with a lot in the context. A given prompt may work poorly, whereas it would have worked quite well earlier. I don't think this difference is purely semantic.

For whatever it's worth I've never liked the term "prompt engineering." It is perhaps the quintessential example of overusing the word engineering.

ffsm8 1 hour ago||

Yeah, if anything it should be called an art.

The term engineering makes little sense in this context, but really... Did it make sense for eg "QA Engineer" and all the other jobs we tacked it on, too? I don't think so, so it's kinda arguing after we've been misusing the term for well over 10 yrs

Aeolun 2 hours ago|||

There is only so much you can do with prompts. To go from the 70% accuracy you can achieve with that to the 95% accuracy I see in Claude Code, the context is absolutely the most important, and it’s visible how much effort goes into making sure Claude retrieves exactly the right context, often at the expense of speed.

majormajor 1 hour ago||

Why are we drawing a difference between "prompt" and "context" exactly? The linked article is a bit of puffery that redefines a commonly-used term - "context" - to mean something different than what it's meant so far when we discuss "context windows." It seems to just be some puffery to generate new hype.

When you play with the APIs the prompt/context all blurs together into just stuff that goes into the text fed to the model to produce text. Like when you build your own basic chatbot UI and realize you're sending the whole transcript along with every step. Using the terms from the article, that's "State/History." Then "RAG" and "Long term memory" are ways of working around the limits of context window size and the tendency of models to lose the plot after a huge number of tokens, to help make more effective prompts. "Available tools" info also falls squarely in the "prompt engineering" category.

The reason prompt engineering is going the way of the dodo is because tools are doing more of the drudgery to make a good prompt themselves. E.g., finding relevant parts of a codebase. They do this with a combination of chaining multiple calls to a model together to progressively build up a "final" prompt plus various other less-LLM-native approaches (like plain old "find").

So yeah, if you want to build a useful LLM-based tool for users you have to write software to generate good prompts. But... it ain't really different than prompt engineering other than reducing the end user's need to do it manually.

It's less that we've made the AI better and more that we've made better user interfaces than just-plain-chat. A chat interface on a tool that can read your code can do more, more quickly, than one that relies on you selecting all the relevant snippets. A visual diff inside of a code editor is easier to read than a markdown-based rendering of the same in a chat transcript. Etc.

arugulum 56 minutes ago|||

Because the author is artifically shrinking the scope of one thing (prompt engineering) to make its replacement look better (context engineering).

Never mind that prompt engineering goes back to pure LLMs before ChatGPT was released (i.e. before the conversation paradigm was even the dominant one for LLMs), and includes anything from few-shot prompting (including question-answer pairs), providing tool definitions and examples, retrieval augmented generation, and conversation history manipulation. In academic writing, LLMs are often defined as a distribution P(y|x) where X is not infrequently referred to as the prompt. In other words, anything that comes before the output is considered the prompt.

But if you narrow the definition of "prompt" down to "user instruction", then you get to ignore all the work that's come before and talk up the new thing.

simonw 1 hour ago|||

One crucial difference between prompt and the context: the prompt is just content that is provided by a user. The context also includes text that was output by the bot - in conversational interfaces the context incorporates the system prompt, then the user's first prompt, the LLMs reply, the user's next prompt and so-on.

majormajor 1 hour ago||

Here, even making that distinction of prompt-as-most-recent-user-input-only, if we use context as how it's generally been defined in "context window" then RAG and such are not then part of the context. They are just things that certain applications might use to enrich the context.

But personally I think a focus on "prompt" that refers to a specific text box in a specific application vs using it to refer to the sum total of the model input increases confusion about what's going on behind the scenes. At least when referring to products built on the OpenAI Chat Completions APIs, which is what I've used the most.

Building a simple dummy chatbot UI is very informative here for de-mystifying things and avoiding misconceptions about the model actually "learning" or having internal "memory" during your conversation. You're just supplying a message history as the model input prompt. It's your job to keep submitting the history - and you're perfectly able to change it if you like (such as rolling up older messages to keep a shorter context window).

felipeerias 3 hours ago|||

If someone asked you about the usages of a particular element in a codebase, you would probably give a more accurate answer if you were able to use a code search tool rather than reading every source file from top to bottom.

For that kind of tasks (and there are many of those!), I don't see why you would expect something fundamentally different in the case of LLMs.

dinvlad 7 hours ago|||

> when the "right" format and "right" time are essentially, and maybe even necessarily, undefined, then aren't you still reaching for a "magic" solution?

Exactly the problem with all "knowing how to use AI correctly" advice out there rn. Shamans with drums, at the end of the day :-)

autobodie 52 minutes ago|||

Tha problem is that "right" is defined circularly

andy99 7 hours ago|||

It's called over-fitting, that's basically what prompt engineering is.

edwardbernays 8 hours ago|||

The state of the art theoretical frameworks typically separates these into two distinct exploratory and discovery phases. The first phase, which is exploratory, is best conceptualized as utilizing an atmospheric dispersion device. An easily identifiable marker material, usually a variety of feces, is metaphorically introduced at high velocity. The discovery phase is then conceptualized as analyzing the dispersal patterns of the exploratory phase. These two phases are best summarized, respectively, as "Fuck Around" followed by "Find Out."

ninetyninenine 1 hour ago|||

Yeah but do we have to make a new buzz word out of it? "Context engineer"

FridgeSeal 6 hours ago|||

It’s just AI people moving the goalposts now that everyone has realised that “prompt engineering” isn’t a special skill.

coliveira 4 hours ago|||

In other words, "if AI doesn't work for you the problem is not IA, it is the user", that's what AI companies want us to believe.

shermantanktop 3 hours ago||

That’s a good indicator of an ideology at work: no-true-Scotsman deployed at every turn.

j45 2 hours ago|||

Everything is new to someone and the tends of reference will evolve.

csallen 4 hours ago||

This is like telling a soccer player that no change in practice or technique is fundamentally different than another, because ultimately people are non-deterministic machines.

simonw 8 hours ago||

I wrote a bit about this the other day: https://simonwillison.net/2025/Jun/27/context-engineering/

Drew Breunig has been doing some fantastic writing on this subject - coincidentally at the same time as the "context engineering" buzzword appeared but actually unrelated to that meme.

How Long Contexts Fail - https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-ho... - talks about the various ways in which longer contexts can start causing problems (also known as "context rot")

How to Fix Your Context - https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.... - gives names to a bunch of techniques for working around these problems including Tool Loadout, Context Quarantine, Context Pruning, Context Summarization, and Context Offloading.

the_mitsuhiko 8 hours ago||

Drew Breunig's posts are a must read on this. This is not only important for writing your own agents, it is also critical when using agentic coding right now. These limitations/behaviors will be with us for a while.

outofpaper 8 hours ago||

They might be good reads on the topic but Drew makes some significant etymological mistakes. For example loadout doesn't come from gaming but military terminology. It's essentially the same as kit or gear.

simonw 8 hours ago|||

Drew isn't using that term in a military context, he's using it in a gaming context. He defines what he means very clearly:

> The term “loadout” is a gaming term that refers to the specific combination of abilities, weapons, and equipment you select before a level, match, or round.

In the military you don't select your abilities before entering a level.

xarope 30 minutes ago|||

the military definitely do use the term loadout. It can be based on mission parameters e.g. if armored vehicles are expected, your loadout might include more MANPATS. It can also refer to the way each soldier might customize their gear, e.g. cutaway knife in boot or on vest, NODs if extended night operations expected (I know, I know, gamers would like to think you'd bring everything, but in real life no warfighter would want to carry extra weight unnecessarily!), or even the placement of gear on their MOLLE vests (all that velcro has a reason).

GuinansEyebrows 6 hours ago|||

i think that software engineers using this terminology might be envisioning themselves as generals, not infantry :)

DiggyJohnson 7 hours ago||||

This seems like a rather unimportant type of mistake, especially because the definition is still accurate, it’s just the etymology isn’t complete.

coldtea 4 hours ago||||

>Drew makes some significant etymological mistakes. For example loadout doesn't come from gaming but military terminology

Does he pretend to give the etymology and ultimately origin of the term, or just where he or other AI-discussions found it? Because if it's the latter, he is entitled to call it a "gaming" term, because that's what it is to him and those in the discussion. He didn't find it in some military manual or learned it at boot camp!

But I would mostly challenge this mistake, if we admit it as such, is "significant" in any way.

The origin of loadout is totally irrelevant to the point he makes and the subject he discusses. It's just a useful term he adopted, it's history is not really relevant.

ZYbCRq22HbJ2y7 8 hours ago||||

> They might be good reads on the topic but Drew makes some significant etymological mistakes. For example loadout doesn't come from gaming but military terminology. It's essentially the same as kit or gear.

Doesn't seem that significant?

Not to say those blog posts say anything much anyway that any "prompt engineer" (someone who uses LLMs frequently) doesn't already know, but maybe it is useful to some at such an early stage of these things.

scubbo 6 hours ago||||

It _is_ a gaming term - it is also a military term (from which the gaming term arose).

luckydata 3 hours ago|||

this is textbook pointless pedantry. I'm just commenting to find it again in the future.

Daub 3 hours ago|||

For visual art I feel that the existing approaches in context engineering are very much lacking. An Ai understands well enough such simple things as content (bird, dog, owl etc), color (blue green etc) and has a fair understanding of foreground/background. However, the really important stuff is not addressed.

For example: in form, things like negative shape and overlap. In color contrast things like Ratio contrast and dynamic range contrast. Or how manipulating neighboring regional contrast produces tone wrap. I could go on.

One reason for this state of affairs is that artists and designers lack the consistent terminology to describe what they are doing (though this does not stop them from operating at a high level). Indeed, many of the terms I have used here we (my colleagues and I) had to invent ourselves. I would love to work with an AI guru to address this developing problem.

skydhash 3 hours ago||

> artists and designers lack the consistent terminology to describe what they are doing

I don't think they do. It may not be completely consistent, but open any art book and you find the same thing being explained again and again. Just for drawing humans, you will find emphasis on the skeleton and muscle volume for forms and poses, planes (especially the head) for values and shadows, some abstract things like stability and line weight, and some more concrete things like foreshortening.

Several books and course have gone over those concepts. They are not difficult to explain, they are just difficult to master. That's because you have to apply judgement for every single line or brush stroke deciding which factors matter most and if you even want to do the stroke. Then there's the whole hand eye coordination.

So unless you can solve judgement (which styles derive from), there's not a lot of hope there.

ADDENDUM

And when you do a study of another's work, it's not copying the data, extracting colors, or comparing labels,... It's just studying judgement. You know the complete formula from which a more basic version is being used for the work, and you only want to know the parameters. Whereas machine training is mostly going for the wrong formula with completely different variables.

arbitrary_name 1 hour ago|||

From the first link:Read large enough context to ensure you get what you need.

How does this actually work, and how can one better define this to further improve the prompt?

This statement feels like the 'draw the rest of the fucking owl' referred to elsewhere in the thread

simonw 1 hour ago||

I'm not sure how you ended up on that page... my comment above links to https://simonwillison.net/2025/Jun/27/context-engineering/

The "Read large enough context to ensure you get what you need" quote is from a different post entirely, this one: https://simonwillison.net/2025/Jun/30/vscode-copilot-chat/

That's part of the system prompts used by the GitHub Copilot Chat extension for VS Code - from this line: https://github.com/microsoft/vscode-copilot-chat/blob/40d039...

The full line is:

  When using the {ToolName.ReadFile} tool, prefer reading a
  large section over calling the {ToolName.ReadFile} tool many
  times in sequence. You can also think of all the pieces you
  may be interested in and read them in parallel. Read large
  enough context to ensure you get what you need.

That's a hint to the tool-calling LLM that it should attempt to guess which area of the file is most likely to include the code that it needs to review.

It makes more sense if you look at the definition of the ReadFile tool:

https://github.com/microsoft/vscode-copilot-chat/blob/40d039...

  description: 'Read the contents of a file. Line numbers are
  1-indexed. This tool will truncate its output at 2000 lines
  and may be called repeatedly with offset and limit parameters
  to read larger files in chunks.'

The tool takes three arguments: filePath, offset and limit.

dosnem 4 hours ago|||

Providing context makes sense to me, but do you have any examples of providing context and then getting the AI to produce something complex? I am quite a proponent of AI but even I find myself failing to produce significant results on complex problems, even when I have clone + memory bank, etc. it ends up being a time sink of trying to get the ai to do something only to have me eventually take over and do it myself.

simonw 4 hours ago|||

Quite a few times, I've been able to give it enough context to write me an entire working piece of software in a single shot. I use that for plugins pretty often, eg this:

  llm -m openai/o3 \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue
      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'

Which produced this: https://gist.github.com/simonw/249e16edffe6350f7265012bee9e3...

AnotherGoodName 3 hours ago|||

I had a series of “Using Manim create an animation for formula X rearranging into formula Y with a graph of values of the function”

Beautiful one shot results and i now have really nice animations of some complex maths to help others understand. (I’ll put it up on youtube soon).

I don't know the manim library at all so saved me about a week of work learning and implementing

old_man_cato 7 hours ago|||

[flagged]

d0gsg0w00f 7 hours ago|||

This hits too close to home.

_carbyau_ 6 hours ago||||

[flagged]

crsv 5 hours ago|||

And then the AI doesn’t handle the front end caching properly for the 100th time in a row so you edit the owl and nothing changes after you press save.

NomDePlum 6 hours ago|||

[flagged]

TrainedMonkey 6 hours ago|||

Hire a context engineer to define the task of drawing an owl as drawing two owls.

zdw 2 hours ago|||

[flagged]

jknoepfler 6 hours ago|||

Oh, and don't forget to retain the artist to correct the ever-increasingly weird and expensive mistakes made by the context when you need to draw newer, fancier pelicans. Maybe we can just train product to draw?

storus 8 hours ago|||

Those issues are considered artifacts of the current crop of LLMs in academic circles; there is already research allowing LLMs to use millions of different tools at the same time, and stable long contexts, likely reducing the amount of agents to one for most use cases outside interfacing different providers.

Anyone basing their future agentic systems on current LLMs would likely face LangChain fate - built for GPT-3, made obsolete by GPT-3.5.

simonw 8 hours ago|||

Can you link to the research on millions of different terms and stable long contexts? I haven't come across that yet.

storus 7 hours ago||

You can look at AnyTool, 2024 (16,000 tools) and start looking at newer research from there.

https://arxiv.org/abs/2402.04253

For long contexts start with activation beacons and RoPE scaling.

simonw 7 hours ago|||

I would classify AnyTool as a context engineering trick. It's using GPT-4 function calls (what we would call tool calls today) to find the best tools for the current job based on a 3-level hierarchy search.

Drew calls that one "Tool Loadout" https://www.dbreunig.com/2025/06/26/how-to-fix-your-context....

timr 5 hours ago||

So great. We have not one, but two different ways of saying "use text search to find tools".

This field, I swear...it's the PPAP [1] of engineering.

[1] https://www.youtube.com/watch?v=NfuiB52K7X8

I have a toool...I have a seeeeearch...unh! Now I have a Tool Loadout!" *dances around in leopard print pyjamas*

Art9681 3 hours ago||||

RoPE scaling is not an ideal solution since all LLMs in general start degrading at around 8k. You also have the problem of cost by yolo'ing long context per task turn even if the LLM were capable of crunching 1M tokens. If you self host then you have the problem of prompt processing time. So it doesnt matter in the end if the problem is solved and we can invoke n number of tools per task turn. It will be a quick way to become poor as long as providers are charging per token. The only viable solution is to use a smart router so only the relevant tools and their descriptions are appended to the context per task turn.

nyrikki 6 hours ago|||

Thanks for the link. It finally explained why I was getting hit up by recruiters for a job that was for a data broker looking to do what seemed like silly uses.

Cloud API recommender systems must seem like a gift to that industry.

Not my area anyways but I couldn't see a profit model for a human search for an API when what they wanted is well covered by most core libraries in Python etc...

ZYbCRq22HbJ2y7 7 hours ago||||

How would "a million different tool calls at the same time" work? For instance, MCP is HTTP based, even at low latency in incredibly parallel environments that would take forever.

Art9681 3 hours ago|||

It wouldn't. There is a difference between theory and practicality. Just because we could, doesnt mean we should, especially when costs per token are considered. Capability and scale are often at odds.

Jarwain 5 hours ago|||

MCPs aren't the only way to embed tool calls into an LLM

coldtea 4 hours ago||

Doesn't change the argument.

tptacek 3 hours ago||

It obviously does.

Art9681 3 hours ago||

It does not. Context is context no matter how you process it. You can configure tools without MCP or with it. No matter. You still have to provide that as context to an LLM.

tptacek 3 hours ago||

If you're using native tool calls and not MCP, the latency of calls is a nonfactor; that was the concern raised by the root comment.

dinvlad 7 hours ago||||

> already research allowing LLMs to use millions of different tools

Hmm first time hearing about this, could you share any examples please?

simonw 7 hours ago||

See this comment https://news.ycombinator.com/item?id=44428548

Foreignborn 7 hours ago|||

yes, but those aren’t released and even then you’ll always need glue code.

you just need to knowingly resource what glue code is needed, and build it in a way it can scale with whatever new limits that upgraded models give you.

i can’t imagine a world where people aren’t building products that try to overcome the limitations of SOTA models

storus 7 hours ago||

My point is that newer models will have those baked in, so instead of supporting ~30 tools before falling apart they will reliably support 10,000 tools defined in their context. That alone would dramatically change the need for more than one agent in most cases as the architectural split into multiple agents is often driven by the inability to reliably run many tools within a single agent. Now you can hack around it today by turning tools on/off depending on the agent's state but at some point in the future you might afford not to bother and just dump all your tools to a long stable context, maybe cache it for performance, and that will be it.

ZYbCRq22HbJ2y7 7 hours ago||

There will likely be custom, large, and expensive models at an enterprise level in the near future (some large entities and governments already have them (niprgpt)).

With that in mind, what would be the business sense in siloing a single "Agent" instead of using something like a service discovery service that all benefit from?

storus 7 hours ago||

My guess is the main issue is latency and accuracy; a single agent without all the routing/evaluation sub-agents around it that introduce cumulative errors, lead to infinite loops and slow it down would likely be much faster, accurate and could be cached at the token level on a GPU, reducing token preprocessing time further. Now different companies would run different "monorepo" agents and those would need something like MCP to talk to each other at the business boundary, but internally all this won't be necessary.

Also the current LLMs have still too many issues because they are autoregressive and heavily biased towards the first few generated tokens. They also still don't have full bidirectional awareness of certain relationships due to how they are masked during the training. Discrete diffusion looks interesting but I am not sure how does that one deal with tools as I've never seen a model from that class using any tools.

JoeOfTexas 7 hours ago|||

So who will develop the first Logic Core that automates the context engineer.

igravious 7 hours ago||

The first rule of automation: that which can be automated will be automated.

Observation: this isn't anything that can't be automated /

risyachka 7 hours ago|||

“A month-long skill” after which it won’t be a thing anymore, like so many other.

simonw 7 hours ago|||

Most of the LLM prompting skills I figured out ~three years ago are still useful to me today. Even the ones that I've dropped are useful because I know that things that used to be helpful aren't helpful any more, which helps me build an intuition for how the models have improved over time.

dbreunig 6 hours ago|||

While researching the above posts Simon linked, I was struck by how many of these techniques came from the pre-ChatGPT era. NLP researchers have been dealing with this for awhile.

refulgentis 6 hours ago|||

I agree with you, but would echo OP's concern, in a way that makes me feel like a party pooper, but, is open about what I see us all expressing squeamish-ness about.

It is somewhat bothersome to have another buzz phrase. I don't why we are doing this, other than there was a Xeet from the Shopify CEO, QT'd approvingly by Karpathy, then its written up at length, and tied to another set of blog posts.

To wit, it went from "buzzphrase" to "skill that'll probably be useful in 3 years still" over the course of this thread.

Has it even been a week since the original tweet?

There doesn't seem to be a strong foundation here, but due to the reach potential of the names involved, and their insistence on this being a thing while also indicating they're sheepish it is a thing, it will now be a thing.

Smacks of a self-aware version of Jared Friedman's tweet re: watching the invention of "Founder Mode" was like a startup version of the Potsdam Conference. (which sorted out Earth post-WWII. and he was not kidding. I could not even remember the phrase for the life of me. Lasted maybe 3 months?)

dbreunig 6 hours ago|||

Sometimes buzzwords turn out to be mirages that disappear in a few weeks, but often they stick around.

I find they takeoff when someone crystallizes something many people are thinking about internally, and don’t realize everyone else is having similar thoughts. In this example, I think the way agent and app builders are wrestling with LLMs is fundamentally different than chatbots users (it’s closer to programming), and this phrase resonates with that crowd.

Here’s an earlier write up on buzzwords: https://www.dbreunig.com/2020/02/28/how-to-build-a-buzzword....

refulgentis 6 hours ago||

I agree - what distinguishes this is how rushed and self-aware it is. It is being pushed top down, sheepishly.

EDIT: Ah, you also wrote the blog posts tied to this. It gives 0 comfort that you have a blog post re: building buzz phrases in 2020, rather, it enhances the awkward inorganic rush people are self-aware of.

dbreunig 5 hours ago||

I studied linguistic anthropology, in addition to CS. Been at it since 2002.

And I wrote the first post before the meme.

refulgentis 4 hours ago||

I've read these ideas a 1000 times, I thought it was the most beautiful core of the "Sparks of AGI" paper. (6.2)

We should be able to name the source of this sheepishness and have fun with that we are all things at once: you can be a viral hit 2002 super PhD with expertise in all areas involved in this topic that has brought pop attention onto something important, and yet, the hip topic you feel centered on can also make people's eyes roll temporarily. You're doing God's work. The AI = F(C) thing is really important. Its just, in the short term, it will feel like a buzzword.

This is much more about me playing with, what we can reduce to, the "get off my lawn!" take. I felt it interesting to voice because it is a consistent undercurrent in the discussion and also leads to observable absurdities when trying to describe it. It is not questioning you, your ideas, or work. It has just come about at a time when things become hyperreal hyperquickly and I am feeling old.

simonw 6 hours ago|||

The way I see it we're trying to rebrand because the term "prompt engineering" got redefined to mean "typing prompts full of stupid hacks about things like tipping and dead grandmas into a chatbot".

joe5150 5 hours ago||

It helps that the rebrand may lead some people to believe that there are actually new and better inputs into the system rather than just more elaborate sandcastles built in someone else's sandbox.

orbital-decay 6 hours ago||||

Many people figured it out two-three years ago when AI-assisted coding basically wasn't a thing, and it's still relevant and will stay relevant. These are fundamental principles, all big models work similarly, not just transformers and not just LLMs.

However, many fundamental phenomena are missing from the "context engineering" scope, so neither context engineering nor prompt engineering are useful terms.

tptacek 3 hours ago||||

If you're not writing your own agents, you can skip this skill.

anilgulecha 2 hours ago||

Are you sure? Looking forward - AI is going to be so pervasively used, that understanding what information is to be input will be a general skill. What we've been calling "prompt engineering" - the better ones were actually doing context engineering.

tptacek 1 hour ago||

If you're doing context engineering, you're writing an agent. It's mostly not the kind of stuff you can do from a web chat textarea.

coldtea 4 hours ago|||

What exactly month-long AI skills of 2023 AI are obsolete now?

Surely not prompt engineering itself, for example.

TZubiri 4 hours ago||

Rediscovering encapsulation

bgwalter 6 hours ago||

These discussions increasingly remind me of gamers discussing various strategies in WoW or similar. Purportedly working strategies found by trial and error and discussed in a language that is only intelligible to the in-group (because no one else is interested).

We are entering a new era of gamification of programming, where the power users force their imaginary strategies on innocent people by selling them to the equally clueless and gaming-addicted management.

matkoniecz 12 minutes ago||

> only intelligible to the in-group (because no one else is interested)

that applies to basically any domain-specific terminology, from WoW raids through cancer research to computer science and say OpenStreetMap

dysoco 2 hours ago|||

> Purportedly working strategies found by trial and error and discussed in a language that is only intelligible to the in-group

This really does sound like Computer Science since it's very beginnings.

The only difference is that now it's a much more popular field, and not restricted to a few nerds sharing tips over e-mail or bbs.

dawnofdusk 1 hour ago||

>This really does sound like Computer Science since it's very beginnings.

Except in actual computer science you can prove that your strategies, discovered by trial and error, are actually good. Even though Dijkstra invented his eponymous algorithm by writing on a napkin, it's phrased in the language of mathematics and one can analyze quantitatively its effectiveness and trade-offs, and one can prove if it's optimal (as was done recently).

coderatlarge 6 hours ago|||

i tend to share your view. but then your comment describes a lot of previous cycles of enterprise software selling. it’s just that this time is reaching a little uncomfortably into the builder’s /developer’s traditional areas of influence/control/workflow. how devs feel now is probably how others (ex csr, qa, sre) felt in the past when their managers pushed whatever tooling/practice was becoming popular or sine qua non in previous “waves”.

sarchertech 4 hours ago||

This has been happening to developers for years.

25 years ago it was object oriented programming.

coliveira 4 hours ago|||

The difference is that with OO there was at least hope that a well trained programmer could make it work. Nowadays, any person who understands how AI knows that's near impossible.

coderatlarge 4 hours ago|||

or agile and scrums.

Madmallard 2 hours ago|||

There's quite a lot science that goes into WoW strategizing at this point.

People are using their thinking caps and modelling data.

mrits 3 hours ago||

Tuning the JVM, compiler optimizations, design patterns, agile methodologies, seo , are just a few things that come to mind

benreesman 6 hours ago||

The new skill is programming, same as the old skill. To the extent these things are comprehensible, you understand them by writing programs: programs that train them, programs that run inferenve, programs that analyze their behavior. You get the most out of LLMs by knowing how they work in detail.

I had one view of what these things were and how they work, and a bunch of outcomes attached to that. And then I spent a bunch of time training language models in various ways and doing other related upstream and downstream work, and I had a different set of beliefs and outcomes attached to it. The second set of outcomes is much preferable.

I know people really want there to be some different answer, but it remains the case that mastering a programming tool involves implemtenting such, to one degree or another. I've only done medium sophistication ML programming, and my understand is therefore kinda medium, but like compilers, even doing a medium one is the difference between getting good results from a high complexity one and guessing.

Go train an LLM! How do you think Karpathy figured it out? The answer is on his blog!

pyman 6 hours ago|

Saying the best way to understand LLMs is by building one is like saying the best way to understand compilers is by writing one. Technically true, but most people aren't interested in going that deep.

benreesman 6 hours ago||

I don't know, I've heard that meme too but it doesn't track with the number of cool compiler projects on GitHub or that frontpage HN, and while the LLM thing is a lot newer, you see a ton of useful/interesting stuff at the "an individual could do this on their weekends and it would mean they fundamentally know how all the pieces fit together" type stuff.

There will always be a crowd that wants the "master XYZ in 72 hours with this ONE NEAT TRICK" course, and there will always be a..., uh, group of people serving that market need.

But most people? Especially in a place like HN? I think most people know that getting buff involves going to the gym, especially in a place like this. I have a pretty high opinion of the typical person. We're all tempted by the "most people are stupid" meme, but that's because bad interactions are memorable, not because most people are stupid or lazy or whatever. Most people are very smart if they apply themselves, and most people will work very hard if the reward for doing so is reasonably clear.

https://www.youtube.com/shorts/IQmOGlbdn8g

baxtr 8 hours ago||

>Conclusion

Building powerful and reliable AI Agents is becoming less about finding a magic prompt or model updates. It is about the engineering of context and providing the right information and tools, in the right format, at the right time. It’s a cross-functional challenge that involves understanding your business use case, defining your outputs, and structuring all the necessary information so that an LLM can “accomplish the task."

That’s actually also true for humans: the more context (aka right info at the right time) you provide the better for solving tasks.

root_axis 7 hours ago||

I am not a fan of this banal trend of superficially comparing aspects of machine learning to humans. It doesn't provide any insight and is hardly ever accurate.

furyofantares 7 hours ago|||

I've seen a lot of cases where, if you look at the context you're giving the model and imagine giving it to a human (just not yourself or your coworker, someone who doesn't already know what you're trying to achieve - think mechanical turk), the human would be unlikely to give the output you want.

Context is often incomplete, unclear, contradictory, or just contains too much distracting information. Those are all things that will cause an LLM to fail that can be fixed by thinking about how an unrelated human would do the job.

EricMausler 6 hours ago|||

Alternatively, I've gotten exactly what I wanted from an LLM by giving it information that would not be enough for a human to work with, knowing that the llm is just going to fill in the gaps anyway.

It's easy to forget that the conversation itself is what the LLM is helping to create. Humans will ignore or depriotitize extra information. They also need the extra information to get an idea of what you're looking for in a loose sense. The LLM is much more easily influenced by any extra wording you include, and loose guiding is likely to become strict guiding

furyofantares 5 hours ago||

Yeah, it's definitely not a human! But it is often the case in my experience that problems in your context are quite obvious once looked at through a human lens.

Maybe not very often in a chat context, my experience is in trying to build agents.

root_axis 1 hour ago|||

I don't see the usefulness of drawing a comparison to a human. "Context" in this sense is a technical term with a clear meaning. The anthropomorphization doesn't enlighten our understanding of the LLM in any way.

Of course, that comment was just one trivial example, this trope is present in every thread about LLMs. Inevitably, someone trots out a line like "well humans do the same thing" or "humans work the same way" or "humans can't do that either". It's a reflexive platitude most often deployed as a thought-terminating cliche.

baxtr 1 hour ago||||

Without my note I wouldn’t have seen this comment, which is very insightful to me at least.

https://news.ycombinator.com/item?id=44429880

stefan_ 6 hours ago||||

Theres all these philosophers popping up everywhere. This is also another one of these topics that featured in peoples favorite scifi hyperfixation so all discussions inevitably get ruined with scifi fanfic (see also: room temperature superconductivity).

ModernMech 7 hours ago|||

I agree, however I do appreciate comparisons to other human-made systems. For example, "providing the right information and tools, in the right format, at the right time" sounds a lot like a bureaucracy, particularly because "right" is decided for you, it's left undefined, and may change at any time with no warning or recourse.

layer8 4 hours ago|||

The difference is that humans can actively seek to acquire the necessary context by themselves. They don't have to passively sit there and wait for someone else to do the tedious work of feeding them all necessary context upfront. And we value humans who are able to proactively do that seeking by themselves, until they are satisfied that they can do a good job.

simonw 4 hours ago||

> The difference is that humans can actively seek to acquire the necessary context by themselves

These days, so can LLM systems. The tool calling pattern got really good in the last six months, and one of the most common uses of that is to let LLMs search for information they need to add to their context.

o3 and o4-mini and Claude 4 all do this with web search in their user-facing apps and it's extremely effective.

The same patterns is increasingly showing up in coding agents, giving them the ability to search for relevant files or even pull in official document documentation for libraries.

fergal 1 hour ago|||

THis.. I was about to make a similar point; this conclusion reads like a job description for a technical lead role where they managed and define work for a team of human devs who execute implementation.

mentalgear 7 hours ago|||

Basically, finding the right buttons to push within the constraints of the environment. Not so much different from what (SW) engineering is, only non-deterministic in the outcomes.

QuercusMax 8 hours ago|||

Yeah... I'm always asking my UX and product folks for mocks, requirements, acceptance criteria, sample inputs and outputs, why we care about this feature, etc.

Until we can scan your brain and figure out what you really want, it's going to be necessary to actually describe what you want built, and not just rely on vibes.

lupire 7 hours ago||

Not "more" context. "Better" context.

(X-Y problem, for example.)

zaptheimpaler 5 hours ago||

I feel like this is incredibly obvious to anyone who's ever used an LLM or has any concept of how they work. It was equally obvious before this that the "skill" of prompt-engineering was a bunch of hacks that would quickly cease to matter. Basically they have the raw intelligence, you now have to give them the ability to get input and the ability to take actions as output and there's a lot of plumbing to make that happen.

munificent 4 hours ago||

All of these blog posts to me read like nerds speedrunning "how to be a tech lead for a non-disastrous internship".

Yes, if you have an over-eager but inexperienced entity that wants nothing more to please you by writing as much code as possible, as the entity's lead, you have to architect a good space where they have all the information they need but can't get easily distracted by nonessential stuff.

tptacek 3 hours ago|

Just to keep some clarity here, this is mostly about writing agents. In agent design, LLM calls are just primitives, a little like how a block cipher transform is just a primitive and not a cryptosystem. Agent designers (like cryptography engineers) carefully manage the inputs and outputs to their primitives, which are then composed and filtered.

ozim 8 hours ago||

Finding a magic prompt was never “prompt engineering” it was always “context engineering” - lots of “AI wannabe gurus” sold it as such but they never knew any better.

RAG wasn’t invented this year.

Proper tooling that wraps esoteric knowledge like using embeddings, vector dba or graph dba becomes more mainstream. Big players improve their tooling so more stuff is available.

crystal_revenge 8 hours ago||

Definitely mirrors my experience. One heuristic I've often used when providing context to model is "is this enough information for a human to solve this task?". Building some text2SQL products in the past it was very interesting to see how often when the model failed, a real data analyst would reply something like "oh yea, that's an older table we don't use any more, the correct table is...". This means the model was likely making a mistake that a real human analyst would have without the proper context.

One thing that is missing from this list is: evaluations!

I'm shocked how often I still see large AI projects being run without any regard to evals. Evals are more important for AI projects than test suites are for traditional engineering ones. You don't even need a big eval set, just one that covers your problem surface reasonably well. However without it you're basically just "guessing" rather than iterating on your problem, and you're not even guessing in a way where each guess is an improvement on the last.

edit: To clarify, I ask myself this question. It's frequently the case that we expect LLMs to solve problems without the necessary information for a human to solve them.

adiabatichottub 6 hours ago||

A classic law of computer programming:

"Make it possible for programmers to write in English and you will find that programmers cannot write in English."

It's meant to be a bit tongue-in-cheek, but there is a certain truth to it. Most human languages fail at being precise in their expression and interpretation. If you can exactly define what you want in English, you probably could have saved yourself the time and written it in a machine-interpretable language.

kevin_thibedeau 8 hours ago|||

Asking yes no questions will get you a lie 50% of the time.

adriand 7 hours ago|||

I have pretty good success with asking the model this question before it starts working as well. I’ll tell it to ask questions about anything it’s unsure of and to ask for examples of code patterns that are in use in the application already that it can use as a template.

hobs 7 hours ago||

The thing is, all the people cosplaying as data scientists don't want evaluations, and that's why you saw so little in fake C level projects, because telling people the emperor has no clothes doesn't pay.

For those actually using the products to make money well, hey - all of those have evaluations.

shermantanktop 3 hours ago||

I know this proliferation of excited wannabes is just another mark of a hype cycle, and there’s real value this time. But I find myself unreasonably annoyed by people getting high on their own supply and shouting into a megaphone.

dinvlad 7 hours ago|

I feel like ppl just keep inventing concepts for the same old things, which come down to dancing with the drums around the fire and screaming shamanic incantations :-)

viccis 6 hours ago|

When I first used these kinds of methods, I described it along those lines to my friend. I told him I felt like I was summoning a demon and that I had to be careful to do the right incantations with the right words and hope that it followed my commands. I was being a little disparaging with the comment because the engineer in me that wants reliability, repeatability, and rock solid testability struggles with something that's so much less under my control.

God bless the people who give large scale demos of apps built on this stuff. It brings me back to the days of doing vulnerability research and exploitation demos, in which no matter how much you harden your exploits, it's easy for something to go wrong and wind up sputtering and sweating in front of an audience.

More comments...