OpenAI are quietly adopting skills, now available in ChatGPT and Codex CLI

Posted by simonw 12/12/2025

OpenAI are quietly adopting skills, now available in ChatGPT and Codex CLI(simonwillison.net)

587 points | 324 comments

_pdp_ 12/13/2025|

LLMs need prompts. Prompts can get very big very quickly. The so called "skills", which exist in other forms in other platforms outside of Anthropic and OpenAI, are simply a mechanism to extend the prompt dynamically. The tool (scripts) that are part of the skill are no different then simply having the tools already installed in the OS where the agent operates.

The idea behind skills is sound because context management matters.

However, skills are different from MCP. Skills has nothing to do with tool calling at all!

You can implement your own version of skills easily and there is absolutely zero need for any kind of standard or a framework of sorts. They way to do is to register a tool / function to load and extend the base prompt and presto - you have implemented your own version of skills.

In ChatBotKit AI Widget we even have our own version of that for both the server and when building client-side applications.

With client-side applications the whole thing is implemented with a simple react hook that adds the necessary tools to extend the prompt dynamically. You can easily come up with your own implementation of that with 20-30 lines of code. It is not complicated.

Very often people latch on some idea thinking this is the next big thing hoping that it will explode. It is not new and it wont explode! It is just part of a suite of tools that already exist in various forms. The mechanic is so simple at its core that practically makes no sense to call it a standard and there is absolutely zero need to have it for most types of applications. It does make sense for coding assistant though as they work with quite a bit of data so there it matters. But skills are not fundamentally different from *.instruction.md prompt in Copilot or AGENT.md and its variations.

electric_muse 12/13/2025||

> But skills are not fundamentally different from *.instruction.md prompt in Copilot or AGENT.md and its variations.

One of the best patterns I’ve see is having an /ai-notes folder with files like ‘adding-integration-tests.md’ that contain specialized knowledge suitable for specific tasks. These “skills” can then be inserted/linked into prompts where I think they are relevant.

But these skills can’t be static. For best results, I observe what knowledge would make the AI better at the skill the next time. Sometimes I ask the AI to propose new learnings to add to the relevant skill files, and I adopt the sensical ones while managing length carefully.

Skills are a great concept for specialized knowledge, but they really aren’t a groundbreaking idea. It’s just context engineering.

tedivm 12/13/2025|||

Back in my day we referred to this as "documentation". It turns out it's actually useful for developers too, not just agents.

abirch 12/13/2025||

Wait developers RTFM?

itsafarqueue 12/13/2025||

Only after exhausting every other avenue

CuriouslyC 12/13/2025||||

Pro tip, just add links in code comments/readmes with relevant "skills" for the code in question. It works for both humans and agents.

_pdp_ 12/13/2025|||

This is exactly what I do. It works super well. Who would have thought that documenting your code helps both other developers and AI agent? I've been sarcastic.

smoe 12/13/2025||

I would argue that many engineering “best practices” have become much more important much earlier in projects. Personally, I can deal with a lot of jank and lack of documentation in a early stage codebase, but LLMs get lost so quickly, or they just multiply the jank faster than anyone ever could have in the past, making it much, much worse for both LLMs and humans.

Documentation, variable naming, automated tests, specs, type checks, linting. Anything the agent can bang its proverbial head against in a loop for a while without involving you every step of the way.

scottlamb 12/13/2025|||

This might be one of the best things about the current AI boom. The agents give quick, frequent, cheap feedback on how effective the comments, code structure, and documentation are to helping a "new" junior engineer get started.

I like to think I'm above average in terms of having design docs alongside my code, having meaningful comments, etc. But playing with agents recently has pointed out several ways I could be doing better.

Leynos 12/15/2025||

If I see an LLM having trouble with a library, I can feed its transcript into another agent and ask for actionable feedback on how to make the library easier to use. Which of course gets fed into a third agent to implement. It works really well for me. Nothing more satisfying than a satisfied customer.

CuriouslyC 12/16/2025||

I've done something similar. I ask agents to use CLIs, then I give them an "exit survey" on their experience along with feedback on improvements. Feels pretty meta.

pbronez 12/13/2025|||

I’ve seen some dev agents do this pretty well.

cube2222 12/13/2025|||

The general idea is not very new, but the current chat apps have added features that are big enablers.

That is, skills make the most sense when paired with a Python script or cli that the skill uses. Nowadays most of the AI model providers have code execution environments that the models can use.

Previously, you could only use such skills with locally running agent clis.

This is imo the big enabler, which may totally mean that “skills will go big”. And yeah, having implemented multiple MCP servers, I think skills are a way better approach for most use-cases.

DonHopkins 12/13/2025|||

I like the focus on python cli tools, using the standard argparse module, and writing good help and self documentation.

You can develop skills incrementally, starting with just one md file describing how to do something, and no code at first.

As you run through it for the first several times, testing and debugging it, you accumulate a rich history of prompts, examples, commands, errors, recovery, backing up and branching. But that chat history is ephemeral, so you need to scoop it up and fold it back into the md instructions.

While the experience is still fresh in the chat, have it uplift knowledge from the experience into the md instructions, refine the instructions with more details, give concrete examples of input and output, Add more detailed and explicit instructions, handle exceptions and prerequisites, etc.

Then after you have a robust reliable set of instructions and examples for solving a problem (with branches and conditionals and loops to handle different conditions, like installing prerequisite tools, or checking and handling different cases), you can have it rewrite the parts that don't require "thought" into python, as a self documenting cli tool that an llm, you, and other scripts can call.

It's great to end up with a tangible well documented cli tool that you can use yourself interactively, and build on top of with other scripts.

Often the whole procedure can be rewritten in python, in which case the md instructions only need to tell how to use the python cli tool you've generated, which cli.py --help will fully document.

But if it requires a mix of llm decision making or processing plus easily automated deterministic procedures, then the art is in breaking it up into one or more cli tools and file formats, and having the llm orchestrate them.

Finally you can take it all the way into one tool, turn it outside in, and have the python cli tool call out to an llm, instead of being called by an llm, so it can run independently outside of cursor or whatever.

It's a lot like a "just in time" compiler from md instructions to python code.

Anyone can write up (and refine) this "Self Optimizing Skills" approach in another md file of meta instructions for incrementally bootstrapping md instructions into python clis.

jmalicki 12/13/2025|||

MCP servers are really just skills paired with python scripts, it's not really that different, MCP just lets you package them together for distribution.

cube2222 12/13/2025||

But then those work only locally - not in the web ui’s, unless you make it a remote MCP, and then it’s back to being something somewhat different.

Skills also have a nicer way of working with the context, by default (and in the main web uis), with their overview-driven lazy loading.

lxgr 12/13/2025|||

Many useful inventions seem blindingly obvious in hindsight.

Yes, in the end skills are just another way to manage prompts and avoid cluttering the context of a model, but they happen to be one that works really well.

_pdp_ 12/13/2025||

It is not an invention if it is common sense and there is plenty of previous art. How would you otherwise dynamically extend the prompt? You will have some kind of function that based on the selected preferences add more prompt to the base prompt. That is basically what this is except that Anthropic added it as a built in tool.

skydhash 12/13/2025||

Open the python REPL

Type `import math`

You now have more skills (symbols)

_pdp_ 12/13/2025||

???

wordpad 12/13/2025|||

He is making a point something extremely powerful can be simple and obvious. Importing libraries is an obvious way to manage code complexity and dependencies.

Skills do that for prompts.

butlike 12/15/2025||

But I need skills ~4.3 cause 19,383 deps depend on it. Should I bump to ^4.0 in the llm-composer.json?

bg24 12/13/2025|||

With a little bit of experience, I realized that it makes sense even for agent to run commands/scripts for deterministic tasks. For example, to find a particular app out of a list of N (can be 100) with a complex filtering crietria, best option is to run a shell command to get specific output.

Like this, you can divide a job to be done into blocks of reasoning and deterministic tasks. The later are scripts/commands. The whole package is called skills.

btown 12/13/2025|||

> [The] way to do is to register a tool / function to load and extend the base prompt and presto - you have implemented your own version of skills.

So are they basically just function tool calls whose return value is a constant string? Do we know if that’s how they’re implemented, or is the string inserted into the new input context as something other than a function_call_output?

_pdp_ 12/13/2025||

No. You basically call a function to temporarily or permanently extend the base prompt. But of course you can think of other patterns to do more interesting things depending on your use-case. The prompt selection is a RAG.

btown 12/13/2025||

Did some research and it's a bit more nuanced than this, though still RAG at its core: each skill has a name and brief description that's included verbatim into every prompt, and a Bash "cat" is triggered as a standard tool call to load the full skill specification from disk.

https://platform.claude.com/docs/en/agents-and-tools/agent-s...

And as implemented in Codex: https://github.com/openai/codex/pull/7412/changes#diff-35647...

valstu 12/13/2025|||

> However, skills are different from MCP. Skills has nothing to do with tool calling at all

Although skills require that you have certain tools available like basic file system operations so the model can read the skills files. Usually this is implemented as ephemeral "sandbox environment" where LLM have access to file system and can also execute python, run bash commands etc.

PythonicNinja 12/14/2025|||

added blog post about skills in AI and references to

Dotprompt / Claude / Dia browser skills - "Skills Everywhere: Portable Playbooks for Codex, Claude, and Dia"

https://pythonic.ninja/blog/2025-12-14-codex-skills/

kelvinjps10 12/13/2025||

Isn't the simplicity of the concept, that will make it "explode"?

extr 12/13/2025||

It’s crazy how Anthropic keeps coming up with sticky “so simple it seems obvious” product innovations and OpenAI plays catch up. MCP is barely a protocol. Skills are just md files. But they seem to have a knack for framing things in a way that just makes sense.

Jimmc414 12/13/2025||

Skills are lazy loaded prompt engineering. They are simple, but powerful. Claude sees a one line index entry per skill. You can create hundreds. The full instructions only load when invoked.

Those instructions can reference external scripts that Claude executes without loading the source. You can package them with hooks and agents in plugins. You pay tokens for the output, not the code that calls it.

Install five MCPs and you've burned a large chunk of tokens before typing a prompt. With skills, you only pay for what you use.

You can call deterministic code (pipelines, APIs, domain logic) with a non-deterministic model, triggered by plain language, without the context bloat.

robrenaud 12/13/2025|||

They are the LLM whisperers.

In the same way Nagel knew what it was like to be a bat, Anthropic has the highest fraction of people who approximately know what it's like to be a frontier ai model.

gabaix 12/13/2025|||

Nagel's point is that he could not know what it was like to be a bat.

01HNNWZ0MV43FF 12/13/2025||||

Huh https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F

uoaei 12/13/2025|||

It's surprising to me that Anthropic's CEO is the only one getting real recognition for their advances. The people around him seem to be as or more crucial for their mission.

ACCount37 12/13/2025|||

Is that really true?

I can name OpenAI CEO but not Anthropic CEO off the top of my head. And I actually like Anthropic's work way more than what OpenAI is doing right now.

uoaei 12/13/2025||

Pick up the newest edition of Time.

blueblisters 12/13/2025||||

Amanda Askell, Sholto Douglas have somewhat of a fan following on twitter

adastra22 12/13/2025|||

That’s always the case.

altmanaltman 12/13/2025|||

> https://www.anthropic.com/news/donating-the-model-context-pr...

This is a prime example of what you're saying. Creating a "foundation" for a protocol created an year ago that's not even a protocol

Has the Gavin Belson tecthics energy

sigmoid10 12/13/2025|||

Anthropic is in a bit of a rough spot if you look at the raw data points we have available. Their valuation is in the same order of magnitude as OpenAI, but they have orders of magnitude fewer users. And current leaderboards for famous unsolved benchmarks like ARC AGI and HLE are also dominated by Google and OpenAI. Announcements like the one you linked are the only way for Anthropic to stay in the news cycle and justify its valuation to investors. Their IPO rumours are yet another example of this. But I really wonder how long that strategy can keep working.

ramraj07 12/13/2025|||

Those benchmarks mean nothing. Anthropic still makes the models that gets real work done in enterprise. We want to move but are unable to.

If anyone disagrees,I would like to see their long running deep research agents built on gemini or openai.

sigmoid10 12/13/2025|||

I have built several agents based on OpenAI now that are running real life business tasks. OpenAI's tool calling integration still beats everyone else (in fact it did from the very beginning), which is what actually matters in real world business applications. And even if some small group of people prefer Anthropic for very specific tasks, the numbers are simply unfathomable. Their business strategy has zero chance of working long-term.

dotancohen 12/13/2025||

In writing code, from what I've seen, Anthropic's models are still the most widely used. I would venture that over 50% of vibe coded apps, garbage though they are, are written by Claude Code. And they capture the most market in real coding shops as well, from what I've seen.

sigmoid10 12/15/2025||

What data are you basing your assumption on? OpenRouter? That itself is only used by a tiny fraction of people. According to the latest available numbers, OpenAI has ~800x more monthly active users than OpenRouter. So even if only 0.5% of them use it for code, it will dwarf everything that Anthropic's models produce.

taylorius 12/13/2025|||

Just out of interest, why do you want to move? What's wrong with Claude and Anthropic in your view? (I use it, and it works really well.)

biorach 12/13/2025||||

> Their valuation is in the same order of magnitude as OpenAI, but they have orders of magnitude fewer users.

it's an open question how many of OpenAI's users are monetizable.

There's an argument to be made that your brand being what the general public identifies with AI is a medium term liability in light of the vast capital and operating costs involved.

It may well be that Anthropic focusing on an order of magnitudes smaller, but immediately monetiazable market will play out better.

sigmoid10 12/19/2025||

I wouldn't count on it being immediately monetizable. At least not to the point where training foundation models becomes fundamentally profitable. And from what we're seeing right now, you have to do that or you will get left behind fast. But with a billion active users, you are approaching Facebook levels of market penetration and thereby advertising-potential. So in the mid to long term, this is certainly more valuable.

robrenaud 12/13/2025||||

Low scores on HLE and ARC AGI might be a good sign. They didn't goodhart their models. ARG AGI in particular doesn't mean much, IMO. It's just some weird hard geometry induction. I don't think it correlates well with real world problem solving.

AFAICT, claude code is the biggest engineering mind share. An apple software engineer of mine says he sometimes uses $100/day of claude code tokens at work and gets sad, because that's the budget.

Also, look at costs and revenue. OpenAI is bleeding way more than Antropic.

losvedir 12/13/2025||||

Not sure how relevant it is, but I finally decided to dip my toes in last night and write my first agent. Despite paying for ChatGPT Pro, Claude Pro, etc, you still have to load up credits to use the API version of them. I started with Claude, but there was a bug on the add credit form and I couldn't submit (I'm guessing they didn't test on MacOS Safari, maybe?). So I gave up and moved on to OpenAI's developer thing.

Maybe they should do less vibe coding on their checkout flow and they might have more users.

bfuller 12/14/2025||||

Anthropic has less users, but I think their value per user is higher due to claude mostly producing code. I know my shop is just gonna keep paying for $200 max subscriptions until one of these open source clients with a chinese LLM can beat sonnet 4.5 (which may be now, but not worth it for me to explore until its solid enough for my uses)

extr 12/13/2025||||

Hard to believe you could be so misinformed. Anthropic is not far behind OAI on revenue and has a much more stable position with most of it coming from enterprise/business customers.

andy99 12/13/2025|||

I’d argue openAI has put their cards on the table and they don’t have anything special, while Anthropic has not.

Their valuations come from completely different calculus: Anthropic looks much more like a high potential early startup still going after PMF while OpenAI looks more like a series B flailing to monetize.

The cutting edge has largely moved past benchmarks, beyond a certain performance threshold that all these models have reached, nobody really cares about scores anymore, except people overfitting to them. They’re going for models that users like better, and Claude has a very loyal following.

TLDR, OpenAI has already peaked, Anthropic hasn’t, this the valuation difference.

DonHopkins 12/13/2025||||

I just re-binge-watched Silicon Valley in its entirety, with the benefit of a decade of hindsight, so I could get all the interconnected characters and sub-plots and cultural references together in my head better than the first time I watched it in real time at one episode per month.

It really should be required viewing for anyone in the industry, it has so much spot-on social commentary, it's just not "tecthical" not to be fully aware of it, even if it stings.

https://silicon-valley.fandom.com/wiki/Tethics

>Meanwhile, Gavin Belson (Matt Ross) comes up with a code of ethics for tech, which he lamely calls "tethics", and urges all tech CEOs to sign a pledge to abide by the tethics code. Richard refuses to sign, he considers the pledge to be unenforceable and meaningless.

>Belson invites Richard to the inauguration of the Gavin Belson Institute for Tethics. Before Belson's speech, Richard confronts the former Hooli CEO with the fact that the tethics pledge is a stream of brazenly plagiarized banalities, much like Belson's novel Cold Ice Cream & Hot Kisses.

>Once at the podium, Belson discards his planned speech and instead confesses to his misdeeds when he was CEO of Hooli. Belson urges California's attorney general to open an investigation.

>Richard mistakenly thinks that Belson is repentant for all his past bad behavior. But, as Ron LaFlamme (Ben Feldman) explains, Belson's contrite act is just another effort to sandbag Richard. If the attorney general finds that Belson acted unethically during his tenure as Hooli CEO, the current Hooli CEO would be the one who has to pay the fine. And since Pied Piper absorbed Hooli, it would be Pied Piper that has to pay the fine.

beng-nl 12/13/2025|||

Tethics, Denpok.

mhalle 12/13/2025|||

Skills are not just markdown files. They are markdown files combined with code and data, which only work universally when you have a general purpose cloud-based code execution environment.

Out of the box Claude skills can call python scripts that load modules from Pypi or even GitHub, potentially ones that include data like sqlite files or parquet tables.

Not just in Claude Code. Anywhere, including the mobile app.

rcarmo 12/13/2025||

They’re not alone in that.

lacy_tinpot 12/13/2025|||

Their name is Anthropic. Their entire schtick is a weird humanization of AIs.

MCP/Tool use, Skills, and I'm sure others that I can't think of.

This is might be because of some core direction that is more coherent than other labs.

JoshuaDavid 12/13/2025||

... I am pretty sure that the name "Anthropic" is as in "principle" not as in "pertaining to human beings".

kaashif 12/13/2025|||

The anthropic principle is named as such because it is "pertaining to human beings".

This is like saying McDonald's is named after the McDonald's happy meal rather than the McDonald brothers.

lacy_tinpot 12/15/2025||||

Just look at their aesthetic/branding, or the way they've trained their models. Very Anthrop-ic.

GlitchInstitute 12/13/2025||||

anthropic is derived from the Greek word anthropos (human)

https://en.wikipedia.org/wiki/Anthropic_principle

yunohn 12/13/2025|||

Really? Anthropic is /the/ AI company known for anthropomorphizing their models, giving them ethics and “souls”, considering their existential crises, etc.

JoshuaDavid 12/13/2025|||

Anthropic was founded by a group of 7 former OpenAI employees who left over differences in opinions about AI Safety. I do not see any public documentation that the specific difference in opinion was that that group thought that OpenAI was too focused on scaling and that there needed to be a purely safety-focused org that still scaled, though that is my impression based on conversations I've had.

But regardless anthropic reasoning was extremely in the intellectual water supply of the Anthropic founders, and they explicitly were not aiming at producing a human-like model.

simonw 12/13/2025||

They tried to fire Sam Altman and left to form their own company when that didn't work. https://simonwillison.net/2023/Nov/22/before-altmans-ouster-...

rcarmo 12/13/2025|||

“you are totally right!” does feel like a very human behavior in some respects…

losvedir 12/13/2025|||

MCP is a terribly designed (and I assume vibe-designed) protocol. Give me the requirements that an LLM needs to be able to load tools dynamically from another server and invoke them like an RPC, and I could give you a much simpler, better solution.

The modern HTTP Streamable version is light-years better, but took a year and was championed by outside engineers faced with the real problem of integrating it, and I imagine was designed by a human.

OpenAI was there first, but unfortunately the models weren't quite good enough yet, so their far superior approach unfortunately didn't take off.

smokel 12/13/2025|||

Also, MCP is a serious security disaster. Too simple, I'd wager.

valzam 12/13/2025|||

Id argue that this isn't so much a fault of the MCP spec but how 95% of AI 'engineers' have no engineering background. MCP is just an OpenAPI spec. It's the same as any other API. If you are exposing sensitive data without any authz/n that's on the developer.

sam_lowry_ 12/13/2025||||

complex is synonym of insecure

brazukadev 12/13/2025|||

MCP biggest problem is not being simple

nl 12/13/2025|||

Also `CLAUDE.md` (which is `AGENTS.md` everywhere? else now)

msy 12/13/2025|||

I get the impression the innovation drivers at OpenAI have all moved on and the people that have moved in were the ones chasing the money, the rest is history.

nrhrjrjrjtntbt 12/13/2025|||

The RSS of AI

uoaei 12/13/2025||

I like this line of analogy. The next obvious step would be IRC (or microservices?) of AI (for co-reasoning) which could offer the space for specialized LLMs rather than the current approach of monoliths.

jbgt 12/13/2025||

Oh wow conreasoning through an IRC like chat. That's a great idea.

Would be cool (sci fi) for LLMs of different users to chat and discuss approaches to what the humans are talking about etc.

exe34 12/13/2025||

omg that's how crystal society starts and then it goes downhill! highly recommended series in this space.

speakspokespok 12/13/2025|||

I noticed something like this earlier, in the android app you can have it rewrite a paragraph, and then and only then do you have the option to send that as a text message. It's just a button that pops up. Claude has an elegance to it.

ivape 12/13/2025|||

It’s the only AI company that isn’t monetize at all costs. I’m curious how deep their culture goes as it’s remarkable they even have any discernible value system in today’s business world.

rcarmo 12/13/2025|||

Well, my MCP servers only really started working when I implemented the prompt endpoints, so I’m happy I’ll never have to use MCP again if this sticks.

blitzar 12/13/2025|||

Anthropic are using Ai beyond the chat window. Without external information, context and tools the "magic" of Ai evaporates after a few minutes.

baxtr 12/13/2025|||

A good example of:

Build things and then talk about them in a way that people remember and share it with friends.

I guess some call it clever product marketing.

extr 12/13/2025|||

Oh yeah I forgot the biggest one. Claude fucking code. Lol

baby 12/13/2025||

I was very skeptical about anything not OpenAI for a while, and then discovered Claude code, Anthropic blogposts, etc. It's basically the coolest company in the field.

mh- 12/13/2025|||

Claude Code and its ecosystem is what made me pick Anthropic over OpenAI for our engineers, when we decided to do seat licensing for everyone last week.

It's a huge asset.

joemazerino 12/13/2025||||

I appreciate Claude not training on my data by default. ChatGPT through the browser does not give you that option.

skeptic_ai 12/13/2025|||

Same here. Until I read more about them and actually seem sketchy too. All about “safety” reasons not to do certain things.

_pdp_ 12/13/2025|||

I hate to be that guy but skills are not an invention of sorts. It a simple mechanism that exists already in many places.

The biggest unlock was tool calling that was in invented at OpenAI.

simonw 12/13/2025||

I'd credit tool calling to the ReAct paper, which was Princeton CA and Google DeepMind: https://arxiv.org/abs/2210.03629

_pdp_ 12/13/2025||

Oh nice. I did not know. Thanks for the link.

CuriouslyC 12/13/2025||

Anthropic has good marketing, but ironically their well marketed mediocre ideas retard development of better standards.

energy123 12/13/2025||

A public warning about OpenAI's Plus chat subscription as of today.

They advertise 196k tokens context length[1], but you can't submit more than ~50k tokens in one prompt. If you do, the prompt goes through, but they chop off the right-hand-side of your prompt (something like _tokens[:50000]) before calling the model.

This is the same "bug" that existed 4 months ago with GPT-5.0 which they "fixed" only after some high-profile Twitter influencers made noise about it. I haven't been a subscriber for a while, but I re-subscribed recently and discovered that the "bug" is back.

Anyone with a Plus sub can replicate this by generating > 50k tokens of noise then asking it "what is 1+1?". It won't answer.

[1] https://help.openai.com/en/articles/11909943-gpt-52-in-chatg...

hu3 12/13/2025||

Well this explains the weird behaviour of GPT-5 often ignoring a large part of my prompt when I attatched many code/csv files despite keeping total token count under control. That is with Github Copilot inside VSCode.

The fix was to just switch to Claude 3.5 and now to 4.5 in VSCode.

wrcwill 12/13/2025|||

ugh this is so amateurish. i swear since the release of o3 this has been happening on and off.

scrollop 12/13/2025|||

And the Xhigh version is only available via API, not chatgpt.

noname120 12/13/2025||

Are you sure the that “extended thinking” option from the ChatGPT web client is something different?

seunosewa 12/13/2025|||

Probably high.

ismailmaj 12/13/2025|||

"Oh sorry guys, we made the mistake again that saves us X% in compute cost, we will fix it soon!"

DANmode 12/15/2025||

Is this via the API, or only webUI?

simonw 12/12/2025||

I had a bunch of fun writing about this one, mainly because it was a great excuse to highlight the excellent news about Kākāpō breeding season this year.

(I'm not just about pelicans.)

quinncom 12/13/2025||

Awww. If there weren’t only 237 of them, I would want to bring one of them home.

> Kākāpō can be up to 64 cm (25 in) long. They have a combination of unique traits among parrots: finely blotched yellow-green plumage, a distinct facial disc, owl-style forward-facing eyes with surrounding discs of specially-textured feathers, a large grey beak, short legs, large blue feet, relatively short wings and a short tail. It is the world's only flightless parrot, the world's heaviest parrot, and also is nocturnal, herbivorous, visibly sexually dimorphic in body size, has a low basal metabolic rate, and does not have male parental care. It is the only parrot to have a polygynous lek breeding system. It is also possibly one of the world's longest-living birds, with a reported lifespan of up to 100 years.

https://en.wikipedia.org/wiki/K%C4%81k%C4%81p%C5%8D

ajcp 12/13/2025|||

And so the Kākāpō Benchmark was born

swyx 12/13/2025||

and this is my excuse to talk about the :partyparrot: emoji being from an actual real life documentary https://www.youtube.com/watch?v=9T1vfsHYiKY&pp=ygUSa2FrYXBvI...

KK7NIL 12/13/2025|||

TIL about a large moss green flightless parrot :)

uoaei 12/13/2025|||

I'm impressed you have never encountered :partyparrot: in your work Slack.

mkl 12/13/2025|||

They're also nocturnal!

joshtbradley 12/13/2025|||

Will Kākāpō be riding bicycles soon?

OrsonSmelles 12/13/2025|||

They already ride British nature photographers—what do they need bikes for?

throwup238 12/13/2025||

https://youtube.com/watch?v=Jlk9u8MIv7o

The foreplay starts around the 1 minute mark.

pineaux 12/13/2025|||

as an svg you mean? cause nano banana rides circles around the pelicans

bilekas 12/13/2025||

> Skills are a keeper #

Good thinking, I agree actually, however..

> Skills are based on a very light specification, if you could even call it that, but I still think it would be good for these to be formally documented somewhere.

Like a lot of posts around AI, and I hope OP can speak to it, surely you can agree that while when used for a good cool idea, it can also be used for the inverse and probably to more detrimental reason. Why would they document an unmanageable feature that may be consumed.

Shareholder value might not go up if they learnt that the major product is learning bad things.

Have you or would you try this on a local LLM instead ?

simonw 12/13/2025||

These work well with local LLMs that are powerful enough to run a coding agent environment with a decent amount of context over longer loops.

The OpenAI GPT OSS models can drive Codex CLI, so they should be able to do this.

I have high hopes for Mistral's Devstral 2 but I've not run that locally yet.

bilekas 12/13/2025|||

> These work well with local LLMs that are powerful enough to run a coding agent environment with a decent amount of context over longer loops.

That's actually super interesting, maybe something I'll try investigate and find the minimum requirements because as cool as they seem, personalized 'skills' might be a more useful use of AI overall.

Nice article, and thanks for answering.

Edit: My thinking is consumer grade could be good enough to run this soon.

ipaddr 12/13/2025|||

Something that powerful requires some rewriting of the house.

Local LLMs are better for long batch jobs not things you want immediately or your flow gets killed.

lacker 12/13/2025||

I'm not sure if I have the right mental model for a "skill". It's basically a context-management tool? Like a skill is a brief description of something, and if the model decides it wants the skill based on that description, then it pulls in the rest of whatever amorphous stuff the skill has, scripts, documents, what have you. Is this the right way to think about it?

simonw 12/13/2025||

It's a folder with a markdown file in it plus optional additional reference files and executable scripts.

The clever part is that the markdown file has a section in it like this: https://github.com/datasette/skill/blob/a63d8a2ddac9db8225ee...

  ---
  name: datasette-plugins
  description: "Writing Datasette plugins using Python and the pluggy plugin system. Use when Claude needs to: (1) Create a new Datasette plugin, (2) Implement plugin hooks like prepare_connection, register_routes, render_cell, etc., (3) Add custom SQL functions, (4) Create custom output renderers, (5) Add authentication or permissions logic, (6) Extend Datasette's UI with menus, actions, or templates, (7) Package a plugin for distribution on PyPI"
  ---

On startup Claude Code / Codex CLI etc scan all available skills folders and extract just those descriptions into the context. Then, if you ask them to do something that's covered by a skill, they read the rest of that markdown file on demand before going ahead with the task.

spike021 12/13/2025|||

Apologies for not reading all of your blogs on this, but a follow-up question. Are models still prone to reading these and disregarding them even if they should be used for a task?

Reason I ask is because a while back I had similar sections in my CLAUDE.md and it would either acknowledge and not use or just ignore them sometimes. I'm assuming that's more of an issue of too much context and now skill-level files like this will reduce that effect?

jrecyclebin 12/13/2025||

Skill descriptions get dumped in your system prompt - just like MCP tool definitions and agent descriptions before them. The more you have, the more the LLM will be unable to focus on any one piece of it. You don't want a bunch of irrelevant junk in there every time you prompt it.

Skills are nice because they offload all the detailed prompts to files that the LLM can ask for. It's getting even better with Anthropic's recent switchboard operator (tool search tool) that doesn't clutter the system prompt but tries to cut the tool list down to those the LLM will need.

ithkuil 12/13/2025|||

Can I organize skills hierarchically? If when many skills are defined, Claude Code loads all definitions into the prompt, potentially diluting its ability to identify relevant skills, I'd like a system where only broad skill group summaries load initially, with detailed descriptions loaded on-demand when Claude detects a matching skill group might be useful.

simonw 12/13/2025||

There's a mechanism for that built into skills already: a skill folder can also include additional reference markdown files, and the skill can tell the coding agent to selectively read those extra files only when that information is needed on top of the skill.

There's an instruction about that in the Codex CLI skills prompt: https://simonwillison.net/2025/Dec/13/openai-codex-cli/

  If SKILL.md points to extra folders such as references/, load only the specific files needed for the request; don't bulk-load everything.

ithkuil 12/15/2025||

yes but those are not quite new skills right?

can those markdown in the references also in turn tell the model to lazily load more references only if the model deems they are useful?

simonw 12/15/2025||

Yes, using regular English prompting:

  If you need to write tests that mock
  an HTTP endpoint, also go ahead and
  read the pytest-mock-httpx.md file

greymalik 12/13/2025|||

> Anthropic's recent switchboard operator

I don’t know what this is and Google isn’t finding anything. Can you clarify?

Maxious 12/13/2025||

https://platform.claude.com/docs/en/agents-and-tools/tool-us...

https://www.anthropic.com/engineering/advanced-tool-use talks more about the why

behnamoh 12/13/2025||||

why did this simple idea take so long to become available? I remember even in llama 2 days I was doing this stuff, and that model didn't even function call.

simonw 12/13/2025|||

Skills only work if you have a full blown code execution environment with a model that can run ls and cat and execute scripts and suchlike.

The models are really good at driving those environments now which makes skills the right idea at the right time.

jstummbillig 12/13/2025|||

Why do you need code execution envs? Could the skill not just be a function over a business process, do a then b then c?

steilpass 12/13/2025||

Turns out that basic shell commands are a really powerful for context management. And you get tools which run in shells for free.

But yes. Other agent platforms will adopt this pattern.

true2octave 12/13/2025||

I prefer to provide CLIs to my agent

I find it powerful how it can leverage and self-discover the best way to use a CLI and its parameters to achieve its goals

It feels more powerful than providing pre-defined set functions as MCP that will have less flexibility as a CLI

NiloCK 12/13/2025|||

I still don't really understand `skills` as ... anything? You said yourself that you've been doing this since llama 2 days - what do you mean by "become available"?

It is useful in a user-education sense to communicate that it's good to actively document useful procedures like this, and it is likely a performance / utilization boost that the models are tuned or prompt-steered toward discovering this stuff in a conventional location.

But honestly reading about skills mostly feels like reading:

> # LLM provider has adopted a new paradigm: prompts

> What's a prompt?

> You tell the LLM what you'd like to do, and it tries to do it. OR, you could ask the LLM a question and it will answer to the best of its ability.

Obviously I'm missing something.

baq 12/13/2025||

It’s so simple there isn’t really more to understand. There’s a markdown doc with a summary/abstract section and a full manual section. Summary is always added to the context so the model is aware that there’s something potentially useful stored here and can look up details when it decides the moment is right. IOW it’s a context length management tool which every advanced LLM user had a version of (mine was prompt pieces for special occasions in Apple notes.)

kswzzl 12/13/2025||||

> On startup Claude Code / Codex CLI etc scan all available skills folders and extract just those descriptions into the context. Then, if you ask them to do something that's covered by a skill, they read the rest of that markdown file on demand before going ahead with the task.

Maybe I still don't understand the mechanics - this happens "on startup", every time a new conversation starts? Models go through the trouble of doing ls/cat/extraction of descriptions to bring into context? If so it's happening lightning fast and I somehow don't notice.

Why not just include those descriptions within some level of system prompt?

simonw 12/13/2025||

Yes, it happens on startup of a fresh Claude Code / Codex CLI session. They effectively get pasted into the system prompt.

Reading a few dozen files takes on the order of a few ms. They add enough tokens per skill to fit the metadata description, so probably less than 100 for each skill.

raybb 12/13/2025||

So when it says:

> The body can contain any Markdown; it is not injected into context.

It just means it's not injected into the context until the skill is used or it's never injected into the context?

https://github.com/openai/codex/blob/main/docs/skills.md

simonw 12/13/2025||

Yeah, that means that the body of that file will not be injected into the context on startup.

I had thought that once the skill is selected the whole file would be read, but it looks like that's not the case: https://github.com/openai/codex/blob/ad7b9d63c326d5c92049abd...

  1) After deciding to use a skill, open its `SKILL.md`. Read only enough to follow the workflow.

So you could have a skill file that's thousands of lines long but if the first part of the file provides an outline Codex may stop reading at that point. Maybe you could have a skill that says "see migrations section further down if you need to alter the database table schema" or similar.

wahnfrieden 12/13/2025|||

Knowing Codex, I wonder if it might just search for text in the skill file and read around matches, instead of always reading a bit from the top first.

debugnik 12/13/2025|||

Can models actually stream the file in as they see fit, or is "read only enough" just an attention trick? I suspect the latter.

true2octave 12/13/2025||

Depends the agent, they can read in chunks (i.e.: 500 lines at a time)

kridsdale1 12/13/2025||||

So it’s a header file. In English.

throwaway314155 12/13/2025||||

Do skills get access to the current context or are they a blank slate?

simonw 12/13/2025||

They execute within the current context - it's more that the content of the skill gets added to that context when it is needed.

leetrout 12/13/2025|||

Have you used AWS bedrock? I assume these get pretty affordable with prompt caching...

prescriptivist 12/13/2025|||

Skills have a lot of uses, but one in particular I like is replacing one off MCP server usage. You can use (or write) an MCP server for you CI system and then add the instructions to your AGENTS.md to query the CI MCP for build results for the current branch. Then you need to find a way to distribute the MCP server so the rest of the team can use it or cook it into your dev environment setup. But all you really care about is one tool in the MCP server, the build result. Or...

You can hack together a shell, python, whatever script that fetches build results from your CI server, dumps them to stdout in a semi structured format like markdown, then add a 10-15 line SKILL.md and you have the same functionality -- the skill just executes the one-off script and reads the output. You package the skill with the script, usually in a directory in the project you are working on, but you can also distribute them as plugins (bundles) that claud code can install from a "repository", which can just be a private git repo.

It's a little UNIX-y in a way, little tools that pipe output to another tool and they are useful in a standalone context or in a chain of tools. Whereas MCP is a full blown RPC environment (that has it's uses, where appropriate).

wiether 12/13/2025||

How do you manage the credentials to requests your CI server in this case? They are hardcoded in the script associated to your SKILL?

true2octave 12/13/2025||

Credentials are tied to the service principal of the user

It’s straightforward for cloud services

delaminator 12/13/2025|||

Claude Code is not very good at “remembering” its skills.

Maybe they get compacted out of the context.

But you can call upon them manually. I often do something like “using your Image Manipulation skill, make the icons from image.png”

Or “use your web design skill to create a design for the front end”

Tbh i do like that.

I also get Claude to write its own skills. “Using what we learned about from this task, write a skill document called /whatever/using your writing skills skill”

I have a GitHub template including my skills and commands, if you want to see them.

https://github.com/lawless-m/claude-skills

jorl17 12/13/2025|||

I'm so excited for the future, because _clearly_ our technology has loads to improve. Even if new models don't come out, the tooling we build upon them, and the way we use them, is sure to improve.

One particular way I can imagine this is with some sort of "multipass makeshift attention system" built on top of the mechanisms we have today. I think for sure we can store the available skills in one place and look only at the last part of the query, asking the model the question: "Given this small, self-contained bit of the conversation, do you think any of these skills is a prime candidate to be used?" or "Do you need a little bit more context to make that decision?". We then pass along that model's final answer as a suggestion to the actual model creating the answer. There is a delicate balance between "leading the model on" with imperfect information (because we cut the context), and actually "focusing it" on the task at hand, and the skill selection". Well, and, of course, there's the issue of time and cost.

I actually believe we will see several solutions make use of techniques such as this, where some model determines what the "big context" model should be focusing on as part of its larger context (in which it may get lost).

In many ways, this is similar to what modern agents already do. cursor doesn't keep files in the context: it constantly re-reads only the parts it believes are important. But I think it might be useful to keep the files in the context (so we don't make an egregious mistake) at the same time that we also find what parts of the context are more important and re-feed them to the model or highlight them somehow.

Sammi 12/13/2025|||

I'm kinda confused about why this even is something that we need an extra feature for when it's basically already built in to the agentic development feature. I just keep a folder of md files and I add whatever one is relevant when it's relevant. It's kinda straight forward to do...

Just like you I don't edit much in these files on my own. Mostly just ask the model to update an md file whenever I think we've figured out something new, so the learning sticks. I have files for test writing, backend route writing, db migration writing, frontend component writing etc. Whenever a section gets too big to live in agents.md it gets it's own file.

jorl17 12/13/2025|||

Because the concept of skills is not tied to code development :) Of course if that's what you're talking about, you are already very close to the "interface" that skills are presented in, and they are obvious (and perhaps not so useful)

But think of your dad or grandma using a generic agent, and simply selecting that they want to have certain skills available to it. Don't even think of it as a chat interface. This is just some option that they set in their phone assistant app. Or, rather, it may be that they actually selected "Determine the best skills based on context", and the assistant has "skill packs" which it periodically determines it needs to enable based on key moments in the conversation or latest interactions.

These are all workarounds for the problems of learning, memory...and, ultimately, limited context. But they for sure will be extremely useful.

delaminator 12/13/2025|||

It’s a formalisation of the method, and it’s in your global ~/.claude and also per project.

I have mine in a GitHub template so I can even use them in Claude Code for the web. And synchronise them across my various machine (which is about 6 machines atm).

marwamc 12/13/2025|||

My understanding is this: A skill is made up of SKILL.md which is what tells claude how and when to use this skill. I'm a bit of a control freak so I'll usually explicitly direct claude to "load the wireframe-skill" and then do X.

Now SKILL.md can have references to more finegrained behaviors or capabilities of our skill. My skills generally tend to have a reference/{workflows,tools,standards,testing-guide,routing,api-integration}.md. These references are what then gets "progressively loaded" into the context.

Say I asked claude to use the wireframe-skill to create profileView mockup. While creating the wireframe, claude will need to figure out what API endpoints are available/relevant for the profileView and the response types etc. It's at this point that claude reads the references/api-integration.md file from the wireframe skill.

After a while I found I didn't like the progressive loading so I usually direct claude to load all references in the skill before proceeding - this usually takes up maybe 20k to 30k tokens, but the accuracy and precision (imagined or otherwise ha!) is worth it for my use cases.

kxrm 12/13/2025|||

> I'm a bit of a control freak so I'll usually explicitly direct claude to "load the wireframe-skill" and then do X.

You shouldn't do this, it's generally considered bad practice.

You should be optimizing your skill description. Often times if I am working with Claude Code and it doesn't load I skill, I ask it why it missed the skill. It will guide me to improving the skill description so that it is picked up properly next time.

This iteration on skill description has allowed skills to stay out of context until they are needed rather predictably for me so far.

adastra22 12/13/2025|||

There are different ways to use the tool. If you chat with the model, you want it to naturally pick the right tool to use based on vibes and context so you don’t have to repeat yourself. If you are plugging a call it Claude code within a larger, structured workflow, you want the tool selection to be deterministic.

rane 12/13/2025|||

It's not enough. Sometimes skills just randomly won't be invoked.

chrisweekly 12/13/2025|||

My understanding is that use of "description" frontmatter is essential, bc Claude Code can read just the description without loading the entire file into context.

taytus 12/13/2025|||

Easy, let me try to explain: You want to achieve X, so you ask your AI companion, "How do I do X?" Your companion thinks and tries a couple of things, and they eventually work. So you say, "You know what, next time, instead of figuring it out, just do this"... that is a skill. A recipe for how to do things.

jmalicki 12/13/2025|||

Yes. I find these very useful for enforcing e.g. skills like debugging, committing code, make prs, responding to pr feedback from ai review agents, etc. without constantly polluting the context window.

So when it's time to commit, make sure you run these checks, write a good commit message, etc.

Debugging is especially useful since AI agents can often go off the rails and go into loops rewriting code - so it's in a skill I can push for "read the log messages. Inserting some more useful debug assertions to isolate the failure. Write some more unit tests that are more specific." Etc.

canadiantim 12/13/2025||

I think it’s also important to think of skills in the context of tasks, so when you want an agent to perform a specialized task, then this is the context, the resources and scripts it needs to perform the task.

hadlock 12/13/2025||

I'm excited to use this with the Ghidra cli mode to rapidly decompile physics engines from various games. Do I want my flight simulator to behave like the Cessna like in flight simulator 3.0 in the air? Codex can already do that. Do I want the plane to handle like Yoshi from Mario Kart 64 when taxiing? It hasn't been done yet but Claude code is apparently pretty good at pulling apart n64 roms so that seems within the realm of possibility.

mbesto 12/13/2025||

From a purely technical view, skills are just an automated way to introduce user and system prompt stuffing into the context right? Not to belittle this, but rather that seems like a way of reducing the need for AI wrapper apps since most AI wrappers just do systematic user and system prompt stuffing + potentially RAG + potentially MCP.

simonw 12/13/2025|

Yeah, there are a whole lot of AI wrapper applications that could be a folder with a markdown file in at this point!

ctoth 12/13/2025||

@simonw Thank you for always setting alt text in your images. I really appreciate it.

GaggiX 12/13/2025|

When there is no alt text do you have like a solution for that? Like VLMs are really powerful, I imagine they can be used to parse through the unlabeled images automatically if needed.

mkagenius 12/13/2025||

If anyone wants to use skills with any other model or tool like Gemini CLI etc. I had created open-skills, which lets you use skills for any other llm.

Caveat: needs mac to run

Bonus: it runs it locally in a container, not on cloud nor directly on mac

1. Open-Skills: https://GitHub.com/BandarLabs/open-skills

swyx 12/13/2025||

we just released Anthropic's Skills talk for those who want to find more info on the design thinking / capabilities: https://www.youtube.com/watch?v=CEvIs9y1uog&t=2s

jumploops 12/13/2025|

I think the future is likely one that mixes the kitchen-sink style MCP resources with custom skills.

Services can provide an MCP-like layer that provides semantic definitions of everything you can do with said service (API + docs).

Skills can then be built that combine some subset of the 3rd party interfaces, some bespoke code, etc. and then surface these more context-focused skills to the LLM/agent.

Couldn’t we just use APIs?

Yes, but not every API is documented in the same way. An “MCP-like” registry might be the right abstraction for 3rd parties to expose their services in a semantic-first way.

prescriptivist 12/13/2025||

Agree. I'd add that a aha moment to skills is AI agents are pretty good at writing skills. Let's say you have developed an involved prompt that explains how to hit an API (possibly with the complexity of reading credentials from an env var or config file) or run a tool locally to get some output you want the agent to analyze (example, downloading two versions of python packages and diffing them to analyze changes). Usually the agent reading the prompt it's going to leverage local tools to do it (curl, shell + stdout, git, whatever) every single time. Every time you execute that prompt there is a lot thinking spent on deciding to run these commands and you are burning tokens (and time!). As an eng you know that this is a relatively consistent and deterministic process to fetch the data. And if you were consuming it yourself, you'd write a script to automate it.

So you read about skills (prompt + scripts) to make this more repeatable and reduce time spent thinking. At that point there are two paths you can go down -- write the skill and prompt yourself for the agent to execute -- or better -- just tell the agent to write the skill and prompt and then you lightly edit it and commit it.

This may seem obvious to some, but I've seen engineers create skills from scratch because they have a mental model around skills being something that people must build for the agent, whereas IMO skills are you just bridging a productivity gap that the agent can't figure out itself (for now), which is instructing it to write tools to automate its own day to day tedium.

simonw 12/13/2025|||

The example Datasette plugin authoring skill I used in my article was entirely written by Claude Opus 4.5 - I uploaded a zip file to its the Datasette repo in it (after it failed to clone that itself for some weird environment reason) and had it use its skill-writing skill to create the rest: https://claude.ai/share/0a9b369b-f868-4065-91d1-fd646c5db3f4

prescriptivist 12/13/2025||

That's awesome and I have a few similar conversations with Claude. I wasn't quite an AI luddite a couple months ago, but close. I joined a new company recently that is all in on AI and I have a comically huge token budget so I jumped all the way in myself. I have my choice of tools I can use and once I tried Claude Code it all clicked. The topology they are creating for AI tooling and concepts is the best of all the big LLMs, by far. If they can figure out the remote/cloud agent piece with the level of thoughtfulness they have given to Code, it'd be amazing. Cursor Cloud has that area locked down right now, but I'm looking forward to how Anthropic approaches it.

TypeDeck 12/13/2025|||

Completely agree with both points. Skills replacing one-off microservices and agents writing their own skills feel like two sides of the same coin to me. I’m a solo developer building a markdown-first slide editing app. The core format is just Markdown with --- slide separators, but it has custom HTML comment directives for layouts (, , etc.) and content-type detection for tables, code blocks, and Mermaid diagrams. It’s a small DSL, but enough that an LLM without context will generate slides that don’t render optimally. Right now my app is designed for copy-paste from external LLMs, which means users have to manually include the format spec in their prompts every time. But your comment about agents writing skills made me realize the better path: I could just ask Claude Code to read my parser and layout components, then generate a Slide_Syntax_Guide skill for me. The agent already understands the codebase—it can write the definitive spec better than I could document it manually.

dkdcio 12/13/2025|||

CLIs are really good when you can use them. self-documenting, agents already have shell tools, they tend to solve fine-grained auth, etc.

feels like the right layer of abstraction for remote APIs

esafak 12/13/2025||

If only there was a way to progressively disclose the API in MCP instead of presenting the full laundry list up front.

simonw 12/13/2025||

That is effectively what this proposal is about: https://www.anthropic.com/engineering/code-execution-with-mc...

More comments...