Posted by simonw 2 days ago
The idea behind skills is sound because context management matters.
However, skills are different from MCP. Skills has nothing to do with tool calling at all!
You can implement your own version of skills easily and there is absolutely zero need for any kind of standard or a framework of sorts. They way to do is to register a tool / function to load and extend the base prompt and presto - you have implemented your own version of skills.
In ChatBotKit AI Widget we even have our own version of that for both the server and when building client-side applications.
With client-side applications the whole thing is implemented with a simple react hook that adds the necessary tools to extend the prompt dynamically. You can easily come up with your own implementation of that with 20-30 lines of code. It is not complicated.
Very often people latch on some idea thinking this is the next big thing hoping that it will explode. It is not new and it wont explode! It is just part of a suite of tools that already exist in various forms. The mechanic is so simple at its core that practically makes no sense to call it a standard and there is absolutely zero need to have it for most types of applications. It does make sense for coding assistant though as they work with quite a bit of data so there it matters. But skills are not fundamentally different from *.instruction.md prompt in Copilot or AGENT.md and its variations.
One of the best patterns I’ve see is having an /ai-notes folder with files like ‘adding-integration-tests.md’ that contain specialized knowledge suitable for specific tasks. These “skills” can then be inserted/linked into prompts where I think they are relevant.
But these skills can’t be static. For best results, I observe what knowledge would make the AI better at the skill the next time. Sometimes I ask the AI to propose new learnings to add to the relevant skill files, and I adopt the sensical ones while managing length carefully.
Skills are a great concept for specialized knowledge, but they really aren’t a groundbreaking idea. It’s just context engineering.
Documentation, variable naming, automated tests, specs, type checks, linting. Anything the agent can bang its proverbial head against in a loop for a while without involving you every step of the way.
I like to think I'm above average in terms of having design docs alongside my code, having meaningful comments, etc. But playing with agents recently has pointed out several ways I could be doing better.
That is, skills make the most sense when paired with a Python script or cli that the skill uses. Nowadays most of the AI model providers have code execution environments that the models can use.
Previously, you could only use such skills with locally running agent clis.
This is imo the big enabler, which may totally mean that “skills will go big”. And yeah, having implemented multiple MCP servers, I think skills are a way better approach for most use-cases.
You can develop skills incrementally, starting with just one md file describing how to do something, and no code at first.
As you run through it for the first several times, testing and debugging it, you accumulate a rich history of prompts, examples, commands, errors, recovery, backing up and branching. But that chat history is ephemeral, so you need to scoop it up and fold it back into the md instructions.
While the experience is still fresh in the chat, have it uplift knowledge from the experience into the md instructions, refine the instructions with more details, give concrete examples of input and output, Add more detailed and explicit instructions, handle exceptions and prerequisites, etc.
Then after you have a robust reliable set of instructions and examples for solving a problem (with branches and conditionals and loops to handle different conditions, like installing prerequisite tools, or checking and handling different cases), you can have it rewrite the parts that don't require "thought" into python, as a self documenting cli tool that an llm, you, and other scripts can call.
It's great to end up with a tangible well documented cli tool that you can use yourself interactively, and build on top of with other scripts.
Often the whole procedure can be rewritten in python, in which case the md instructions only need to tell how to use the python cli tool you've generated, which cli.py --help will fully document.
But if it requires a mix of llm decision making or processing plus easily automated deterministic procedures, then the art is in breaking it up into one or more cli tools and file formats, and having the llm orchestrate them.
Finally you can take it all the way into one tool, turn it outside in, and have the python cli tool call out to an llm, instead of being called by an llm, so it can run independently outside of cursor or whatever.
It's a lot like a "just in time" compiler from md instructions to python code.
Anyone can write up (and refine) this "Self Optimizing Skills" approach in another md file of meta instructions for incrementally bootstrapping md instructions into python clis.
Skills also have a nicer way of working with the context, by default (and in the main web uis), with their overview-driven lazy loading.
Yes, in the end skills are just another way to manage prompts and avoid cluttering the context of a model, but they happen to be one that works really well.
Type `import math`
You now have more skills (symbols)
Skills do that for prompts.
Like this, you can divide a job to be done into blocks of reasoning and deterministic tasks. The later are scripts/commands. The whole package is called skills.
So are they basically just function tool calls whose return value is a constant string? Do we know if that’s how they’re implemented, or is the string inserted into the new input context as something other than a function_call_output?
https://platform.claude.com/docs/en/agents-and-tools/agent-s...
And as implemented in Codex: https://github.com/openai/codex/pull/7412/changes#diff-35647...
Although skills require that you have certain tools available like basic file system operations so the model can read the skills files. Usually this is implemented as ephemeral "sandbox environment" where LLM have access to file system and can also execute python, run bash commands etc.
Dotprompt / Claude / Dia browser skills - "Skills Everywhere: Portable Playbooks for Codex, Claude, and Dia"
Those instructions can reference external scripts that Claude executes without loading the source. You can package them with hooks and agents in plugins. You pay tokens for the output, not the code that calls it.
Install five MCPs and you've burned a large chunk of tokens before typing a prompt. With skills, you only pay for what you use.
You can call deterministic code (pipelines, APIs, domain logic) with a non-deterministic model, triggered by plain language, without the context bloat.
In the same way Nagel knew what it was like to be a bat, Anthropic has the highest fraction of people who approximately know what it's like to be a frontier ai model.
I can name OpenAI CEO but not Anthropic CEO off the top of my head. And I actually like Anthropic's work way more than what OpenAI is doing right now.
This is a prime example of what you're saying. Creating a "foundation" for a protocol created an year ago that's not even a protocol
Has the Gavin Belson tecthics energy
If anyone disagrees,I would like to see their long running deep research agents built on gemini or openai.
it's an open question how many of OpenAI's users are monetizable.
There's an argument to be made that your brand being what the general public identifies with AI is a medium term liability in light of the vast capital and operating costs involved.
It may well be that Anthropic focusing on an order of magnitudes smaller, but immediately monetiazable market will play out better.
AFAICT, claude code is the biggest engineering mind share. An apple software engineer of mine says he sometimes uses $100/day of claude code tokens at work and gets sad, because that's the budget.
Also, look at costs and revenue. OpenAI is bleeding way more than Antropic.
Maybe they should do less vibe coding on their checkout flow and they might have more users.
Their valuations come from completely different calculus: Anthropic looks much more like a high potential early startup still going after PMF while OpenAI looks more like a series B flailing to monetize.
The cutting edge has largely moved past benchmarks, beyond a certain performance threshold that all these models have reached, nobody really cares about scores anymore, except people overfitting to them. They’re going for models that users like better, and Claude has a very loyal following.
TLDR, OpenAI has already peaked, Anthropic hasn’t, this the valuation difference.
It really should be required viewing for anyone in the industry, it has so much spot-on social commentary, it's just not "tecthical" not to be fully aware of it, even if it stings.
https://silicon-valley.fandom.com/wiki/Tethics
>Meanwhile, Gavin Belson (Matt Ross) comes up with a code of ethics for tech, which he lamely calls "tethics", and urges all tech CEOs to sign a pledge to abide by the tethics code. Richard refuses to sign, he considers the pledge to be unenforceable and meaningless.
>Belson invites Richard to the inauguration of the Gavin Belson Institute for Tethics. Before Belson's speech, Richard confronts the former Hooli CEO with the fact that the tethics pledge is a stream of brazenly plagiarized banalities, much like Belson's novel Cold Ice Cream & Hot Kisses.
>Once at the podium, Belson discards his planned speech and instead confesses to his misdeeds when he was CEO of Hooli. Belson urges California's attorney general to open an investigation.
>Richard mistakenly thinks that Belson is repentant for all his past bad behavior. But, as Ron LaFlamme (Ben Feldman) explains, Belson's contrite act is just another effort to sandbag Richard. If the attorney general finds that Belson acted unethically during his tenure as Hooli CEO, the current Hooli CEO would be the one who has to pay the fine. And since Pied Piper absorbed Hooli, it would be Pied Piper that has to pay the fine.
Out of the box Claude skills can call python scripts that load modules from Pypi or even GitHub, potentially ones that include data like sqlite files or parquet tables.
Not just in Claude Code. Anywhere, including the mobile app.
MCP/Tool use, Skills, and I'm sure others that I can't think of.
This is might be because of some core direction that is more coherent than other labs.
This is like saying McDonald's is named after the McDonald's happy meal rather than the McDonald brothers.
But regardless anthropic reasoning was extremely in the intellectual water supply of the Anthropic founders, and they explicitly were not aiming at producing a human-like model.
The modern HTTP Streamable version is light-years better, but took a year and was championed by outside engineers faced with the real problem of integrating it, and I imagine was designed by a human.
OpenAI was there first, but unfortunately the models weren't quite good enough yet, so their far superior approach unfortunately didn't take off.
Would be cool (sci fi) for LLMs of different users to chat and discuss approaches to what the humans are talking about etc.
Build things and then talk about them in a way that people remember and share it with friends.
I guess some call it clever product marketing.
It's a huge asset.
The biggest unlock was tool calling that was in invented at OpenAI.
They advertise 196k tokens context length[1], but you can't submit more than ~50k tokens in one prompt. If you do, the prompt goes through, but they chop off the right-hand-side of your prompt (something like _tokens[:50000]) before calling the model.
This is the same "bug" that existed 4 months ago with GPT-5.0 which they "fixed" only after some high-profile Twitter influencers made noise about it. I haven't been a subscriber for a while, but I re-subscribed recently and discovered that the "bug" is back.
Anyone with a Plus sub can replicate this by generating > 50k tokens of noise then asking it "what is 1+1?". It won't answer.
[1] https://help.openai.com/en/articles/11909943-gpt-52-in-chatg...
The fix was to just switch to Claude 3.5 and now to 4.5 in VSCode.
(I'm not just about pelicans.)
> Kākāpō can be up to 64 cm (25 in) long. They have a combination of unique traits among parrots: finely blotched yellow-green plumage, a distinct facial disc, owl-style forward-facing eyes with surrounding discs of specially-textured feathers, a large grey beak, short legs, large blue feet, relatively short wings and a short tail. It is the world's only flightless parrot, the world's heaviest parrot, and also is nocturnal, herbivorous, visibly sexually dimorphic in body size, has a low basal metabolic rate, and does not have male parental care. It is the only parrot to have a polygynous lek breeding system. It is also possibly one of the world's longest-living birds, with a reported lifespan of up to 100 years.
The foreplay starts around the 1 minute mark.
Good thinking, I agree actually, however..
> Skills are based on a very light specification, if you could even call it that, but I still think it would be good for these to be formally documented somewhere.
Like a lot of posts around AI, and I hope OP can speak to it, surely you can agree that while when used for a good cool idea, it can also be used for the inverse and probably to more detrimental reason. Why would they document an unmanageable feature that may be consumed.
Shareholder value might not go up if they learnt that the major product is learning bad things.
Have you or would you try this on a local LLM instead ?
The OpenAI GPT OSS models can drive Codex CLI, so they should be able to do this.
I have high hopes for Mistral's Devstral 2 but I've not run that locally yet.
That's actually super interesting, maybe something I'll try investigate and find the minimum requirements because as cool as they seem, personalized 'skills' might be a more useful use of AI overall.
Nice article, and thanks for answering.
Edit: My thinking is consumer grade could be good enough to run this soon.
Local LLMs are better for long batch jobs not things you want immediately or your flow gets killed.
The clever part is that the markdown file has a section in it like this: https://github.com/datasette/skill/blob/a63d8a2ddac9db8225ee...
---
name: datasette-plugins
description: "Writing Datasette plugins using Python and the pluggy plugin system. Use when Claude needs to: (1) Create a new Datasette plugin, (2) Implement plugin hooks like prepare_connection, register_routes, render_cell, etc., (3) Add custom SQL functions, (4) Create custom output renderers, (5) Add authentication or permissions logic, (6) Extend Datasette's UI with menus, actions, or templates, (7) Package a plugin for distribution on PyPI"
---
On startup Claude Code / Codex CLI etc scan all available skills folders and extract just those descriptions into the context. Then, if you ask them to do something that's covered by a skill, they read the rest of that markdown file on demand before going ahead with the task.Reason I ask is because a while back I had similar sections in my CLAUDE.md and it would either acknowledge and not use or just ignore them sometimes. I'm assuming that's more of an issue of too much context and now skill-level files like this will reduce that effect?
Skills are nice because they offload all the detailed prompts to files that the LLM can ask for. It's getting even better with Anthropic's recent switchboard operator (tool search tool) that doesn't clutter the system prompt but tries to cut the tool list down to those the LLM will need.
There's an instruction about that in the Codex CLI skills prompt: https://simonwillison.net/2025/Dec/13/openai-codex-cli/
If SKILL.md points to extra folders such as references/, load only the specific files needed for the request; don't bulk-load everything.can those markdown in the references also in turn tell the model to lazily load more references only if the model deems they are useful?
If you need to write tests that mock
an HTTP endpoint, also go ahead and
read the pytest-mock-httpx.md fileI don’t know what this is and Google isn’t finding anything. Can you clarify?
The models are really good at driving those environments now which makes skills the right idea at the right time.
But yes. Other agent platforms will adopt this pattern.
I find it powerful how it can leverage and self-discover the best way to use a CLI and its parameters to achieve its goals
It feels more powerful than providing pre-defined set functions as MCP that will have less flexibility as a CLI
It is useful in a user-education sense to communicate that it's good to actively document useful procedures like this, and it is likely a performance / utilization boost that the models are tuned or prompt-steered toward discovering this stuff in a conventional location.
But honestly reading about skills mostly feels like reading:
> # LLM provider has adopted a new paradigm: prompts
> What's a prompt?
> You tell the LLM what you'd like to do, and it tries to do it. OR, you could ask the LLM a question and it will answer to the best of its ability.
Obviously I'm missing something.
Maybe I still don't understand the mechanics - this happens "on startup", every time a new conversation starts? Models go through the trouble of doing ls/cat/extraction of descriptions to bring into context? If so it's happening lightning fast and I somehow don't notice.
Why not just include those descriptions within some level of system prompt?
Reading a few dozen files takes on the order of a few ms. They add enough tokens per skill to fit the metadata description, so probably less than 100 for each skill.
> The body can contain any Markdown; it is not injected into context.
It just means it's not injected into the context until the skill is used or it's never injected into the context?
I had thought that once the skill is selected the whole file would be read, but it looks like that's not the case: https://github.com/openai/codex/blob/ad7b9d63c326d5c92049abd...
1) After deciding to use a skill, open its `SKILL.md`. Read only enough to follow the workflow.
So you could have a skill file that's thousands of lines long but if the first part of the file provides an outline Codex may stop reading at that point. Maybe you could have a skill that says "see migrations section further down if you need to alter the database table schema" or similar.You can hack together a shell, python, whatever script that fetches build results from your CI server, dumps them to stdout in a semi structured format like markdown, then add a 10-15 line SKILL.md and you have the same functionality -- the skill just executes the one-off script and reads the output. You package the skill with the script, usually in a directory in the project you are working on, but you can also distribute them as plugins (bundles) that claud code can install from a "repository", which can just be a private git repo.
It's a little UNIX-y in a way, little tools that pipe output to another tool and they are useful in a standalone context or in a chain of tools. Whereas MCP is a full blown RPC environment (that has it's uses, where appropriate).
It’s straightforward for cloud services
Maybe they get compacted out of the context.
But you can call upon them manually. I often do something like “using your Image Manipulation skill, make the icons from image.png”
Or “use your web design skill to create a design for the front end”
Tbh i do like that.
I also get Claude to write its own skills. “Using what we learned about from this task, write a skill document called /whatever/using your writing skills skill”
I have a GitHub template including my skills and commands, if you want to see them.
One particular way I can imagine this is with some sort of "multipass makeshift attention system" built on top of the mechanisms we have today. I think for sure we can store the available skills in one place and look only at the last part of the query, asking the model the question: "Given this small, self-contained bit of the conversation, do you think any of these skills is a prime candidate to be used?" or "Do you need a little bit more context to make that decision?". We then pass along that model's final answer as a suggestion to the actual model creating the answer. There is a delicate balance between "leading the model on" with imperfect information (because we cut the context), and actually "focusing it" on the task at hand, and the skill selection". Well, and, of course, there's the issue of time and cost.
I actually believe we will see several solutions make use of techniques such as this, where some model determines what the "big context" model should be focusing on as part of its larger context (in which it may get lost).
In many ways, this is similar to what modern agents already do. cursor doesn't keep files in the context: it constantly re-reads only the parts it believes are important. But I think it might be useful to keep the files in the context (so we don't make an egregious mistake) at the same time that we also find what parts of the context are more important and re-feed them to the model or highlight them somehow.
Just like you I don't edit much in these files on my own. Mostly just ask the model to update an md file whenever I think we've figured out something new, so the learning sticks. I have files for test writing, backend route writing, db migration writing, frontend component writing etc. Whenever a section gets too big to live in agents.md it gets it's own file.
But think of your dad or grandma using a generic agent, and simply selecting that they want to have certain skills available to it. Don't even think of it as a chat interface. This is just some option that they set in their phone assistant app. Or, rather, it may be that they actually selected "Determine the best skills based on context", and the assistant has "skill packs" which it periodically determines it needs to enable based on key moments in the conversation or latest interactions.
These are all workarounds for the problems of learning, memory...and, ultimately, limited context. But they for sure will be extremely useful.
I have mine in a GitHub template so I can even use them in Claude Code for the web. And synchronise them across my various machine (which is about 6 machines atm).
Now SKILL.md can have references to more finegrained behaviors or capabilities of our skill. My skills generally tend to have a reference/{workflows,tools,standards,testing-guide,routing,api-integration}.md. These references are what then gets "progressively loaded" into the context.
Say I asked claude to use the wireframe-skill to create profileView mockup. While creating the wireframe, claude will need to figure out what API endpoints are available/relevant for the profileView and the response types etc. It's at this point that claude reads the references/api-integration.md file from the wireframe skill.
After a while I found I didn't like the progressive loading so I usually direct claude to load all references in the skill before proceeding - this usually takes up maybe 20k to 30k tokens, but the accuracy and precision (imagined or otherwise ha!) is worth it for my use cases.
You shouldn't do this, it's generally considered bad practice.
You should be optimizing your skill description. Often times if I am working with Claude Code and it doesn't load I skill, I ask it why it missed the skill. It will guide me to improving the skill description so that it is picked up properly next time.
This iteration on skill description has allowed skills to stay out of context until they are needed rather predictably for me so far.
So when it's time to commit, make sure you run these checks, write a good commit message, etc.
Debugging is especially useful since AI agents can often go off the rails and go into loops rewriting code - so it's in a skill I can push for "read the log messages. Inserting some more useful debug assertions to isolate the failure. Write some more unit tests that are more specific." Etc.
Caveat: needs mac to run
Bonus: it runs it locally in a container, not on cloud nor directly on mac
1. Open-Skills: https://GitHub.com/BandarLabs/open-skills
Services can provide an MCP-like layer that provides semantic definitions of everything you can do with said service (API + docs).
Skills can then be built that combine some subset of the 3rd party interfaces, some bespoke code, etc. and then surface these more context-focused skills to the LLM/agent.
Couldn’t we just use APIs?
Yes, but not every API is documented in the same way. An “MCP-like” registry might be the right abstraction for 3rd parties to expose their services in a semantic-first way.
So you read about skills (prompt + scripts) to make this more repeatable and reduce time spent thinking. At that point there are two paths you can go down -- write the skill and prompt yourself for the agent to execute -- or better -- just tell the agent to write the skill and prompt and then you lightly edit it and commit it.
This may seem obvious to some, but I've seen engineers create skills from scratch because they have a mental model around skills being something that people must build for the agent, whereas IMO skills are you just bridging a productivity gap that the agent can't figure out itself (for now), which is instructing it to write tools to automate its own day to day tedium.
feels like the right layer of abstraction for remote APIs