2025: The Year in LLMs

Posted by simonw 12/31/2025

2025: The Year in LLMs(simonwillison.net)

940 points | 599 commentspage 2

the_mitsuhiko 1/1/2026|

> The (only?) year of MCP

I like to believe, but MCP is quickly turning into an enterprise thing so I think it will stick around for good.

MitziMoto 1/1/2026||

MCP isn't going anywhere. Some developers can't seem to see past their terminal or dev environment when it comes to MCP. Skills, etc do not replace MCP and MCP is far more than just documentation searching.

MCP is a great way for an LLM to connect to an external system in a standardized way and immediately understand what tools it has available, when and how to use them, what their inputs and outputs are,etc.

For example, we built a custom MCP server for our CRM. Now our voice and chat agents that run on elevenlabs infrastructure can connect to our system with one endpoint, understand what actions it can take, and what information it needs to collect from the user to perform those actions.

I guess this could maybe be done with webhooks or an API spec with a well crafted prompt? Or if eleven labs provided an executable environment with tool calling? But at some point you're just reinventing a lot of the functionality you get for free from MCP, and all major LLMs seem to know how to use MCP already.

simonw 1/1/2026||

Yeah, I don't think I was particularly clear in that section.

I don't think MCP is going to go away, but I do think it's unlikely to ever achieve the level of excitement it had in early 2025 again.

If you're not building inside a code execution environment it's a very good option for plugging tools into LLMs, especially across different systems that support the same standard.

But code execution environments are so much more powerful and flexible!

I expect that once we come up with a robust, inexpensive way to run a little Bash environment - I'm still hoping WebAssembly gets us there - there will be much less reason to use MCP even outside of coding agent setups.

brabel 1/1/2026||

I disagree. MCP will remain the best way to do most things for the same reason REST APIs are the main way to access non local services: they provide a way to secure and audit access to systems in a way that a coding environment cannot. And you can authorize actions depending on the well defined inputs and outputs. You can’t do that using just a bash script unless said script actually does SSO and calls REST APIs but then you just have a worse MCP client without any interoperability.

the_mitsuhiko 1/1/2026||

I find it very hard to pick winners and losers in this environment where everything changes so quickly. Right now a lot of people are using bash as a glue environment for agents, even if they are not for developers.

simonw 1/1/2026|||

I think it will stick around, but I don't think it will have another year where it's the hot thing it was back in January through May.

Alex-Programs 1/1/2026||

I never quite got what was so "hot" about it. There seems to be an entire parallel ecosystem of corporates that are just begging to turn AI into PowerPoint slides so that they can mould it into a shape that's familiar.

9dev 1/1/2026||

One reason may be that it makes it a lot easier to open up a product to AI. Instead of adding a bad ChatGPT UI clone into your app, you inverse control and let external AI tools interact with your application and its data, thus giving your customers immediate benefits, while simultaneously sating your investors/founders/managers desire to somehow add AI.

nrhrjrjrjtntbt 1/1/2026|||

MCP or skills? Can a skill negate the need for MCP. In addition there was a YC startup who is looking at searching docs for LLMs or similar. I think MCP may be less needed once you have skills, openapi specs, and other things that LLMs can call directly.

cloudking 1/1/2026||

For connecting agents to third-party systems I prefer CLI tools, less context bloat and faster. You can define the CLI usage in your agent instructions. If the MCP you're using doesn't exist as a CLI, build one with your agent.

martinald 1/1/2026||

Totally agree - wrote this over the holidays which sums it all pretty well https://martinalderson.com/posts/why-im-building-my-own-clis...

timonoko 1/1/2026||

OpenSCAD-coding has improved significantly on all models. Now syntax is always right and they understand the concept of negative space.

Only problem is that they don't see connection between form and function. They may make teapot perfectly but don't understand that this form is supposed to contain liquid.

fullstackchris 1/1/2026||

> The reason I think MCP may be a one-year wonder is the stratospheric growth of coding agents. It appears that the best possible tool for any situation is Bash—if your agent can run arbitrary shell commands, it can do anything that can be done by typing commands into a terminal.

I push back strongly from this. In the case of the solo, one-machine coder, this is likely the case - if you're exposing workflows or fixed tools to customers / collegues / the web at large via API or similar, then MCP is still the best way to expose it IMO.

Think about a GitHub or Jira MCP server - commandline alone they are sure to make mistakes with REST requests, API schema etc. With MCP the proper known commands are already baked in. Remember always that LLMs will be better with natural language than code.

simonw 1/1/2026|

The solution to that is Anthropic's Skills.

Create a folder called skills/how-to-use-jira

Add several Bash scripts with the right curl commands to perform specific actions

Add a SKILL.md file with some instructions in how to use those scripts

You've effectively flattened that MCP server into some Markdown and Bash, only the thing you have now is more flexible (the coding agent can adapt those examples to cover new things you hadn't thought to tell it) and much more context-efficient (it only reads the Markdown the first time you ask it to do something with JIRA).

aflukasz 1/1/2026||

But that moves the burden of maintenance from the provider of the service to its users (and/or partially to intermediary in form of "skills registry" of sorts, which apparently is a thing now).

So maybe a hybrid approach would make more sense? Something like /.well-known/skills/README.md exposed and owned by the providers?

That is assuming that the whole idea of "skills" makes sense in practice.

simonw 1/1/2026||

Yeah that's true, skill distribution isn't a solved problem yet - MCPs have a URL, which is a great way of making them available for people to start using without extra steps.

rr808 1/1/2026||

What happened to Devin? 2024 it was a leading contender now it isn't even included in the big list of coding agents.

fullstackchris 1/1/2026||

Wasn't it basically revealed as a scam? I remember some article about their fancy demo video being sped up / unfairly cut and sliced etc.

simonw 1/1/2026|||

To be honest that's more because I've never tried it myself, so it isn't really on my radar.

I don't hear much buzz about it from the people I pay attention to. I should still give it a go though.

monkeydust 1/1/2026|||

https://cognition.ai/blog/devin-annual-performance-review-20...

ColinEberhardt 1/1/2026||

It’s still around, and tends to be adopted by big enterprises. It’s generally a decent product, but is facing a lot of equally powerful competition and is very expensive.

zerocool86 1/2/2026||

The "local models got good, but cloud models got even better" section nails the current paradox. Simon's observation that coding agents need reliable tool calling that local models can't yet deliver is accurate - but it frames the problem purely as a capability gap.

There's a philosophical angle being missed: do we actually want our coding agents making hundreds of tool calls through someone else's infrastructure? The more capable these systems become, the more intimate access they have to our codebases, credentials, and workflows. Every token of context we send to a frontier model is data we've permanently given up control of.

I've been working on something addressing this directly - LocalGhost.ai (https://www.localghost.ai/manifesto) - hardware designed around the premise that "sovereign AI" isn't just about capability parity but about the principle that your AI should be yours. The manifesto articulates why I think this matters beyond the technical arguments.

Simon mentions his next laptop will have 128GB RAM hoping 2026 models close the gap. I'm betting we'll need purpose-built local inference hardware that treats privacy as a first-class constraint, not an afterthought. The YOLO mode section and "normalization of deviance" concerns only strengthen this case - running agents in insecure ways becomes less terrifying when "insecure" means "my local machine" rather than "the cloud plus whoever's listening."

The capability gap will close. The trust gap won't unless we build for it.

apolloartemis 1/1/2026||

Thank you for your warning about the normalization of deviance. Do you think there will be an AI agent software worm like NotPetya which will cause a lot of economic damage?

simonw 1/1/2026|

I'm expecting something like a malicious prompt injection which steals API keys and crypto wallets and uses additional tricks to spread itself further.

Or targeted prompt injections - like spear phishing attacks - against people with elevated privileges (think root sysadmins) who are known to be using coding agents.

losvedir 1/1/2026||

I predict 2026 will be the year of the first AI Agent "worm" (or virus?). Kind of like the Morris worm running amok as an experiment gone wrong, I think we will sometime soon have someone set up an AI agent whose core loop is to try to propagate itself, either as an experiment or just for the lulz.

The actual Agent payload would be very small, likely just a few hundred line harness plus system prompt. It's just a question of whether the agent will be skilled enough to find vulnerabilities to propagate. The interesting thing about an AI worm is that it can use different tricks on different hosts as it explores its own environment.

If a pure agent worm isn't capable enough, I could see someone embedding it on top of a more traditional virus. The normal virus would propagate as usual, but it would also run an agent to explore the system for things to extract or attack, and to find easy additional targets on the same internal network.

A main difference here is that the agents have to call out to a big SotA model somewhere. I imagine the first worm will simply use Opus or ChatGPT with an acquired key, and part of it will be trying to identify (or generate) new keys as it spreads.

Ultimately, I think this worm will be shut down by the model vendor, but it will have to have made a big enough splash beforehand to catch their attention and create a team to identify and block keys making certain kinds of requests.

I'd hope OpenAI, Anthropic, etc have a team and process in place already to identify suspicious keys, eg, those used from a huge variety of IPs, but I wouldn't be surprised if this were low on their list of priorities (until something like this hits).

agentifysh 1/1/2026||

What an amazing progress in just short time. The future is bright! Happy New Year y'all!

aussieguy1234 1/1/2026||

> The year of YOLO and the Normalization of Deviance #

On this including AI agents deleting home folders, I was able to run agents in Firejail by isolating vscode (Most of my agents are vscode based ones, like Kilo Code).

I wrote a little guide on how I did it https://softwareengineeringstandard.com/2025/12/15/ai-agents...

Took a bit of tweaking, vscode crashing a bunch of times with not being able to read its config files, but I got there in the end. Now it can only write to my projects folder. All of my projects are backed up in git.

NitpickLawyer 1/1/2026|

I have a bunch of tabs opened on this exact topic, so thank you for sharing. So far I've been using devcontainers w/ vscode, and mostly having a blast with it. It is a bit awkward since some extensions need to be installed in the remote env, but they seem to play nicely after you have it setup, and the keys and stuff get populated so things like kilocode, cline, roo work fine.

mark_l_watson 1/1/2026|

Thanks Simon, great writeup.

It has been an amazing year, especially around tooling (search, code analysis, etc.) and surprisingly capable smaller models.

More comments...