2025: The Year in LLMs

Posted by simonw 12/31/2025

2025: The Year in LLMs(simonwillison.net)

940 points | 599 commentspage 3

lopatin 1/1/2026|

The "pelicans on a bike" challenge is pretty wide spread now. Are we sure it's still not being trained on?

simonw 1/1/2026|

See https://simonwillison.net/2025/nov/13/training-for-pelicans-... (also in the pelicans section of the post).

lopatin 1/1/2026||

> All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle.

lukaslalinsky 1/1/2026||

Speaking of asynchronous agents, what do people use? Claude Code for web is extremely limited, because you have no custom tools. Claude Code in GitHub Actions is vastly more useful, due to the custom environment, but ackward to use interactively. Are there any good alternatives?

simonw 1/1/2026||

I use Claude Code for web with an environment allowing full internet access, which means it can install extra tools as and when it needs them. I don't run into limits with it very often.

fullstackchris 1/1/2026|||

I just use a couple of custom MCP tools with the standard claude desktop app:

https://chrisfrew.in/blog/two-of-my-favorite-mcp-tools-i-use...

IMO this is the best balance of getting agentic work done while having immediate access to anything else you may need with your development process.

ehsanu1 1/1/2026|||

What exactly do you mean by custom tools here? Just cli tools accessible to the agent?

lukaslalinsky 1/1/2026||

Development environment needed to build and test the project.

jes5199 1/1/2026|||

I'm running Claude Code in a tmux on a VPS, and I'm working on setting up a meta-agent who can talk to me over text messages

absoluteunit1 1/1/2026||

Hey - this sounds like really interesting set-up!

Would you be open to providing more details. Would love to hear more, your workflows, etc.

jimmySixDOF 1/1/2026||

Pretty sure next year's wrapup will have "Year of the sub-agent"

npalli 1/1/2026||

Great summary of the year in LLMs. Is there a predictions (for 2026) blogpost as well?

simonw 1/1/2026||

Given how badly my 2025 predictions aged I'm probably going to sit that one out! https://simonwillison.net/2025/Jan/10/ai-predictions/

zahlman 1/1/2026|||

Making predictions is useful even when they turn out very wrong. Consider also giving confidence levels, so that you can calibrate going forward.

jjude 1/1/2026||

I use predictions to prepare rather than to plan.

Planing depends on deterministic view of the future. I used to plan (esp annual plans) until about 5 years. Now I scan for trends and prepare myself for different scenarios that can come in the future. Even if you get it approximately right, you stand apart.

For tech trends, I read Simon, Benedict Evans, Mary Meeker etc. Simon is in a better position make these predictions than anyone else having closely analyzed these trends over the last few years.

Here I wrote about my approach: https://www.jjude.com/shape-the-future/

DANmode 1/1/2026|||

Don’t be a bad sport, now!!

websiteapi 1/1/2026||

I'm curious how all of the progress will be seen if it does indeed result in mass unemployment (but not eradication) of professional software engineers.

ori_b 1/1/2026||

My prediction: If we can successfully get rid of most software engineers, we can get rid of most knowledge work. Given the state of robotics, manual labor is likely to outlive intellectual labor.

BobbyJo 1/1/2026|||

I would have agreed with this a few months ago, but something Ive learned is that the ability to verify an LLMs output is paramount to its value. In software, you can review its output, add tests, on top of other adversarial techniques to verify the output immediately after generation.

With most other knowledge work, I don't think that is the case. Maybe actuarial or accounting work, but most knowledge work exists at a cross section of function and taste, and the latter isn't an automatically verifiable output.

throw1235435 1/1/2026||

I also believe this - I think it will probably just disrupt software engineering and any other digital medium with mass internet publication (i.e. things RLVR can use). For the short term future it seems to need a lot of data to train on, and no other profession has posted the same amount of verifiable material. The open source altruism has disrupted the profession in the end; just not in the way people first predicted. I don't think it will disrupt most knowledge work for a number of reasons. Most knowledge professions have "credentials' (i.e. gatekeeping) and they can see what is happening to SWE's and are acting accordingly. I'm hearing it firsthand at least locally in things like law, even accounting, etc. Society will ironically respect these professions more for doing so.

Any data, verifiability, rules of thumb, tests, etc are being kept secret. You pay for the result, but don't know the means.

coffeebeqn 1/1/2026||

I mean law and accounting usually have a “right” answer that you can verify against. I can see a test data set being built for most professions. I’m sure open source helps with programming data but I doubt that’s even the majority of their training. If you have a company like Google you could collect data on decades of software work in all its dimensions from your workforce

District5524 1/1/2026|||

It's not about invalidating your conclusion, but I'm not so sure about law having a right answer. At a very basic level, like hypothetical conduct used in basic legal training matrerials or MCQs, or in criminal/civil code based situations in well-abstracting Roman law-based jurisdictions, definitely. But the actual work, at least for most lawyers is to build on many layers of such abstractions to support your/client's viwepoint. And that level is already about persuasion of other people, not having the "right" legal argument or applying the most correct case found. And this part is not documented well, approaches changes a lot, even if law remains the same. Think of family law or law of succession - does not change much over centuries but every day, worldwide, millions of people spend huge amounts of money and energy on finding novel ways to turn those same paragraphs to their advantage and put their "loved" ones and relatives in a worse position.

throw1235435 1/1/2026|||

Not really. I used to think more general with the first generation of LLM's but given all progress since o1 is RL based I'm thinking most disruption will happen in open productive domains and not closed domains. Speaking to people in these professions they don't think SWE's have any self respect and so in your example of law:

* Context is debatable/result isn't always clear: The way to interpret that/argue your case is different (i.e. you are paying for a service, not a product)

* Access to vast training data: Its very unlikely that they will train you and give you data to their practice especially as they are already in a union like structure/accreditation. Its like paying for a binary (a non-decompilable one) without source code (the result) rather than the source and the validation the practitioner used to get there.

* Variability of real world actors: There will be novel interpretations that invalidate the previous one as new context comes along.

* Velocity vs ability to make judgement: As a lawyer I prefer to be paid higher for less velocity since it means less judgement/less liability/less risk overall for myself and the industry. Why would I change that even at an individual level? Less problem of the commons here.

* Tolerance to failure is low: You can't iterate, get feedback and try again until "the tests pass" in a court room unlike "code on a text file". You need to have the right argument the first time. AI/ML generally only works where the end cost of failure is low (i.e can try again and again to iron out error terms/hallucinations). Its also why I'm skeptical AI will do much in the real economy even with robots soon - failure has bigger consequences in the real world ($$$, lives, etc).

* Self employment: There is no tension between say Google shareholders and its employees as per your example - especially for professions where you must trade in your own name. Why would I disrupt myself? The cost I charge is my profit.

TL;DR: Gatekeeping, changing context, and arms race behavior between participants/clients. Unfortunately I do think software, art, videos, translation, etc are unique in that there's numerous examples online and has the property "if I don't like it just re-roll" -> to me RLVR isn't that efficient - it needs volumes of data to build its view. Software sadly for us SWE's is the perfect domain for this; and we as practitioners of it made it that way through things like open source, TDD, etc and giving it away free on public platforms in numerous quantities.

beardedwizard 1/1/2026||||

"Given the state of robotics" reminds me a lot of what was said about llms and image/video models over the past 3 years. Considering how much llms improved, how long can robotics be in this state?

I have to think 3 years from now we will be having the same conversation about robots doing real physical labor.

"This is the worst they will ever be" feels more apt.

chii 1/1/2026|||

but robotics had the means to do majority of the physical labour already - it's just not worth the money to replace humans, as human labour is cheap (and flexible - more than robots).

With knowledge work being less high-paying, physical labour supply should increase as well, which drops their price. This means it's actually less likely that the advent of LLM will make physical labour more automated.

Davidzheng 1/1/2026||||

Robotics is coming FAST. Faster than LLM progress in my opinion.

wh0knows 1/1/2026|||

Curious if you have any links about the rapid progression of robotics (as someone who is not educated on the topic).

It was my feeling with robotics that the more challenging aspect will be making them economically viable rather than simply the challenge of the task itself.

beardedwizard 1/1/2026||

I mentioned military in my reply to the sibling comment - that is the most ready example. What anduril and others are doing today may be sloppy, but it's moving very quickly.

throw1235435 1/1/2026|||

The question is how rapid the adoption is. The price of failure in the real world is much higher ($$$, environmental, physical risks) vs just "rebuild/regenerate" in the digital realm.

beardedwizard 1/1/2026||

Military adoption is probably a decent proxy indicator - and they are ready to hand the kill switch to autonomous robots

throw1235435 1/1/2026||

Maybe. There the cost of failure again is low. Its easier to destroy than to create. Economic disruption to workers will take a bit longer I think.

Don't get me wrong; I hope that we do see it in physical work as well. There is more value to society there; and consists of work that is risky and/or hard to do - and is usually needed (food, shelter, etc). It also means that the disruption is an "everyone" problem rather than something that just affects those "intellectual" types.

9dev 1/1/2026||||

That’s the deep irony of technology IMHO, that innovation follows Conway's law on a meta layer: White collar workers inevitably shaped high technology after themselves, and instead of finally ridding humanity of hard physical labour—as was the promise of the Industrial Revolution—we imitate artists, scientists, and knowledge workers.

We can now use natural language to instruct computers generate stock photos and illustrations that would take a professional artist a few years ago, discover new molecule shapes, beat the best Go players, build the code for entire applications, or write documents of various shapes and lengths—but painting a wall? An unsurmountable task that requires a human to execute reliably, not even talking about economics.

JumpCrisscross 1/1/2026|||

> If we can successfully get rid of most software engineers, we can get rid of most knowledge work

Software, by its nature, is practically comprehensively digitized, both in its code history as well as requirements.

simonw 1/1/2026|||

I nearly added a section about that. I wanted to contrast the thing where many companies are reducing junior engineering hires with the thing where Cloudflare and Shopify are hiring 1,000+ interns. I ran out of time and hadn't figured out a good way to frame it though so I dropped it.

legulere 1/1/2026|||

Even if it will make software engineering drastically more productive, it’s questionable that this will lead to unemployment. Efficiency gains translate to lower prices. Sometimes this leads to very few additional demand, as can be seen with masses of typesetters that lost their jobs. Sometimes this leads to a dramatically higher demand like you can see in the classic Jevons paradox examples of coal and light bulbs. I highly suspect software falls in the latter category

kingstnap 1/1/2026||

Software demand is philosophically limited by the question of "What can your computer do for you?"

You can describe that somewhat formally as:

{What your computer can do} intersect {What you want done (consciously or otherwise)}

Well a computer can technically calculate any computuable task that fits in bounded memory, that is an enormous set so its real limitations are its interfaces. In which case it can send packets, make noises, and display images.

How many human desires are things that can be solved with making noises, displaying images, and sending packets? Turns out quite a few but its not everything.

Basically I'm saying we should hope more sorts of physical interfaces come around (like VR and Robotics) so we cover more human desires. Robotics is a really general physical interface (like how ip packets are an extremely general interface) so its pretty promising if it pans out.

Personally, I find it very hard to even articulate what desires I have. I have this feeling that I might be substantially happier if I was just sitting around a campfire eating food and chatting with people instead of enjoying whatever infinite stuff a super intelligent computer and robots could do for me. At least some of the time.

Madmallard 1/1/2026|||

Why would it?

The ability to accurately describe what you want with all constraints managed and with proactive design is the actual skill. Not programming. The day PMs can do that and have LLMs that can code to that, is the day software engineers en masse will disappear. But that day is likely never.

The non-technical people I've ever worked for were hopelessly terrible at attention to detail. They're hiring me primarily for that anyway.

fullstackchris 1/1/2026||

This overly discussed thesis is already laughable - decent LLMs have been out for 3 years now and unemployment (using US as example) is up around 1% over the same time frame - and even attributing that small percentage change completely to AI is also laughable

vanderZwan 1/1/2026||

Speaking of new year and AI: my phone just suggested "Happy Birthday!" as the quick-reply to any "Happy New Year!" notification I got in the last hours.

I'm not too worried about my job just yet.

gverrilla 1/1/2026||

This year I had a spotify and a youtube thing to "recall my year", and it was abolute garbage (30% truth, to be exact). I think they're doing it more like an exercise to build up systems, infra, processes, people, etc - it's already clear they don't actually care about users.

pants2 1/1/2026||

It won't help to point out the worst examples. You're not competing with an outdated Apple LLM running on a phone. You're competing with Anthropic frontier models running on a multimillion dollar rack of servers.

vanderZwan 1/1/2026|||

Sounds like I'm much more affordable with better ROI

sanreau 1/1/2026||

> Vendor-independent options include GitHub Copilot CLI, Amp, OpenHands CLI, and Pi

...and the best of them all, OpenCode[1] :)

[1]: https://opencode.ai

simonw 1/1/2026||

Good call, I'll add that. I think I mentally scrambled it with OpenHands.

the_mitsuhiko 1/1/2026||

Thanks for adding pi to it though :)

d4rkp4ttern 1/1/2026|||

Can OpenCode be used with the Claude Max or ChatGPT Pro subscriptions, i.e., without per-token API charges?

simonw 1/1/2026|||

Apparently it does work with Claude Max: https://opencode.ai/docs/providers/#anthropic

I don't see a similar option for ChatGPT Pro. Here's a closed issue: https://github.com/sst/opencode/issues/704

williamstein 1/1/2026||

There's a plugin that evidently supports ChatGPT Pro with Opencode: https://github.com/sst/opencode/issues/1686#issuecomment-349...

ewoodrich 1/1/2026|||

Yes, I use it with a regular Claude Pro subscription. It also supports using GitHub Copilot subscriptions as a backend.

logicprog 1/1/2026|||

I don't know why you're downloaded, OpenCode is by far the best.

nineteen999 1/1/2026||

How did I miss this until now! Thank you for sharing.

icapybara 1/1/2026||

It was the year of Claude Code

Gud 1/1/2026||

What about self hosting?

simonw 1/1/2026|

I talked about that in this section https://simonwillison.net/2025/Dec/31/the-year-in-llms/#the-... - and touched on it a bit in the section about Chinese AI labs: https://simonwillison.net/2025/Dec/31/the-year-in-llms/#the-...

_pdp_ 1/1/2026|

With everything that we have done so far (our company) I believe by end of 2026 our software will be self improving all the time.

And no it is not AI slop and we don't vibe code. There are a lot of practical aspects of running software and maintaining / improving code that can be done well with AI if you have the right setup. It is hard to formulate what "right" looks like at this stage as we are still iterating on this as well.

However, in our own experiments we can clearly see dramatic increases in automation. I mean we have agents working overnight as we sleep and this is not even pushing the limits. We are now wrapping major changes that will allows us to run AI agents all the time as long as we can afford them.

I can even see most of these materialising in Q1 2026.

Fun times.

papacj657 1/1/2026|

What exactly are your agents doing overnight? I often hear folks talk about their agents running for long periods of time but rarely talk about the outcomes they're driving from those agents.

_pdp_ 1/1/2026||

We have a lot of grunt work scheduled overnight like finding bugs, creating tests where we don’t have good coverage or where we can improve, integrations, documentation work, etc.

Not everything gets accepted. There is a lot of work that is discarded and much more pending verification and acceptance.

Frankly, and I hope I don’t come as alarmist (judge for yourself from my previous comments on Hn and Reddit) we cannot keep up with the output! And a lot of it is actually good and we should incorporate it even partially.

At the moment we are figuring out how to make things more autonomous while we have the safety and guardrails in place.

The biggest issue I see at this stage is how to make sense of it all as I do not believe we have the understanding of what is happening - just the general notion of it.

I truly believe that we will reach the point where ideas matter more than execution, which what I would expect to be the case with more advanced and better applied AI.

More comments...