The last six months in LLMs in five minutes

Posted by yakkomajuri 7 hours ago

The last six months in LLMs in five minutes(simonwillison.net)

350 points | 218 commentspage 2

ionwake 1 hour ago|

why is there no talk about the world is already run by AI by proxy? ie bureaucrats using chatgpt to make their speeches decisions shopfront designs etc. I just dont seem to read about this, intead its more this nebulous specific date in the future

pr337h4m 3 hours ago||

Something that’s largely been ignored: DeepSeek has made context caching virtually free with V4-Flash.

grey-area 4 hours ago||

Haven’t noticed much significant progress in LLMs myself in 6 months (significant as in new or vastly improved capabilities or understanding, not new releases, there are plenty of those).

I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.

Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.

So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.

https://github.com/openclaw/openclaw/pulse?period=daily

279 commits to main from 77 authors in the last 24 hours.

Why is there so much churn and how could you trust it with your data? This is changes in ONE day!

If these are useful changes, surely it’d be superhuman by now given months of this pace.

What are people using this for?

inglor_cz 1 hour ago||

"there’s zero chance any AI lab would train a model for such a ridiculous task"

Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.

ramon156 2 hours ago||

> Google released the Gemma 4 series of models, which are the most capable open weight models I’ve seen from a US company.

Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.

pferdone 2 hours ago||

The consensus right now is that Qwen3.6 in its 27B and 35B-A3B versions is better for coding whereas Gemma4 is stronger when it comes to OCR, audio transcription and the likes. Margins are slim though and the harness at these model sizes is the most important factor.

0xCMP 2 hours ago||

In my experience the qwen models are best locally, but gemma ones have always been good. gemma4 is a notable improvement.

LarsDu88 2 hours ago||

My goal post for "AI will definitely replace most SWEs" was to reproduce a particular 90s programming game one shot and then add multiplayer support with minimal prompting.

Opus 4.5 hit that point in November.

grey-area 2 hours ago|

I tried this a while ago, haven’t tried again recently. The models were producing code that was clearly lifted from stuff in their training data, and what I ended up with was a fairly decent game in html and js after a bit of tidy up, though it felt like several code paradigms smooshed together rather than a coherent whole, but it mostly worked. Not something I’d want to maintain but it was impressive at the time.

They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.

The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.

vishal_new 4 hours ago||

what are your thoughts on Software engineer replacement. My team has already seen big reductions. Q/A team is gone. Software Engineer reduced by a third. Scared for the future

simonw 3 hours ago||

Ditching the QA team when the single highest challenge is verifying that vibe-coded systems do what they're meant to is extraordinarily short-sighted.

Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.

kenloef 3 hours ago|||

I believe that many of those saying that they "never write code anymore" or are experiencing "10x productivity," are heavily underestimating (or outright misrepresenting) how much they are guiding the model, and ignoring everything else that goes into shipping fit for purpose software. I frequently see zero measurements or factual arguments supplied to support such claims. I also see many people say that they are "vibe coding," when they are almost certainly reviewing, editing, or otherwise steering the output.

I wonder why there is such a mad dash to trump up the capabilities of coding agents. And why such loose terminology and lack of rigor? I thought programmers were supposed to be rational people (har har!)

koonsolo 3 hours ago|||

Have you seen the automated tests that QA members deliver? My experience is that they are horrible, and it's not so hard to beat that low quality bar with an LLM.

I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.

Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.

Mashimo 2 hours ago||

Huh, never thought about QA writing unit tests.

In my limited experience they write test cases, test each story, do regression test, verify bugs from customers. All by hand.

At my current job I don't want to miss them.

ShinyLeftPad 4 hours ago|||

If you're famous, you'll be fine. If you're in retiring age, you don't care. Otherwise, good luck! We put ourselves on the street by not protesting what is happening.

vanuatu 3 hours ago||

I think there will be larger markets, more companies, more jobs than before due to AI, but also a very painful transition period

AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded

It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever

asdff 2 hours ago|||

The problems in any domain are infinite. But, alas, money is not.

trojans1290 3 hours ago|||

What are these skills?

stuxnet79 2 hours ago|||

This is the magic question that I'm very eager to hear the answer to.

Fundamentally, steering LLMs requires the same structured, logical thought process that is required to write code, regardless of abstraction level. Unlike what HN would have you believe this is not a skill that is equally distributed across the population.

But given the rapid pace at which this technology is evolving, "steering" may very well be ceded to the clankers. LLM agents are fantastic at logical reasoning & any inefficiencies relative to human experts can be circumvented by sheer compute.

koonsolo 2 hours ago|||

Being able to work with an infinite amount of dumb interns that work super fast and have a vast amount of knowledge.

rTX5CMRXIfFG 5 hours ago||

Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?

bluegatty 5 hours ago||

You will immediately notice the difference if you use it at the threshold.

It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.

If you were to just watching them play, work out, shoot - you'd never notice the difference.

Put them head to head and it's 98-54 and you start to see the patterns.

It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.

Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.

Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.

Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.

dnnddidiej 4 hours ago|||

Head to head is interesting. I had not tried 2 agents on the same task simulateniously with 2 models.

nothinkjustai 2 hours ago|||

[flagged]

Hfuffzehn 2 hours ago|||

You have correctly identified that getting a "high-quality harness (ie preloaded instructions from md files, including custom skills)" is the (or at least a) hard part.

Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.

Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.

And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.

Sparkyte 5 hours ago|||

No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.

Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.

minimaxir 5 hours ago|||

To an extent. I've had GPT 5.5 solve problems that Opus 4.7 struggled with, using an identical AGENTS.md/CLAUDE.md and no skills.

nl 4 hours ago|||

The difference is very noticeable as your codebase gets bigger and you give higher and higher level tasks.

I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved.

And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious.

raincole 4 hours ago||

By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.

dnnddidiej 4 hours ago||

Also LinkedIn wars of people trying to claim throne as most AI-pilled, throwing down strawmen stories of luddites yelling at data centres who'll lose their job to a single person doing 100x work.

bunzee 3 hours ago|

Spot on. Building our tool, we found AI is magic at scraping competitor data, but terrible at market validation. The 'why' is strictly human.

2001zhaozhao 3 hours ago||

So, the best way to use LLMs is to wait for your competitors to do market validation and then scrape their data.

Hmmm......

tardedmeme 1 hour ago||

It's always been much easier to copy an existing product than to make a new one nobody's thought of before.

kkarpkkarp 3 hours ago||

sorry but how this comment refers to the commented post?

More comments...