The last six months in LLMs in five minutes

Posted by yakkomajuri 8 hours ago

The last six months in LLMs in five minutes(simonwillison.net)

400 points | 266 commentspage 3

hansmayer 56 minutes ago|

TL;DR:

"Coding agents got really good - here, a bunch of non-releavant slop-pictures of pelicans riding bikes as a key benchmark AND a couple of hardly relevant edge-case demo-projects of mine to prove it right! "

Come on man, where is the AI writing all the code in 6 months? We're close to June and Amodei's latest statement from January does not look like going into fulfilling over the next weeks, does it now?

inglor_cz 1 hour ago||

"there’s zero chance any AI lab would train a model for such a ridiculous task"

Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.

bunzee 4 hours ago||

Spot on. Building our tool, we found AI is magic at scraping competitor data, but terrible at market validation. The 'why' is strictly human.

2001zhaozhao 4 hours ago||

So, the best way to use LLMs is to wait for your competitors to do market validation and then scrape their data.

Hmmm......

tardedmeme 2 hours ago||

It's always been much easier to copy an existing product than to make a new one nobody's thought of before.

kkarpkkarp 3 hours ago||

sorry but how this comment refers to the commented post?

ex-aws-dude 5 hours ago||

Is the RLVR the key breakthrough for the uplift or is there more to it?

Does that suggest the uplift was only for things that are easily verifiable like code?

vanuatu 4 hours ago||

Yes, with good RLVR at scale you can greatly improve performance especially on benchmarks

The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still

And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so

rdedev 5 hours ago|||

I would say that most improvements are in easily verifiable things like code or math. Atleast that's where all the amazing results seem to be coming from.

Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive

4b11b4 5 hours ago||

RL we're gonna find out will get abandoned cuz we don't even know what is getting "aligned", just my naive gut feeling don't take it seriously

bradley13 2 hours ago||

As someone who uses AI daily (not in agent mode, just user-interactive), I have definitely noticed major quality improvements over the past few months. And that's surprising, because when you use something daily, you tend to overlook the big jumps.

I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.

DeathArrow 5 hours ago||

Apart from GLM 5.1 and Qwen 3.6, there are other Chinese models that are noteworthy: Kimi K2.6, Xiaomi MiMo V2.5 Pro, Deepseek v4 and MiniMax M2.7.

simonw 5 hours ago|

100% true - I only had five minutes so I had to edit it down to just a couple, but all of those models are excellent and keep leap-frogging each other.

rahimnathwani 4 hours ago||

Looking forward to next time, hoping you mention speculative decoding and MTP :)

It would support your point about the performance of 20GB local models.

bluegatty 5 hours ago||

'Producing Images' or even 'Some Code that is Valid and Compiles' is in some ways one of the most misleading ways we assess quality of the AI.

It is getting very good at producing code that compiles - at the algorithmic level.

This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.

But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.

Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:

-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-

It just knows how to 'incant' the duck.

This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.

This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.

We already kind of knew that - but we have not yet built an intuition for that until now.

Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise

This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.

In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.

LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.

It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.

We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.

I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.

But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.

This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.

Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.

The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.

nl 5 hours ago||

> But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.

That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.

The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc

(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)

bluegatty 5 hours ago|||

"That's a higher level of abstraction"

No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.

If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.

Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.

Precisely because it does not understand those things.

FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.

We're a long way away - but in the meantime, there's lots to unpack.

nl 4 hours ago|||

> Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.

Proof by existence?

https://gist.github.com/nlothian/50241d34a654fcf0caa280d4475...

Looks pretty good to me. ChatGPT in "Thinking" model.

Edit: I've added the Opus version on the same link.

squeaky-clean 2 hours ago||

Those are just awful compared to the side view of a pelican on a bike.

IanCal 4 hours ago|||

Are we a long way away?

https://chatgpt.com/share/e/6a0bf28b-e198-8012-9a88-c777d965...

nl 4 hours ago||

Link doesn't work - maybe not public?

koonsolo 3 hours ago|||

> That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.

When it was new, sure. Right now, models can be trained on that because everybody uses it as a benchmark.

viking123 1 hour ago||

Wow! Actually a sensible comment under all the astroturfing that even this place is so full of now.

gib444 4 hours ago||

Starting from zero today, how would someone quickly get upto speed with the latest and greatest AI tooling on an extremely limited budget?

Is the only choice to pay for the "max" plans?

Or just read so much about it that you bs your way through an interview and then use the company's resources?

Simon, I'm curious too how much you invest each month researching all the latest and great AI tech?

x86cherry 3 hours ago||

Opencode has free access to Qwen 3.6 and Deepseek v4 Flash right now.

They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.

DeathArrow 5 minutes ago|||

>Starting from zero today, how would someone quickly get upto speed with the latest and greatest AI tooling on an extremely limited budget?

Z.AI, Moonshot.AI, Xiaomi, Minimax, Alibaba all have coding plans that allows a massive usage of GLM 5.1, Kimi k2.6, Minimax M2.7, Qwen 3.6 Plus, Xiaomi MiMo v2.5 Pro for cheap.

Pair those coding plans with the harness of choice including Claude Code and you are good to go.

Also, Nvidia is offering free access to top models for free through NIM - but you have 40 RPM limits. https://blog.kilo.ai/p/nvidia-nim-kilo-code-free-kimi-k25

RobinL 3 hours ago||

$20 chatgpt pro plan gives pretty generous usage both of codex, general chat

gib444 3 hours ago||

Ah I'd read so much about the downgrading of that plan I didn't think that was still true?

azuanrb 2 hours ago||

It depends on what you’re comparing it against. For $20, OpenAI is still probably the best value for SOTA models. In terms of limits, you can use GPT-5.4 instead of 5.5. The intelligence feels similar, but it’s cheaper. You can also experiment with other harnesses like pi. It’s lightweight but capable enough, and its token usage is definitely much more efficient.

tayo42 5 hours ago||

The claw thing really came and went fast lol

yieldcrv 4 hours ago||

I just started a new job and the person I report to was just excited to tell me about it, here in Mid May

"and then you have to get a mac mini, and then, and then"

smile and nod, it pays weekly

viking123 1 hour ago||

I mean yeah? It was marketing campaign to boost the model providers and give Steinberger a cozy job at OpenAI. Hook, line and sinker.

Wake me up when we have an agent with constant learning and changing weights that I can have personally, not some LLM that can always fall prone to jailbreak and context injection attacks.

You think most of this stuff here is organic? Oh boy..

DeathArrow 5 hours ago|

I think that there's a lot to be improved in harnesses and the way the models are interacting with harnesses. For example, the harness should be able to steer the model when thinking.

More comments...