Posted by yakkomajuri 8 hours ago
"Coding agents got really good - here, a bunch of non-releavant slop-pictures of pelicans riding bikes as a key benchmark AND a couple of hardly relevant edge-case demo-projects of mine to prove it right! "
Come on man, where is the AI writing all the code in 6 months? We're close to June and Amodei's latest statement from January does not look like going into fulfilling over the next weeks, does it now?
Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.
Hmmm......
Does that suggest the uplift was only for things that are easily verifiable like code?
The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still
And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive
I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.
It would support your point about the performance of 20GB local models.
It is getting very good at producing code that compiles - at the algorithmic level.
This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.
But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:
-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-
It just knows how to 'incant' the duck.
This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.
This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.
We already kind of knew that - but we have not yet built an intuition for that until now.
Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise
This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.
In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.
LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.
It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.
We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.
I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.
But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.
This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.
Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.
The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Precisely because it does not understand those things.
FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
We're a long way away - but in the meantime, there's lots to unpack.
Proof by existence?
https://gist.github.com/nlothian/50241d34a654fcf0caa280d4475...
Looks pretty good to me. ChatGPT in "Thinking" model.
Edit: I've added the Opus version on the same link.
https://chatgpt.com/share/e/6a0bf28b-e198-8012-9a88-c777d965...
When it was new, sure. Right now, models can be trained on that because everybody uses it as a benchmark.
Is the only choice to pay for the "max" plans?
Or just read so much about it that you bs your way through an interview and then use the company's resources?
Simon, I'm curious too how much you invest each month researching all the latest and great AI tech?
They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.
Z.AI, Moonshot.AI, Xiaomi, Minimax, Alibaba all have coding plans that allows a massive usage of GLM 5.1, Kimi k2.6, Minimax M2.7, Qwen 3.6 Plus, Xiaomi MiMo v2.5 Pro for cheap.
Pair those coding plans with the harness of choice including Claude Code and you are good to go.
Also, Nvidia is offering free access to top models for free through NIM - but you have 40 RPM limits. https://blog.kilo.ai/p/nvidia-nim-kilo-code-free-kimi-k25
"and then you have to get a mac mini, and then, and then"
smile and nod, it pays weekly
Wake me up when we have an agent with constant learning and changing weights that I can have personally, not some LLM that can always fall prone to jailbreak and context injection attacks.
You think most of this stuff here is organic? Oh boy..