Posted by fesens 18 hours ago
(1) Let the LLM randomly perturbate the system.
(2) Measure the system's performance.
(3a) If the perturbation improved performance, keep the change.
(3b) Otherwise, don't.
(4) Repeat
[1] https://github.com/karpathy/autoresearchAlphaEvolve from google is evolutionary algorithm which uses LLMs for Idea generation following very similar loop:
- https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...
- Open source implementation of the algorithm: https://github.com/algorithmicsuperintelligence/openevolve
* Gödel Machine (2006-2007) [1]
* Optimal Ordered Problem Solver (2002) [2]
* Meta-Learning and Artificial Curiosity (1990s onward) [3]
[1] https://arxiv.org/html/2505.22954v3
[2] https://arxiv.org/abs/cs/0207097
[3] https://evolution.ml/pdf/schmidhuber.pdf
Edit: markdown formatting
I don't see both ingredients in Karpathy's proposed scheme.
What’s next “karpathy investing” where ai in a loop builds a portfolio?
> (1) Let the LLM randomly perturbate the system.
instead of this i ask LLM to what's least likely to improve performance and then measure it.
sometimes big gains come from places you thought are least likely.
Why should throwing ideas at the wall in regards to optimizing code be any different: as long as you can measure and verify it, are okay with added complexity, and are capable of making the code itself not be crap by the end of it?
If an approach is found that improves how well something works, you can even treat the AI slop as a draft and iterate upon it yourself further.
At the time I dismissed it as potentially being incredibly expensive for the improvement you do get, and runs into typical pitfalls of evolutionary algorithms (in the same way evolution doesn't let an organism grow a wheel, your LLM evolution algorithm will never come up with something that requires a far bigger leap than what you allow the LLM to perturb on a single step. Also the genetic algorithm will probably result in a vibecoded mess of short-sighted decisions just like evolution creates a spaghetti genome in real life.)
I'll definitely need to look into how people have improved the idea and whether it is practical now.
> The same observation had previously also been made by many others.
I think hyperparameter tuning may actually be a kind of genetic algorithm.
Hyperparam tuning is usually done by Bayesian Optimization though.
https://publicityreform.github.io/findbyimage/readings/lem.p...
Nice detail on the encountered failures. Very similar experiences with my own loops against testsuites.
Great post. A snapshot in time.
> The agent did not know that would also halve the LUT count. It found out by doing it and watching the synthesizer.
So I guess this is an example of an LLM anthropomorphizing and making wild conjectures about the internal workings of a different LLM.
Pretty much what I did to let Codex with gpt5.4xhigh improve my fairly complex CUDA kernel which resulted in 20x throughput improvement.
Big difference between a working model that needs to be optimized, vs nothing working at all.
OP's post is basically pointing out what certainly many others have independently discovered: Your agent-based dev operation is as good as the test rituals and guard rails you give the agents.
However there isn't really a "correct" answer that's easy to define in code (I could manually label a training set, but wanted to avoid that) so I had the LLM just analyse the results itself and decide if they are better or not. It wrote deterministic rules for a few things, but overall it just reviewed the results of each round and decided if the are better or not.
Reviewing the before and after results, I would say yes, it's a big improvement in quality. It also optimised the prompt size to reduce input tokens by 25% and switched to a smaller/cheaper model.
I have recursive agent that finds trading strategies after recreating academic research and probing the model using its training on everything. It works really well but I have to force it to write out every line and write a proof that data in the future from the time of the wall clock didn't enter the system. Even then some stupid thing like not converting the timezone with daylight savings will allow it to peek into the future 1 hour. These types of bugs are almost impossible to find. Now there needs to be another agent whose only purpose to write out every line explaining that the timezone for that line of code was correct.
a fantastic opportunity to become the next next big thing and write a verifier verifier.
at the hypothesized inflexion point where AI instantly performs exactly as commanded, what happens to heavily regulated industries like medical? do we get huge leaps and bounds everywhere EXCEPT where it matters, or is regulation going to be handed over to a verifier verifier?
The devil is in the details. There are an amazing number of details in a good [thing]. Someone somewhere has to say exactly what this [thing] being built actually is.
Read almost any story about wishes from a genie. Simple statements don't work.