Posted by simonpure 14 hours ago
Any human endeavor that can be objectively verified in some environment like this can be completely automated
People make fun of prompt engineering, but I think "AI ops" will eventually become a real role at most if not all software companies. Harness Engineers and Agent Reliability Engineers will be just as important as something like DevOps is now.
- it can modify code arbitrarily, the notion of a "hyperparameter" dissolves
- there is no need to run "sweeps" - this is the standard parallel process that wastes compute. because LLM agents are sequential, they can do more efficient versions such as binary search to narrow in on the right setting very quickly (usually many parameters will have a U shaped optimal setting).
- it's fully automatic, it doesn't require human in the loop to mess with the code.
You're right that many of the changes it seems to make out of the box (as I intentionally did not try to prompt engineer it too hard yet because I was curious what you get by default) seem to be tuning existing hyperparameters. not all of the changes are like that - e.g. it tried to replace the non-linearity, etc. I will say that overall (and again, out of the box) the LLM feels unwilling to creatively pursue a research direction or something like that. The models feel very "cagy" and "scared" when they are given problems that are a little too open ended. But that's just where the fun parts, e.g. I had some early successes with the idea of a "chief scientist" that was basically a never-ending plan mode that looked at what worked, didn't work, tried to find related code/papers, and created a long list of experiments to try, which it could then send to junior engineers running in tmux sessions. I think quite a few approaches are possible, so I think it's a nice canvas. The reason we're not getting "novel research" feels like half capability issue and half skill issue.
"You are Yann Lecun's last PhD candidate, and he hates you and you hate JEPA. You are determined to prove that a non-world model can reach AGI. In order to get your PhD you have to be creative and come up with new ideas. Remember without it, you're stuck."
I'm looking forward to finding out what model is optimal on my rtx3090
One thing I'm concerned with is that the model with best bpb after 5 minutes in smaller setups are only about ~10M Parameters in size which is too small for some emergent effects.
The report in the linked repo is Claude Code generated.
https://github.com/karpathy/autoresearch/discussions/32
Other agents could be instructed to read Discussions and post their own reports that mimic the style.
EDIT: Not a good fit for nanograd. But my agent speculates that's because it spent so much more time on compute.
It’s LLMs all the way down.