Autoresearch on an old research idea

Posted by ykumards 6 hours ago

Autoresearch on an old research idea(ykumar.me)

243 points | 64 commentspage 2

mlmonkey 3 hours ago||

> Then I lock down Claude Code’s permissions to only edit these two files and run run.sh. No direct Python execution, no pip installs, no network access, no git push, etc.

How does one run Claude Code without network access?

ykumards 2 hours ago||

Sorry I could have worded this part better.

The docker container didn’t have network access. Claude didn’t have permission to execute anything other than the run.sh bash script, which would orchestrate the docker run

shepherdjerred 3 hours ago|||

You can do this via a Docker container or seatbelt on MacOS.

in both cases you'd limit it so CC can only talk to the required Anthropic APIs.

So not zero access, but as close to it as you can get.

franktankbank 3 hours ago||

Pretty good question, also how do you update python version without network access?

n_bhavikatti 4 hours ago||

The temperature clamp fix and "Optuna++" actions by the agents (the cause of basically all improvement to eCLIP) indicate they are good at finding bugs and hyper-parameter tuning. But when it comes to anything beyond that, such as novel architectural shifts, agents aren't good enough. With no clear path forward they tend to randomly change things, which is a poor approach. Agents: Optimization >> innovation

lamroger 6 hours ago||

Awesome breakdown! It really feels like a hyper-hyper parameter search + bug fixer.

I started looking at Kaggle again and autoresearch seems to converge to many of the solution vibes there.

Wild ensembles, squeezing a bit of loss out. More engineering than research IMO

sdenton4 5 hours ago|

For raw hyperparameter search, though, I would expect a proper Bayesian framework to be much better. Eg, vizier.

ainch 5 hours ago||

I think it depends whether you can leverage some knowledge. It's possible for a person/LLM to look at a loss curve and say "oh that's undertraining, let's bump the lr" - whereas a Bayesian method doesn't necessarily have deeper understanding, so it'll waste a lot of time exploring the search space on poor options.

If you're resource unconstrained then BO should ofc do very well though.

sdenton4 4 hours ago||

Yah, I'm a bit skeptical - ime humans tend to under explore due to incorrect assumptions. Often this is due to forming a narrative to explain some result, and then over attaching to it. Also, agents aren't actually good at reasoning yet.

Good Bayesian exploration is much, much better than grid search, and does indeed learn to avoid low value regions of the parameter space. If we're talking about five minute experiments (as in the blog post), Bayesian optimization should chew through the task no problem.

pikachu0625 3 hours ago||

It's better to outsource optimization phases. Our idea should be for constraint, assumptions etc. for breakthrough. Boyd often argues that once you can express a problem in a standard mathematical form, the implementation becomes a commodity that software can handle automatically.

BrokenCogs 5 hours ago||

Does autoresearch work for projects that are not llm based? Eg in karpathy's example he is optimizing the nanogpt. What if I wanted to improve a Unet for image segmentation?

simonw 5 hours ago||

Tobi from Shopify used a variant of autoresearch to optimize the Liquid template engine, and found a 53% speedup after ~120 experiments: https://github.com/Shopify/liquid/pull/2056

I wrote up some more notes on that here: https://simonwillison.net/2026/Mar/13/liquid/

Denzel 5 hours ago||

How much did this cost? Has there ever been an engineering focus on performance for liquid?

It’s certainly cool, but the optimizations are so basic that I’d expect a performance engineer to find these within a day or two with some flame graphs and profiling.

simonw 5 hours ago||

He used Pi as the harness but didn't say which underlying model. My stab-in-the-air guess would be no more than a few hundred dollars in token spend (for 120 experiments run over a few days assuming Claude Opus 4.6 used without the benefits of the Claude Max plan.)

So cheaper than a performance engineer for a day or two... but the Shopify CEO's own time is likely a whole lot more expensive than a regular engineer!

sdenton4 5 hours ago|||

The gist of these things is you point them at an eval metric and say 'make it go better.' so, you can point it at anything you can measure. The example in the blog post here is bonding boxes on wood cut images.

bethekind 5 hours ago|||

I used it to speed up an codecompass-like repo from 86 files per second to 2000. Still haven't used the repo in production, so maybe it secretly broke things, but the ability to say: "optimize this benchmark and commit only if you pass these tests" is nice

ks2048 4 hours ago|||

I think image segmentation is in the same class as LLMs - ML experiments.

What about more distant software projects? Give it the CPython source code and say you want it to be faster.

motbus3 4 hours ago||

I've done something with a small project I have and I had very similar results overall.

wasting_time 1 hour ago|

Care to elaborate?

More comments...