Autoresearch on an old research idea

Posted by ykumards 5 hours ago

Autoresearch on an old research idea(ykumar.me)

243 points | 64 comments

the_arun 4 hours ago|

Try this if the main link is not responsive - https://archive.is/6xLiU

datsci_est_2015 4 hours ago||

I often use LLMs to explore prior art and maybe find some alternative ways of thinking of problems. About 90% of what it tells me is useless or inapplicable to my domain due to a technicality it could not have known, but the other 10% is nice and has helped me learn some great new things.

I can’t imagine letting an agent try everything that the LLM chatbot had recommended ($$$). Often coming up in recommendations are very poorly maintained / niche libraries that have quite a lot of content written about them but what I can only imagine is very limited use in real production environments.

On the other hand, we have domain expert “consultants” in our leadership’s ears making equally absurd recommendations that we constantly have to disprove. Maybe an agent can occupy those consultants and let us do our work in peace.

andy12_ 4 hours ago||

I think the main value lies in allowing the agent to try many things while you aren't working (when you are sleeping or doing other activities), so even if many tests are not useful, with many trials it can find something nice without any effort on your part.

This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.

M4v3R 3 hours ago|||

Even if your tests take a long time, you can always (if hardware permits) run multiple tests in parallel. This would enable you to explore many approaches at the same time.

genxy 2 hours ago||||

> single test can take half a day

Why is that?

I don't doubt you, but when Shigeo Shingo created SMED (Single Minute Exchange of Die), die changes were an hours long process.

datsci_est_2015 3 hours ago|||

Experiments for us cost on the order of tens of dollars, so doing 100 of them every night quickly becomes the price of an entire new employee. And that’s not even including the cost of letting agents run all night.

Definitely not in the budget for non-VC-backed companies who aren’t in the AI bubble.

lukebechtel 5 minutes ago|||

What is your domain?

Eufrat 4 hours ago|||

I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember or things where even being flat out wrong is okay and you just do it yourself.

For all the folks spending a lot of time and energy in setting up MCP servers, AGENTS.md, etc. I think this represents more that the LLM cannot do what it is being sold as by AI boosters and needs extreme amounts of guidance to reach a desired goal, if it even can. This is not an argument that the tech has no value. It clearly can be useful in certain situations, but this is not what OpenAI/Anthropic/Perplexity are selling and I don’t think the actual use cases have a sustainable business model.

People who spend the energy to tailor the LLMs to their specific workflows and get it to be successful, amazing. Does this scale? What’s going to happen if you don’t have massive amounts of money subsidizing the training and infrastructure? What’s the actual value proposition without all this money propping it up?

M4v3R 3 hours ago|||

> I find LLMs useful in regurgitating one-liners

This was the case for me a year ago. Now Claude or Codex are routinely delivering finished & tested complete features in my projects. I move much, much faster than before and I don’t have an elaborate setup - just a single CLAUDE.md file with some basic information about the project and that’s it.

Eufrat 2 hours ago||

People keep saying this and I agree Claude has gotten a lot better even in my own experience, but I think the value is questionable.

What’s the point of adding features that are inscrutable? I have gotten Claude to make a feature and it mostly works and if it doesn’t work quite right I spend a massive amount of time trying to understand what is going on. For things that don’t matter too much, like prototyping, I think it’s great to just be able to get a working demo out faster, but it’s kind of terrifying when people start doing this for production stuff. Especially if their domain knowledge is limited. I can personally attest to seeing multiple insane things that are clearly vibe coded by people who don’t understand things. In one case, I saw API keys exposed because they were treating database users as regular user accounts for website login auth.

> I move much, much faster than before

This is a bad metric as has been attested multiple times in unrelated situations. Moving faster is not necessarily productivity nor is it value.

foobarian 4 hours ago|||

> I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember

I found LLMs make a fabulous frontend for git :-D

electroglyph 2 hours ago||

ah, you've found the danger zone!

MattGaiser 4 hours ago||

> agent try everything that the LLM chatbot had recommended ($$$)

A lot depends on whether it is expensive to you. I use Claude Code for the smallest of whims and rarely run out of tokens on my Max plan.

datsci_est_2015 3 hours ago||

Our experiments aren’t free. We use cloud infrastructure. An experiment costs on the order of tens of dollars, so massively parallelizing “spaghetti at wall” simulators is costly before we even talk about LLMs.

victorbjorklund 1 hour ago||

If it is an experiment. Can’t you just make a POC for the experiment that doesn’t need to use half of AWS to just run? And if the experiment is actually positive you can then bring it to the real application and test it there (and spending the 10-100 usd it costs to test it live)?

datsci_est_2015 1 hour ago|||

I wouldn’t want the LLM-based agent to hyperspecialize its solution to a subset of the data. That’s a basic tenet of machine learning.

Steelmanning your question though, I guess you could come up with some sort of tiered experimentation scheme where you slowly expose it to more data and more compute based on prior success or failures.

carlsborg 4 hours ago||

> “ The agent acted like a hyperparameter optimization algorithm with some basic reasoning baked in.”

Good lens.

The crux of the auto research repo is basically one file - program.md which is a system prompt that can be summarized as “do this in a loop: improve train.py, run the training, run evals, record result. Favor simplicity”. The other files are an arbitrary ML model that is being trained.

_pdp_ 3 hours ago||

Take some working code. Ask an LLM to fix bugs. Measure performance and test coverage. Feed the results back into the LLM. Repeat.

This has been the standard approach for more complex LLM deployments for a while now in our shop.

Using different models across iterations is also something I've found useful in my own experiments. It's like getting a fresh pair of eyes.

cyanydeez 3 hours ago|

Can we modify this approach to get LLMs that are good at specific programming languages or frameworks? That seems to be where local LLMs could really shine.

nico 3 hours ago|||

Would love to have a small local model that only knows about rails and mvc web development

Alternatively, a modular model with multiple “experts” that I could mix and match for my specific stack

I don’t need the model to know all of the Internet plus 20 different human languages. I just want it to be really good with the stack of the project

barrenko 3 hours ago|||

It's just RL-everything.

jpcompartir 4 hours ago||

There are better techniques for hyper-parameter optimisation, right? I fear I have missed something important, why has Autoresearch blown up so much?

The bottleneck in AI/ML/DL is always data (volume & quality) or compute.

Does/can Autoresearch help improve large-scale datasets? Is it more compute efficien than humans?

bonoboTP 2 hours ago||

There is a field of AutoML, with its own specialized academic literature and libraries that tried to achieve this type of thing but didn't work very well in practice.

Years ago there were big hopes about bayesian hyperparameter optimization, predicting performance with Gaussian processes etc, hyperopt library, but it was often starting wasteful experiments because it really didn't have any idea what the parameters did. People mostly just do grid search and random search with a configuration that you set up by intuition and experience. Meanwhile LLMs can see what each hyperparameter does, it can see what techniques and settings have worked in the literature, it can do something approximating common sense regarding what has a big enough effect. It's surprisingly difficult to precisely define when a training curve has really flattened for example.

So in theory there are many non-LLM approaches but they are not great. Maybe this is also not so great yet. But maybe it will be.

nextos 4 hours ago|||

AFAIK, it's a bit more than hyper-parameter tuning as it can also make non-parametric (structural) changes.

Non-parametric optimization is not a new idea. I guess the hype is partly because people hope it will be less brute force now.

coppsilgold 3 hours ago|||

Perhaps LLM-guided Superoptimization: <https://en.wikipedia.org/wiki/Superoptimization>

I recall reading about a stochastic one years ago: <https://github.com/StanfordPL/stoke>

gwerbin 3 hours ago|||

It's an LLM-powered evolutionary algorithm.

ainch 3 hours ago|||

I'd like see a system like this take more inspiration from the ES literature, similar to AlphaEvolve. Let's see an archive of solutions, novelty scoring and some crossover rather than purely mutating the same file in a linear fashion.

nextos 3 hours ago||

Exactly, that's the way forward.

There are lots of old ideas from evolutionary search worth revisiting given that LLMs can make smarter proposals.

UncleOxidant 2 hours ago|||

That was my impression. Including evolutionary programming which normally would happen at the AST level, with the LLM it can happen at the source level.

frumiousirc 3 hours ago|||

> There are better techniques for hyper-parameter optimisation, right?

Yes, for example "swarm optimization".

The difference with "autoresearch" (restricting just to the HPO angle) is that the LLM may (at least we hope) beat conventional algorithmic optimization by making better guesses for each trial.

For example, perhaps the problem has an optimization manifold that has been studied in the past and the LLM either has that study in its training set or finds it from a search and learns the relative importance of all the HP axes. Given that, it "knows" not to vary the unimportant axes much and focus on varying the important ones. Someone else did the hard work to understand the problem in the past and the LLM exploits that (again, we may hope).

janalsncm 1 hour ago|||

> The bottleneck in AI/ML/DL is always data (volume & quality) or compute.

Not true at all. The whole point of ML is to find better mappings from X to Y, even for the same X.

Many benchmarks can’t be solved by just throwing more compute at the problem. They need to learn better functions which traditionally requires humans.

And sometimes an algorithm lets you tap into more data. For example transformers had better parallelism than LSTMs -> better compute efficiency.

hun3 4 hours ago||

> There are better techniques for hyper-parameter optimisation, right?

There always are. You need to think about what those would be, though. Autoresearch outsources the thinking to LLMs.

love2read 4 hours ago||

So... It did work. It found bugs (that he didn't know about) and it did optimization (that he hadn't done).

trcf23 1 hour ago|

From what i understood, not so much.

Most of the gains came from fixing a bug + hyperparameters with optuna which is supposed to be already quite automatic (you set the list of all the var with values you want to try and voilà). I guess a simple claude code session would fix that in a few minutes instead of a full day.

To me, I guess the main value of Autoresearch would be to test different kind of architectures. It's sometimes hard to know what to choose and it would probably give a nice overview.

Anyone used it for exploratory modeling?

dvt 4 hours ago||

Ok, so looking at the commit log[1], I was mostly interested in seeing what the "moonshot ideas" implementations looked like, but basically everything is just hyperparameter tuning. Which is nice, but likely not worth the $$$ spent on the tokens. Am I missing something here?

[1] https://github.com/ykumards/eCLIP/commits/main/autoresearch

DoctorOetker 4 hours ago||

It would seem wise to modify the autoresearch instructions to first estimate the computational costs rigorously and then sort and compare the proposals for human review, and for each actually executed attempt to feed back the computational costs with LoRa adapter?

i.e. perhaps minimal changes to autoresearch can take control for cost-effective research to occur.

mandevil 3 hours ago||

Optuna or skopt are open source and won't take any GPU time at all to do it.

janalsncm 1 hour ago||

Optuna requires exploring the hyperparameter space which means running the experiments with those hyperparameters.

For a fixed search space it will almost certainly be better though.

1970-01-01 3 hours ago||

> The original paper used several medical X-ray datasets which I don’t have access to anymore, so I needed a new dataset with spatial annotations to test the expert attention mechanism. I picked the Ukiyo-eVG dataset: ~11K Japanese woodblock prints

That's such a weird switch. There's lots of free medical imaging online. Example: https://www.cancerimagingarchive.net/

ykumards 1 hour ago|

That’s true! It felt a bit flippant to give medical data to an agent. Also, I wanted to see if the model would work in other domains!

saidnooneever 1 hour ago||

pretty cool experiment, i thought about someone maybe doing this and am happy you did it in this way. nice writeup too. this made me giggle a bit: "At one point it got tired of waiting for training to finish and just ended the conversation. I wouldn’t give it full autonomy just yet :)"

thanks for sharing your results and the road to them!

ykumards 1 hour ago|

Thank you, glad you liked it!

lucasay 3 hours ago|

This feels less like automated research and more like structured trial and error with a decent feedback loop. Still useful, but I think the real bottleneck is how good your eval metric is. If that’s weak, the whole loop just optimizes for the wrong thing faster.

Almondsetat 1 hour ago||

Designing a good fitness function, a tale as old as time...

kridsdale1 3 hours ago||

I mean, isn’t that “the scientific method”?

lucasay 2 hours ago||

Partially—but science also questions the hypothesis and the metric. This mostly assumes both are correct and just optimizes within that box.

More comments...