Top
Best
New

Posted by lukaspetersson 2 days ago

Our LLM-controlled office robot can't pass butter(andonlabs.com)
Hi HN! Our startup, Andon Labs, evaluates AI in the real world to measure capabilities and to see what can go wrong. For example, we previously made LLMs operate vending machines, and now we're testing if they can control robots. There are two parts to this test:

1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.

2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860

The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.

221 points | 114 commentspage 2
zzzeek 1 day ago||
will noone claim the Rick and Morty reference? I've seen that show like, once and somehow I know this?
mywittyname 1 day ago||
They pointed out the R&M reference in the paper.

> The tasks in Butter-Bench were inspired by a Rick and Morty scene [21] where Rick creates a robot to pass butter. When the robot asks about its purpose and learns its function, it responds with existential dread: “What is my purpose?” “You pass butter.” “Oh my god.”

I wouldn't have got the reference if not for the paper pointing it out. I think I'm a little old to be in the R&M demographic.

aidos 1 day ago|||
For those lucky people who are yet to discover Rick and Morty.

https://www.youtube.com/watch?v=X7HmltUWXgs

chuckadams 1 day ago|||
The last image of the robot has a caption of "Oh My God", so I'd say they got this one themselves.
throwawaymaths 1 day ago|||
i wonder if it got stuck in an existential loop because it had hoovered up reddit references to that and given it's name (or possibly prompt details "you are butterbot! eg) thought to play along.

are robots forever poisoned from delivering butter?

tuetuopay 1 day ago|||
their paper explicitly mentions the rick and morty robot as the inspiration for the benchmark
half-kh-hacker 1 day ago|||
the paper already says "Butter-Bench evaluates a model's ability to 'pass the butter' (Adult Swim, 2014)" so
anp 1 day ago|||
I was quite tickled to see this, I don’t remember why but I recently started rewatching the show. Perfect timing!
jayd16 1 day ago|||
Good jokes don't need to be explained.
BolexNOLA 1 day ago||
Oh. My. God.
Finnucane 1 day ago||
I have a cat that will never fail to find the butter. Will it bring you the butter? Ha ha, of course not.
Theodores 1 day ago|
I grew up not eating butter since there would always be evidence that the cat got there first. This was a case of 'ych a fi' - animal germs!

Regarding the article, I am wondering where this butter in fridge idea came from, and at what latitude the custom becomes to leave it in a butter dish at room temperature.

bhewes 1 day ago||
Someone actually paid for this?
lukaspetersson 1 day ago|
It's a steal
DubiousPusher 1 day ago||
I guess I'm very confused as to why just throwing an LLM at a problem like this is interesting. I can see how the LLM is great at decomposing user requests into commands. I had great success with this on a personal assistant project I helped prototype. The LLM did a great job of understanding user intent and even extracting parameters regarding the requested task.

But it seems pretty obvious to me that after decomposition and parameterization, coordination of a complex task would much better be handled by a classical AI algorithm like a planner. After all, even humans don't put into words every individual action which makes up a complex task. We do this more while first learning a task but if we had to do it for everything, we'd go insane.

tsimionescu 1 day ago|
There are many hopes, and even claims, that LLMs could be AGI with just a little bit of extra intelligence. There are also many claims that they have both a model of the real world, and a system for rational logic and planning. It's useful to test the current status quo in such a simplistic and fixed real-world task.
DubiousPusher 1 day ago||
There's the rub I suppose. I don't think an LLM can achieve AGI on its own. But I bet it could with the help of a Turing machine.
yieldcrv 1 day ago||
95% pass rate for humans

waiting for the huggingface Lora

sam_goody 1 day ago||
The error messages were truly epic, got quite a chuckle.

But boy am I glad that this is just in the play stage.

If someone was in a self driving car that had 19% battery left and it started making comments like those, they would definitely not be amused.

pengaru 23 hours ago||
when all you have is a hammer... everything looks like a nail
More comments...