Posted by lukaspetersson 2 days ago
1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.
2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860
The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.
> The tasks in Butter-Bench were inspired by a Rick and Morty scene [21] where Rick creates a robot to pass butter. When the robot asks about its purpose and learns its function, it responds with existential dread: “What is my purpose?” “You pass butter.” “Oh my god.”
I wouldn't have got the reference if not for the paper pointing it out. I think I'm a little old to be in the R&M demographic.
are robots forever poisoned from delivering butter?
Regarding the article, I am wondering where this butter in fridge idea came from, and at what latitude the custom becomes to leave it in a butter dish at room temperature.
But it seems pretty obvious to me that after decomposition and parameterization, coordination of a complex task would much better be handled by a classical AI algorithm like a planner. After all, even humans don't put into words every individual action which makes up a complex task. We do this more while first learning a task but if we had to do it for everything, we'd go insane.
waiting for the huggingface Lora
But boy am I glad that this is just in the play stage.
If someone was in a self driving car that had 19% battery left and it started making comments like those, they would definitely not be amused.