Top
Best
New

Posted by lukaspetersson 10/28/2025

Our LLM-controlled office robot can't pass butter(andonlabs.com)
Hi HN! Our startup, Andon Labs, evaluates AI in the real world to measure capabilities and to see what can go wrong. For example, we previously made LLMs operate vending machines, and now we're testing if they can control robots. There are two parts to this test:

1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.

2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860

The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.

229 points | 117 commentspage 3
JEFFREYBURKE 10/30/2025|
[dead]
hidelooktropic 10/28/2025||
How can I get early access to this "Human" model on the benchmarks? /s
throwawayffffas 10/29/2025||
It feels misguided to me.

I think the real value of llms for robotics is in human language parsing.

Turning "pass the butter" to a list of tasks the rest of the system is trained to perform, locate an object, pick up an object, locate a target area, drop off the object.

fsckboy 10/28/2025|
>Our LLM-controlled office robot can't pass butter

was the script of Last Tango in Paris part of the training data? maybe it's just scared...