Our LLM-controlled office robot can't pass butter

Posted by lukaspetersson 10/28/2025

Our LLM-controlled office robot can't pass butter(andonlabs.com)

Hi HN! Our startup, Andon Labs, evaluates AI in the real world to measure capabilities and to see what can go wrong. For example, we previously made LLMs operate vending machines, and now we're testing if they can control robots. There are two parts to this test:

1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.

2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860

The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.

229 points | 117 commentspage 2

amelius 10/28/2025|

> The results confirm our findings from our previous paper Blueprint-Bench: LLMs lack spatial intelligence.

But I suppose that if you can train an llm to play chess, you can also train it to have spatial awareness.

tracerbulletx 10/28/2025||

Probably not optimal for it. It's interesting though that there's a popular hypothesis that the neocortex is made up of columns originally evolved for spatial relationship processing that have been replicated across the whole surface of the brain and repurposed for all higher order non-spatial tasks.

SrslyJosh 10/28/2025|||

The key word here is "if".

https://www.linkedin.com/posts/robert-jr-caruso-23080180_ai-...

root_axis 10/28/2025||

I don't see why that would be the case. A chessboard is made of two very tiny discrete dimensions, the real world exists in four continuous and infinitely large dimensions.

ge96 10/28/2025||

Funny I was looking at the chart like "what model is Human?"

Animats 10/29/2025||

Using an LLM for robot actuator control seems like pounding a screw. Wrong tool for the job.

Someday, and given the billions being thrown at the problem, not too far out, someone will figure out what the right tool is.

sam_goody 10/28/2025||

The error messages were truly epic, got quite a chuckle.

But boy am I glad that this is just in the play stage.

If someone was in a self driving car that had 19% battery left and it started making comments like those, they would definitely not be amused.

yieldcrv 10/29/2025||

95% pass rate for humans

waiting for the huggingface Lora

bhewes 10/28/2025||

Someone actually paid for this?

lukaspetersson 10/28/2025|

It's a steal

pengaru 10/29/2025||

when all you have is a hammer... everything looks like a nail

More comments...