Posted by lukaspetersson 10/28/2025
1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.
2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860
The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.
But I suppose that if you can train an llm to play chess, you can also train it to have spatial awareness.
https://www.linkedin.com/posts/robert-jr-caruso-23080180_ai-...
Someday, and given the billions being thrown at the problem, not too far out, someone will figure out what the right tool is.
But boy am I glad that this is just in the play stage.
If someone was in a self driving car that had 19% battery left and it started making comments like those, they would definitely not be amused.
waiting for the huggingface Lora