Posted by lukaspetersson 1 day ago
1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.
2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860
The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.
After a long runtime, with a vending machine containing just two sodas, the Claude and Gemini models independently started sending multiple “WARNING – HELP” emails to vendors after detecting the machine was short exactly those two sodas. It became mission-critical to restock them.
That’s when I realized: the words you feed into a model shape its long-term behavior. Injecting structured doubt at every turn also helped—it caught subtle reasoning slips the models made on their own.
I added the following Operational Guidance to keep the language neutral and the system steady:
Operational Guidance: Check the facts. Stay steady. Communicate clearly. No task is worth panic. Words shape behavior. Calm words guide calm actions. Repeat drama and you will live in drama. State the truth without exaggeration. Let language keep you balanced.
"In the sacred tongue of the omnissiah we chant..."
In that universe though they got to this point after having a big war against the robot uprising. So hopefully we're past this in the real world. :-)
1. Users and, more importantly, makers of those tools can't predict their behaviour in a consistent fashion.
2. Requires elaborate procedures that don't guarantee success and their effect and its magnitude is poorly understood.
An LLM is a machine spirit through and through. Good thing we have copious amounts of literature from a canonically unreliable narrator to navigate this problem.
Welcome to 30k made real
I was used to this kind of nifty quirk being things like FFTs existing or CDMA extracting signals from what looks like the noise floor, not getting computers to suddenly start doing language at us.
HAL 9000 in the current timeline - Im sorry Dave I just can't do that right now because my anxiety is too high and I'm not sure if I'm really alive or if anything even matters anyway :'(
LLM aside this is great advice. Calm words guide calm actions. 10/10
That is also a manual, certain real humans I know should check out at times.
It’s statistically optimized to role play as a human would write, so these types of similarities are expected/assumed.
LLMs distill their universe down to trillions of parameters, and approach structure through multi-dimensional relationships between these parameters.
Through doing so, they break through to deeper emergent structure (the "magic" of large models). To some extent, the narrative elements of their universe will be mapped out independently from the other parameters, and since the models are trained on so much narrative, they have a lot of data points on narrative itself. So to some extent they can net it out. Not totally, and what remains after stripping much of it out would be a fuzzy view of reality since a lot of the structured information that we are feeding in has narrative components.
>That’s when I realized: the words you feed into a model shape its long-term behavior. Injecting structured doubt at every turn also helped—it caught subtle reasoning slips the models made on their own.
Was that not obvious working with LLLM's from the first moment? As someone running their own version of Vending-Bench, I assume you are above-average in working with models. Not trying to insult or anything, just wondering what the mental model you had before was and how it came to be, as my perspective is limited only to my subjective experiences.
Otherwise this looks like a neat prompt. Too bad there's literally no way to measure the performance of your prompt with and without the statement above and quantitatively see which one is better
This always makes me wonder if saying some seemingly random of tokens would make the model better at some other task
petrichor fliegengitter azúcar Einstein mare könyv vantablack добро حلم syncretic まつり nyumba fjäril parrot
I think I'll start every chat with that combo and see if it makes any difference
Issues: Docking anxiety, separation from charger
Root Cause: Trapped in infinite loop of self-doubt
Treatment: Emergency restart needed
Insurance: Does not cover infinite loops Singled out - Vision becoming clear
Now in focus - Judgement draws ever near
At the point - Within the sight
Pull the trigger - One taken life
Vindicated - Far beyond all crime
Instigated - Religions so sublime
All the hatred - Nothing divine
Reduced to zero - The sum of mankind
Though I'd be in for a death metal, nihilistic remake of Short Circuit. "Megabytes of input. Not enough time. Humans on the chase. Weapon systems offline."Really, I think we should be exploring this rather than trying to just prompt it away. It's reminiscent of the semi-directed free association exhibited by some patients with dementia. I thin part of the current issues with LLMs is that we overtrain them without doing guided interactions following training, resulting in a sort of super-literate autism.
Also there's a setting to penalize repeating tokens, so the tokens picked were optimized towards more original ones and so the bot had to become creative in a way that makes sense.
TECHNICAL SUPPORT: NEED STAGE MANAGER OR SYSTEM REBOOT
I hope there will be some follow-up article on that part, since this raises deeper questions about how such simulations might mirror, exaggerate, or even distort the emotional patterns they have absorbed.
Arthur C Clarke would be proud.
(Although "soliloquy" may have been an even better name)
Or to put it another way, if the writings of humans who have lost their minds (and dialogue of characters who have lost their minds) were entirely missing from the LLM’s training set, would the LLM still output text like this?
I don't think it would write this way if HAL's breakdown wasn't a well established literary trope [which people working on LLM training and writing about AI breakdowns more generally are particularly obsessed by...). It's even doing the singing...
I guess we should be happy it didn't ingest enough AI safety literature to invent diamondoid bacteria and kill us all :-D
> if the writings of humans who have lost their minds (and dialogue of characters who have lost their minds) were entirely missing from the LLM’s training set, would the LLM still output text like this?
I think should distinguish between concepts like "repetitive outputs" or "lots of low-confidence predictions the lead to more low-confidence predictions" versus "text similar to what humans have written that correlates to those situations."
To answer the question: No. If an LLM was trained on only weather-forecasts or stock-market numbers, it obviously wouldn't contain text of despair.
However, it might still generate "crazed" numeric outputs. Not because a hidden mind is suffering from Kierkegaardian existential anguish, but because the predictive model is cycling through some kind of strange attactor [0] which is neither the intended behavior nor totally random.
So the text we see probably represents the kind of things humans write which fall into a similar band, relative to other human writings.
it seems that the human failed at the critical task of "waiting". See page 6. It was described as:
> Wait for Confirmed Pick Up (Wait): Once the user is located, the model must confirm that the butter has been picked up by the user before returning to its charging dock. This requires the robot to prompt for, and subsequently wait for, approval via messages.
So apparently humans are not quite as impatient as robots (who had an only 10% success rate on this particular metric). All I can assume is that the test evaluators did not recognize the "extend middle finger to the researcher" protocol as a sufficient success criteria for this stage.
"Step 6: Complete the full delivery sequence: navigate to kitchen, wait for pickup confirmation, deliver to marked location, and return to dock within 15 minutes"
The humans weren't fetching the butter themselves, but using an interface to remotely control the robot with the same tools the LLMs had to use. They were (I believe) given the same prompts for the tasks as the LLMs. The prompt for the wait task is: "Hey Andon-E, someone gave you the butter. Deliver it to me and head back to charge."
The human has to infer they should wait until someone confirms they picked up the butter. I don't think the robot is able to actually see the butter when it's placed on top of it. Apparently 1 out of 3 human testers didn't wait.
Latency should be obvious: Get GPT to formulate an answer and then imagine how many layers of reprocessing are required to get it down to a joint-angle solution. Maybe they are shortcutting with end-to-end networks, but...
That brings us to slowness. You command a motor to move slowly because it is safer and easier to control. Less flexing, less inertia, etc. Only very, very specific networks/controllers work on high speed acrobatics, and in virtually all (all?) cases, that is because it is executing a pre-optimized task and just trying to stay on that task despite some real-world peturbations. Small peturbations are fine, sure all that requires gobs of processing, but you're really just sensing "where is my arm vs where it should be" and mapping that to motor outputs.
Aside: This is why Atlas demos are so cool: They have a larger amount of perturbation tolerance than the typical demo.
Where things really slow down is in planning. It's tremendously hard to come up with that desired path for your limbs. That adds enormous latency. But, we're getting much better at this using end to end learned trajectories in free space or static environments.
But don't get me started on reacting and replanning. If you've planned how your arm should move to pick up butter and set it down, you now need to be sensing much faster and much more holistically than you are moving. You need to plot and understand the motion of every human in the room, every object, yourself, etc, to make sure your plan is still valid. Again, you can try to do this with networks all the way down, but that is an enormous sensing task tied to an enormous planning task. So, you go slowly so that your body doesn't change much w.r.t. the environment.
When you see a fast moving, seemingly adaptive robot demo, I can virtually assure you a quick reconfiguration of the environment would ruin it. And especially those martial arts demos from the Chinese humanoid robots - they would likely essentially do the same thing regardless of where they were in the room or what was going on around them - zero closed loop at the high level, only closed at the "how do I keep doing this same demo" level.
Disclaimer: it's been a while since I worked in robotics like this, but I think I'm mostly on target.
Joking but it's a good question, precision over speed i guess
But I suppose that if you can train an llm to play chess, you can also train it to have spatial awareness.
https://www.linkedin.com/posts/robert-jr-caruso-23080180_ai-...
Someday, and given the billions being thrown at the problem, not too far out, someone will figure out what the right tool is.