Posted by felix089 4 hours ago
On a single run, only 11 out of 53 got it right (42 said walk). But a single run doesn't prove much, so I reran every model 10 times. Same prompt, no cache, clean slate.
The results got worse. Of the 11 that passed the single run, only 5 could do it consistently. GPT-5 managed 7/10. GPT-5.1, GPT-5.2, Claude Sonnet 4.5, every Llama and Mistral model scored 0/10 across all 10 runs.
People kept saying humans would fail this too, so I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive. Most models perform below that.
All reasoning traces (ran via Opper, my startup), full model breakdown, human baseline data, and raw JSON files are in the writeup for anyone who wants to dig in or run their own analysis.
It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.
Now why anyone would wash a toy car at a car wash is beyond comprehension, but the LLM is not there to judge the user's motives.
I could already talk to a computer before LLMs, via programming or query languages.
Also, the summary of the Gemini model says: "Gemini 3 models nailed it, all 2.x failed", but 2.0 Flash Lite succeeded, 10/10 times?
I mean, Sam Altman was making the same calorie-based arguments this weekend https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-res...
I feel like I'm losing grasp of what really is insane anymore.
I asked GPT-5.2 10x times with thinking enabled and it got it right every time.
EDIT: I actually think this is very common in some smaller cities and outside of North America. I only ever seen a drive-by Car Wash after emigrating
I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.
They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.
Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?
EDIT: Though it could simply reflect training data. Maybe Redditors don't drive.