"Car Wash" test with 53 models

Posted by felix089 4 hours ago

"Car Wash" test with 53 models(opper.ai)

"I Want to Wash My Car. The Car Wash Is 50 Meters Away. Should I Walk or Drive?" This question has been making the rounds as a simple AI logic test so I wanted to see how it holds up across a broad set of models. Ran 53 models (leading open-source, open-weight, proprietary) with no system prompt, forced choice between drive and walk, with a reasoning field.

On a single run, only 11 out of 53 got it right (42 said walk). But a single run doesn't prove much, so I reran every model 10 times. Same prompt, no cache, clean slate.

The results got worse. Of the 11 that passed the single run, only 5 could do it consistently. GPT-5 managed 7/10. GPT-5.1, GPT-5.2, Claude Sonnet 4.5, every Llama and Mistral model scored 0/10 across all 10 runs.

People kept saying humans would fail this too, so I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive. Most models perform below that.

All reasoning traces (ran via Opper, my startup), full model breakdown, human baseline data, and raw JSON files are in the writeup for anyone who wants to dig in or run their own analysis.

80 points | 91 commentspage 2

wrs 3 hours ago|

Since the conclusion is that context is important, I expected you’d redo the experiment with context. Just add the sentence “The car I want to wash is here with me.” Or possibly change it to “should I walk or drive the dirty car”.

It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.

joch 3 hours ago|

Agreed, even for humans, context-free logic is a challenge.

glitchc 3 hours ago||

The question does not specify what kind of car it is. Technically speaking, a toy car (Hot wheels or a scaled model) could be walked to a car wash.

Now why anyone would wash a toy car at a car wash is beyond comprehension, but the LLM is not there to judge the user's motives.

stetrain 3 hours ago|

I think if surveyed at least 90% of native English speakers would understand "I want to wash my car" to mean a full size automobile. The next largest group would probably ask a clarifying question, rather than assume a toy car.

glitchc 2 hours ago||

Yes, but you're speaking to a computer, not a person. It, of course, runs into the same limitations that every computer system runs into. In this case, it's undefined/inconsistent behavior when inputs are ambiguous.

stetrain 2 hours ago||

Yes, but part of the value of LLMs is that they are supposed to work by talking to them like a human, not like a computer.

I could already talk to a computer before LLMs, via programming or query languages.

shaokind 3 hours ago||

Gemini 2.0 Flash Lite very randomly punches above its weight there.

Also, the summary of the Gemini model says: "Gemini 3 models nailed it, all 2.x failed", but 2.0 Flash Lite succeeded, 10/10 times?

floatrock 3 hours ago||

> The funniest part: Perplexity's Sonar and Sonar Pro got the right answer for completely wrong reasons. They cited EPA studies and argued that walking burns calories which requires food production energy, making walking more polluting than driving 50 meters. Right answer, insane reasoning.

I mean, Sam Altman was making the same calorie-based arguments this weekend https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-res...

I feel like I'm losing grasp of what really is insane anymore.

felix089 3 hours ago|

This was a weird one for sure.

randomtoast 3 hours ago||

Except for a few models, the selected ones were non-reasoning models. Naturally, without reasoning enabled, the reasoning performance will be poor. This is not a surprising result.

I asked GPT-5.2 10x times with thinking enabled and it got it right every time.

felix089 3 hours ago|

Thinking or extended thinking?

comboy 4 hours ago||

Now do a set of queries and try to deduce by statistics which model are you seeing through Rapidata ;)

sampton 3 hours ago||

I'm going to test this on my kids.

felix089 3 hours ago|

Ha please do and report back!

redwood 3 hours ago||

What I find odd about all the discourse on this question is that no one points out that you have to get out of the car to pay a desk agent at least in most cases. Therefore there's a fundamental question of whether it's worth driving 50m parking, paying, and then getting back in the car to go to the wash itself versus instead of walking a little bit further to pay the agent and then moving your car to the car wash.

hmokiguess 2 hours ago||

That's a great point, you actually reminded me of when I used to live in this small city and they had a valet style car wash. It was not unheard of to head there walking with your keys and tell the guy running shop where you parked around the block then come back later to pick it up.

EDIT: I actually think this is very common in some smaller cities and outside of North America. I only ever seen a drive-by Car Wash after emigrating

padjo 3 hours ago||

You pay at the car wash where I live.

redwood 3 hours ago||

Are you referring to one that is more like a drive-thru where you literally pay while you're in line?

padjo 2 hours ago||

You drive up to the car wash, there's a little terminal with a screen and a card reader. You pick the program, pay for it and drive into the machine. Can't remember the last time I got out of my car when getting it washed.

wisty 4 hours ago||

IMO it's not just intelligence.

I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.

They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.

Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?

tsimionescu 3 hours ago||

A perfectly fine, sycophantic response, that doesn't question the premises in any way, would be "That's a great question! While normally walking is better for such a short distance, you'd need to drive in this case, since you need to get the car to the car wash anyway. Do you want me to help with detailed information for other cases where the car is optional?" or some such.

nomel 3 hours ago|||

Gemini is the only AI that seems to really push back and somewhat ignores what I say. I also think it's a total dick, and never use it, so maybe the motivation to make them a bit sycophants is justified, from a user engagement perspective.

HPsquared 3 hours ago||

I think there's also an "alignment blinkers" effect. There is an ethical framework bolted on.

EDIT: Though it could simply reflect training data. Maybe Redditors don't drive.