Reasoning models reason well, until they don't

Posted by optimalsolver 10/31/2025

Reasoning models reason well, until they don't(arxiv.org)

218 points | 217 commentspage 2

kordlessagain 10/31/2025|

What specific reasoning capabilities matter for what real-world applications?

Nobody knows.

Moreover, nobody talks about that because it's boring and non-polarizing. Instead, supposedly smart people post stupid comments that prevent anyone from understanding this paper is worthless.

The paper is worthless because it has a click-bait title. Blog posts get voted down for that, why not this?

The implicit claim is worthless. Failure to navigate a synthetic graph == failure to solve real world problems. False.

Absolutely no connection to real world examples. Just losing the model in endless graphs.

wavemode 10/31/2025|

> The implicit claim is worthless. Failure to navigate a synthetic graph == failure to solve real world problems. False.

This statement is the dictionary definition of attacking a strawman.

Every new model that is sold to us, is sold on the basis that it performs better than the old model on synthetic benchmarks. This paper presents a different benchmark that those same LLMs perform much worse on.

You can certainly criticize the methodology if the authors have erred in some way, but I'm not sure why it's hard to understand the relevance of the topic itself. If benchmarks are so worthless then go tell that to the LLM companies.

js8 10/31/2025||

I think the explanation is pretty simple, as I said in my earlier comment: https://news.ycombinator.com/item?id=44904107

I also believe the problem is we don't know what we want: https://news.ycombinator.com/item?id=45509015

If we could make LLMs to apply a modest set of logic rules consistently, it would be a win.

Sharlin 10/31/2025|

That's a pretty big "if". LLMs are by design entirely unlike GoFAI reasoning engines. It's also very debatable whether it makes any sense to try and hack LLMs into reasoning engines when you could just... use a reasoning engine. Or have the LLM to defer to one, which would play to their strength as translators.

hirako2000 10/31/2025||

Has any one ever found an ML/AI paper that make claims that RLMs can reason?

When I prompt an RLM, I can see it spits out reasoning steps. But I don't find that evidence RLMs are capable of reasoning.

tempfile 10/31/2025||

I don't understand what point you are making. Doesn't the name "Reasoning language models" claim that they can reason? Why do you want to see it explicitly written down in a paper?

hirako2000 10/31/2025|||

This very paper sits on the assumption reasoning (to solve puzzles) is at play. It calls those LLMs RLMs.

Imo the paper itself should have touched on the lack of paper discussing what's in the blackbox that makes them Reasoning LMs. It does mention some tree algorithm supposedly key to reasoning capabilities.

By no means attacking the paper as its intent is to demonstrate the lack of success to even solve simple to formulate, complex puzzles.

I was not making a point, I was genuinely asking in case someone knows of papers I could read on that make claims with evidence that's those RLM actually reason, and how.

tekno45 10/31/2025|||

By renaming this binary to a "Mind reading language model" We now can read your mind and predict your choices just by chatting.

Don't ask how it works cuz its called a "Mind reading language model" duh.

_heimdall 10/31/2025|||

That would require the ability to understand what happens inside the system during inference when the output is created and they can't do that today.

There's no evidence to be had when we only know the inputs and outputs of a black box.

Sharlin 10/31/2025||

Semantics schemantics.

hirako2000 10/31/2025||

It's a statistical imitation of a reasoning pattern, underlying mechanism is pattern matching. The ability to create a model that can determine two radically different words have strong similarity in meaning doesn't imply emergence of some generalizable, logical model that suddenly can Reason to solve novel problems.

Pattern matching is a component of reason. Not === reason.

j45 10/31/2025||

Compared to software that can explicitly reason, reasoning models don’t seem to reason at all.

They simulate reasoning through matching patterns.

lingrush4 10/31/2025||

Is that really the best title the authors could come up with?

Up next: "Lawn mowers are good at cutting grass until they aren't"

andy99 10/31/2025||

I think that would be a good title if we’d previously thought lawn mowers had solved generalized grass cutting and assumed that because one worked on my lawn that they could cut hayfields or harvest bamboo (a grass I believe) effectively.

tekno45 10/31/2025||

When the news cycle has been "lawnmowers can now do anything, throw away your kitchenaide" its a pretty relevant title.

anal_reactor 10/31/2025||

I'm yet to see a task that AI fails at that bottom 10% of population wouldn't also fail at.

krackers 10/31/2025||

ARC-AGI v3 is a pretty good benchmark, and it's notably different from the other ARC-AGI in that it has a "truer" human baseline (you can go play it right now and add your datapoint), and captures the act of in-context learning better as you start an unfamiliar game then master it over time.

Also bottom 10% feels like a bad comparison, median human would be better. And unlike "specialized" things like programming, game playing is something almost all of us have done.

layer8 10/31/2025|||

If I have the choice of performing an intellectual task myself, or have it performed by someone from the bottom 10% of the population, I’d probably rather perform it myself.

Der_Einzige 10/31/2025||

What happens when both choices lead to you doing it yourself?

acdha 10/31/2025|||

The problem is consistency: AI tools usually produce output which _sounds_ like the top 10% but you have to read it carefully to find the bottom 10% parts. We’re not used to that because human performance isn’t that inconsistent and we use history and social factors: someone’s performance goes down when they’re really drunk, but they rarely show up to work in that state and it’s obvious enough that other people recognize that they shouldn’t be trusted.

anal_reactor 10/31/2025||

> We’re not used to that because human performance isn’t that inconsistent

It is. It's very common for socially apt people to bullshit through things they don't know, or outright want to hide.

acdha 10/31/2025||

That’s not inconsistent: your bluffer knows they’re making something up and is using their model of you to construct something they think you’ll believe. Someone who can do that isn’t going to suddenly forget how to count the number of letters in a word.

anal_reactor 10/31/2025||

You're wrong. Counting the number of letters in a word is a significantly more difficult task than lying, both for humans and LLMs. Imagine going to a ghetto and asking people "have you ever lied to someone and had them believe the lie", and ask them to spell "continuously". Children learn to lie before they learn to spell.

acdha 10/31/2025||

> Counting the number of letters in a word is a significantly more difficult task than lying

No, it’s not - you don’t even need to be literate to count symbols - but also consider the complexity of the second task and how many skills each requires: unlike counting letters, lying isn’t simple confabulation and requires a theory of mind and some kind of goal. A child who lies to avoid trouble is doing that because they have enough of a world model to know they are going to get in trouble for something even if they haven’t worked out yet that this is unlikely to work.

anal_reactor 10/31/2025||

Sure, let's stick to counting symbols. When I need to count something, there's a decent chance I'll get lost if I count beyond 10, and beyond 20 I'll get lost for sure. Even below 10, when I count it's one-two-three-four-five-six-seven-eight-nine items. But when I lie I do it instantaneously, without altering the pace of the conversation. I can come up with a believable lie within the brief period between someone saying something to me, and the moment I'm expected to respond. No way I'd be able to count 10 items that fast.

Pirahã language doesn't even have numerals - that's an extreme case, but there quite a few languages where people stop counting beyond certain small number and just say "a lot". Same people though don't have issues lying to one another. Let that sink in for a while - fully grown-ass adults, fully capable of functioning in their society, not capable of counting one-two-three because the concept is beyond them.

What I'm trying to say is that all of those "requires theory of mind" statements are probably true but completely irrelevant because humans (and LLMs) have "hardware acceleration" of whatever it takes to lie, meanwhile counting is an abstract idea that requires to use the brain in a way it didn't evolve to be used. Similarly, LLMs cannot count if they aren't connected to a math engine - not because they're stupid, but because counting is really difficult.

TheOtherHobbes 10/31/2025|||

How about keeping a conversation going with family over Thanksgiving? (Or local equivalent.)

randomNumber7 10/31/2025||

This is something where the top 10% sometimes horribly fail.

Earw0rm 10/31/2025|||

If by task you mean the written, intellectual variety, maybe.

riskable 10/31/2025||

My hypothesis: This is why AI is fantastic as a coding assistant but not so great at other things. A software developer—after watching an AI model fail over and over again, trying to say, fix a difficult bug—will stop and approach the issue from a different angle. They'll take a closer look at what's going on, fiddle things around by hand, and that's usually enough to get over that hump of complexity (that the AI model couldn't work its way through).

We (developers) do this because it's what we've always done with our own code. Everyone's encountered a bug that they just couldn't figure out. So they search the Internet, try different implementations of the same thing, etc but nothing works. Usually, we finally solve such problems when we take a step back and look at it with a different lens.

For example, just the other day—after spending far too long trying to get something working—I realized, "Fuck it! The users don't really need this feature." :thumbsup:

acuozzo 10/31/2025|

> AI is fantastic as a coding assistant

The extent to which this is true is a rough measure of how derivative your work is, no?

kerabatsos 10/31/2025||

How is that different than human reasoning?

ares623 10/31/2025|

I’d like $500B to just be the way I am thanks.

WesolyKubeczek 10/31/2025||

It’s because they generate a seeming of reasoning, and don’t actually reason!

(Slams the door angrily)

(stomps out angrily)

(touches the grass angrily)

samuell 10/31/2025||

Yea, a bit like a cheating student rote memorizing and copying another students technique for solving a type of problem, and failing hard as soon as there's too much variation from the original problem.

fsloth 10/31/2025||

Yes!

That said the input space of supported problems is quite large and you can configure the problem parametrs quite flexibly.

I guess the issue is that what the model _actually_ provides you is this idiot savant who has pre-memorized everything without offering a clear index that would disambiguate well-supported problems from ”too difficult” (i.e. novel) ones

brap 10/31/2025||

What is to reason, if not to generate a seeming of reasoning?

(tips fedora)

hshdhdhehd 10/31/2025||

You said the quiet part out loud of political debate.

(does something)

devlogstream 10/31/2025|

LLMs are like students, they can reason a bit, but real understanding still takes time and practice.

hansmayer 10/31/2025|

What? The LLMs are nothing like students (or any other human for that matter).