Posted by optimalsolver 2 days ago
Nobody knows.
Moreover, nobody talks about that because it's boring and non-polarizing. Instead, supposedly smart people post stupid comments that prevent anyone from understanding this paper is worthless.
The paper is worthless because it has a click-bait title. Blog posts get voted down for that, why not this?
The implicit claim is worthless. Failure to navigate a synthetic graph == failure to solve real world problems. False.
Absolutely no connection to real world examples. Just losing the model in endless graphs.
This statement is the dictionary definition of attacking a strawman.
Every new model that is sold to us, is sold on the basis that it performs better than the old model on synthetic benchmarks. This paper presents a different benchmark that those same LLMs perform much worse on.
You can certainly criticize the methodology if the authors have erred in some way, but I'm not sure why it's hard to understand the relevance of the topic itself. If benchmarks are so worthless then go tell that to the LLM companies.
I also believe the problem is we don't know what we want: https://news.ycombinator.com/item?id=45509015
If we could make LLMs to apply a modest set of logic rules consistently, it would be a win.
When I prompt an RLM, I can see it spits out reasoning steps. But I don't find that evidence RLMs are capable of reasoning.
Imo the paper itself should have touched on the lack of paper discussing what's in the blackbox that makes them Reasoning LMs. It does mention some tree algorithm supposedly key to reasoning capabilities.
By no means attacking the paper as its intent is to demonstrate the lack of success to even solve simple to formulate, complex puzzles.
I was not making a point, I was genuinely asking in case someone knows of papers I could read on that make claims with evidence that's those RLM actually reason, and how.
Don't ask how it works cuz its called a "Mind reading language model" duh.
There's no evidence to be had when we only know the inputs and outputs of a black box.
Pattern matching is a component of reason. Not === reason.
They simulate reasoning through matching patterns.
Up next: "Lawn mowers are good at cutting grass until they aren't"
Also bottom 10% feels like a bad comparison, median human would be better. And unlike "specialized" things like programming, game playing is something almost all of us have done.
It is. It's very common for socially apt people to bullshit through things they don't know, or outright want to hide.
No, it’s not - you don’t even need to be literate to count symbols - but also consider the complexity of the second task and how many skills each requires: unlike counting letters, lying isn’t simple confabulation and requires a theory of mind and some kind of goal. A child who lies to avoid trouble is doing that because they have enough of a world model to know they are going to get in trouble for something even if they haven’t worked out yet that this is unlikely to work.
Pirahã language doesn't even have numerals - that's an extreme case, but there quite a few languages where people stop counting beyond certain small number and just say "a lot". Same people though don't have issues lying to one another. Let that sink in for a while - fully grown-ass adults, fully capable of functioning in their society, not capable of counting one-two-three because the concept is beyond them.
What I'm trying to say is that all of those "requires theory of mind" statements are probably true but completely irrelevant because humans (and LLMs) have "hardware acceleration" of whatever it takes to lie, meanwhile counting is an abstract idea that requires to use the brain in a way it didn't evolve to be used. Similarly, LLMs cannot count if they aren't connected to a math engine - not because they're stupid, but because counting is really difficult.
We (developers) do this because it's what we've always done with our own code. Everyone's encountered a bug that they just couldn't figure out. So they search the Internet, try different implementations of the same thing, etc but nothing works. Usually, we finally solve such problems when we take a step back and look at it with a different lens.
For example, just the other day—after spending far too long trying to get something working—I realized, "Fuck it! The users don't really need this feature." :thumbsup:
The extent to which this is true is a rough measure of how derivative your work is, no?
(Slams the door angrily)
(stomps out angrily)
(touches the grass angrily)
That said the input space of supported problems is quite large and you can configure the problem parametrs quite flexibly.
I guess the issue is that what the model _actually_ provides you is this idiot savant who has pre-memorized everything without offering a clear index that would disambiguate well-supported problems from ”too difficult” (i.e. novel) ones
(tips fedora)
(does something)