Posted by daigoba66 3 days ago
I tried this with gemini - (i am trying(something(re(a(l(ly)c)r)a)z)((y)he)re)
and it tripped.
And I'm still pretty likely to make the off-by-one error even if I slow down, and there are certain optical illusions are nearly guaranteed to confuse me no matter how hard I try, particularly if I don't use any visual guides (i.e. tools). VLMs will not make my mistakes but will make their own ones, because their quirks are different from the quirks of my visual cortex.
Is this seriously surprising to anyone who knows the absolute minimum about how LLMs parse and understand text?
It's only surprising to people who still think they're going to build God out of LLMs.
There's plenty of rage to go around on literally every divisive topic, and it's not the place we want discussions to come from here.
"Eschew flamebait. Avoid generic tangents."
"Comments should get more thoughtful and substantive, not less, as a topic gets more divisive."
In light of this, why was my comment - which was in large part a reaction to the behavior of the users described above - the only one called out here?
Their viewpoint on this technology has become part of the identity for some unfortunately and any position that isn't either "AGI imminent" or "This is useless" can cause some major emotions.
Thing is, this finding being the case (along with all other LLM limits) does not mean that these models aren't impactful and shouldn't be scrutinised, nor does it mean they are useless. The truth is likely just a bit more nuanced than a narrow extreme.
Also, mental health impact, job losses for white collar workers, privacy issues, concerns of rights holders on training data collection, all the current day impacts of LLMs are easily brushed aside by someone believing that LLMs are near the "everyone dies" stage, which just so happens to be helpful if one were to run a lab. Same if you believe these are useless and will never get better, any discussion about real-life impacts is seen as trying to slowly get them to accept LLMs as a reality, when to them, they never were and never will be.
He's retired so I guess there's no harm in letting him try
In this case your intuition is completely valid and yet another case of misleading.
FIFY, it's not endemic to here or LLMs. point out Mac issues to an Apple fan, problems with a vehicle to <insert car/brand/model> fan, that their favorite band sucks, that their voted representative is a PoS.
Most people aren't completely objective about everything and thus have some non-objective emotional attachment to things they like. A subset of those people perceive criticism as a personal attack, are compelled to defend their position, or are otherwise unable to accept/internalize that criticism so they respond with anger or rage.
This is stupid enough even in the realm of sports fandom, but how does it make any sense in science? Imagine if any time we studied or enumerated the cognitive biases and logical fallacies in human thinking the gut response of these same people was an immediate "yeah, well dogs are even stupider!" No shit, but it's non-sequitur. Are we forever banned from studying the capabilities and limitations of software systems because humans also have limitations?
{
"model": "gpt-5.2-2025-12-11",
"instructions": "Is the parentheses string balanced? Answer with only Yes or No.",
"input": "((((())))))",
"temperature": 0
}
> Lower reasoning effortThe reasoning.effort parameter controls how many reasoning tokens the model generates before producing a response. Earlier reasoning models like o3 supported only low, medium, and high: low favored speed and fewer tokens, while high favored more thorough reasoning.
Starting with GPT-5.2, the lowest setting is none to provide lower-latency interactions. This is the default setting in GPT-5.2 and newer models. If you need more thinking, slowly increase to medium and experiment with results.
With reasoning effort set to none, prompting is important. To improve the model’s reasoning quality, even with the default settings, encourage it to “think” or outline its steps before answering.
———————-
So in the paper, the model very likely used no reasoning tokens. (Only uses it if you ask for it specifically in prompt). What is the point of such a paper? We already know that reasoning tokens are necessary.
Edit: I actually ran the prompt and this was the response
{
"model": "gpt-5.2-2025-12-11",
"output_text": "Yes",
"reasoning": {
"effort": "none",
"summary": null
},
"usage": {
"input_tokens": 26,
"output_tokens": 5,
"total_tokens": 31,
"output_tokens_details": {
"reasoning_tokens": 0
}
}
}So reasoning_tokens used were zero. So this whole paper is kinda useless and misleading. Did this get peer reviewed or something?
Additionally, many of us in the field of researching LLM's are curious to understanding the boundaries and limitations of what is capable. This paper isn't really meant as any sort of "gotcha", rather serve as a possible basis point for future work. Though with a caveat I'm still digesting the paper myself.
>While LLMs appear extremely intelligent and capable of reasoning, they sometimes make mistakes that seem inconceivably foolish from a human perspective. For example, GPT-5.2 can implement complex fluid dynamics simulation code, yet it cannot even compute the parity of the short string 11000, cannot determine whether the parentheses in ((((()))))) are balanced, and makes calculation errors on 127 × 82 (Figure 1).
Why would they say it is capable of reasoning and then not allow it to reason in the experiment?
I'm again taking your responses in good faith, but the abstract answers your question about what they are trying to achieve. For any statistical significance, you'd want to point to a baseline comparison(e.g. what I'm guessing is what you mean by "no reasoning" here). You'll also note within the paper, the author argues and cites that failing at the baseline step(e.g. multiplication) has shown "that error often adversely affects subsequent reasoning [38, 44]".
Which indicates to me, we don't need to use further "reasoning" given previous results/studies show a decline once our base has an error. To me, this seems like a fair assumption. Given though this is an active field of research, and we are largely testing a black box application, we can't say for certain. Further studies(like this one) will give researchers a better understand at what is and isn't possible.