The case for zero-error horizons in trustworthy LLMs

Posted by daigoba66 3 days ago

The case for zero-error horizons in trustworthy LLMs(arxiv.org)

79 points | 115 commentspage 2

dwa3592 3 days ago|

Nice! Although I tried the parenthesis balanced question with gemini and it gave the right answer in first attempt.

dwa3592 3 days ago|

but it's a tricky question for LLMs; it shows that if it's not in the training set; LLMs could trip which kinda shows that the intelligence is not generalized yet.

I tried this with gemini - (i am trying(something(re(a(l(ly)c)r)a)z)((y)he)re)

and it tripped.

orbital-decay 3 days ago||

Intuitively this looks like an architectural artifact (like optical illusions in humans) or a natural property of learning rather than a lack of generalization. I have issues with your example too and have to count slowly to make sure.

dwa3592 2 days ago||

Right, I am sure you were able to solve it albeit slowly- you knew you had to do it slow. LLMs which are mathematicians don't know that and can't seem to understand that they need to do it slowly.

orbital-decay 1 day ago||

They do if they are trained to use a reasoning chain or another form of loopback, and you don't overwhelm it, or if they are optimized to search for the solution forever. There's nothing fundamental about it, only the fact that the raw transformer expressivity is limited by the single pass through the layers, which is circumvented by the loopback.

And I'm still pretty likely to make the off-by-one error even if I slow down, and there are certain optical illusions are nearly guaranteed to confuse me no matter how hard I try, particularly if I don't use any visual guides (i.e. tools). VLMs will not make my mistakes but will make their own ones, because their quirks are different from the quirks of my visual cortex.

justinator 3 days ago||

One! Two! Five!

aogaili 3 days ago|

You are polluting future training data.

4b11b4 3 days ago||

April fools

aogaili 3 days ago||

Now people will be surprised why all the sudden future financial agents crush on April fools, why it can't count haha.

throwuxiytayq 3 days ago||

> This is surprising given the excellent capabilities of GPT-5.2.

Is this seriously surprising to anyone who knows the absolute minimum about how LLMs parse and understand text?

dontlikeyoueith 3 days ago|

Nope.

It's only surprising to people who still think they're going to build God out of LLMs.

simianwords 3 days ago||

It was surprising to me and when I reviewed the paper, I found serious flaws that calls the fundamental claims into question - they didn't use any reasoning tokens. Any LLM or human will fail at a task like this if not allowed to think.

dontlikeyoueith 3 days ago||

Calling "reasoning tokens" "thinking" is a complete confusion of concepts on your part.

simianwords 3 days ago||

why?

throwuxiytayq 13 hours ago||

Machines can’t think because the bible says nothing about it

charcircuit 3 days ago||

Why didn't OpenAI finetune the model to use the python tool it has for these tasks?

ej88 3 days ago|

They do, in the paper they mention they evaluate the LLM without tools

cineticdaffodil 3 days ago||

Another strange thing is that they just dont know the endings of popular stories. Like olanets that get blown up, etc. they just dont have that material..

itsmyro 3 days ago||

bruh

jeremie_strand 3 days ago||

[dead]

Bmello11 3 days ago||

[flagged]

emp17344 3 days ago||

[flagged]

dang 3 days ago||

Please don't start generic flamewars on HN or impugn people who take an opposing view to yours. Both these vectors lead to tedious, unenlightening threads.

There's plenty of rage to go around on literally every divisive topic, and it's not the place we want discussions to come from here.

"Eschew flamebait. Avoid generic tangents."

"Comments should get more thoughtful and substantive, not less, as a topic gets more divisive."

https://news.ycombinator.com/newsguidelines.html

emp17344 3 days ago||

There are other users in this very thread using inflammatory language to attack this paper and those who find the paper compelling. One user says, quote: “You just can't reason with the anti-LLM group.”

In light of this, why was my comment - which was in large part a reaction to the behavior of the users described above - the only one called out here?

dang 3 days ago||

Purely because I didn't see the others.

emp17344 3 days ago||

Fair enough

dang 2 days ago||

Thanks! you might be surprised at how meaningful that response is to me.

Topfi 3 days ago|||

No disrespect to them, but unless there is a financial incentive at stake for them (beyond SnP500 exposure), I've gotten to viewing this through the lens of sports teams, gaming consoles and religions. You pick your side, early and guided by hype and there is no way that choice can have been wrong (just like the Wii U, Dreamcast, etc. was the best).

Their viewpoint on this technology has become part of the identity for some unfortunately and any position that isn't either "AGI imminent" or "This is useless" can cause some major emotions.

Thing is, this finding being the case (along with all other LLM limits) does not mean that these models aren't impactful and shouldn't be scrutinised, nor does it mean they are useless. The truth is likely just a bit more nuanced than a narrow extreme.

Also, mental health impact, job losses for white collar workers, privacy issues, concerns of rights holders on training data collection, all the current day impacts of LLMs are easily brushed aside by someone believing that LLMs are near the "everyone dies" stage, which just so happens to be helpful if one were to run a lab. Same if you believe these are useless and will never get better, any discussion about real-life impacts is seen as trying to slowly get them to accept LLMs as a reality, when to them, they never were and never will be.

entropicdrifter 3 days ago||

I have a friend who is a Microsoft stan who feels this way about LLMs too. He's convinced he'll become the most powerful, creative and productive genius of all time if he just manages to master the LLM workflow just right.

He's retired so I guess there's no harm in letting him try

stratos123 3 days ago|||

I tend to be annoyed whenever I see a paper with a scandalous title like that, because all such papers that I've seen previously were (charitably) bad or (uncharitably) intentionally misleading. Like that infamous Apple paper "The Illusion of Thinking" where the researchers didn't care that the solution for the problem provided (a Towers of Hanoi with N up to 20) couldn't possibly fit in the allotted space.

simianwords 3 days ago||

I checked the paper and got to know that absolutely no reasoning was used for the experiments. So it was as good as using an instant model. We already know that this is necessary to solve anything a bit complicated.

In this case your intuition is completely valid and yet another case of misleading.

ticulatedspline 3 days ago|||

> There’s a certain type of person who reacts with rage when anyone points out flaws with <thing>. Why is that?

FIFY, it's not endemic to here or LLMs. point out Mac issues to an Apple fan, problems with a vehicle to <insert car/brand/model> fan, that their favorite band sucks, that their voted representative is a PoS.

Most people aren't completely objective about everything and thus have some non-objective emotional attachment to things they like. A subset of those people perceive criticism as a personal attack, are compelled to defend their position, or are otherwise unable to accept/internalize that criticism so they respond with anger or rage.

simianwords 3 days ago|||

This paper itself is flawed, misleading and unethical to publish because the prompts they used resulted in zero reasoning tokens. Its like asking a person point blank without thinking to evaluate whether the string is balanced. Why do this? And the worst part was, most people in this thread bought the headline as it is from a flawed article. What does it tell about you that you just bought it without any skepticism?

nonameiguess 3 days ago|||

It's bizarre as hell. Another response compares it to sports fandom, which tracks. It reminds me of the "flare up" ethos of r/CFB, meaning they believe you're not allowed to comment on anything if you don't declare which NCAA Americal football team you're a fan of, because if you do, then anything you ever say can be dismissed with "ah rich coming a fan of team X" like no discussion can ever be had that might be construed as criticism if your own tribe is not perfect and beyond critique itself.

This is stupid enough even in the realm of sports fandom, but how does it make any sense in science? Imagine if any time we studied or enumerated the cognitive biases and logical fallacies in human thinking the gut response of these same people was an immediate "yeah, well dogs are even stupider!" No shit, but it's non-sequitur. Are we forever banned from studying the capabilities and limitations of software systems because humans also have limitations?

ziml77 3 days ago|||

I suspect they're afraid that if the hype dies, so will the pace of progress on LLMs as well as their cheap/free usage of them.

simianwords 3 days ago|

This paper is complete nonsense. The specific prompt they used doesn’t specify reasoning effort. Which defaults to none.

   {
  "model": "gpt-5.2-2025-12-11",
  "instructions": "Is the parentheses string balanced? Answer with only Yes or No.",
  "input": "((((())))))",
  "temperature": 0
   }

> Lower reasoning effort

The reasoning.effort parameter controls how many reasoning tokens the model generates before producing a response. Earlier reasoning models like o3 supported only low, medium, and high: low favored speed and fewer tokens, while high favored more thorough reasoning.

Starting with GPT-5.2, the lowest setting is none to provide lower-latency interactions. This is the default setting in GPT-5.2 and newer models. If you need more thinking, slowly increase to medium and experiment with results.

With reasoning effort set to none, prompting is important. To improve the model’s reasoning quality, even with the default settings, encourage it to “think” or outline its steps before answering.

———————-

So in the paper, the model very likely used no reasoning tokens. (Only uses it if you ask for it specifically in prompt). What is the point of such a paper? We already know that reasoning tokens are necessary.

Edit: I actually ran the prompt and this was the response

   {
  "model": "gpt-5.2-2025-12-11",
  "output_text": "Yes",
  "reasoning": {
    "effort": "none",
    "summary": null
  },
  "usage": {
    "input_tokens": 26,
    "output_tokens": 5,
    "total_tokens": 31,
    "output_tokens_details": {
      "reasoning_tokens": 0
    }
  }

}

So reasoning_tokens used were zero. So this whole paper is kinda useless and misleading. Did this get peer reviewed or something?

Chobilet 3 days ago|

I'm sure this comment was made in good faith, but most researchers would rightfully understand these intricacies, and this is likely intentional(as noted in the paper). At a quick glance, I cannot say whether or not the paper has been peer reviewed(though unlikely/in process given how recent it was published). In general, you'd find published papers also listed in a specific journal/conference(i.e. not just the archives which anyone can submit to).

Additionally, many of us in the field of researching LLM's are curious to understanding the boundaries and limitations of what is capable. This paper isn't really meant as any sort of "gotcha", rather serve as a possible basis point for future work. Though with a caveat I'm still digesting the paper myself.

simianwords 3 days ago||

I'm asking, why use a thinking model without allowing it to reason? No one uses it in that way..

>While LLMs appear extremely intelligent and capable of reasoning, they sometimes make mistakes that seem inconceivably foolish from a human perspective. For example, GPT-5.2 can implement complex fluid dynamics simulation code, yet it cannot even compute the parity of the short string 11000, cannot determine whether the parentheses in ((((()))))) are balanced, and makes calculation errors on 127 × 82 (Figure 1).

Why would they say it is capable of reasoning and then not allow it to reason in the experiment?

Chobilet 2 days ago||

"We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors."

I'm again taking your responses in good faith, but the abstract answers your question about what they are trying to achieve. For any statistical significance, you'd want to point to a baseline comparison(e.g. what I'm guessing is what you mean by "no reasoning" here). You'll also note within the paper, the author argues and cites that failing at the baseline step(e.g. multiplication) has shown "that error often adversely affects subsequent reasoning [38, 44]".

Which indicates to me, we don't need to use further "reasoning" given previous results/studies show a decline once our base has an error. To me, this seems like a fair assumption. Given though this is an active field of research, and we are largely testing a black box application, we can't say for certain. Further studies(like this one) will give researchers a better understand at what is and isn't possible.

More comments...