Detecting when LLMs are uncertain

Posted by trq_ 1 day ago

Detecting when LLMs are uncertain(www.thariq.io)

272 points | 161 comments

nhlx2 1 day ago|

On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. — Charles Babbage

astrange 1 day ago||

That's just autocorrect. (Or generative AI.)

adrian_b 1 day ago|||

Except that autocorrect is frequently wrong, so that many authors of hilariously wrong messages have to apologize that the messages must have been messed by autocorrect (which may be true or not).

When autocorrect is wrong, it usually is because it chooses words believed to be used more frequently in that context, so especially the authors of scientific or technical texts are affected by the wrong guesses of autocorrect, because they use less common words.

TeMPOraL 1 day ago|||

Or error correction. Or statistical analysis.

"Right" and "wrong" aren't binary states. In many cases, if the data is at least in small part correct, that small part can be used to improve correctness in an automated way.

TeMPOraL 1 day ago|||

Honestly, I always thought this is a perfectly legitimate question, and it's Babbage that's failing to comprehend it, or being obtuse for show.

darepublic 3 hours ago|||

I understood it as "if I entered 1+2 but actually I had meant 2+2 will the machine correctly give me 4 despite my error?"

tomtom1337 1 day ago||||

I guess the unspoken assumption Babbage makes here is «if I put only the wrong figures into the machine». Then it is completely unreasonable to expect correct output. In ML context an LLM has been trained on much data, some «wrong» and some (hopefully more) «correct», which is why asking something incorrectly can still give you the correct answer.

nyrikki 22 hours ago||

For ML it goes deeper, but unfortunately discussions about it devolve into an approximation of the Brouwer–Hilbert controversy.

If you think about it from the VC dimensionality lens, in respect to learnability and set shattering is simply a choice function it can help.

Most of us have serious cognitive dissonance with dropping the principal of the excluded middle, as Aristotle and Plato's assumptions are baked into our minds.

You can look at why ZFC asserts that some sets are inconstructable, or through how Type or Category theory differ from classic logic.

But the difference between RE and coRE using left and right in place of true and false seems to work for many.

While we can build on that choice function, significantly improving our abilities to approximate or numerical stability, the limits of that original trinity of laws of thought are still underlying.

The union of RE and coRE is the recursive set, and is where not p implys p and not not p implys p holds.

There is a reason constructivist logic, lambda calculus, and category theory are effectively the same thing.

But for most people it is a challenging path to figure out why.

As single layer perceptrons depend on linearly separable sets, and multilayer perceptrons are not convex, I personally think the constructivist path is the best way to understand the intrinsic limits despite the very real challenges with moving to a mindset that doesn't assume PEM and AC.

There are actually stronger forms of choice in that path, but they simply cannot be assumed.

More trivial examples, even with perfect training data.

An LLM will never be able to tell you unknowable unknowns like 'will it rain tomorrow' or underspecified questions like 'should I driven on the left side of the road'

But it also won't be able to reliably shatter sets for problems that aren't in R with next token prediction, especially with problems that aren't in RE, as even coRE requires 'for any' universal quantification on the right side.

A LLM model will never be total, so the above question applies but isn't sufficient to capture the problem.

While we can arbitrarily assign tokens to natural numbers, that is not unique and is a forgetful functor, which is why compression is considered equivalent to the set shattering I used above for learnability.

The above questions framing with just addition and with an assumption of finite precision is why there is a disconnect for some people.

bobbylarrybobby 19 hours ago||||

Maybe I am failing to comprehend it. But to me the question reads “is your analytical engine, which you've described as a merely a mechanical calculator, also psychic or otherwise magical, so that it may subvert its own mechanical design in order to produce the answer I want instead of what I asked for?”.

chipsrafferty 23 hours ago|||

Can you help me understand the question (and context)?

Life the "machine" is a calculator, and I want to ask 5+5, but I put in the "wrong figures" e.g. (4+4), is the "right answer" 8 or 10? Is the right answer the answer you want to the question you want to ask, or the answer to the question you actually asked?

d1sxeyes 2 hours ago||

Imagine it’s not a computer, it’s a piece of paper. And the paper is a bit dirty and you can’t quite tell if it’s a 4 or a 5. You guess it’s 4, but the print-out says 5. Do you pass the exam?

Imagine you ask your friend “hey, what’s twenty divided by five?”, and they say “four” and then you realise you misspoke and meant to say “what’s twenty divided by four?” Is your friend wrong?

Of course not, in both cases.

kylebenzle 1 day ago|||

So well put!

People think they understand what "AI" is supposed to do, then "AI" turns out to not do what they expect and they call it broken.

DonHopkins 1 day ago||

You think you understand what "all women" are, then "all women" turns out to include your mother. Sorry to break it to you, Kyle Benzle.

https://news.ycombinator.com/item?id=33010046

kylebenzle on Sept 28, 2022 [dead] | parent | context | favorite | on: Why are sex workers forced to wear a financial sca...

All women are whores. Sorry to break it to you.

raindear 1 day ago||

[dead]

zby 1 day ago||

These sampling based techniques is a rare occasion where experimenting with consumer hardware can let you improve on SOTA models. I don't think it will last - the end game surely will be a trainable sampler. But for now - enjoy tinkering: https://github.com/codelion/optillm implements a few of these techniques

optillm authors suggest that the additional computations in Entropics don’t bring any better results in comparison with the simple CoT decoding (but I am not sure if they also check efficiency):https://x.com/asankhaya/status/1846736390152949966

It looks to me that many problems with LLMs come from something like semantic leaking, or distraction by irrelevant information (like in the GSM Symbolic paper) - maybe there is some space for improving attention too.

I wrote a couple of blog posts on these subjects: https://zzbbyy.substack.com/p/semantic-leakage-quick-notes, https://zzbbyy.substack.com/p/llms-and-reasoning, https://zzbbyy.substack.com/p/o1-inference-time-turing-machi...

NitpickLawyer 1 day ago||

The problem that I see with all these different sampling techniques is the way people usually judge them. There are people who claim they work better, but no rigorous benchmarks to prove it. Lots of "it writes better" or "the prose is fresh", but that is one argument where I think LeCun is 100% right - you can't judge a generalist model by "it works on poetry" or "prose", because that's the definition of bias, and you're shooting yourself in the foot with personal anecdotes.

I'd like to see this applied to coding or math. See the samplers work better in say olympiad math problems, with thorough benchmarks before and after.

ninetyninenine 1 day ago|||

If the objective is to make a better poet or a better story book writer than this flawed metric is the only form of measure.

It’s the same measure we judge human writers on so it’s not necessarily the worst.

Der_Einzige 1 day ago|||

The min_p paper and many other papers are doing exactly that.

NitpickLawyer 1 day ago||

Is this [1] the paper you're referring to?

Unless I'm reading Table2 (page7 - pdf version) wrong, on math, min_p is shown to score worse than top_p.

For temp 0.7 it scores 1 point lower than top_p. And from temps 1.0 and up, while scoring higher than top_p for the same temp, it scores way lower (6points and up) than top_p at 0.7. So overall, if you want accurate answers (and for math you kinda do), min_p is worse overall? Unless I miss-understand something.

I agree with the authors that if you want a tradeoff between accuracy and diversity, min_p might help, but if you're looking for precise answers, the results will be slightly worse. It's a tradeoff, but as I said above, people often fail to mention it as such, and instead proclaim it to be "better" across the board.

[1] - https://arxiv.org/pdf/2407.01082

scellus 1 day ago|||

Semantic leakage could be just weakness of the model, and related to claims that they don't _really_ reason. Maybe more training could help.

Or maybe it's a more fundamental weakness of the attention mechanism? (There are alternatives to that now.)

trq_ 1 day ago||

This is incredible! I haven't seen that repo yet, thank you for pointing it out, and the writing

tylerneylon 1 day ago||

I couldn't figure out if this project is based on an academic paper or not — I mean some published technique to determine LLM uncertainty.

This recent work is highly relevant: https://learnandburn.ai/p/how-to-tell-if-an-llm-is-just-gues...

It uses an idea called semantic entropy which is more sophisticated than the standard entropy of the token logits, and is more appropriate as a statistical quantification of when an LLM is guessing or has high certainty. The original paper is in Nature, by authors from Oxford.

vark90 1 day ago||

The idea behind semantic entropy (estimating entropy of distribution over semantic units, instead of individual sequences in the output space) is great, but it's somewhat naive in the sense that it considers these semantic units to be well-defined partitions of output space. There is further generalization of this approach [1] which performs soft clustering of sampled outputs based on a similar notion of semantic equivalence between them.

But even with this in mind, there are caveats. We have recently published [2] a comprehensive benchmark of SOTA approaches to estimating uncertainty of LLMs, and have reported that while in many cases these semantic-aware methods do perform very well, in other tasks simple baselines, like average entropy of token distributions, performs on par or better than complex techniques.

We have also developed an open-source python library [3] (which is still in early development) that offers implementations of all modern UE techniques applicable to LLMs, and allows easy benchmarking of uncertainty estimation methods as well as estimating output uncertainty for deployed models in production.

[1] https://arxiv.org/abs/2307.01379

[2] https://arxiv.org/abs/2406.15627

[3] https://github.com/IINemo/lm-polygraph

mikkom 1 day ago|||

This is based on work done by this anonymous twitter account:

https://x.com/_xjdr

I have been following this quite closely, it has been very interesting as it seems smaller models can be more efficient with this sampler. Worth going through the posts if someone is interested in this. I kind of have a feeling that this kind of sampling is a big deal.

weitendorf 1 day ago|||

I don't believe it is, because I'd hope that academicians would better understand the distinction between token-uncertainty and semantic-uncertainty/semantic-correctness (or at least endeavor to establish a data-backed correlation between the two before making claims about their relation). As I noted in my other comment, I believe that the author of this is making a fundamental misunderstanding, which per their note at the top, is probably why they haven't been able to actually yield practical results.

I don't say that to be a hater or discourage them because they may well be on to something, and it's good for unique approaches like this to be tried. But I'm also not surprised there aren't academic papers about this approach because if it had no positive effects for the reasons I mention, it probably wouldn't get published.

trq_ 1 day ago|||

It's not an academic paper as far as I know, which is why I wanted to write this up. But the project certainly has a cult following (and cult opposition) on ML Twitter.

tylerneylon 1 day ago||

PS My comment above is aimed at hn readers who are curious about LLM uncertainty. To the authors of the post / repo: looks cool! and I'd be interested to see some tests on how well it works in practice to identify uncertainty.

cchance 1 day ago||

This when that entropy is high i feel like models should have an escape hatch to trigger that the answers overall certainty was low, and hell add it up and score it so at the end the user can see if during the generation the certainty of the answer was shit, and should be thrown out ore replaced with a "i'm not sure"

vark90 1 day ago||

Yep, usually it's called abstention or rejection.

When people in this field compare various methods of quantifying model uncertainty, they often perform what is called rejection verification. Basically, you continuously reject data points where uncertainty is high, and see how average quality of the remaining outputs increases. A good uncertainty estimate is highly correlated with output quality, and thus low-uncertainty outputs should have higher average quality.

We use exactly this approach in our recent benchmark of uncertainty estimation approaches for LLMS [1] and have an open-source library under development [2] which allows for such benchmarking. It also can produce uncertainty scores for a given model output, so ppl in industry can integrate it into their applications as well.

[1] https://arxiv.org/abs/2406.15627

[2] https://github.com/IINemo/lm-polygraph

radarsat1 1 day ago|||

The problem is that deep net classifiers in general are not well statistically calibrated by default. So while the entropy is often high when they are "not sure", models can very often also be "confidently wrong". So using entropy of the logits as an indicator of confidence can easily be very misleading.

I'm not an expert in LLMs though, this is just my understanding of classifiers in general. Maybe with enough data this consideration no longer applies? I'd be interested to know.

mumblemumble 1 day ago|||

I'm not an expert, either, but I've poked at this a little. From what I've seen, token logprobs are correlated enough with correctness of the answer to serve as a useful signal at scale, but it's a weak enough correlation that it probably isn't great for evaluating any single output.

My best guess is that somewhere close to the root of the problem is that language models still don't really distinguish syntagmatic and paradigmatic relationships. The examples in this article are a little bit forced in that respect because the alternatives it shows in the illustrations are all paradigmatic alternatives but roughly equivalent from a syntax perspective.

This might relate to why, within a given GPT model generation, the earlier versions with more parameters tend to be more prone to hallucination than the newer, smaller, more distilled ones. At least for the old non-context-aware language models (the last time I really spent any serious time digging deep into language models), it was definitely the case that models with more parameters would tend to latch onto syntagmatic information so firmly that it could kind of "overwhelm" the fidelity of representation of semantics. Kind of like a special case of overfitting just for language models.

singularity2001 1 day ago||

maybe this signal needs to be learned in the final step of reinforcement learning where people decide whether "I don't know" is the right answer

trq_ 1 day ago||||

I want to build intuition on this by building a logit visualizer for OpenAI outputs. But from what I've seen so far, you can often trace down a hallucination.

Here's an example of someone doing that for 9.9 > 9.11: https://x.com/mengk20/status/1849213929924513905

z3t4 1 day ago||

I'm thinking versioning. 9.9, 9.10, 9.11 etc because in my native language we use the comma, for decimal separation 9,11 9,22 9,90

modeless 1 day ago|||

My understanding is that base models are reasonably well calibrated but the RLHF and other tuning that turns them into chat assistants screws up the calibration.

scottmf 1 day ago||

There’s much that is lost but imo gpt-4-base would be borderline unusable for most of us compared to its descendants — perhaps even more so than GPT-3 davinci, at least relative to its time.

4 can be an absolute demonic hallucinating machine.

tkellogg 1 day ago|||

Entropix gives you a framework for doing that sort of thing. The architecture is essentially to detect the current state, and then adjust sampler settings or swap in an entirely new sampler strategy.

You absolutely could experiment with pushing it into a denial, and I highly encourage you to try it out. The smollm-entropix repo[1] implements the whole thing in a Jupyter notebook, so it's easier to try out ideas.

[1]: https://github.com/SinatrasC/entropix-smollm

edwdt 1 day ago||

you can also try https://github.com/EdwardDali/EntropixLab

danielmarkbruce 1 day ago|||

We are almost certainly going to see lots of additional tokens added to vocabularies (like the thinking token, but also could be a "<LOGIC FAIL>" token), lots of sophisticated decoding strategies etc. Just need to generate the data.

nopinsight 1 day ago|||

The new Claude Sonnet 3.5 does something like that in my experience.

trq_ 1 day ago||

Yeah wouldn't be surprised if the big labs are doing more than just arg max in the sampling.

throwawaymaths 1 day ago|||

That's not really trivially compatible with the transformer scheme used to pick tokens and generate results.

Transformers are generative AI, not classifiers. They throw out a lot of statistics in the service of forward progress and completing the generative task. This project is a rudimentary attempt to regenerate those stats

trq_ 1 day ago||

Yeah that's been my thinking as well.

There are definitely times when entropy can be high but not actually be uncertain (again synonyms are the best), but it seems promising. I want to build a visualizer using the OpenAI endpoints.

benreesman 1 day ago||

A modern GPT of any serious size outputs logits from a big-ass classifier over token vocabulary. These exist in a space, one can not only posit but empirically calculate a manifold with some nontrivial convexity properties, it’s a well-posed if not outright solved problem which LLM wrote something (up to telling it to use a certain manner).

This was a problem not only studied but in which fast and impressive progress was happening until they just turned it off.

It’s a fucking gigantic business to be the best at this. And it’s exactly what a startup should be: unlikely to have a well-heeled incumbent competitor not because no well-heeled firms ignore the market, but because they actively don’t want it to exist.

digdugdirk 1 day ago|

Can you explain more about this and why this would be useful? From your description it seems like a huge percentage of requests would alter the output enough to prevent specific LLM detection. Also, with so many new LLMs using synthetic and generated data, I'd imagine that throwing a wrench in things too.

jawns 1 day ago||

The way this is being described is almost like a maze-traversal algorithm, where compute time is "how far I'm willing to go down a path to test whether it's a possible solution." I wonder what other parallels we might find. For instance, are some of the maze-solving algorithms relevant to apply to LLMs?

radarsat1 1 day ago||

Sampling sequentially to find the highest joint probability over the sequence is definitely a search problem. that's why you see algorithms like beam search often used for sampling.

trq_ 1 day ago||

Yes that's right, it seems like an area of more research.

Honestly it goes counter to the Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html, which stems from getting too fancy about maze traversal in Chess. But at the scale LLMs are at right now, the improvements might be worth it.

menhguin 1 day ago|||

Hi, contributor to Entropix here. This is just my opinion, but I don't think it goes counter to the Bitter Lesson at all, because it's meant to leverage model computation capabilities. Several papers have suggested that models internally compute certainty (https://arxiv.org/abs/2406.16254), and in my view our method simply leverages this computation and factors it explicitly into decoding.

This is as opposed to pure sampling + next token prediction which basically randomly chooses a token. So if a model does 1274 x 8275 and it's not very sure of the answer, it still confidently gives an answer even though it's uncertain and needs to do more working.

danielmarkbruce 1 day ago||

100%. It's in line with bitter lesson learnings. Good going.

danielmarkbruce 1 day ago|||

Yeah i don't think it's counter at all. The bitter lesson calls out the fact that more computation/search wins.

petsounds 1 day ago||

When I read about potential optimizations like this, I can't believe that people trust LLMs enough to do things with minimal oversight. Do people really believe that "AI" products that use LLMs are capable enough to do things like control a computer, or write accurate code? By design, isn't _everything_ a "hallucination" or a guess? Is it really possible to overcome that?

Workaccount2 1 day ago||

I have written (oversaw?) a few programs that we use in our production test systems using chatgpt and python. A program that sends actions to machines, queries them for results/errors/outputs, and then stores all that in a .csv which it later translates into a nicely formatted excel file. It also provides a start-up guide to show the technician how to hook-up things for a given test.

I am not a programmer. No one at my company is a programmer. It writes code that works and does exactly what we asked it to do. When the code choked while I was "developing" it, I just fed it back into chatgpt to figure out. And it eventually solved everything. Took a day or so, whereas it would probably take me a month or a contractor $10,000 and a week.

LLM's might be bad for high level salary grade programming projects. But for those of us who use computers to do stuff, but can't get past the language barrier preventing us from telling the computer what to do, it's a godsend.

lll-o-lll 1 day ago|||

Really interesting. We programmers live in a bit of a bubble, so it’s good to get this perspective. Perhaps with LLM’s we’ve finally reached the early dreams of the “programmable computer for everyone”, that seemed to slip out of reach after the 80’s.

starbugs 1 day ago|||

In other words: Your problem was simple enough and well enough represented in the training corpus and you were a bit lucky. Also, the problem is not important enough for there to be a requirement for the code to be updatable/fixable at short notice, because effectively now nobody in your org knows how the solution actually works.

For this very constrained subset of a problem domain LLMs are indeed very suitable but this doesn't scale at all.

danielmarkbruce 1 day ago|||

How do you overcome it as a human? If you think through it... you'll come to the conclusion that LLMs can be used to do all kinds of things. Humans don't write down code and then shove it into production, for example.

Kiro 1 day ago|||

> Do people really believe that "AI" products that use LLMs are capable enough to do things like control a computer, or write accurate code?

Of course. It's not a hypothetical question. Almost all of my code is written by Claude 3.5 Sonnet. It's much more robust and accurate than my regular code and I've been programming for 20 years.

OtomotO 1 day ago||

No it's not, but when humans have invested too much (emotions or money) they do not retreat easily. They rather go all in.

It's just another hype, people. Just like Client/Server, Industry 4.0, Machine Learning, Microservices, Cloud, Crypto ...

badsandwitch 1 day ago||

Has anyone tried to see what the output looks like if the model is never allowed to be uncertain?

For example, whenever certainty drops below a threshold the sampler backtracks and chooses different tokens. Such that at the end every single token had an above threshold certainty.

I doubt it would entirely eliminate undesirable outputs, but it would be interesting.

eddd-ddde 1 day ago||

Couldn't that just, never get an answer?

Or maybe just says "i don't know" with full certainty.

zbentley 1 day ago||

That would be extremely useful in some domains.

mumblemumble 1 day ago||

Perhaps only if you can also be very certain that the output is correct whenever the logprobs don't trigger the filter.

If that's not the case then it might just trigger bad risk compensation behavior in the model's human operators.

Jerrrrrrry 1 day ago||

You used to get purely determinant near-quotes, but still affected by floating point inaccuracies.

joe_the_user 1 day ago||

The problem is that the limits to LLM answers have more dimensions than just "uncertainty". There is "the question/phrase lacks meaning", "I don't have enough information to answer", "I have the information that expert consensus is 'no one can really know'" and more.

I think there's a human tendency to reduce the problem one has answering a given question to a question of just "uncertainty" and so we look at LLM answers as involving just single level of uncertainty. But that's anthropomorphism.

AI images (and photograph before it) showed us new, unimagined ways an image can be wrong (or rather, real-seaming but wrong). AI language interactions do this too but in a more subtle way.

trq_ 1 day ago||

Definitely, but if you can detect when you might be in one of those states, you could reflect to see exactly which state you're in.

So far this has mostly been done using Reinforcement Learning, but catching it and doing it inference seems like it could be interesting to explore. And much more approachable for open source, only the big ML labs can do this sort of RL.

TZubiri 1 day ago||

Right. The uncertainty will be high when responding to garbage inputs and it will be distributed along many tokens.

If probability(sum(tokens[:5])) < 0.5: Respond("I'm sorry I don't quite understand what you mean.")

melenaboija 1 day ago|||

As anthropomorphic as calling hallucinations to inaccuracies of the model.

I feel anthropomorphism is part of the marketing strategy for LLMs

jazzyjackson 1 day ago|||

Having an oracle to chat with is a good product, but a bad framing for the tech. IMO all the broken expectations come from viewing the output as something that comes from "an other", a thing other than yourself with knowledge and experience, when really it's more of a mirror, reflecting your words back to you, enlarged or squeezed like funhouse mirrors (back in my day we didn't have skinny filters, we had to walk uphill to the pier and stand in front of a distorted piece of mercury glass! ;).

MobiusHorizons 1 day ago||

Did you live under water? How was the pier uphill;)

cpeterso 1 day ago||

The inland area could be lower than the waterfront.

jazzyjackson 1 day ago||

Somehow I just knew a few of you'se would consider the implications of walking uphill to a pier

botanical76 1 day ago||||

What other word would you suggest?

I've seen "bullshitting" suggested, but this of course still implies intent, which AIs do not have in any typical sense of the word.

I think we as a community have settled on hallucination as the best English word that approximately conveys the idea. I've seen folks on here making up words to describe it, as if that is any more useful to the victim here. The victim being the uninformed (w.r.t AI tech) layperson.

atoav 1 day ago|||

LLMs give you a plausible chain of words, the word "hallucination" assumes intentionality that doesn't exist — as if the LLM had a "clear" state of mind and one where it felt a bit dizzy — but all of that does not describe what is going on.

joe_the_user 1 day ago|||

The thing about "hallucination" (or confabulation or anything describing having false ideas) is that it captures the LLM behavior of not just making a statement but "standing behind it", making a continuing argument for their (false) idea when questioned.

Human do this too, of course. The LLMs are simply emulating this human behavior.

haccount 1 day ago||||

The word confabulation is used in situations where human beings unintentionally pad whatever they say with falsehoods.

CooCooCaCha 1 day ago|||

Hallucination does not imply intentionality, in fact the opposite.

atoav 1 day ago||

which was my point.

CooCooCaCha 1 day ago||

Your point is misusing a word? The word “hallucination” in no way implies intentionality.

atoav 1 day ago||

Granted maybe it was a bit unclear, so let me claify my point:

In humans hallucination is about a loss of a relationship with an underlying physical world. A physical world whose model we have in our heads and interact with in intentional ways if we are not hallucinating.

That means using the word hallucinating implies that the thing could also not be hallucinating and have a grip on reality. And rhis was my criticism, a LLM spits out plausible phrases, if the graph wouldn't consider an output plausible it wouldn't return it. That means for the LLM there is no difference between plausible bogus and a factually correct statement, this is something humans interpret into the output from the outside.

codetrotter 1 day ago||||

“Confabulations” is sometimes mentioned as an alternative to “hallucinations”.

It’s a better alternative than “bullshitting”, because “confabulating” does not have that kind of connotation of intent.

paulddraper 1 day ago||||

Hallucinating is descriptive but superlative.

Wrong or inaccurate are alternatives.

Semiapies 1 day ago|||

Illusion. Mirage.

stavros 1 day ago|||

A more apt word is "confabulation".

vark90 1 day ago|||

You are right that uncertainty is a kinda loosely defined term. Usually people mean that it's a kind of proxy to the probability that the output of the model is correct in some sense.

It's also true that uncertainty can be decomposed into "flavours". The simplest and most discussed decomposition is into aleatoric and epistemic kinds of uncertainty. Epistemic uncertainty (or model-based uncertainty) usually refers to the case, when poor output is a result of the model being presented with the kind of input which it never saw before, and should not be expected to handle correctly. Aleatoric uncertainty on the other hand is thought to be intrinsic to the data itself, think of the natural ambiguity of the task, or noisy labelling.

People in the field of uncertainty estimation are very much concerned with developing methods of quantifying these different types of uncertainty, and different methods can be more sensitive to one or the other.

glaugh 1 day ago|||

Fwiw this feels deeply relevant to my usage of LLMs to structure data. I’d like exactly a good indicator of uncertainty for each bit of data.

CooCooCaCha 1 day ago||

Aren’t those different flavors of uncertainty?

trq_ 1 day ago|||

Yeah, I think the idea of finding out what flavor of uncertainty you have is very interesting.

ben_w 1 day ago|||

I think that's the point?

danielmarkbruce 1 day ago||

No, the comment reflects a misunderstanding of uncertainty. Uncertainty could be caused by all kinds of things (ie, there are flavors). That's different than saying "there are more dimensions than uncertainty".

ben_w 1 day ago||

The mathematical use of the term is as you say.

The article itself is uncertainty at the level of the next token rather than of the entire response, which is different: "Capital of Germany is" followed by "Berlin" is correct but it would have also been valid for the full answer to have been ", since reunification in 1990, Berlin; before this…" - correct at the conceptual level, uncertainty at the token level.

Most of the users aren't aware of the maths and use words in more every-day manners, to the annoyance of those of us who care about the precise technical definitions.

The listed types of uncertainty can and do have different uses in different cases.

Especially the difference between "I don't know the answer" and "I do know absolutely that the answer is that nobody knows".

As a chatbot it's also important to say "I don't understand your question" when appropriate, rather than to say "dunno" in response to e.g. "how do I flopragate my lycanthrope?"

bjourne 1 day ago|

There are billions of sampling strategies for language models. The problem is that it is very difficult to empirically show that one sampling strategy is better than standard top-k or top-p sampling. Minimizing perplexity is not enough to demonstrate superiority of a particular method. The strategy suggested in the blog post has the same issue. An innovation that sounds plausible in theory, but is unproven in practice.

danielmarkbruce 1 day ago|

Proof isn't required.

It's difficult to prove because it's difficult to state clearly what is "better" and it's expensive to collect preference data (or similar).

You could use common sense after looking at lots of samples and say "this method seems to work better if you are trying to optimize for X".

More comments...