The math that explains why bell curves are everywhere

Posted by ibobev 2 days ago

The math that explains why bell curves are everywhere(www.quantamagazine.org)

123 points | 67 comments

mikrl 9 hours ago|

Great article. Personally I have been learning more about the mathematics of beyond-CLT scenarios (fat tails, infinite variance etc)

The great philosophical question is why CLT applies so universally. The article explains it well as a consequence of the averaging process.

Alternatively, I’ve read that natural processes tend to exhibit Gaussian behaviour because there is a tendency towards equilibrium: forces, homeostasis, central potentials and so on and this equilibrium drives the measurable into the central region.

For processes such as prices in financial markets, with complicated feedback loops and reflexivity (in the Soros sense) the probability mass tends to ends up in the non central region, where the CLT does not apply.

btilly 1 hour ago||

The key principle is that you get CLT when a bunch of random factors add. Which happens in lots of places.

In finance, the effects of random factors tend to multiply. So you get a log-normal curve.

As Taleb points out, though, the underlying assumptions behind log-normal break in large market movements. Because in large movements, things that were uncorrelated, become correlated. Resulting in fat tails, where extreme combinations of events (aka "black swans") become far more likely than naively expected.

parpfish 9 hours ago|||

As to ye philosophy of “why” the CLT gives you normals, my hunch is that it’s because there’s some connection between:

a) the CLT requires samples drawn from a distribution with finite mean and variance

and b) the Gaussian is the maximum entropy distribution for a particular mean and variance

I’d be curious about what happens if you starting making assumptions about higher order moments in the distro

orangemaen 8 hours ago|||

The standard framing defines the Gaussian as this special object with a nice PDF, then presents the CLT as a surprising property it happens to have. But convolution of densities is the fundamental operation. If you keep convolving any finite-variance distribution with itself, the shape converges, and we called the limit "normal." The Gaussian is a fixed point of iterated convolution under √n rescaling. It earned its name by being the thing you inevitably get, not by having elegant closed-form properties.

The most interesting assumptions to relax are the independence assumptions. They're way more permissive than the textbook version suggests. You need dependence to decay fast enough, and mixing conditions (α-mixing, strong mixing) give you exactly that: correlations that die off let the CLT go through essentially unchanged. Where it genuinely breaks is long-range dependence -fractionally integrated processes, Hurst parameter above 0.5, where autocorrelations decay hyperbolically instead of exponentially. There the √n normalization is wrong, you get different scaling exponents, and sometimes non-Gaussian limits.

There are also interesting higher order terms. The √n is specifically the rate that zeroes out the higher-order cumulants. Skewness (third cumulant) decays at 1/√n, excess kurtosis at 1/n, and so on up. Edgeworth expansions formalize this as an asymptotic series in powers of 1/√n with cumulant-dependent coefficients. So the Gaussian is the leading term of that expansion, and Edgeworth tells you the rate and structure of convergence to it.

ramblingrain 8 hours ago||||

It is the not knowing, the unknown unknowns and known unknowns which result in the max entropy distribution's appearance. When we know more, it is not Gaussian. That is known.

mitthrowaway2 6 hours ago||

Exactly this. From this perspective, the CLT then can be restated as: "it's interesting that when you add up a sufficiently large number of independent random variables, then even if you have a lot of specific detailed knowledge about each of those variables, in the end all you know about their sum is its mean and variation. But at least you do reliably know that much."

D-Machine 6 hours ago||

Came here basically looking to see this explanation. Normal dist is [approximately] common when summing lots of things we don't understand, otherwise, it isn't really.

sobellian 9 hours ago||||

IIRC there's a video by 3b1b that talks about that, and it is important that gaussians are closed under convolution.

gowld 7 hours ago||

That makes it an equilibrium point in function space, but the other half is why it's an a global attractor.

pfortuny 1 hour ago||

There must be a contractive nature in "passing to the limit". And then Brower's fixed point theorem.

(I know it is very easy to do "maths" this way).

derbOac 8 hours ago|||

IIRC the third moment defines a maxent distribution under certain conditions and with a fourth moment it becomes undefined? It's been awhile though.

If I'm remembering it correctly it's interesting to think about the ramifications of that for the moments.

benmaraschino 9 hours ago|||

You (and others) may enjoy going down the rabbit hole of universality. Terence Tao has a nice survey article on this which might be a good place to start: https://direct.mit.edu/daed/article/141/3/23/27037/E-pluribu...

trhway 8 minutes ago||

>natural processes tend to exhibit Gaussian behaviour

to me it results of 2 factors - 1. Gaussian is the max entropy for a distribution with a given variance and 2. variance is the model of energy-limited behavior whereis physical processes are always under some energy limits. Basically it is the 2nd law.

causalityltd 2 hours ago||

Causes mostly add up: molecular kinetic energies aggregate to temperature, collisions to pressure, imperfections to measurement errors, etc. So, normal or CLT is the attractor state for the unexceptional world.

BUT for the exceptional world, causes multiply or cascade: earthquake magnitudes, network connectivity, etc. So, you get log-normal or fat-tailed.

AxEy 6 hours ago||

I remember seeing one of these

https://en.wikipedia.org/wiki/Galton_board

at the (I think) Boston Science Museum when I was a kid. They have some pretty cool videos on Youtube if you're curious.

fiforpg 6 hours ago||

On opening the article, I was somehow expecting a mention of the large deviations formalism, which was (is?) fashionable in late 20th century, and gives a nice information theoretic view of the CLT. Or something like that. There's a ton of deep math there. So having a bio statistician say "look, the CLT is cool" is a bit underwhelming.

Edit: see eg John Baez's write-up What is Entropy? about the entropy maximization principle, where gaussians make an entrance.

bicepjai 5 hours ago||

This is one of my favorite philosophical questions to ponder. I always ask it in interviews as a warmup to get their thoughts. I’ve noticed that interviewees often curl up, thinking it’s a technical question, so I’ve been modifying the question one after the other to make it less scary. The interviews are for data scientist roles.

Buttons840 5 hours ago||

I haven't read the article, but my understanding is that a normal curve results from summing several samples from most common probability distributions, and also a normal curve results from summing many normal curves.

All summation roads lead to normal curves. (There might be an exception for weird probability distributions that do not have a mean; I was surprised when I learned these exist.)

Life is full of sums. Height? That's a sum of genetics and nutrition, and both of those can be broken down into other sums. How long the treads last on a tire? That's a sum of all the times the tire has been driven, and all of those times driving are just sums of every turn and acceleration.

I'm not a data scientist. I'm just a programmer that works with piles of poorly designed business logic.

How did I do in my interview? (I am looking for a job.)

abetusk 4 hours ago||

Say I have N independent and identically distributed random variables with finite mean. Assuming the sum converges to a distribution, what is the distribution they converge to?

Buttons840 3 hours ago||

A normal distribution.

abetusk 3 hours ago||

Levy stable [0].

If I had made the extra condition that the random variables had finite variance, you'd be correct. Without the finite variance condition, the distribution is Levy stable.

Levy stable distributions can have finite mean but infinite variance. They can also have infinite mean and infinite variance. Only in the finite mean and finite variance case does it imply a Gaussian.

Levy stable distributions are also called "fat-tailed", "heavy-tailed" or "power law" distributions. In some sense, Levy stable distributions are more normal than the normal distribution. It might be tempting to dismiss the infinite variance condition but, practically, this just means you get larger and larger numbers as you draw from the distribution.

This was one of Mandelbrot's main positions, that power laws were much more common than previously thought and should be adopted much more readily.

As an aside, if you do ever get asked this in an interview, don't expect to get the job if you answer correctly.

[0] https://en.wikipedia.org/wiki/L%C3%A9vy_distribution

hilliardfarmer 5 hours ago||

A lot of times I can't tell if I'm the idiot or if everyone else is. Says that this isn't an interesting question at all and the article was horrible. I studied data science for a few years but I'm no expert, but it seems pretty obvious to me that if you make a series of 50/50 choices randomly, that's the shape you end up with and there's really nothing more interesting about it than that.

smcin 3 hours ago|||

Sampling 50/50 choices would be a binary distribution that (very crudely) approximates a normal distribution.

But the counterintuitive thing about the CLT is that it applies to distributions that are not normal.

alanbernstein 4 hours ago|||

I don't think "obvious" is the right word here. It makes perfect sense when you understand it, but it's not a conclusion that most people could come to immediately without detailed, assisted study.

abetusk 6 hours ago||

Sorry, does the article actually give reasons why the bell curve is "everywhere"?

For simplicity, take N identically distributed random variables that are uniform on the interval from [-1/2,1/2], so the probability distribution function, f(x), on the interval from [-1/2,1/2] is 1.

The Fourier transform of f(x), F(w), is essentially sin(w)/w. Taking only the first few terms of the Taylor expansion, ignoring constants, gives (1-w^2).

Convolution is multiplication in Fourier space, so you get (1-w^2)^n. Squinting, (1-w^2)^n ~ (1-n w^2 / n)^n ~ exp(-n w^2). The Fourier transform of a Gaussian is a Gaussian, so the result holds.

Unfortunately I haven't worked it out myself but I've been told if you fiddle with the exponent of 2 (presumably choosing it to be in the range of (0,2]), this gives the motivation for Levy stable distributions, which is another way to see why fat-tailed/Levy stable distributions are so ubiquitous.

woopsn 2 hours ago||

There's a paragraph on discovery that multinomial distributions are normal in the limit. The turn from there to CLT is not great, but that's a standard way to introduce normal distributions and explains a myriad of statistics.

WCSTombs 5 hours ago|||

It's not super hard to prove the central limit theorem, and you gave the flavor of one such proof, but it's still a bit much for the likely audience of this article, who can't be assumed to have the math background needed to appreciate the argument. And I think you're on the right track with the comment about stable distributions.

abetusk 5 hours ago||

The Fourier transform of a uniform distribution is the sinc function which looks like a quadratic locally around 0. Convolution to multiplication is how the quadratic goes from downstairs to upstairs, giving the Gaussian.

Widths of different uniform distributions along with different centers all still have a quadratic center, so the above argument only needs to be minimally changed.

The added bonus is that if the (1-w^2)^n is replaced by (1-w^a)^n, you can sort of see how to get at the Levy stable distribution (see the characteristic function definition [0]).

The point is that this gives a simple, high-level motivation as to why it's so common. Aside from seeing this flavor of proof in "An Invitation to Modern Number Theory" [1], I haven't really seen it elsewhere (though, to be fair, I'm not a mathematician). I also have never heard the connection of this method to the Levy stable distributions but for someone communicating it to me personally.

I disagree about the audience for Quanta. They tend to be exposed to higher level concepts even if they don't have a lot of in depth experience with them.

[0] https://en.wikipedia.org/wiki/Stable_distribution#Parametriz...

[1] https://www.amazon.com/Invitation-Modern-Number-Theory/dp/06...

gwern 7 hours ago||

A little disappointing. All about the history of bell curves, but I don't think it does a very good job explaining why the bell curve appears or the CLT is as it is.

jibal 5 hours ago|

https://en.wikipedia.org/wiki/Central_limit_theorem

GeoSys 1 hour ago||

It's in many places, but not everywhere. CLT means that samples tend towards the mean, which is neat.

Unfortunately, many "researchers" blindly assume that many real life phenomena follow Gaussian, which they don't... So then their models are skewed

fritzo 9 hours ago||

Hot take: bell curves are everywhere exactly because the math is simple.

The causal chain is: the math is simple -> teachers teach simple things -> students learn what they're taught -> we see the world in terms of concepts we've learned.

The central limit theorem generalizes beyond simple math to hard math: Levy alpha stable distributions when variance is not finite, the Fisher-Tippett-Gnedenko theorem and Gumbel/Fréchet/Weibull distributions regarding extreme values. Those curves are also everwhere, but we don't see them because we weren't taught them because the math is tough.

BobbyTables2 8 hours ago||

It also took me a little while to realize “least squares” and MMSE approaches were not necessarily the “correct” way to do things but just “one thing we actually know how to do” because everything else is much harder.

We can use Calculus to do so much but also so little…

atrettel 6 hours ago|||

I've often described this as a bias towards easily taught ("teachable") material over more realistic but difficult to teach material. Sometimes teachers teach certain subjects because they fit the classroom well as a medium. Some subjects are just hard to teach in hour-long lectures using whiteboards and slides. They might be better suited to other media, especially self study, but that does not mean that teachers should ignore them.

rtgfhyuj 1 hour ago|||

any good resources to understand more about them?

orangemaen 8 hours ago|||

The CLT is everywhere because convolution/adding independentish random variables is a super common thing to do.

gowld 7 hours ago|||

Most things aren't infinite or extreme, though. Almost by definition, most phenomena aren't extreme phenomena.

D-Machine 5 hours ago||

No, but when you get into the nitty gritty of most things sometimes being influenced by extremely rare things, and also that the convergence rate of the central limit theorem is not universal at all, then much of the utility (and apparent universality) of the CLT starts to evaporate.

In practice when modeling you are almost always better not assuming normality, and you want to test models that allow the possibility of heavy tails. The CLT is an approximation, and modern robust methods or Bayesian methods that don't assume Gaussian priors are almost always better models. But this of course brings into question the very universality of the CLT (i.e. it is natural in math, but not really in nature).

AndrewKemendo 9 hours ago||

That’s exactly the right take and the article proves it:

Statisticians love averages so everywhere that could be sampled as a normal distribution will be presented as one

The median is actually more descriptive and power law is equally as pervasive if not more

fsckboy 5 hours ago||

combining repeated samples of any distribution* (any population density fuction including power law distributions) will converge to the normal distribution, that's why it appears everywhere.

* excluding bizarre degenerates like constants or impulse functions

abetusk 3 hours ago||

No, that's not correct. Sums of power law distributions can converge to power low tailed distributions, not normal distributions.

gowld 8 hours ago|

3b1b playlist on Central Limit Theorem: https://www.youtube.com/playlist?list=PLZHQObOWTQDOMxJDswBaL...

He has several other related videos also.

https://www.youtube.com/@3blue1brown/search?query=convolutio...

More comments...