Posted by speckx 3 days ago
In my professional life I’ve never personally worked on a problem that I felt wasn’t adequately approached with frequentist methods. I’m sure other people’s experiences are different depending on the problems you gravitate towards.
In fact, I tend to get pretty frustrated with Bayesian approaches because when I do turn to them it tends to be in situations that already quite complex and large. In basically every instance of that I’ve never been able to make the Bayesian approach work. Won’t converge or the sampler says it will take days and days to run. I can almost always just resort to some resampling method that might take a few hours but it runs and gives me sensible results.
I realize this is heavily biased by basically only attempting on super-complex problems, but it has sort of soured me on even trying anymore.
To be clear I have no issue with Bayesian methods. Clearly they work well and many people use them with great success. But I just haven’t encountered anything in several decades of statistical work that I found really required Bayesian approaches, so I’ve really lost any motivation I had to experiment with it more.
Multilevel models are one example of problem were Bayesian methods are hard to avoid as otherwise inference is unstable, particularly when available observations are not abundant. Multilevel models should be used more often as shrinking of effect sizes is important to make robust estimates.
Lots of flashy results published in Nature Medicine and similar journals turn out to be statistical noise when you look at them from a rigorous perspective with adequate shrinking. I often review for these journals, and it's a constant struggle to try to inject some rigor.
From a more general perspective, many frequentist methods fall prey to Lindley's Paradox. In simple terms, their inference is poorly calibrated for large sample sizes. They often mistake a negligible deviation from the null for a "statistically significant" discovery, even when the evidence actually supports the null. This is quite typical in clinical trials. (Spiegelhalter et al, 2003) is a great read to learn more even if you are not interested in medical statistics [1].
[1] https://onlinelibrary.wiley.com/doi/book/10.1002/0470092602
Ironically this has been one of the primary examples of, in my personal experience, with the problems I have worked on, frequentist mixed & random effects models have worked just fine. On rare occasions I have encountered a situation where the data was particularly complex or I wanted to use an unusual compound probability distribution and thought Bayesian approaches would save me. Instead, I have routinely ended up with models that never converge or take unpractical amounts of time to run. Maybe it’s my lack of experience jumping into Bayesian methods only on super hard problems. That’s totally possible.
But I have found many frequentist approaches to multilevel modeling perfectly adequate. That does not, of course, mean that will hold true for everyone or all problems.
One of my hot takes is that people seriously underestimate the diversity of data problems such that many people can just have totally different experiences with methods depending on the problems they work on.
If you have experienced problems with convergence, give Stan a try. Stan is really robust, polished, and simple. Besides, models are statically typed and it warns you when you do something odd.
Personally, I think once you start doing multilevel modeling to shrink estimates, there's no way back. At least in my case, I now see it everywhere. Thanks to efficient variational Bayes methods built on top of JAX, it is doable even on high-dimensional models.
In a Bayesian analysis, the result of an inference, e.g. about the fairness of a coin as in Lindley's paradox, depends completely on the distribution of the alternative specified in the analysis. The frequentist analysis, for better and worse, doesn't need to specify a distribution for the alternative.
The classic Lindley's paradox uses a uniform alternative, but there is no justification for this at all. It's not as though a coin is either perfectly fair or has a totally random heads probability. A realistic bias will be subtle and the prior should reflect that. Something like this is often true of real-world applicaitons too.
Bayesian supporters often like to say they are just using more information by coding them in priors, but if they had data to support their priors, they are frequentists.
—-
Dear ChatGPT, are there priors in frequentist statistics? (Please answer with a single sentence.)
No — unlike Bayesian statistics, frequentist statistics do not use priors, as they treat parameters as fixed and rely solely on the likelihood derived from the observed data.
I studied stats at Duke which is a Bayesian academy. Almost every problems come from regimes with small sample sizes. Given that Duke houses the largest academic clinical research organization globally, having a stats and biostats department with this bent is useful: samples are tiny in clinical trials compared to most big data settings.
The biggest problem with the whole Bayesian regime IMO is that as the data gets larger its selling point vanishes. If your data is big or is normal (mean-based statistics), a frequentest/bootstrapped CI approximates the Bayesian CI anyway.
Furthermore, many us work in settings where we're trying to sell toothpaste: we don't need the Bayesian guarantees that an insurer might.
I've found Bayesian methods shine in cases of an "intractible partition function".
Cases such as language models, where the cardinality of your discrete probability distribution is extremely large, to the point of intractability.
Bayesians tend to immediately go to things like Monte Carlo estimation. Is that fundamentally Bayesian and anti-frequentist? Not really... it's just that being open to Bayesian ways of thinking leads you towards that more.
Reinforcement learning also feels much more naturally Bayesian. I mean Thompson sampling, the granddaddy of RL, was developed through a frequentist lens. But it also feels very Bayesian as well.
In the modern era, we have Stein's paradox, and it all feels the same.
Hardcore Bayesians that seem to deeply hate the Kolmogorov measure theoretic approach to probability are always interesting to me as some of the last true radicals.
I feel like 99% of the world today is these are all just tools and we use them where they're useful.
In the very first example, a practitioner would consciously have to decide (i.e. make the assumption) whether the number of side on the die (n) is known and deterministic. Once that decision is made, the framework with which observations are evaluated and statistical reasoning applied will forever be conditional on that assumption.. unless it is revised. Practitioners are generally OK with that, whether it leads to ‘Bayesian’ or ‘frequentist’ analysis, and move on.
Obviously the analogy isn't perfect (priors are explicit and interpretable, pre-trained weights are not), but I think it's a useful mental model for anyone coming from an ML background who finds Bayesian stats unintuitive. Regularization being secretly Bayesian was the other thing that made it click for me. If you've ever tuned a Ridge regression lambda, you were doing informal prior selection.
After Stein's paradox it became super hard to be a pure frequentist if you didn't have your head in the sand.
What does that have to do with anything? If one cares about that using a shrinkage estimator is an option which maintains the frequentist purity.
However, from an engineering lead's perspective, I find that while students might have a 'Bayesian intuition,' our industry-standard observability tools (Prometheus, etc.) are fundamentally frequentist. We define SLAs based on tail latency percentiles (p99), which are frequentist estimators.
The cognitive shift I'm referring to is moving from 'here is a threshold' to 'here is a distribution of possible truths' when building adaptive systems, like agentic orchestrators. In those cases, the overhead of a Bayesian approach (defining priors for every microservice latency, etc.) often loses out to the pragmatism of 'is the p99 stable?'. We trade theoretical correctness for operational speed and simplicity.
Haskell is a little more complicated to learn but also more expressive than other programming languages, this is where the comparison works.
But where it breaks down is safety. If your Haskell code runs, it's more likely to be correct because of all the type system goodness.
That's the reverse of the situation with Bayesian statistics, which is more like C++. It has all kinds of cool features, but they all come with superpowered footguns.
Frequentist statistics is more like Java. No one loves it but it allows you to get a lot of work done without having to track down one of the few people who really understand Haskell.
The entire generative concept implicitly assumes that parameters have probability distributions themselves that naturally give rise to generative models...
You could do frequentist inference on a generative model, sure, but generative modelling seems fundamentally alien to frequentist thinking?
Though if you think about it, a diffusion model is somewhat (partially) frequentist.
But while it's a probability distribution, to a frequentist they are estimating the fixed parameters of a distribution.
The distribution isn't generative, it just represents uncertainty - and I think that's a bit of the deep core philosophical divide between frequentists and Bayesians - you might use all the same math, but you cannot possibly think of it as being generative.
https://arxiv.org/pdf/2510.18777
But that doesn't mean a frequentist views a VAE as a generative model!
Putting it another way, Gaussian processes originated as a frequentist technique! But to a frequentist they are not generative.
To be more precise, in Bayesian statistics a parameter is random variable. But what does that mean? A parameter is a characteristic of a population (as opposed to a characteristic of a sample, which is called a statistic). A quantity, such as the average cars per household right now. That's a parameter. To think of a parameter as a random variable is like regarding reality as just one realisation of an infinite number of alternate realities that could have been. The problem is we only observe our reality. All the data samples that we can ever study come from this reality. As a result, it's impossible to infer anything about the probability distribution of the parameter. The whole Bayesian approach to statistical inference is nonsensical.