Top
Best
New

Posted by gidellav 9/8/2025

Anscombe's Quartet(en.wikipedia.org)
133 points | 26 comments
flpm 9/9/2025|
And check this one, which is a generalization of the Datasaurus where you can define your own shapes :D

https://github.com/stefmolin/data-morph

moi2388 9/9/2025|
From now on I won’t trust any statistic unless I can transform it into a panda.
djoldman 9/9/2025||
A classic.

See also:

https://en.wikipedia.org/wiki/Datasaurus_dozen

sunrunner 9/9/2025||
Content warning: This is a baker’s dozen not a regular dozen, in case anyone clicks through expecting to find twelve and is mildly and briefly perturbed.
djoldman 9/9/2025||
The scary thing is that yea we can see these in 2D and maybe 3D. But ...

usually there are more than 2 or 3 columns in our data :(

imurray 9/9/2025||
It's clearly hard, but there are tools for doing exploratory visualization of high-dim data. GGobi http://ggobi.org/ and all the ones that arrange points but try to get local neighborhoods correct (t-sne, umap, et al.).
lamename 9/9/2025||
Yeah, but still "scary" because you have to be really careful to not fool yourself and pay attention even with those algorithms. For example, a good demonstration with tsne https://distill.pub/2016/misread-tsne/?hl=cs
jihadjihad 9/9/2025||
Often there is little or no substitute for plotting the data to see how it is distributed. A scatter plot, histogram, density plot, etc. is almost always going to tell you a "story" about the data that the summary stats will have compressed.

But sometimes you are at the mercy of the data and your visualization of choice. Box plots, for example, are great at showing more than just how the data is centered, but it is possible to encounter situations where the box plots of the data remain static while the underlying data is clearly changing [0].

As always it is good to know about these things and continue to add to the arsenal (violin plots, in the example above) of tools and intuition needed to tease out the story behind the data.

0: https://www.research.autodesk.com/publications/same-stats-di...

__mharrison__ 9/9/2025||
I teach curve fitting with this dataset and recently added the fifth dataset. It illustrates Simpsons paradox.

https://www.linkedin.com/posts/panela_loved-adding-ancombes-...

aleyan 9/9/2025|
That's an amazing addition! Once I read about Simpson's paradox[0], couldn't help but seeing it or suspecting it everywhere. Luckily, it is not a true paradox, and it can resolved if underlying data is available and not just summary statistics.

I recommend putting together the Quintet in one image, so that the original 4 charts, plus the new one are all visible and interpretable together. It will be learning aid for decades to come.

[0] https://en.wikipedia.org/wiki/Simpson's_paradox

__mharrison__ 9/9/2025||
Yes, not saying the data dinosaur isn't cool. But for real-world applications, the quartet with the addition of this fifth dataset is more useful for pedagogical purposes.
joshdavham 9/9/2025||
During my statistics degree, Anscombe’s Quartet was used as an example of why you should always try to visualize your dataset and not just run your calculations blindly. I’m a bit odd in that I don’t care much for data viz, but Anscombe’s Quartet really shows how important it is in practice.
jkyrlach 9/9/2025||
This dataset is definitely a treasure, and I love visualizing data. That said, i think what's missed when this is used as an argument for visual analysis is the idea of quantitatively identified outliers. If you take the descriptive statistics of p99, they most definitely will not be the same across these four sets. Visual analysis is a valuable dimension for data exploration, but it's a bit of a strawman to infer that "quantitative analysis could go no further, only visual analysis could figure this out"
divbzero 9/9/2025||
I know this is against the main point of Anscombe’s Quartet but just curious: Could skewness or other summary statistics differentiate the four distributions?
dccsillag 9/10/2025|
Take enough moments and you'll be able to differentiate any distributions.
padraigf 9/9/2025||
I love it. I was introduced to it by Edward Tufte's book, 'https://www.amazon.co.uk/Visual-Display-Quantitative-Informa...'.

And was just thinking about it the other day. I had a bug aggregating sleep-data from an iPhone, which comes in the form of sleep-samples.

I was trying to fix it, both by prodding Claude Code to fix the problem, and looking at debug logs of the sleep-samples, but we weren't getting anywhere. I asked Claude Code to graph the samples, and BAM, saw it right away. (the problem was that HealthKit returns you sleep-samples from ALL devices, not just the priority one)

Maybe not exactly the same thing as Anscombe/Tufte were getting at, but I was reminded of it, and the value of visualising data.

dejj 9/9/2025||
“The Datasaurus Dozen”:

https://blog.revolutionanalytics.com/2017/05/the-datasaurus-...

Mithriil 9/9/2025|
Relevant: Simpson's paradox. https://en.wikipedia.org/wiki/Simpson%27s_paradox
More comments...