Posted by nl 8 hours ago
SELECT a.province, COUNT(DISTINCT b.id_num) FROM registry a INNER JOIN national_id b ON a.nat_id_num = b.id_num WHERE timeframe = 2026-01-01 GROUP BY 1
Why even do a census if you're just going to synthesize random data as the last step?
I was a big fan of differential privacy but now I think it might be doing more harm than good, as I haven't seen a single case where it was applied successfully in a problem where it actually mattered, and it contributed strongly to discrediting and preventing a lot of work on other anonymization techniques as it was deemed the only way to preserve privacy by the research community, so showing up with enhancements to k-anonymity or any other noise mechanism not rooted in it was a sure way to get ridiculed and ignored. And it's just not a practical mechanism, even when it works for a single disclosure you always end up having to blow up the privacy budget to a ridiculous amount in order to keep disclosing statistics as otherwise you would for almost all real-world data run out of budget after a few publications.
So, for me it's a technique that works in the areas where it doesn't really matter (publishing highly aggregated statistics that pose almost zero privacy risk even without differential privacy) and doesn't work in other areas where it would actually matter (publishing fine-grained data about individuals or small groups). There are some niche use cases but in my view the privacy community has really overblown the importance of differential privacy by portraying it as the only way to reliably anonymize data.
BTW the German census bureau has an interesting approach to anonymization which they use for several decades already and so far I haven't heard of any cases of successful de-anonymization of the data, maybe the US bureau should have a look at that for their own needs.
As the article says anytime you want to enforce privacy, the data becomes somewhat less useful, there is just no way around that.
The point of rights is that we have them and that they should not be trampled upon when they become slightly inconvenient to someone in power.
1: https://pmc.ncbi.nlm.nih.gov/articles/PMC8494446/?utm_source...
They weren't prepared for data that was obviously noisy. The data has always been inherently inaccurate, and folks just chose to ignore that previously
1: https://www.aeaweb.org/articles?id=10.1257%2Fpandp.20191107&... 2: https://www.science.org/doi/10.1126/sciadv.abk3283?utm_sourc... 3: https://www.nationalacademies.org/read/27150/chapter/14
You can of course disagree about what what should actually be part of a transparent public record. (Though I suspect a lot of people post-date what was generally available in a "phone book.")
"...for the next 950 days"
every time you read some politically spiteful news like thisbecause the next two years are going to become insanely miserable
in 950 days there will be several hundred warehouses concentrating over a million people in this country including many thousands of children costing a quarter trillion dollars (already funded)
and the Iran War will still be happening despite over a hundred declared "deals"
and the US will be running Cuba (forcing millions to return there)
statistical noise or the lack of it will be the least of our problems
* I want to accurately report the finances of our company to the best of my ability.
* But that report would allow people to reconstruct private data about the terms of our contracts with various counterparties. I'd really like to avoid that, there's no rule that says we're supposed to release that data. In fact some of those contracts probably came with nondisclosure agreements!
* So here's what I'm going to do. I'm going to calculate our results to the best of my ability, and then I'm going to add random values to them and report only the randomized ones. Any reconstruction people try to do will be wrong because of the randomness.
* If the SEC says "no, you need to report your actual numbers", I will explain to them that there's no such thing as an actual number because all data is noisy.
I can't get behind it.