Top
Best
New

Posted by nl 8 hours ago

US bans differential privacy in Census data(desfontain.es)
543 points | 288 commentspage 3
lokar 6 hours ago|
Can anyone share how other countries handle this?
simonw 6 hours ago||
A lot of countries are really bad at running their census. https://asteriskmag.com/issues/11/why-governments-cant-count
ghaff 5 hours ago||
And a lot of countries have things like national IDs that, rightly or wrongly, given things like RealID and passports, that a lot of Americans just don't like on principle.
pelorat 5 hours ago||
Sure, in Europe we don't because we already have databases of all citizens, also recording attributes like race, skin color, religious affiliation or political leaning in a database is highly illegal, both for the government and for private use.
throw-the-towel 4 hours ago||
Wait, are you saying Europe doesn't have censuses?
brainwad 3 hours ago|||
In my European country (Switzerland), it's mandatory to notify the government whenever you move. There is thus no universal census and also no voter registration. There's still subsampled surveys though, for e.g. economic data, that might come by mail (addressed to you by name, because they know where you live!).
drysine 2 hours ago||
Just like in the USSR
brainwad 2 hours ago|||
As I understand it, in the Soviet Union you had to get government permission before moving. Whereas here the right to move is guaranteed, you just have to update your details.
Peanuts99 2 hours ago|||
Perhaps on that one axis. On the other hand Switzerland is probably about as far opposed as the USSR as you can get.
generj 3 hours ago||||
Many countries effectively have a live registry of where people live, updated to within a few weeks. A door to door census isn’t needed because they can do something like:

SELECT a.province, COUNT(DISTINCT b.id_num) FROM registry a INNER JOIN national_id b ON a.nat_id_num = b.id_num WHERE timeframe = 2026-01-01 GROUP BY 1

gpvos 3 hours ago|||
Correct. At least, not that I know of. On the other hand, when you move, you must deregister with your old municipality and register with your new one. The exact system differs a bit per country.
thisisaman408 2 hours ago||
I was going to build something cool with fable, and now it's banned, feeling disappointed
watersb 6 hours ago||
The better to sell the data, all your privates are belong to us.
Bratmon 3 hours ago||
This is a rare occasion of the Trump administration getting something right.

Why even do a census if you're just going to synthesize random data as the last step?

ThePhysicist 5 hours ago||
I think it should be noted that there was a lot of dissatisfaction from users of the census data as far as I know. So it's not been banned just for politicals sake or because they hate privacy... Some people I talked to in the privacy field even called the whole thing a total disaster and weren't shy to put blame on John Abowd who apparently pushed this through despite a lot of internal opposition and concerns. Not sure if that's true, but what is definitely true is that the way the data was released produced serious issues downstream as most researchers and statisticians that ingested the data weren't prepared for receiving noisy data values. Differential privacy was applied in a way such that many invariants that data users cared about weren't preserved, which was expected as it's not possible as you can't preserve all invariants and at the same time add meaningful noise to the data. The thing is, with such a differentially private data release you need to adapt all of the downstream analyses to take into account the exact mechanism the data was altered in. And since the census bureau used a very intricate mechanism that didn't just add Laplace noise to data values but instead relied on a multi-stage process that preserved some invariants but not others it was very difficult to even write routines to account for the changes being made to the data. They essentially asked of every data user to rewrite their whole analysis pipeline based on the exact disclosure mechanism that contained a large number of bespoke choices regarding which data invariants to preserve and basically produced a mix of noisy, synthesized data that was just really hard to reason about. I don't even know if there even would've been a way to do this better, but the fact is that not every small county or school district has top-tier statisticians at hand that can just read a whole monograph on differentially private synthesized census data and then hotpatch their existing analysis systems to work with that data.

I was a big fan of differential privacy but now I think it might be doing more harm than good, as I haven't seen a single case where it was applied successfully in a problem where it actually mattered, and it contributed strongly to discrediting and preventing a lot of work on other anonymization techniques as it was deemed the only way to preserve privacy by the research community, so showing up with enhancements to k-anonymity or any other noise mechanism not rooted in it was a sure way to get ridiculed and ignored. And it's just not a practical mechanism, even when it works for a single disclosure you always end up having to blow up the privacy budget to a ridiculous amount in order to keep disclosing statistics as otherwise you would for almost all real-world data run out of budget after a few publications.

So, for me it's a technique that works in the areas where it doesn't really matter (publishing highly aggregated statistics that pose almost zero privacy risk even without differential privacy) and doesn't work in other areas where it would actually matter (publishing fine-grained data about individuals or small groups). There are some niche use cases but in my view the privacy community has really overblown the importance of differential privacy by portraying it as the only way to reliably anonymize data.

BTW the German census bureau has an interesting approach to anonymization which they use for several decades already and so far I haven't heard of any cases of successful de-anonymization of the data, maybe the US bureau should have a look at that for their own needs.

hristov 4 hours ago||
Of course there will be dissatisfaction from users of the data. Anyone that wants to use census data will prefer less privacy in the data. And anytime privacy is enforced the data becomes less useful. It would be certainly very convenient for both advertisers and gerrymandering political consultants to have detailed data on every citizen.

As the article says anytime you want to enforce privacy, the data becomes somewhat less useful, there is just no way around that.

The point of rights is that we have them and that they should not be trampled upon when they become slightly inconvenient to someone in power.

ThePhysicist 3 hours ago||
Are you sure about that? You are saying that differentially private census data couldn't be used for gerrymeandering and advertisement while non differentially private data could? Hard to believe, I'm not an advertisement or gerrymeandering expert but I would assume people running ads or cutting up districts are mostly interested in aggregate statistics i.e. they won't care about single households? And I would assume they can rely on voter files, party databases etc... And to the contrary there are reports [1] that indicate differential privacy actually makes gerrymeandering analysis more difficult or impossible. So, not really an argument for differential privacy, discriminatory action can be equally well taken based on differentially private data as the government cares about groups not individuals and groups aren't protected by differential privacy. It seems people really fundamentally misunderstand what this technique can achieve and what it won't do.

1: https://pmc.ncbi.nlm.nih.gov/articles/PMC8494446/?utm_source...

swiftcoder 5 hours ago||
> serious issues downstream as most researchers and statisticians that ingested the data weren't prepared for receiving noisy data values

They weren't prepared for data that was obviously noisy. The data has always been inherently inaccurate, and folks just chose to ignore that previously

ThePhysicist 5 hours ago||
No, there are dozens of articles discussing the mechanism and explaining the impact it had in different areas e.g. [1,2,3]. And the release mechanism wasn't just "add noise", far from it, you may read the original paper [4] to see how intricate it was, anyone wanting to make real use the resulting data would have needed to understand that approach in detail to work with the resulting data. The report of the national academies [3] is probably the most comprehensive analysis of the mechanism and the complications it introduced, so writing "it has always been inherently inaccurate" is just wrong, this new mechanism was way worse than just introducing unbiased sampling noise.

1: https://www.aeaweb.org/articles?id=10.1257%2Fpandp.20191107&... 2: https://www.science.org/doi/10.1126/sciadv.abk3283?utm_sourc... 3: https://www.nationalacademies.org/read/27150/chapter/14

4: https://hdsr.mitpress.mit.edu/pub/7evz361i/release/2

wnc3141 6 hours ago||
Stalin's demographic researchers kept disappearing until they came up with the numbers he wanted.
delichon 6 hours ago||
The dueling political demands of accuracy and privacy are simply incompatible at some level. After reading this, maybe Hanlon's Razor isn't the right standard. Besides malice and stupidity, there is impossibility. Some problems just aren't solvable under certain constraints. I don't envy the statisticians tasked with finding a politically palatable solution to a math problem.
Sol- 6 hours ago||
But the strength of differential privacy is that you can now make this tradeoff explicit and quantify it. I always liked it because it offers a mathematical solution to a policy problem, but then of course it's up to us to decide what parameters and tradeoff to choose. Also, some data might just not get published at all if the privacy implications are too problematic, so differential privacy might buy you more signal!
thatfrenchguy 6 hours ago|||
Yeah, the main issue with differential privacy is that you need competent government officials making decisions who understand math beyond a high school level.
tbrownaw 6 hours ago|||
It offers a mathematical description of a policy tradeoff, and the policy makers are apparently setting one of the parameters to zero.
ghaff 6 hours ago||
There's a ton of information in the US that is accessible to various degrees--especially through the the deep web much less background investigations. Unless you're a wealthy person who can set up various levels of trusts you can't really hide them.

You can of course disagree about what what should actually be part of a transparent public record. (Though I suspect a lot of people post-date what was generally available in a "phone book.")

ck2 5 hours ago||
if you want to keep your sanity, I suggest silently adding the phrase

     "...for the next 950 days" 
every time you read some politically spiteful news like this

because the next two years are going to become insanely miserable

layer8 5 hours ago|
It’s highly uncertain what will happen in 950 days.
ck2 56 minutes ago||
I think it's easy to predict some things that will happen in 950 days

in 950 days there will be several hundred warehouses concentrating over a million people in this country including many thousands of children costing a quarter trillion dollars (already funded)

and the Iran War will still be happening despite over a hundred declared "deals"

and the US will be running Cuba (forcing millions to return there)

statistical noise or the lack of it will be the least of our problems

SpicyLemonZest 4 hours ago||
I really have to take the anti-noise side here. I get why it's a hard problem, and I get why the Census Bureau thought this was a neat solution. But I'm imagining an accountant stepping through a similar chain of logic:

* I want to accurately report the finances of our company to the best of my ability.

* But that report would allow people to reconstruct private data about the terms of our contracts with various counterparties. I'd really like to avoid that, there's no rule that says we're supposed to release that data. In fact some of those contracts probably came with nondisclosure agreements!

* So here's what I'm going to do. I'm going to calculate our results to the best of my ability, and then I'm going to add random values to them and report only the randomized ones. Any reconstruction people try to do will be wrong because of the randomness.

* If the SEC says "no, you need to report your actual numbers", I will explain to them that there's no such thing as an actual number because all data is noisy.

I can't get behind it.

mikelitoris 4 hours ago|
But why?? Differential privacy works? It's not even "woke" or whatever these people perceive. It's just math man...
More comments...