Placeholder Image

Subtitles section Play video

  • Disclaimer: This video was produced in collaboration with the US Census Bureau and fact-checked

  • by Census Bureau scientists; any opinions and errors are my own.

  • Every ten years the US Census Bureau surveys the American population - the ambitious goal

  • is to count every person currently living in the entire United States of America and

  • collect information about them like age, sex, race and ethnicity.

  • The whole purpose of doing surveys like the census (and many other big medical or demographic

  • surveys) is to be able to get an overall, quantitative picture of a particular population

  • - how many people live in Minnesota?

  • Or Mississippi?

  • What’s their average age?

  • And how do these things differ in different places, or by sex, or race?

  • The results of the US Census are of particular political relevance since theyre used to

  • determine the numbers of seats that different states get in the US House of Representatives

  • as well as the boundaries of legislative districts from Congress down to city councils, but big

  • surveys are also useful for understanding lots of other issues, too . The problem, of

  • course, is that the Census (like many other medical and demographic studies) is supposed

  • to be private.

  • Like, no one outside the Census Bureau is supposed to be able to look at just the published

  • statistics about the US population demographics and definitively figure out that there’s

  • a white married male 31-year old with no kids living in my neighborhood (that’s me).

  • The census bureau is supposed to keep my information confidential.

  • And theyre supposed to keep the information of every single other person living in the

  • United States confidential, too.

  • Which is a tall order, because how can you keep everyone’s information entirely confidential

  • while still saying anything at all based on that information?

  • The short answer is that you can’t.

  • There’s an inherent tradeoff between publishing something you learn from a survey and maintaining

  • the privacy of the participants.

  • It might seem like you could just remove people’s names from the spreadsheet, or only publish

  • summaries like averages and totals.

  • But it’s easy to reconnect names to datasets using powerful computers, and there’s a

  • mathematical theorem that guarantees that if you do a study, every single piece of accurate

  • information that you release, however small it seems, will inherently violate the privacy

  • of the participants to some degree violate the privacy of the participants in that study

  • to some degree.

  • And the more information you publicly release, the more you violate the individual privacies

  • of the participants.

  • But how do you quantitatively measure something nebulous like loss of privacy, and then how

  • do you protect it?

  • To understand how to measure privacy, it’s helpful to start by imagining how somebody

  • would try use published results (from a study) and piece together the private information

  • of the people surveyed.

  • They could just try to steal or gain direct access to the private information itself , which,

  • of course, can’t be protected against mathematically - it requires good computer security, or physical

  • defenses, so we won’t consider it here!

  • The kind of privacy attack we can defend against mathematically is an attack that looks at

  • publicly published statistics and then applies brute force computational power to imagine

  • all possible combinations of answers the participants could have given to see which ones are the

  • most plausible - that is, which ones fit the published statistics the best.

  • Imagine checking all possible combinations of letters and numbers for a password until

  • one of them works, except instead of letters and numbers it’s checking all possiblecombinations-of-the-answers-that-330-million-people-could-give-on-their-census-questionnaires

  • to see which combinations come closest to the publicly published figures for average

  • age, racial breakdown, and so on.

  • The more closely a potential combination of answers matches the published figures , the

  • more promising a candidate it is (from the attacker’s perspective).

  • The more poorly it matches, the lower their level of certainty.

  • As a small example, if there are 7 people living in a particular area and you tell me

  • that four are female, four like ice cream, four are married adults, three of the ice

  • cream lovers are female, and if you also give me the mean and median ages for all of these

  • categories, then I can perfectly reconstruct the exact ages, sex, and ice cream preference

  • of everyone involved.

  • I would start with the 3 ice cream loving females; even though there are hundreds of

  • thousands of possible combinations of ages for three people, only a small fraction of

  • those - 36, in fact - are plausible - theyre in the right combination to give a median

  • age of 36 and a mean age of 36 and two thirds.

  • And the same thing works for the four females overall - there are almost 10 million possible

  • combinations of ages they could have , but only 24 age combinations that are consistent

  • with a median of 30, a mean of 33.5, AND with at least one of the plausible age combinations

  • for the three ice-cream lovers.

  • Continuing on with this kind of deduction leads to a single plausible (and perfect)

  • reconstruction of all of the ages, sexes, and ice-cream preferences of the people involved;

  • a 100% violation of privacy.

  • If, however, you didn’t list how many of the ice cream lovers were female, there would

  • instead be two plausible possibilities, so I would be less certain which was the true

  • combination of ages and genders and ice cream preferences.

  • And the potential level of certainty of an attacker is precisely how we measure the loss

  • of privacy from publishing results of a study.

  • If all possible combinations of ages and sexes and so on are similarly plausible, then an

  • attacker can’t distinguish between them very well and so privacy is well protected.

  • But if a small number of the possibilities are significantly more plausible than the

  • rest, they stand out - and precisely because they stand out on plausibility, theyre

  • also likely to be close to the truth.

  • So to protect privacy, all possibilities need to seem similarly plausible, or at least there

  • can’t be plausibility peaks that are too conspicuous.

  • The potential for plausibility peaks is quantified mathematically by measuring the maximum slope

  • of the graph - if the slope never gets too steep, then you can’t have any sharp peaks

  • of highly plausible possibilities that stand out.But how do we publish statistics in a

  • way that limits the maximum slope (and possible peaks) on the plausibilities plot?

  • In practice, the best way to limit an attacker’s ability to confidently choose one scenario

  • over the other is to randomly change, orjitter”, the published values.

  • Like, for example, rolling a die and adding that number to the average age reported for

  • ice-cream lovers.

  • Jittering the published results in a mathematically rigorous way puts a limit on the slope of

  • the plausibility graph, and thus makes it harder for any particular possibilities to

  • stand out above the rest.

  • Jittering results might also seem like lying, but as long as the size of the adjustment

  • isn’t big enough to make any significant changes to conclusions people draw from the

  • survey, then it’s considered worth it for the privacy protection.

  • For example, imagine I want to give you a sense of my age while keeping my true age

  • secret.

  • If I just told you my age, obviously there’s just one plausible possibility - 31!

  • But suppose instead that I secretly pulled a number between minus 5 and 5 out of a hat

  • and added it to my age before telling you . In this case, all you know is that my true

  • age is somewhere within 5 years of the number I told you, but you don’t know my age exactly.

  • My privacy has been preserved, though only to a certain degree because you can be confident

  • I’m not 20 and not 40.

  • To protect my age more, I’d have to pull a number between, say, -10 and 10 out of a

  • hat and add it to my age - this increases the number of plausible possibilities - that

  • is, the possible true ages that COULD have resulted in the number I told you.

  • It also increases your uncertainty about my actual age - the tradeoff for privacy is inaccuracy.

  • If I wanted you to know my age within a year, I could only pull a number between -1 and

  • 1 out of the hat.In general, the idea is this: more privacy means you get less accuracy . Less

  • privacy means you can have more accuracy . When you publish results, hopefully there’s a

  • sweet spot where you can share something useful while still sufficiently maintaining peoples

  • privacy.

  • And simultaneously maintaining decent privacy and decent accuracy gets easier and easier

  • with larger datasets.

  • Like how as I add more noise to this image, you can still get the general picture even

  • once youve lost any hope of telling the true original value of a particular pixel.

  • So, to protect people’s privacy, we can and should randomly jitter published statistics

  • (which the US Census, for example, has been doing since the 1970s).

  • However, there’s a subtlety - you can’t just add any old random noise however frequently

  • you want - if I simply add different random noise to this picture a bunch of times different

  • times, once you take the average of all of the noisy images you basically get back the

  • original clean image - you don’t want this happening to your data.

  • So, there’s a whole field of computer science dedicated to figuring out how to add the least

  • possible amount of noise to get both the most privacy and the most accuracy, and to future-proof

  • the publication of data so that when you publish multiple jittered statistics about people,

  • those statistics can’t be combined in a clever way to reconstruct peoplesdata.

  • But up through the 2010 census, the Census bureau couldn’t promise this - sure, they

  • were jittering data published in census bureau tables and charts, but not in a mathematically

  • rigorous way, and so the Census bureau couldn’t mathematically promise anything about how

  • much they were protecting our privacy (or say how badly it’s been violated).

  • Until now!

  • The US 2020 Census will, for the first time, be using mathematically rigorous privacy protections.

  • One of the biggest benefits of the mathematically rigorous definition of privacy is that it

  • reliably compounds over multiple pieces of information - like, if we have a group of

  • people and publish both their average age and median age, each with a privacy loss factor

  • of 3, then the privacy loss factor for having released both pieces of information is at

  • most 6.

  • So you can decide on a total cumulative amount of privacy loss youre willing to suffer

  • , and then decide whether you want to release, say, 10 pieces of information each with 1/10th

  • that total privacy loss (and less accuracy), or if you want to release 1 piece of information

  • with the full privacy loss and a higher level of accuracy.But how much privacy we need is

  • a really hard question to answer.

  • First, it involves weighing how much we as society collectively value the possible benefits

  • from accurately knowing stuff about the group were surveying vs the possible drawbacks

  • of releasing some amount of private information.

  • And second, even though those benefits and drawbacks can be mathematically measured as

  • accuracyandprivacy loss”, we still have to translate the mathematical ideas

  • ofaccuracyandprivacy lossinto something that’s understandable and relatable

  • to people in our society.

  • That’s partly a goal of this video, in fact!

  • So let’s give it one more shot at a translation.First and foremost: it is in principle impossible

  • to publish useful statistics based on private data without in some way violating the privacy

  • of the individuals in question.

  • And if you want to provide a mathematically guaranteed limit on the amount of privacy

  • violation, you have to randomly jitter the statistics to protect the private data.The

  • accuracy of the information after being jittered is generally described probabilistically,

  • by saying something likeif we randomly jittered the true population of this town

  • a bunch of times, 98% of the time our jittered statistic would be within 10 people of the