Subtitles section Play video Print subtitles Disclaimer: This video was produced in collaboration with the US Census Bureau and fact-checked by Census Bureau scientists; any opinions and errors are my own. Every ten years the US Census Bureau surveys the American population - the ambitious goal is to count every person currently living in the entire United States of America and collect information about them like age, sex, race and ethnicity. The whole purpose of doing surveys like the census (and many other big medical or demographic surveys) is to be able to get an overall, quantitative picture of a particular population - how many people live in Minnesota? Or Mississippi? What’s their average age? And how do these things differ in different places, or by sex, or race? The results of the US Census are of particular political relevance since they’re used to determine the numbers of seats that different states get in the US House of Representatives as well as the boundaries of legislative districts from Congress down to city councils, but big surveys are also useful for understanding lots of other issues, too . The problem, of course, is that the Census (like many other medical and demographic studies) is supposed to be private. Like, no one outside the Census Bureau is supposed to be able to look at just the published statistics about the US population demographics and definitively figure out that there’s a white married male 31-year old with no kids living in my neighborhood (that’s me). The census bureau is supposed to keep my information confidential. And they’re supposed to keep the information of every single other person living in the United States confidential, too. Which is a tall order, because how can you keep everyone’s information entirely confidential while still saying anything at all based on that information? The short answer is that you can’t. There’s an inherent tradeoff between publishing something you learn from a survey and maintaining the privacy of the participants. It might seem like you could just remove people’s names from the spreadsheet, or only publish summaries like averages and totals. But it’s easy to reconnect names to datasets using powerful computers, and there’s a mathematical theorem that guarantees that if you do a study, every single piece of accurate information that you release, however small it seems, will inherently violate the privacy of the participants to some degree violate the privacy of the participants in that study to some degree. And the more information you publicly release, the more you violate the individual privacies of the participants. But how do you quantitatively measure something nebulous like loss of privacy, and then how do you protect it? To understand how to measure privacy, it’s helpful to start by imagining how somebody would try use published results (from a study) and piece together the private information of the people surveyed. They could just try to steal or gain direct access to the private information itself , which, of course, can’t be protected against mathematically - it requires good computer security, or physical defenses, so we won’t consider it here! The kind of privacy attack we can defend against mathematically is an attack that looks at publicly published statistics and then applies brute force computational power to imagine all possible combinations of answers the participants could have given to see which ones are the most plausible - that is, which ones fit the published statistics the best. Imagine checking all possible combinations of letters and numbers for a password until one of them works, except instead of letters and numbers it’s checking all possible “combinations-of-the-answers-that-330-million-people-could-give-on-their-census-questionnaires” to see which combinations come closest to the publicly published figures for average age, racial breakdown, and so on. The more closely a potential combination of answers matches the published figures , the more promising a candidate it is (from the attacker’s perspective). The more poorly it matches, the lower their level of certainty. As a small example, if there are 7 people living in a particular area and you tell me that four are female, four like ice cream, four are married adults, three of the ice cream lovers are female, and if you also give me the mean and median ages for all of these categories, then I can perfectly reconstruct the exact ages, sex, and ice cream preference of everyone involved. I would start with the 3 ice cream loving females; even though there are hundreds of thousands of possible combinations of ages for three people, only a small fraction of those - 36, in fact - are plausible - they’re in the right combination to give a median age of 36 and a mean age of 36 and two thirds. And the same thing works for the four females overall - there are almost 10 million possible combinations of ages they could have , but only 24 age combinations that are consistent with a median of 30, a mean of 33.5, AND with at least one of the plausible age combinations for the three ice-cream lovers. Continuing on with this kind of deduction leads to a single plausible (and perfect) reconstruction of all of the ages, sexes, and ice-cream preferences of the people involved; a 100% violation of privacy. If, however, you didn’t list how many of the ice cream lovers were female, there would instead be two plausible possibilities, so I would be less certain which was the true combination of ages and genders and ice cream preferences. And the potential level of certainty of an attacker is precisely how we measure the loss of privacy from publishing results of a study. If all possible combinations of ages and sexes and so on are similarly plausible, then an attacker can’t distinguish between them very well and so privacy is well protected. But if a small number of the possibilities are significantly more plausible than the rest, they stand out - and precisely because they stand out on plausibility, they’re also likely to be close to the truth. So to protect privacy, all possibilities need to seem similarly plausible, or at least there can’t be plausibility peaks that are too conspicuous. The potential for plausibility peaks is quantified mathematically by measuring the maximum slope of the graph - if the slope never gets too steep, then you can’t have any sharp peaks of highly plausible possibilities that stand out.But how do we publish statistics in a way that limits the maximum slope (and possible peaks) on the plausibilities plot? In practice, the best way to limit an attacker’s ability to confidently choose one scenario over the other is to randomly change, or “jitter”, the published values. Like, for example, rolling a die and adding that number to the average age reported for ice-cream lovers. Jittering the published results in a mathematically rigorous way puts a limit on the slope of the plausibility graph, and thus makes it harder for any particular possibilities to stand out above the rest. Jittering results might also seem like lying, but as long as the size of the adjustment isn’t big enough to make any significant changes to conclusions people draw from the survey, then it’s considered worth it for the privacy protection. For example, imagine I want to give you a sense of my age while keeping my true age secret. If I just told you my age, obviously there’s just one plausible possibility - 31! But suppose instead that I secretly pulled a number between minus 5 and 5 out of a hat and added it to my age before telling you . In this case, all you know is that my true age is somewhere within 5 years of the number I told you, but you don’t know my age exactly. My privacy has been preserved, though only to a certain degree because you can be confident I’m not 20 and not 40. To protect my age more, I’d have to pull a number between, say, -10 and 10 out of a hat and add it to my age - this increases the number of plausible possibilities - that is, the possible true ages that COULD have resulted in the number I told you. It also increases your uncertainty about my actual age - the tradeoff for privacy is inaccuracy. If I wanted you to know my age within a year, I could only pull a number between -1 and 1 out of the hat.In general, the idea is this: more privacy means you get less accuracy . Less privacy means you can have more accuracy . When you publish results, hopefully there’s a sweet spot where you can share something useful while still sufficiently maintaining peoples’ privacy. And simultaneously maintaining decent privacy and decent accuracy gets easier and easier with larger datasets. Like how as I add more noise to this image, you can still get the general picture even once you’ve lost any hope of telling the true original value of a particular pixel. So, to protect people’s privacy, we can and should randomly jitter published statistics (which the US Census, for example, has been doing since the 1970s). However, there’s a subtlety - you can’t just add any old random noise however frequently you want - if I simply add different random noise to this picture a bunch of times different times, once you take the average of all of the noisy images you basically get back the original clean image - you don’t want this happening to your data. So, there’s a whole field of computer science dedicated to figuring out how to add the least possible amount of noise to get both the most privacy and the most accuracy, and to future-proof the publication of data so that when you publish multiple jittered statistics about people, those statistics can’t be combined in a clever way to reconstruct peoples’ data. But up through the 2010 census, the Census bureau couldn’t promise this - sure, they were jittering data published in census bureau tables and charts, but not in a mathematically rigorous way, and so the Census bureau couldn’t mathematically promise anything about how much they were protecting our privacy (or say how badly it’s been violated). Until now! The US 2020 Census will, for the first time, be using mathematically rigorous privacy protections. One of the biggest benefits of the mathematically rigorous definition of privacy is that it reliably compounds over multiple pieces of information - like, if we have a group of people and publish both their average age and median age, each with a privacy loss factor of 3, then the privacy loss factor for having released both pieces of information is at most 6. So you can decide on a total cumulative amount of privacy loss you’re willing to suffer , and then decide whether you want to release, say, 10 pieces of information each with 1/10th that total privacy loss (and less accuracy), or if you want to release 1 piece of information with the full privacy loss and a higher level of accuracy.But how much privacy we need is a really hard question to answer. First, it involves weighing how much we as society collectively value the possible benefits from accurately knowing stuff about the group we’re surveying vs the possible drawbacks of releasing some amount of private information. And second, even though those benefits and drawbacks can be mathematically measured as “accuracy” and “privacy loss”, we still have to translate the mathematical ideas of “accuracy” and “privacy loss” into something that’s understandable and relatable to people in our society. That’s partly a goal of this video, in fact! So let’s give it one more shot at a translation.First and foremost: it is in principle impossible to publish useful statistics based on private data without in some way violating the privacy of the individuals in question. And if you want to provide a mathematically guaranteed limit on the amount of privacy violation, you have to randomly jitter the statistics to protect the private data.The accuracy of the information after being jittered is generally described probabilistically, by saying something like “if we randomly jittered the true population of this town a bunch of times, 98% of the time our jittered statistic would be within 10 people of the