By studying the risk of re-identification more thoroughly, researchers were able to better articulate the fundamental requirements for information to be anonymous. They realized that a robust definition of anonymous should not rely on what side information may be available to an attacker. This led to the definition of Differential Privacy in 2006 by Cynthia Dwork, then a researcher at Microsoft. It quickly became the gold standard for privacy and has been used in global technology products like Chrome, the iPhone, and Linkedin. Even the US Census used it for the 2020 census.
Differential privacy solves the problem of side information by looking at the most powerful attacker possible: an attacker who knows everything about everyone in a population except for a single individual. Let’s call her Alice. When releasing information to such an attacker, how can you protect Alice’s privacy? If you release exact aggregate information for the whole population (e.g., the average age of the population), the attacker can compute the difference between what you shared and the expected value of the aggregate with everyone but Alice. You just revealed something personal about Alice.
The only way out is to not share the exact aggregate information but add a bit of random noise to it and only share the slightly noisy aggregate information. Even for the most well-informed of attackers, differential privacy makes it impossible to deduce what value Alice contributed. Also, note that we have talked about simple insights like aggregations and averages but the same possibilities for re-identification apply to more sophisticated insights like machine learning or AI models, and the same differential privacy techniques can be used to protect privacy by adding noise when training models. Now, we have the right tools to find the optimal tradeoff: adding more noise makes it harder for a would-be attacker to re-identify Alice’s information, but at a greater loss of data fidelity for the data analyst. Fortunately, in practice, there is a natural alignment between differential privacy and statistical significance.
Comments are closed.