Google’s differential privacy library will give organizations a way to study their data while protecting people’s information.
Data security has become one of the world’s most prominent issues as we reach the end of the decade. Countries and companies across the globe are ingesting the data of billions of users, exposing people to a seemingly endless stream of concerning hacks or data breaches.
Since the first days of the US Census in 1790, scientists have searched for ways to study troves of data while protecting the identities of those behind the numbers.
In 2006, award-winning researchers Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith introduced the idea of differential privacy to the world, setting off decades of research that would lead to Google’s newly-released open-sourced differential privacy library.
Using the open source library, developers will be able to perform common statistical functions like sums, averages or medians and “other functionalities such as additional mechanisms, aggregation functions, or privacy budget management.”
They’ve even thrown in a lengthy study on their work, a way to check for mistakes and a PostgreSQL extension with common recipes to get you started.
SEE: IT leader’s guide to deep learning (TechRepublic Premium)
“Differentially-private data analysis is a principled approach that enables organizations to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual’s data to be distinguished or re-identified,” said Google Privacy and Data Protection Product Manager Miguel Guevara.
In the end, the goal of differential privacy is to provide anonymity while preserving access to boatloads of useful information. Google said differential privacy, “provides formal guarantees that the output of a database query does not reveal too much information about any individual present in the database.”
It randomizes parts of the information in a way that would make cybersecurity breaches less damaging.
If you’ve ever looked at Google Maps and seen that fun chart of times when a business will be the most busy, you can thank differential privacy for it. Differential privacy allows you to randomize your data in a way that made it easy for Google to anonymously track data about when most people eat at a certain restaurant or shopped at a popular store.
In 2014, they also used it to improve their Chrome browser and have since upgraded Google Fi with it. Dozens of companies including Apple and Uber use versions of differential privacy to optimize their services while protecting the data of users.
After more than a decade of using differential privacy within their own businesses, Google has decided to create an easy tool to help developers incorporate differential privacy with their own data.
“OK so why am I so excited about this release? So many reasons. First, the code is the same one we use internally. It powers massive-scale tools and major use cases,” Google privacy software engineer Damien Desfontaines wrote on Twitter last week.
“It also means that we have a high bar in terms of code quality, testing, scalability, robustness. The methods developed inside are quite simple. We’re not coming up with super fancy new algorithms. We’re mostly bringing intuitions from prior work together in a nice way. This shows that differential privacy should be approachable & usable by anyone, given the right tools,” Desfontaines wrote.
Desfontaines added that there weren’t many resources for organizations or developers looking to use differential privacy with their data. They’re hoping the open-source version of the differential privacy library that Google’s core products benefit from will be helpful and they prioritized ease of use when they created it.
They did something similar in 2014, when they unveiled RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response), which was built under the guidelines of differential privacy. In Google’s blog post about RAPPOR, they break down exactly how differential privacy would work in a real-life situation.
Úlfar Erlingsson, Google’s security research tech lead manager, wrote about the topic using the basic example of person trying to figure out how many of his friends online were dogs.
“To do this, you could ask each friend to answer the question ‘Are you a dog?’ in the following way. Each friend should flip a coin in secret, and answer the question truthfully if the coin came up heads; but, if the coin came up tails, that friend should always say, ‘Yes,’ regardless,” he wrote.
“Then you could get a good estimate of the true count from the greater-than-half fraction of your friends that answered, ‘Yes’. However, you still wouldn’t know which of your friends was a dog: each answer, ‘Yes,’ would most likely be due to that friend’s coin flip coming up tails.”
In the detailed technical paper released on September 5, Desfontaines, Royce Wilson, Celia Zhang, William Lam, Daniel Simmons-Marengo and Bryant Gipson explain the math and thinking behind their library. Differential privacy is so complicated—and easy to screw up—that Google is hoping their tool will simplify the process a bit.
“By using differential privacy when analyzing data, organizations can minimize the disclosure risk of sensitive information about their users. By releasing components of our system as open-source software after we validated its viability on internal use-cases, we hope to encourage further adoption and research of differentially private data analysis,” Google researchers wrote in the report.
“The algorithms presented in this work are relatively simple, but empirical evidence show that this approach is useful, robust and scalable. Many services collect sensitive data about individuals. These services must balance the possibilities offered by analyzing, sharing, or publishing this data with their responsibility to protect the privacy of the individuals present in their data.”
In Google’s announcement about the open-sourced library, Guevara said the tool would be useful for small businesses, software developers and healthcare researchers that need complex personal data to make changes and improvements.
Companies and organizations have to exhaust every opportunity to improve while ensuring strong privacy protections. They risked losing the trust of “citizens, customers, and users” if they didn’t, Guevara said. The overall goal of differential privacy, Google said, is to give users the anonymity they deserve.
“The main focus of the paper is to explain how to protect *users* with differential privacy, as opposed to individual records. So much of the existing literature implicitly assumes that each user is associated to only one record. It’s rarely true in practice!” Desfontaines said on Twitter.
Digital information is so different than census figures or other kinds of data because one user is often counted more than once. Google employees Lea Kissner and Gipson told WIRED that their tool was innovative because it let people add data to a set multiple times.
Google has made privacy tools a priority this year, releasing TensorFlow Privacy, TensorFlow Federated as well as Private Join and Compute. Desfontaines and others involved in the process said the tool would only improve as more people used it on a variety of data sets.
“What I find most exciting isn’t what’s there now, it’s what I hope comes next!” Desfontaines wrote.
“My personal hope is that it enables new use cases and collaboration opportunities. Let’s see what happens.”