The term "differential privacy" has popped into public consciousness after Apple announced it was using this mathematical technique to ensure that user information it collects through iOS devices is kept private. It's a complex statistical science concept that involves large datasets, analytics, adding noise to the data and maths. Maths. And now I have a headache. It's hard to find a simple way of explaining Apple's application of differential privacy to people with limited knowledge of mathematics and technology. But it's something all iOS users should know, especially when it concerns their own data. Here's our layman's guide to Apple's differential privacy.
Before we dive into differential privacy, it's worth noting why technology vendors are collecting user data in the first place. The likes of Apple and Google want to know how customers use their products and services so that they can make incremental improvements. In the cases of Google and Facebook, which do have an advertising businesses, by accumulating data on users and analysing it, they paint a picture of what users like and then display online ads and content that are relevant to the individual. It also reveals trends which can help these organisations create new products with the best chances for success.
The problem with doing this is that it can be a huge invasion of privacy. Even if a company does anonymise the data it gathers, there are still ways to match bits and pieces of supposedly non-identifiable data and deduce who they are from.
Enter differential privacy. Differential privacy is a mathematical technique that came into existence in 2006. Starting from iOS 10, Apple is adopting it a bid to protect user privacy when user information is collected about their app usage habits.
Differential privacy is a research topic in the areas of statistics and data analytics that uses hashing, subsampling and noise injection to enable…crowdsourced learning while keeping the data of individual users completely private. Apple has been doing some super-important work in this area to enable differential privacy to be deployed at scale.
Long story short, Apple will be collecting more data from users, but they will add "noise" to the data so that it's near impossible to identify who the data is from. It's easy to get that part, but the actual concept of differential privacy can be a bit confusing. Cryptographer and professor at Johns Hopkins University Matthew Green has written a lengthy piece that breaks down the mathematics behind it, which you can find here, but it's a bit much to digest for the average Joe.
So here's a very simplified example of differential privacy that, hopefully, will help make explaining the concept a lot easier. Try not to nitpick and question the example. Just take it as it is for now:
You're part of a group of 20 people who work in an office. You all know each other and one day somebody comes along and wants to do a survey about bad office habits. All 20 of you participate and you all write down, truthfully, your worst habits in the office and put it into the collection box. The person running the survey then takes five random entries from a book called 1000 Bad Office Habits, writes them down and put it in the collection box. The box is then given a good shake and passed to the person responsible for record the responses. That person has no way of telling which responses were from the group and which were from the book. The test is repeated on a monthly basis for a year, and each time five different random responses are pulled from the book. At the end, the survey managers have a pretty good idea of the most common bad habits that exist in that office without knowing exactly who possessed them, even if someone in the group leaves and is replaced.
Differential privacy is more complex than that, but example gets the main points across. The responses from the 1000 Bad Office Habits book are the 'noise' in this concept, and it's critical that they change constantly. Repeating gathering data and comparing it is vital in extracting trends and insight into user behaviour as well, which is why in the example the survey is done every month for a year.
The technique also requires a fine balance of 'real' and 'false' data. If you add too much noise, the accuracy of the insight goes down and too little noise you'll be compromising the privacy of users. To quote Green again:
The total allowed leakage is often referred to as a 'privacy budget', and it determines how many queries will be allowed (and how accurate the results will be). The basic lesson of DP is that the devil is in the budget. Set it too high, and you leak your sensitive data. Set it too low, and the answers you get might not be particularly useful.
Differential privacy can be a difficult concept to wrap your head around but if you're a user of an iOS device, it's important to at least try and understand it. It's your data at stake here. Apple seems to be making the right moves in order to protect customer privacy, but it's up to you to decide whether they're doing a good job.