How to unlock valuable personal data for analysis: shedding light on the byzantine world of privacy-enhancing technology
At the heart of privacy preserving data analysis lies a fundamental paradox: privacy preservation aims to hide, while data analysis aims to reveal. The two concepts may seem completely irreconcilable at first, but – using the right approach – they need not be. To help you find this right approach for your specific use case, I discuss both the potential and the limitations of different solution concepts, ranging from de-identification to encryption.
First things first: why data privacy matters
In this day and age, the significance of data privacy for individuals is widely agreed upon. It is worth noting, however, that privacy is an important issue not only for individuals, but also for businesses. Firstly, businesses are obliged by law to protect the personal data with which they have been entrusted. This need for compliance with legal frameworks makes data privacy a crucial topic in business. Many of the legal frameworks which regulate businesses’ use of personal data have long been in place. In Europe, they have recently undergone a substantial upgrade due to the GDPR. Partly derived from this legal obligation is the second reason why businesses care about data privacy: reputation. Non-compliance with data protection law and careless handling of personal data can cause severe reputational damage that may subdue a company’s revenues and share prices. The third reason for businesses to care about data privacy is the need to protect their own sensitive data rather than that of their customers. Naturally, protecting intellectual property and confidential business information is inherently important to all players in a market economy.
Privacy and data science: challenges and solutions
As digital transformation has been gathering pace, data science has become an indispensable discipline across all industries. The simultaneous need for data privacy and data-driven decision making calls for solutions to reconcile the two. Several technologies are being developed to break the inevitable trade-off between data utility and data privacy. Their applicability varies greatly between different data analysis tasks. Generally speaking, those tasks that require high data granularity, such as machine learning, are more difficult to conduct in a privacy preserving manner. In similar fashion, more open-ended investigations that require high flexibility, such as exploratory data analysis, become virtually impossible when privacy protection is high. Basic analytics and data sharing, on the other hand, can often be facilitated quite well in a privacy preserving way by employing de-identification technologies such as anonymization and pseudonymization.
Taming the PET tangle
The universe of Privacy Enhancing Technologies (PET) is messy and vague, as terms and concepts are often used interchangeably. The general lack of structure in this sphere makes PETs frustratingly difficult for non-experts to understand. To avoid this kind of confusion, I have categorized the PETs used in data science into two main buckets: de-identification and encryption. The figure below gives an overview of that categorization, listing well-known techniques in the respective buckets. The list is by no means complete, but it provides an intuition of our approach to PET structuring.
The ins and outs of data de-identification
The two main concepts in the de-identification arena are anonymization and pseudonymization. Both terms are ubiquitous, but the difference between them is often unclear. The main distinction between anonymization and pseudonymization lies in the reversibility of the process. True anonymization implies an irreversible modification of personally identifiable information (PII), be it by suppressing, generalizing or perturbing the data. Pseudonymization on the other hand constitutes the reversible masking of PII, thus allowing for subsequent re-identification of individuals. Both concepts have ample justification for different use cases and neither is per se better than the other.
Concepts such as Differential Privacy (DP) and k-anonymity have become prominent ‘buzzwords’ in media and academic literature alike. Strictly speaking, however, both DP and k-anonymity are privacy models rather than de-identification techniques. Both models allow to define the ‘degree of privacy’ of a given data set after applying different anonymization techniques. While they are useful for describing the properties of anonymized data, neither DP nor k-anonymity are themselves used to anonymize that data in the first place.
Anonymization: more privacy, less utility
Generally speaking, anonymization is the stronger of the two de-identification approaches. True anonymization guarantees strong privacy and rules out re-identification of individual data subjects. Anonymized data is not considered personal data and is not subject to the strict GDPR regulations, allowing for relatively stress-free data handling. However, this high degree of data privacy comes at the cost of reduced data utility. Truly anonymized data is often perturbed or aggregated to an extent that renders certain data science tasks (such as machine learning) impossible.
A potential exception to this may be synthetic data. Synthetic data retains the key statistical properties of the original data set, while having no actual overlap with it. Synthetic data preserves the structure and granularity of the original data, while being completely ‘artificial’ and thus fully anonymous. The process of generating synthetic data, however, is very complex and difficult. While synthetic data may be useful e.g. for functional testing of data products and some analytics applications, it is generally less useful whenever ‘the real thing’ is needed, such as in fraud detection and precision medicine.
Pseudonymization: less privacy, more utility
Pseudonymization on the other hand can do without aggregation and perturbation, thus enabling data scientists to preserve the granularity of the data they are working on. But still, pseudonymization is no exception to the inevitable trade-off between data utility and data privacy. While pseudonymized data retains high granularity and utility, it is also more vulnerable to re-identification, for example through linkage attacks. In a linkage attack, the attacker combines several (pseudonymized) data sources and exploits different quasi-identifiers to uncover an individual’s true identity.
So the trade-off between utility and privacy seems inevitable yet again. But what if it wasn’t?
The best is yet to come
What if there was a way to resolve the paradox of data privacy and data science? That is to say, what if there was a way to simultaneously hide and selectively reveal the information value in personal data? Fully homomorphic encryption (FHE), still a rather young branch of cryptography, offers this possibility – in theory. Put simply, performing a data science task on encrypted data using FHE eventually yields the same result as does performing the same task on the plain data. Data scientists can use the encrypted data to generate insights. This is a revolutionary concept, as it has the potential to completely break the trade-off between data utility and data privacy. Provided that the decryption key is well protected, the encrypted data is private, while it’s information content remains unchanged. Note however, that while encrypted data may be private, it is still considered personal data and thus remains subject to GDPR regulation. To understand this, we have to consider the reversibility of the process: given the decryption key, all personal information can be easily reconstructed and used to identify individuals. Though this may seem somewhat counterintuitive, encrypted data is thus not anonymous.
As promising as FHE sounds, it is unfortunately too early to view it as a panacea for the many privacy challenges in data science. The technology is still relatively young and it is still hugely impractical at this point. Excessively long runtimes render FHE useless for more computationally complex tasks, as computations on encrypted data are still approximately 1 million times slower than plaintext computations. This is not to say that FHE can’t have useful applications in data science, but according to industry experts it will likely take at least another 5 years before efficiency improvements allow for wider use of the technology. Often regarded as the holy grail of encryption, FHE is currently a rather impractical, but hugely promising technology. A clear case of: the best is yet to come.
Less is more
Even if FHE becomes widely applicable, that doesn’t necessarily imply that it should be our default go-to solution for all privacy challenges in data science. Just because technology could be used to solve a problem, that doesn’t always mean that it should. At times, methodological solutions or changes to the set-up of a privacy challenge may be more practicable and in fact more cost-effective than burying the problem in sophisticated technology. Depending on the particular use case, a ‘methodological solution’ may for example be a trusted third party. ‘Changes to the set-up’ may refer to aligning the incentives of all parties involved to counteract moral hazard and resolve trust issues.
Simple multiparty computations like benchmarking are a good example of such a situation. Benchmarking is in fact one of the tasks for which FHE can already be used today to facilitate privacy preserving analysis. Different parties can encrypt their information and, thanks to FHE, the joint result can be computed using the encrypted inputs. The result is subsequently decrypted to yield the benchmark for all parties. At this point you may be thinking: that sounds like a fun gimmick, but why don’t they all just send their inputs to a trusted third party, who then calculates and distributes the benchmarking result? That question is valid and the answer is vague. Reasons to prefer FHE over a trusted third party often lie at the intersection of data privacy and data security. While the trusted third party may be able to provide privacy, FHE is able to provide privacy and security at once. Yet again, the right approach depends on the specific use case at hand.
The art of the possible
There are three key points for you to take away from this article. First, the right approach is highly use case dependent. Second, the right approach may consist of technological solutions or methodological solutions or both. Third, this approach may not be perfect, but it is crucial to tackle data privacy head on and work with what is possible now. Don’t become paralyzed in the face of complex privacy challenges!
In light of the current technological possibilities, businesses must trade off data utility against data privacy. It’s important to strike a balance between specific data needs and privacy concerns – a difficult process that does not have a ‘one size fits all’ solution. Artificial intelligence is built on data: the race is on to capture and unlock valuable information that the next wave of cognitive automation will depend on.