Can data be analysed to the full without compromising individuals’ privacy? Lisa Martin, strategy associate at idalab, assesses current methods and future solutions
At the heart of privacy-preserving data analysis lies a fundamental paradox: privacy preservation aims to hide, while data analysis aims to reveal. The two concepts may seem completely irreconcilable at first, but – using the right approach – need not be.
The two main methods of handling personal data – de-identification and encryption – both have strengths and limitations, but which will work best for you? And might there be a third way? First things first, however …
Why data privacy matters
Data privacy is important not only for individuals, but also for businesses. Firstly, since businesses are obliged by law to protect the personal data with which they are entrusted, any failure to comply with the legal frameworks protecting personal data can lead to legal action.
Secondly, although many of these frameworks have long been in place, 2018 revisions to the General Data Protection Regulations (GDPR) in Europe – which specified that companies must inform individuals of any breach – have forced businesses to consider the reputational risk associated with careless handling of personal data. The Facebook/Cambridge Analytica fiasco – in which the data of 50 million users was sold, then used to influence the outcome of the 2016 US election – led to CEO Mark Zuckerberg being called to testify before Congress, while Facebook’s repeated share price nosedives at the time reflected the market’s disdain for this behaviour.
The third reason businesses should care about data privacy is purely motivated by self interest: they need to secure their own sensitive data. Any failure to protect intellectual property and confidential business information can be fatal to players in a market economy.
Privacy and data science: challenges and solutions
As digital transformation has gathered pace, so data science has become an indispensable discipline across all industries. The simultaneous needs for data privacy and data-driven decision making call for a solution to reconcile the two – and several technologies are being developed to balance the inevitable trade-off between data utility and data privacy; different data-analysis tasks call for different solutions.
Generally speaking, any tasks that require high data granularity, such as machine learning, are more difficult to conduct in a privacy-preserving manner. In similar fashion, more open-ended investigations that require high flexibility, such as exploratory data analysis, become virtually impossible when privacy protection levels are high. Basic analytics and data sharing, on the other hand, can work effectively in a privacy-preserving environment by employing de-identification technologies, such as anonymization and pseudonymization.
Taming the PET tangle
The universe of privacy-enhancing technologies (PET) is messy and vague, as terms and concepts are often used interchangeably. The general lack of structure in this sphere makes PETs frustratingly difficult for non-experts to understand. However, the PETs used in data science can be divided into two main buckets: de-identification and encryption.
Figure: idalab’s approach to PET structuring
The ins and outs of data de-identification
Within the de-identification arena are two main concepts: anonymization and pseudonymization. The main distinction between the two lies in the reversibility (or otherwise) of the process. True anonymization requires an irreversible modification of personally identifiable information (PII), be it by suppressing, generalizing or perturbing the data. Pseudonymization, on the other hand, involves the reversible masking of PII, thus allowing for subsequent re-identification of individuals. Both methods have their uses and neither is per se better than the other.
Although concepts such as differential privacy (DP) and k-anonymity have become prominent buzzwords in the media and academic literature alike, they are, strictly speaking, privacy models rather than de-identification techniques.
Anonymization: more privacy, less utility
Generally speaking, anonymization is the stronger of the two de-identification approaches. True anonymization guarantees strong privacy and rules out re-identification of individual data subjects. Anonymized data is not considered personal data and is, therefore, not subject to GDPR regulations, enabling relatively stress-free data handling. However, this high degree of data privacy comes at the cost of reduced data utility. Truly anonymized data is often perturbed or aggregated to an extent that renders certain data science tasks (such as machine learning) impossible.
A potential exception to this may be synthetic data. Synthetic data retains the key statistical properties of the original data set, while having no actual overlap with it; it preserves the structure and granularity of the original data, while being completely “artificial” and thus fully anonymous. The process of generating synthetic data, however, is very complex and difficult. While synthetic data may be useful, such as in the functional testing of data products and some analytics applications, it is generally less useful whenever “the real thing” is needed, such as in fraud detection and precision medicine.
Pseudonymization: less privacy, more utility
Pseudonymized data, on the other hand, needs no aggregation and perturbation, enabling data scientists to preserve the granularity of the data they are working with. But this high utility and granularity renders it vulnerable to re-identification, such as via linkage attacks (in which the attacker combines several data sources and exploits different quasi-identifiers to uncover an individual’s true identity).
In short, the trade-off between utility and privacy seems unavoidable. But what if it weren’t?
The best is yet to come
What if there were a way to resolve the paradox of data privacy and data science? That is to say, what if there were a way to simultaneously hide and selectively reveal the information value in personal data? Fully homomorphic encryption (FHE), still a rather young branch of cryptography, dangles this possibility – in theory. Put simply, performing a data science task on encrypted data using FHE eventually yields the same result as performing the same task on the plain data. Data scientists can use the encrypted data to generate insights.
This is a revolutionary concept, as it has the potential to completely break the trade-off between data utility and data privacy. Provided that the decryption key is well protected, the encrypted data is private, while its information content remains unchanged. As promising as FHE sounds, it is unfortunately too early to view it as a panacea for the many privacy challenges in data science. One drawback is that FHE data is considered personal (rather than anonymized) – on the basis that the encryption could be reversed by anybody possessing the decryption key – and thus remains subject to GDPR regulation. The technology is also relatively young and still impractical. Excessively long runtimes render FHE useless for more computationally complex tasks, as computations on encrypted data are still approximately 1 million times slower than plaintext computations.
This is not to say FHE can’t have useful applications in data science, but industry experts suggest it will likely be another three years before efficiency improvements enable practical, wider use of the technology. Often lauded as the holy grail of encryption, FHE is currently a rather impractical, but hugely promising technology..
Less is more
Even if FHE becomes widely applicable, it shouldn’t necessarily become our default solution for all privacy challenges in data science. Methodological solutions, or changes to the set-up of a privacy challenge, may well be more feasible – and cost-effective – than burying the problem in sophisticated technology. That ‘methodological solution’ could take the form of a trusted third party, for example, while ‘changes to the set-up’ might refer to aligning the incentives of all parties involved to counteract moral hazards and resolve trust issues.
Simple multiparty computations, such as benchmarking, are a good example. Benchmarking is, in fact, one of the tasks for which FHE can already be used today to facilitate privacy-preserving analysis. Different parties can encrypt their information and, thanks to FHE, the joint result can be computed using the encrypted inputs. The result is subsequently decrypted to yield the benchmark for all parties.
Is this anything more than a fun gimmick, given that they could just send their inputs to a trusted third party to calculate and distribute the benchmarking result? Reasons to prefer FHE over a trusted third party often lie at the intersection of data privacy and data security: while the trusted third party is able to provide privacy, FHE is able to provide privacy and security. Yet again, the right approach depends on the specific use case at hand.
The art of the possible
When considering which data approach to take, there are three key questions to bear in mind. First, the right approach depends on the situation – there is no magic bullet. Second, the right approach may consist of technological solutions or methodological solutions or both. Third, this approach may not be perfect, but it is crucial to tackle data privacy head on and work with what is possible now. Don’t become paralyzed in the face of complex privacy challenges.
Businesses must balance usability and privacy – in a way that compromises neither the task at hand nor individuals’ personal data – to ensure the continued availability of data; the next wave of cognitive automation depends on it.
IBM made waves in mid-2020 when they released the first FHE toolkit readily available for mainstream use. This short article summarizes the principles of Fully Homomorphic Encryption and introduces the IBM toolkit.
A Harvard Business Review article on the limitations of standard anonymization techniques and the need to redefine what “personally identifiable information” really means.
An older but well-written piece on the pursuit of quantum-safe encryption, featuring accessible explanations of different types of cryptography.