Learning about the key tools, applications, and caveats with releasing privacy-preserving data sets.
Back to all posts
Considering the increasingly popular promise of the enormous potential for businesses and citizens that sharing data at scale would unlock, one reads a lot about the vast array of applications potentially being enabled by the legislative projects part of the European Strategy for data (2020), such as the Data Governance Act (DGA).
Hand-in-hand with the rising share of the world population having access to the internet[1] and the expanding base of regular users, the amount of data that becomes available every day has skyrocketed. New applications in a myriad of domains such as health, mobility, environment, agriculture, public sector, and logistics are proclaimed to make our lives and work more efficient[2][3].
And that is not without reason – most of us know about the great use-case examples that would profit from means of using that data, and even more so from an increasingly liberal data-sharing ecosystem.
Nevertheless, while the potential upsides from sharing anonymized protected data at scale are clear, extant scholars assessing the privacy bounds of various kinds of data did repeatedly point at the shortcomings of widely-used tools for anonymizing data, such as performing simple de-identification and sampling before sharing[4][5]. So let’s take a step back and consider what is identified as the set of key challenges in releasing protected data with the help of so-called privacy-enhancing technologies (PETs).
In order to understand where the existing caveats with PETs lie, we have to first distinguish between microdata storing information individual-level, and population-level data, which enables analysts to conduct statistical analysis and learn more about the underlying distribution of the data.
For the latter, privacy researchers did come up with valuable solutions for those who want to release the results of their one-off statistical analysis. Among others, these solutions comprise technologies such as homomorphic encryption[6], differential privacy[7], and secure multi-party computation[8]. While offering a promising choice for those interested in sharing their data for population-level analysis, they still do not solve the conundrum most important for practitioners: how to share high-quality individual-level data in a manner that preserves privacy but allows analysts to extract a dataset’s full value[9].
Early attempts to overcome this issue of releasing individual-level data were to a large extent rooted in the idea to remove those attributes from the data, which might be combined to form a unique identifier,and, thereby, create a privacy-preserving dataset. In order to identify these combinations of attributes making the data vulnerable, they would first need to be predicted and subsequently be neutralized by a set of different mechanisms, such as perturbation or generalization. Only after the steps above have been applied, the transformed dataset could be published.
The caveat with this strategy lies in the high-dimensionality of modern datasets that contain a myriad of features and attributes, which make it computationally intractable to anticipate all possible combinations of features that could serve as potential unique identifiers[10]. Therefore, data holders are required to either accurately predict which attributes might become available to adversaries and remove only them, or try to drop all identified possible combinations in the first place – neither strategy providing the required utility benefits and privacy guarantees.
Prominent examples of supposedly anonymous datasets, which had been released as such but could be re-identified, include journalists who re-identified politicians in an anonymized browsing history dataset of 3 million German citizens, uncovering their medical information and their sexual preferences[11], and the Australian Department of Health which released medical records for 10% of the population that were re-identified only weeks later[12].
One potential and highly-debated novel data-sharing solution that has been proposed in recent years is synthetic data, which has been quickly put into praxis for a vast array of use-cases, such as machine learning for healthcare and medicine[13], and is expected by some to revolutionize AI as we know it[14].
But what does ‘synthetic data’ mean?
According to the European Data Protection Supervisor, synthetic data describes artificial data that is generated from original data and a model that is trained to reproduce the characteristics and structure of the original data. This means that synthetic data and original data should deliver very similar results when undergoing the same statistical analysis[15].
While there has been a significant peak of interest in using synthetic data as a “smarter data anonymization solution”[16], it still remains an open question whether it really provides better protection against adversaries than traditional mechanisms. Researchers from the EPFL in Lausanne for instance have found that synthetic data does not provide a better tradeoff between privacy and utility than traditional anonymization techniques for the release of microdata[17]. They argue that synthetic data can only protect those individuals in the data that are most vulnerable to privacy breaches (i.e. outliers) if the full promised value of the original dataset is not retained – so, again, the same issue as for traditional anonymization techniques.
Another point that needs to be considered is about the potential biases and fairness issues that (even) successfully anonymized and shared data can introduce when used in decision-making systems and other applications. Learning from studies examining the impact of privacy-preserving data on the outcomes of models they are used in, AI engineers should be aware that these can cause disparate impacts and unfair misallocations of resources due to the statistical noise added to the data as part of many privacy mechanisms[18][19][20].
A recent article examining real-world synthetic healthcare data found that the synthetic generation model used in these examples may be severely biased towards certain subgroups and subsequent analysis and decisions based on the synthetic data may not be fair[21]. This just being one example among many studies, it seems important to be aware of this issue and the potential caveats that need to be addressed when deploying large-scale synthetic data in sensitive domains.
Considering the work that has been published on privacy-enhancing technologies, and the years of thought that went into finding a solution to overcome the main tradeoff between privacy and utility, there seems to evolve a consensus about the necessity to acknowledge these two concepts as antagonists that in releasing tabular microdata cannot be guaranteed at a high level simultaneously[22][23].
Nevertheless, building on these findings can still lead to a comprehensive solution to the privacy conundrum. By acknowledging that the set of possible use-cases for which one can give strict privacy-guarantees is limited, Theresa Stadler and Carmela Troncoso argue, researchers can start developing tools, which allow businesses to identify use-cases for which good privacy and utility can be achieved at the same time. They further recommend that in order to empower data-driven business models linked to sharing protected data, it is important to assist data holders in navigating the complex landscape of PETs by developing guidelines, which match use-cases with their appropriate technology for sharing data.