Privacy Risk Assessment - Privacy-aware Ecosystem for Data Sharing

We live in times of increasing opportunities of sensing and analysing data describing human activities at extreme detail and resolution. Unfortunately, this comes with unprecedented risks, since this data can be related to personal and sensitive information about individuals. Data are tipically stored in the databases of companies (e.g., telecom, insurance and retail companies), which use legal constraints on privacy as a reason for not sharing it with science and society at large. Clearly, this blocks a new wave of knowledge-based services, as well as new scientific discoveries.

Indeed, a goal of KDD Lab is to design a privacy-aware ecosystem for data sharing to assure that any privacy violation occurs during the data analysis and the knowledge discovery process.
Our intuition is that raw data are rarely needed to develop a service (for instance, raw trajectories are typically not needed to develop most personal or social mobility services), thus, instead of applying blindly privacy-preserving transformations, we rely on specific pre-processing of users data, which may alter significantly the real risk of privacy compared with the original raw data, related to specific services.
The proposed privacy-aware ecosystem is a systematic implementation of the Privacy-by-Design principle [1] and it is also compliant with the data minimisation principle [2].

The framework we propose supports a Data Provider in the assessment of the empirical risk of re-identification inherent to the data to be transferred to a Service Developer. In order to allow the data sharing, we need to examine a repertoire of possible transformations of the raw data to the purpose of selecting one specific transformation (i.e., aggregations, selections and filtering of raw data) that yields an adequate trade-off between data quality and privacy risk. The systematic exploration of this search space of possible transformations is precisely the scope of our proposed privacy risk ecosystem. To do this, we need to specify and define: 1) metrics capable to evaluate the risk of privacy, such as the risk of the re-identification or the risk of inferences; 2) attack models that describe the method used by an adversary to re-identify a specific user in the released data and how he might infer new useful information about him; and 3) metrics capable to evaluate the data quality of the shared data. The general idea is that, once specific assumptions are made on the personal data and the target analytical questions that are to be answered with the data, it is conceivable to design a privacy-preserving data transformation able to: 1) transform the source data into an anonymous or obfuscated version with a quantifiable privacy guarantee (measured, e.g., as the probability of re-identification), and 2) guarantee that the target analytical questions can be answered correctly (within a quantifiable approximation) using the transformed data instead of the original ones.

Clearly, this methodology is useful for a wide range of circumstances, but KDD Lab already applied this privacy risk assessment framework to two different contexts, relying on real datasets:
1) Through Call Detail Record (CDR) Data (i.e, records of events produced by telecommunication network elements), we estimated the distribution of a population, classifying people in residents, commuters and visitors, on the basis of their call activities. Here, the transformation is a spatio-temporal aggregation of calls at municipality level, since a very specific information, like CDR logs, is not required for develop this service [3]. For example, in order to recognize a resident, we only need to know that he/she performs a call during in both working days and festive days, both in mornings and in evenings.
Considering that an adversary knows exactly the phone activities of a user in 3 (over 4) weeks, we discovered that less than 0.1% of users have a risk of re-identification of 50%, while for more than 99.9% of people the risk is no more than 33%; moreover, the 99% of users have a privacy risk less than about 7%.
2) Using Global Positioning System (GPS) Data, we extracted from raw trajectories three different
examples of pre-processings, that can be used for specific use cases of mobility service: presence data in a territory (that enable parking assistance and context-aware advertising), trajectory
data (related to navigational and car pooling services
), and collectively frequent road segments (identification of strategic
locations for setting up new facilities such as franchise stores and fuel stations and routing optimization to avoid congestions).
In the first pre-processed data, in the worst case we obtained that 20% of users have a risk of re-identification less or equals to 50%, 7% have a risk less or equals to 33% and 4% of users have a risk less or equals to 20%. In the second kind of data, we obtained the worst risk of re-identification, since its level of detail is very similar to the one related to original raw data. In this case, knowing only one single road segment, 75% of users can be completely re-identifiable (100% of risk); if an adversary knows any two road segments, the percentage of users with 100% of risk increase roughly to 94%. The third data format is the one with best results, since it is computed at collective level: in the worst case, about 90% of users have a risk less or equals to 20%, and more than 80% of users have a risk less or equals to 10%.

These examples show how solutions avoiding external access to raw data, providing instead the minimum information to perform the target analysis with correctness, can be achieved in real scenarios. In particular, they highight how this kind of data can be analyzed by our system to give meaningful measures of the users' risk and the data quality varying the privacy guarantee.

[1] Ann Cavoukian.
Privacy design principles for an integrated justice system - working paper, 2000.
[2] Directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016, Official Journal of the European Union (2016)
[3] Anna Monreale, Salvatore Rinzivillo, Francesca Pratesi, Fosca Giannotti, and Dino Pedreschi. Privacy-by-design in big data analytics and social mining. EPJ Data Science, 3(1):10, 2014.
[4] Francesca Pratesi, Anna Monreale, Roberto Trasarti, Fosca Giannotti, Dino Pedreschi, and Tadashi Yanagihara.
PRISQUIT: a System for Assessing Privacy Risk versus Quality in Data Sharing. Technical Report
2016-TR-043. ISTI - CNR, Pisa, Italy. FriNov20162291.