Statistical Disclosure Control meets Recommender Systems: A practical approach

Similar documents
Protecting Against Maximum-Knowledge Adversaries in Microdata Release: Analysis of Masking and Synthetic Data Using the Permutation Model

A COMPARATIVE STUDY ON MICROAGGREGATION TECHNIQUES FOR MICRODATA PROTECTION

Disclosure Control by Computer Scientists: An Overview and an Application of Microaggregation to Mobility Data Anonymization

Ordinal, Continuous and Heterogeneous k-anonymity Through Microaggregation

Fuzzy methods for database protection

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Survey of Statistical Disclosure Control Technique

n-confusion: a generalization of k-anonymity

Anonymization Algorithms - Microaggregation and Clustering

A Disclosure Avoidance Research Agenda

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm

CS573 Data Privacy and Security. Li Xiong

Privacy-preserving Publication of Trajectories Using Microaggregation

CERIAS Tech Report

RESEARCH REPORT SERIES (Statistics # ) Masking and Re-identification Methods for Public-use Microdata: Overview and Research Problems

An efficient hash-based algorithm for minimal k-anonymity

Abstract. Index Terms

Transforming Data to Satisfy Privacy Constraints

The Two Dimensions of Data Privacy Measures

A Recommender System Based on Improvised K- Means Clustering Algorithm

Privacy Preserving Data Sharing in Data Mining Environment

k-anonymization May Be NP-Hard, but Can it Be Practical?

Comparative Analysis of Anonymization Techniques

Knowledge Discovery and Data Mining 1 (VO) ( )

Review on Techniques of Collaborative Tagging

ARGUS version 4.0. User s Manual. Statistics Netherlands Project: CASC-project P.O. Box 4000

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Data Management Glossary

ARGUS version 4.1. User s Manual. Statistics Netherlands Project: CENEX-project P.O. Box 4000

Utility-Preserving Differentially Private Data Releases Via Individual Ranking Microaggregation

A Survey on Various Techniques of Recommendation System in Web Mining

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Privacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S.

Comparative Evaluation of Synthetic Dataset Generation Methods

Machine Learning (BSMC-GA 4439) Wenke Liu

Privacy Preserving Data Mining: An approach to safely share and use sensible medical data

Record Linkage using Probabilistic Methods and Data Mining Techniques

Recommendation with Differential Context Weighting

Self-Organizing Maps for cyclic and unbounded graphs

Privacy Preserving Data Publishing for Recommender System

Comparison of Recommender System Algorithms focusing on the New-Item and User-Bias Problem

Randomized Response Technique in Data Mining

Business microdata dissemination at Istat

Differential Privacy. Seminar: Robust Data Mining Techniques. Thomas Edlich. July 16, 2017

Rotation Perturbation Technique for Privacy Preserving in Data Stream Mining

Recommendation Algorithms: Collaborative Filtering. CSE 6111 Presentation Advanced Algorithms Fall Presented by: Farzana Yasmeen

Privacy Preserving Data Mining, Evaluation Methodologies. Igor Nai Fovino and Marcelo Masera

Engineering Privacy by Design Reloaded

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Recommender Systems using Graph Theory

Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust

A FUZZY BASED APPROACH FOR PRIVACY PRESERVING CLUSTERING

Weighted Alternating Least Squares (WALS) for Movie Recommendations) Drew Hodun SCPD. Abstract

An Efficient Clustering Method for k-anonymization

Property1 Property2. by Elvir Sabic. Recommender Systems Seminar Prof. Dr. Ulf Brefeld TU Darmstadt, WS 2013/14

Towards the Use of Graph Summaries for Privacy Enhancing Release and Querying of Linked Data

Security Control Methods for Statistical Database

Clustering Algorithms for Data Stream

Unsupervised learning on Color Images

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

A Survey on: Privacy Preserving Mining Implementation Techniques

Hierarchical and Ensemble Clustering

Hybrid Recommendation System Using Clustering and Collaborative Filtering

Triangulation-Based Multivariate Microaggregation

Introduction to Data Mining

Know your neighbours: Machine Learning on Graphs

Web Service Recommendation Using Hybrid Approach

Data Distortion for Privacy Protection in a Terrorist Analysis System

Prowess Improvement of Accuracy for Moving Rating Recommendation System

A Review on Privacy Preserving Data Mining Approaches

Privacy-Enhanced Collaborative Filtering

k-degree Anonymity And Edge Selection: Improving Data Utility In Large Networks

The Two Dimensions of Data Privacy Measures

Privacy, Security & Ethical Issues

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

Social Voting Techniques: A Comparison of the Methods Used for Explicit Feedback in Recommendation Systems

Novelty Detection from an Ego- Centric Perspective

Location Privacy in Location-Based Services: Beyond TTP-based Schemes

Privacy-Preserving Collaborative Filtering using Randomized Perturbation Techniques

Implementation of Privacy Mechanism using Curve Fitting Method for Data Publishing in Health Care Domain

A new shiny GUI for sdcmicro

Database and Knowledge-Base Systems: Data Mining. Martin Ester

A Hybrid Peer-to-Peer Recommendation System Architecture Based on Locality-Sensitive Hashing

Challenges in Ubiquitous Data Mining

Artificial Neuron Modelling Based on Wave Shape

Survey Result on Privacy Preserving Techniques in Data Publishing

Co-clustering for differentially private synthetic data generation

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM

Collaborative Filtering Based on Iterative Principal Component Analysis. Dohyun Kim and Bong-Jin Yum*

Towards Traffic Anomaly Detection via Reinforcement Learning and Data Flow

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data

arxiv: v4 [cs.ir] 28 Jul 2016

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

COLD-START PRODUCT RECOMMENDATION THROUGH SOCIAL NETWORKING SITES USING WEB SERVICE INFORMATION

Chapter 5: Outlier Detection

Available online at ScienceDirect. Procedia Technology 17 (2014 )

Hotel Recommendation Based on Hybrid Model

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Multi-Level Trust Privacy Preserving Data Mining to Enhance Data Security and Prevent Leakage of the Sensitive Data

Clustering. Chapter 10 in Introduction to statistical learning

Transcription:

Research Group Statistical Disclosure Control meets Recommender Systems: A practical approach Fran Casino and Agusti Solanas {franciscojose.casino, agusti.solanas}@urv.cat Smart Health Research Group Universitat Rovira i Virgili

Background Outline Recommender Systems and Collaborative Filtering Limitations and Countermeasures Statistical Disclosure Control and Privacy-Preserving Collaborative Filtering Evaluation Tools Contributions to Privacy-Preserving Collaborative Filtering Evaluated Methods Experiments and Comparisons Conclusions 2

Recommender Systems Recommender Systems evolve from the Knowledge Discovery in Databases field. In a typical recommender system, people provide opinions/evaluations as inputs, which the system then aggregates and directs to appropriate recipients [Resnick et. al.]. The main advantage of Recommender Systems (RS) is that they help us to deal with/overcome information overload. P. Resnick, H. Varian, Recommender Systems Communications of the ACM 40(3), 56 (1997) 3

Collaborative Filtering Collaborative Filtering (CF) is a crowdsourcingbased recommender system which aims to make suggestions on items (books, music, movies or routes) based on preferences of users that have already acquired and/or rated these items. 4

CF Philosophy The recommendations provided by CF methods are based on the assumption that similar users will be interested in the same items. Users collaborate in order to obtain more quality recommendations. 5

CF Families Collaborative Filtering Memory Model Hybrid 6

Limitations & Privacy Bribing Shilling Sparseness CF limitations Synonymy Scalability Privacy Cold start Black sheep 7

Collaborative Filtering & Privacy Recommender Systems Collaborative Filtering Privacy-preserving Collaborative Filtering Collaborative Filtering Statistical Disclosure Control 8

Statistical Disclosure Control Statistical Disclosure Control (SDC, [Hunderpool et. al.]), seeks to anonymise microdata sets (i.e. datasets consisting of multiple records corresponding to individual respondents) in order to prevent their disclosure. Types of disclosure Identity Disclosure Identification of an entity (person, institution). Attribute Disclosure The intruder finds something new about the target entity. A. Hundepool, et al. Statistical Disclosure Control. Wiley, 2012. 9

Data Anonymisation Techniques Overview Top/bottom coding Rounding Sampling Suppression Generalisation Limitation of detail Anatomisation Data swapping Noise addition Microaggregation 10

Microaggregation Microaggregation is a family of SDC algorithms for datasets used to prevent against reidentification, which works in two stages: In1. thethe caseset of of RS records in a dataset is clustered in such a way that: We i) consider each cluster all ratings contains asat quasi-identifiers. least k records; Therefore, ii) recordswe within anonymise a cluster are allasratings similar as in possible. order to achieve k-anonymity. 2. Records within each cluster are replaced by a representative of the cluster, typically the centroid record (i.e. the average of the cluster). 11

Evaluation Tools Evaluation Tools SDC Metrics RS Metrics 12

SDC Information Loss The quantity of information which exist in the initial microdata and because of disclosure control methods does not occur in masked microdata [Willemborg et. al.]. Willemborg L., Waal T. Elements of Statistical Disclosure Control. Springer Verlag. 13

SDC Disclosure Risk The risk that a given form of disclosure will arise if a masked microdata is released [Chen et. al.]. Value/attribute disclosure Identity disclosure Individual measures - The risk per record or the probability of correctly re-identifying a unit. [Willemborg et. al.] Global measures - The risk for the entire dataset. Number of correct re-identifications according to a linking measure. [Domingo-Ferrer et. al.] Chen G., Keller-McNulty S. Estimation of Deidentification Disclosure Risk in Microdata. Journal of Official Statistics, Vol 14. No. 1, 79-95. Willemborg L. Waal T. Elements of Statistical Disclosure Control, Springer Verlag. Domingo-Ferrer J. Torra V. Disclosure Risk Assessment in Statistical Microdata Protection Via Advanced Record Linkage Statistics and Computing, vol 13, no 4, pp- 343-354 14

RS Metrics Prediction Match Slight Match Slight Reversal Reversal Real Value Ratings Range 15

Background Outline Recommender Systems and Information Overload Limitations of Collaborative Filtering and Countermeasures Statistical Disclosure Control and Privacy-Preserving Collaborative Filtering Evaluation Tools Contributions to Privacy-Preserving Collaborative Filtering Evaluated Methods Experiments and Comparisons Conclusions 16

PPCF Methods Gaussian Noise Addition with zero mean. Maximum Distance to Average Vector (MDAV) [Domingo-Ferrer et. al.] Variable MDAV (V-MDAV) [Solanas et. al.] J. Domingo-Ferrer and J. M. Mateo-Sanz. Practical data-oriented microaggregation for statistical disclosure control, IEEE Transactions on Knowledge and data Engineering, 2002. A. Solanas and A. Martínez-Ballesté. V-MDAV : A Multivariate Microaggregation With Variable Group Size. Seventh COMPSTAT Symposium of the IASC, 2006. 17

MDAV Fixed-size groups & k-anonymity 18

V-MDAV After each iteration, a heuristic evaluates whether to include a new record r to a group: If r is closer to the actual group than to the rest of records, according to its distance and a gain factor. If the actual group size is < 2k-1, because the optimal k-partition is achieved when groups consists of k to 2k-1 records [Domingo- Ferrer et. al.]. The gain factor can be tuned in order to fit the data distribution. Variable-sized Groups & k-anonymity J. Domingo-Ferrer and V. Torra. Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining and Knowledge Discovery, 11(2):195 212, 2005. 19

Data Preprocessing Matrices are filled and stantardised (z-scores). where x i is the i-th value of item x and µ and σ are the mean and the standard deviation of item x, respectively. Next, the corresponding method is applied. Comparison between methods in terms of data utility and privacy using well-known metrics. 20

GNA & MDAV Movielens 100k Jester 21

MDAV & V-MDAV (I) 22

MDAV & V-MDAV (II) 23

Behavioural Precision B/A 24

Conclusions - Highlights Despite the great advantages of using CF, we have highlighted its downside regarding users privacy. We have analysed/discussed how V-MDAV obtains better results and provides both more privacy and data usability than wellknown methods such as MDAV and Gaussian noise addition. Both microaggregation-based proposals achieve k-anonymity, which guarantees privacy by design, a feature not offered by GNA. Moreover, for low cardinality values, recommendations were more accurate than these obtained when using data without obfuscation, showing the efficacy of our proposal. The use of behavioural measures allowed us to better analyse data and increase its usability. 25

Research Group Statistical Disclosure Control meets Recommender Systems: A practical approach Fran Casino and Agusti Solanas {franciscojose.casino, agusti.solanas}@urv.cat Smart Health Research Group Universitat Rovira i Virgili