PRIVACY IN LARGE DATASETS

Similar documents
PRIVACY IN LARGE DATASETS

Prateek Mittal Princeton University

De-anonymization. A. Narayanan, V. Shmatikov, De-anonymizing Social Networks. International Symposium on Security and Privacy, 2009

An Efficient and Robust Social Network De-anonymization Attack

De-anonymizing Social Networks

Differential Privacy. Seminar: Robust Data Mining Techniques. Thomas Edlich. July 16, 2017

Differential Privacy. CPSC 457/557, Fall 13 10/31/13 Hushiyang Liu

Privacy in Statistical Databases

Michelle Hayes Mary Joel Holin. Michael Roanhouse Julie Hovden. Special Thanks To. Disclaimer

Privacy. CS Computer Security Profs. Vern Paxson & David Wagner

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

De#anonymizing,Social,Networks, and,inferring,private,attributes, Using,Knowledge,Graphs,

Solution of Exercise Sheet 11

Use of Synthetic Data in Testing Administrative Records Systems

An Empirical Analysis of Communities in Real-World Networks

An Architectural Approach to Inter-domain Measurement Data Sharing

SOFIA: Social Filtering for Niche Markets

Privacy-Enhancing Technologies & Applications to ehealth. Dr. Anja Lehmann IBM Research Zurich

Data Anonymization. Graham Cormode.

Goal-Based Assessment for the Cybersecurity of Critical Infrastructure

WHITE PAPER: USING AI AND MACHINE LEARNING TO POWER DATA FINGERPRINTING

Learning Methods for Similarity Handling in Phonebook-centric Social Networks

Automated Website Fingerprinting through Deep Learning

Data Sources for Cyber Security Research

SMART DEVICES: DO THEY RESPECT YOUR PRIVACY?

Anonymous Communications

Achieving k-anonmity* Privacy Protection Using Generalization and Suppression

Anonymity. Assumption: If we know IP address, we know identity

Dandelion: Privacy-Preserving Transaction Propagation in Bitcoin s P2P Network

Spy vs. spy: Anonymous messaging over networks. Giulia Fanti, Peter Kairouz, Sewoong Oh, Kannan Ramchandran, Pramod Viswanath

DE-ANONYMIZING SOCIAL NETWORKS AND MOBILITY TRACES

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

Can t You Hear Me Knocking: Security and Privacy Threats from ML Application to Contextual Information

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

A Practical Attack to De-Anonymize Social Network Users

CNT Computer and Network Security: Privacy/Anonymity

Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

Sanitization Techniques against Personal Information Inference Attack on Social Network

Private & Anonymous Communication. Peter Kairouz ECE Department University of Illinois at Urbana-Champaign

Can t you hear me knocking

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

Privacy Preserving Probabilistic Record Linkage

Anonymity Tor Overview

OSN: when multiple autonomous users disclose another individual s information

Section 4 General Factorial Tutorials

BBC Tor Overview. Andrew Lewman March 7, Andrew Lewman () BBC Tor Overview March 7, / 1

communication Claudia Díaz Katholieke Universiteit Leuven Dept. Electrical Engineering g ESAT/COSIC October 9, 2007 Claudia Diaz (K.U.

Being Prepared In A Sparse World: The Case of KNN Graph Construction. Antoine Boutet DRIM LIRIS, Lyon

this security is provided by the administrative authority (AA) of a network, on behalf of itself, its customers, and its legal authorities

Efficient Private Set Intersection for a Decentralised Web of Trust

A Review on Privacy Preserving Data Mining Approaches

Symmetric Cryptography

COS 318: Operating Systems. File Systems. Topics. Evolved Data Center Storage Hierarchy. Traditional Data Center Storage Hierarchy

Big Data - Security with Privacy

FBI Tor Overview. Andrew Lewman January 17, 2012

0x1A Great Papers in Computer Security

Scalable Trigram Backoff Language Models

Ambiguity Handling in Mobile-capable Social Networks

Data Mining Lecture 2: Recommender Systems

The Tor Network. Cryptography 2, Part 2, Lecture 6. Ruben Niederhagen. June 16th, / department of mathematics and computer science

Mapping Internet Sensors with Probe Response Attacks

Anonymizing Social Networks

OnlineAnonymity. OpenSource OpenNetwork. Communityof researchers, developers,usersand relayoperators. U.S.501(c)(3)nonpro%torganization

Survey Result on Privacy Preserving Techniques in Data Publishing

Graph Analytics in the Big Data Era

Storage and File System

Network Security: Anonymity. Tuomas Aura T Network security Aalto University, Nov-Dec 2010

Data Security and Privacy. Topic 18: k-anonymity, l-diversity, and t-closeness

A Theory of Privacy and Utility for Data Sources

Recommender Systems - Introduction. Data Mining Lecture 2: Recommender Systems

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Instructions 1. Elevation of Privilege Instructions. Draw a diagram of the system you want to threat model before you deal the cards.

K ANONYMITY. Xiaoyong Zhou

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 4. Prof. James She

and Semantic Information

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

A Case For OneSwarm. Tom Anderson University of Washington.

Network Working Group Request for Comments: 1984 Category: Informational August 1996

Istat s Pilot Use Case 1

INSE Lucky 13 attack - continued from previous lecture. Scribe Notes for Lecture 3 by Prof. Jeremy Clark (January 20th, 2014)

Sampling Large Graphs for Anticipatory Analysis

Avoiding The Man on the Wire: Improving Tor s Security with Trust-Aware Path Selection

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

Technical Solutions Novel Challenges to Privacy Privacy Enhancing Technologies Examples

Security and privacy in the smartphone ecosystem: Final progress report

Exploring re-identification risks in public domains

Basic Concepts and Definitions. CSC/ECE 574 Computer and Network Security. Outline

DETERMINISTIC VARIATION FOR ANTI-TAMPER APPLICATIONS

Distributed Data Anonymization with Hiding Sensitive Node Labels

Evaluating Record Linkage Software Using Representative Synthetic Datasets

Tor: The Second-Generation Onion Router. Roger Dingledine, Nick Mathewson, Paul Syverson

Jeffrey Friedberg. Chief Trust Architect Microsoft Corporation. July 12, 2010 Microsoft Corporation

Wireless Network Security Spring 2014

HYDRA Large-scale Social Identity Linkage via Heterogeneous Behavior Modeling

Revolver: Vertex-centric Graph Partitioning Using Reinforcement Learning

A Virtual Laboratory for Study of Algorithms

Mining and Analyzing Online Social Networks

The Table Privacy Policy Last revised on August 22, 2012

2 ND GENERATION ONION ROUTER

Cooperative Computing for Autonomous Data Centers

Transcription:

PRIVACY IN LARGE DATASETS 2013.10.14. Budapest Gábor György Gulyás Dept. Networked Systems and Services gulyas@crysys.hu http://gulyas.info // @GulyasGG

MOTIVATION: WHY DO WE NEED PRIVACY AND ANONYMITY? 2

We live in information societies Development of infocommunication technologies Creating and reshaping information society Government: efficiency, ease of communication, greater control,... Commercial parties: new technology & data, great opportunities,... Society itself: comfort, changes in social interaction,... 3

Anyway, who cares? Who would abuse my privacy? I have nothing to hide! Think again. Isn t it all already lost? Nope. Effect on a personal level? Freedom of speech? 4

What can we do? Legal actions! Bit laggy... Surveillance societies? Social activities! Slow, but usually OK. Raise awareness, demystify false beliefs, act as a rational consumer, boycott some services,... Think of the other possibilites. In contrast how tech people think, mass-violation of privacy is basically a non technical problem. Use privacy technology! Symptomatic treatment. 5

PRIVACY IN LARGE DATASETS http://www.flickr.com/photos/t_gregorius/5839399412/ 6

Natural sources of big data in (social) technology (e.g.) Social networks & media Recommender systems Web tracking dbs (profiling) Doc indexing & search? $$ Predicting user behavior Exposing trends 7

How identifiable are we? Sweeney, 1990 87% of US population is identifiable by (216 million of 248 million): {5 digit ZIP, gender, date of birth} Revisiting study: 64% of US population is identifiable by: {ZIP-code, gender, date of birth} Golle, 2000 8

How identifiable are we? (2) Work-home location pairs as identifying information (US): avg. 1500 person / location cells 5% totally identifiable. avg. anonymity set size is ca. 20 Location based services?! Golle & Partridge, 2009 9

How identifiable are we? (3) Anonymized NetFlix dataset Public IMDb ratings Netflix vs. IMDb rarely used features are identifying only 8 ratings identify 99% of users 2 rating can be erroneous, dates within a 2 week timeframe Using big data, things can get worse. Narayanan & Shmatikov, 2008 10

How identifiable are we? (4) An experiment on Xing indicates that group memberships are identifying: ~8m users at the time ca. 42% uniquely identified extremely small anonymity sets: 2,912 collisions for 90% of users! Univisted Visited Univisted Visited Wondracek et al., 2010 11

How identifiable are we? (5) Fonts: Arial, sans-serif, Comic Sans,... Firefox 23.0 Timezone: -60 Eckersely, 2010 Boda et al., 2011 1280x1024 Fingerprinting evolves: 2010: Browser fingerprint (e.g., accuracy: 94.2%) 2011: System fingerprint (works well on Windows) 2012: Connecting personal devices Future: biometric fingerprinting? Billions of (device) fingerprints in databases Based on simple characteristics 12

How identifiable are we? (6) Unstructured data is also at stake! Writing style can be structured: e.g., inspecting the relative frequency of since and because many of these can enable stylometric profiling Results on in searching the author of a few posts: On 100,000 blogs, cross-context validation 20% of correct identification (3 posts) Improvements: Manual inspection of top 20 results 35% success rate Lower recall (10%) 80%+ precison 30-35% corr. id. with 20 post / author in avg. Narayanan et al., 2012 13

How identifiable are we? (7) Network alignment on temporal location information and social networks with ca. 80% TPR.? Srivatsa & Hicks, 2012 14

Sum of these problems Basic problem: population of 7 billion 33 bits of information Low similarity of items Heavy tail distribution of used attributes Easy feature selection! K-anonymity fails because of sparsity Narayanan & Shmatikov, 2008 15

Sum of these problems (2) Pros Publishing (anonymous) databases is good for research We have types and sizes of data never before. We have some ideas, but not there yet (privacy vs. usability). Cons Breakability of anonymization schemes? Provability? What about wholesale surveillance? One should prepare for attackers with strong auxiliary data! 16

STRUCTURAL DE-ANONYMIZATION IN SOCIAL NETWORKS 17

Data perturbation and sanitization Original data Sanitized data Perturbed data Emőke (nő, 14) Bea (nő, 45) #135 nő, 14 #16 nő, 45 #135 nő, 14 #16 nő, 45 Hajni (nő, 16) Bence (ffi, 48) #12 nő, 16 #1 ffi, 48 #12 nő, 16 #1 ffi, 48 Gergő (ffi, 17) Zalán (ffi, 45) Kati (nő, 41) #97 ffi, 45 #7 ffi, 17 #20 nő, 41 #97 ffi, 45 #7 ffi, 17 #20 nő, 41 18

Attacker model Auxiliary information, G src (a public crawl, e.g., Flickr) Anonimized graph, G tar (anonimized export, e.g., Twitter) Democratic Republican 19

Attacker model (2) Auxiliary information, G src (a public crawl, e.g., Flickr) Anonimized graph, G tar (anonimized export, e.g., Twitter) Global match 1. Seed identification 2. Propagation Narayanan & Shmatikov, 2009 Relative match (local reid.) 20

Large-scale re-identification Underlying concepts work on large social networks Auxiliary data: Flickr (3,3m ns, 53m es) Target (anon.) data: Twitter (224k ns, 8,5m es) Ground truth: 27k nodes (name/user/loc.) Results: 30% TP, only 12% FP (Init: 150 highdeg. seeds) 21

Global re-identification (a.k.a. seed identification) Originally 4-cliques Intact structure Node degree Common neighbor count of members Error within 1 (e.g., =0.05) 22

Global re-identification (2) LiveJournal subgraph ca. 10k nodes Better in dense networks NS09: 4-cliques on high degree nodes SNG: random not even displayed! IdSep: random from top 25% degree Strong start even with one clique Suprsigingly weak! But efficient in large networks! Under publication! 23

Details on the propagation phase Do v i V SRC until we have convergence: 1. Identified neighbors: {v 1,,v k } V SRC, mapped to {v 1,,v k } V TAR, e.g. (v 1 )=v 1 a. Select N={v u 1,,v um } V TAR from nbrs({v 1,,v k }) b. Calculate score: S={s u 1,,s um } 2. If v i is an outstanding candidate in S, do a reverse match checking by swaping the datasets G TAR and G SRC (and the mapping) 3. If v i is the reverse best-match, set (v i )=v i m G SRC : G TAR : e 6 5 A l A 13 f B D k 7 B D 12 g C h i j C 8 9 10 11 24

Details on the propagation phase (2) Score calculation: Cosine similarity: Eccentricity check: e m Score(v, v ) G SRC : G TAR : i CosSim(v, v ) Eccentrici ty(s) i j j 6 V V i V j V V i V i j max j V j S max S \ max S 5 S A l A 13 f B D k 7 B D 12 g C h i j C 8 9 10 11 25

MEASURING ANONYMITY 26

Measuring anonymity by node degree (Error factor: ) Measure Anonymity sets d(v i ) {A, C} {B, D, E, G} {F, H} Let anonymity of v i be: A(v i ) = 1 P(v i ) d(v i ) A(v i ) Alice 1 0.50 Bob 4 0.75 Carol 1 0.50 Dave 4 0.75 Ed 4 0.75 Fred 2 0.50 Greg 4 0.75 Harry 2 0.50 27

Local Topological Anonymity Principle: how v i is structurally hidden in its 2-neighborhood i.e., how similar v i is to its neighbors of neighbors Proposed metrics: sim vi,vk LTA A (vi) LTA LTA B C (v ) i (v ) Gulyas & Imre, 2012 i 2 2 v Vi v k V i 2 k V i sim max v i,v i k V,2 k i 2 2 v 2 Vi max deg Vi k V i sim v,v,1 d(v i )=1 v i d(v i )=2 Proposed: CosSim(v i,v k ) (for NS09) Replacable wrt. to the given attack. 28

Simulation evaluation of LTA Scoring: +1 for a TP -1 for a FP 29

Simulation evaluation of LTA (2) 83.9% of the overlapping nodes! 30

Simulation evaluation of LTA (3) LTA A LTA B LTA C 0-0,1-0,2-0,3-0,4 avg(lta A )=-0.42133 avg(lta B )=-0.34466 avg(lta C )=-0.26988-0,5-0,6-0,7 31

??? TACKLING STRUCTURAL DE- ANONYMIZATION 32

The friend-in-the-middle model Beato et al., 2013 the proxying friend Basic principle: some nodes act as a proxy (hiding edges) Cooperative: users choose proxy nodes (both trusted) Results: Proves 10% of users are enough (perhaps less) On a quite sparse network (easier to defend ) Requires cooperation: 3 nodes need to agree per edge 33

Idea: using identity management? Clauß et al., 2005 34

Idea: using identity management? (2) Auxiliary information, G src (a public crawl, e.g., Flickr) Anonimized graph, G tar (anonimized export, e.g., Twitter) Identity separation Gulyas & Imre, 2013 35

How to model identity separation? Modeled as splitting vertices into partial identities Number of new ids.: random variable Y Edge sorting distribution: Can connections overlap? Can connection be deleted? Edge deletion No edge deletion Overlap Realistic model Worst model Gulyas & Imre, 2011 No overlap Best model Basic model No real life data. 36

Measuring sensitivity to the number of identities Basic model with uniform edge sorting probability Creating Y=2 new vertices from one, and sorting edges with ½ probability to each. Recall rate: percent of correctly re-identified nodes. 37

Measuring sensitivity to the number of identities (2) Basic model with uniform edge sorting probability Disclosure rate: what the attacker learns. (i.e., the number of edges currently) Over all nodes! Over nodes with identity separation! 38

Measuring sensitivity to the number of identities (3) Interesting finding: Only for Y=2 Nodes with identity separation had higher recall rate than others Caused by using nonidsep nodes for seeding Conclusion: Natural choice bad implications on privacy Use Y=2+ 39

Measuring sensitivity to deletion of edges Used models: realistic model with minimal deletion / random deletion basic model with random deletion 40

In the search of privacy-enhancing methods Tackling the attack on a network level? Best model, Y=5, random deletion 41

In the search of privacy-enhancing methods (2) Tackling the attack on a network level? What if only few users care? Best model, Y=5, random deletion 42

Multiple models present in parallel ~ 33% With Y=2 in the LiveJournal network. 43

Using decoy identities Goal: to control what the adversary can discover Decoy identity: a public profile with most connections (90%) Hidden identity: having a few connections (20% with 10% overlap) Over all nodes! Only on hidden nodes 44

What is next? Cooperative identity separation? Goal: few nodes defeating the attack completely Lower number of nodes to be involved Global and local strategies Evaluation with different attacker settings Enhancing the decoy method Proposing certain user strategies for different attackers Provable privacy under different criteria What strategies to use if identity separation is enabled in both networks? 45

Conclusions Technology providing vast amount of data is here but we are not ready How do we detect privacy leakeges? How to design privacy friendly services? (and how to convince busniess men to do so ) How do we protect privacy? How can we evaluate protection schemes?... Can we handle big data technology somehow? Or have we yet passed the point of safe return? 46

Questions? THANK YOU FOR YOUR ATTENTION! Gábor György Gulyás Dept. Networked Systems and Services gulyas@crysys.hu http://gulyas.info // @GulyasGG 47

References Latanya Sweeney: Uniqueness of simple demographics in the US population. LIDAP-WP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA (2000). Philippe Golle: Revisiting the uniqueness of simple demographics in the US population. Proceedings of the 5th ACM workshop on Privacy in electronic society. ACM, 2006. Philippe Golle, Kurt Partridge: On the anonymity of home/work location pairs. Pervasive Computing. Springer Berlin Heidelberg, 2009. 390-397. Arvind Narayanan, Vitaly Shmatikov: Robust de-anonymization of large sparse datasets. Security and Privacy, 2008. SP 2008. IEEE Symposium on. IEEE, 2008. Arvind Narayanan, Vitaly Shmatikov: De-anonymizing social networks. Security and Privacy, 2009 30th IEEE Symposium on. IEEE, 2009. Gilbert Wondracek et al.: A practical attack to de-anonymize social network users. Security and Privacy (SP), 2010 IEEE Symposium on. IEEE, 2010. 48

References (2) Peter Eckersley: How unique is your web browser? Privacy Enhancing Technologies. Springer Berlin Heidelberg, 2010. Károly Boda et al.: User tracking on the Web via cross-browser fingerprinting. Information Security Technology for Applications. Springer Berlin Heidelberg, 2012. 31-46. Mudhakar Srivatsa, Mike Hicks: Deanonymizing mobility traces: Using social network as a side-channel. Proceedings of the 2012 ACM conference on Computer and communications security. ACM, 2012. Arvind Narayanan et al.: On the feasibility of internet-scale author identification. Security and Privacy (SP), 2012 IEEE Symposium on. IEEE, 2012. Filipe Beato, Mauro Conti, Bart Preneel: Friend in the Middle (FiM): Tackling De-Anonymization in Social Networks, 2013. Gábor György Gulyás, Sándor Imre: Hiding Information in Social Networks from De-anonymization Attacks by Using Identity Separation, 2013. 49

References (3) Clauß et al.: Privacy enhancing identity management: protection against re-identification and profiling, 2005. Gábor György Gulyás, Sándor Imre: Analysis of Identity Separation Against a Passive CliqueBased Deanonymization Attack, 2011. 50