PRIVACY IN LARGE DATASETS 2013.10.14. Budapest Gábor György Gulyás Dept. Networked Systems and Services gulyas@crysys.hu http://gulyas.info // @GulyasGG
MOTIVATION: WHY DO WE NEED PRIVACY AND ANONYMITY? 2
We live in information societies Development of infocommunication technologies Creating and reshaping information society Government: efficiency, ease of communication, greater control,... Commercial parties: new technology & data, great opportunities,... Society itself: comfort, changes in social interaction,... 3
Anyway, who cares? Who would abuse my privacy? I have nothing to hide! Think again. Isn t it all already lost? Nope. Effect on a personal level? Freedom of speech? 4
What can we do? Legal actions! Bit laggy... Surveillance societies? Social activities! Slow, but usually OK. Raise awareness, demystify false beliefs, act as a rational consumer, boycott some services,... Think of the other possibilites. In contrast how tech people think, mass-violation of privacy is basically a non technical problem. Use privacy technology! Symptomatic treatment. 5
PRIVACY IN LARGE DATASETS http://www.flickr.com/photos/t_gregorius/5839399412/ 6
Natural sources of big data in (social) technology (e.g.) Social networks & media Recommender systems Web tracking dbs (profiling) Doc indexing & search? $$ Predicting user behavior Exposing trends 7
How identifiable are we? Sweeney, 1990 87% of US population is identifiable by (216 million of 248 million): {5 digit ZIP, gender, date of birth} Revisiting study: 64% of US population is identifiable by: {ZIP-code, gender, date of birth} Golle, 2000 8
How identifiable are we? (2) Work-home location pairs as identifying information (US): avg. 1500 person / location cells 5% totally identifiable. avg. anonymity set size is ca. 20 Location based services?! Golle & Partridge, 2009 9
How identifiable are we? (3) Anonymized NetFlix dataset Public IMDb ratings Netflix vs. IMDb rarely used features are identifying only 8 ratings identify 99% of users 2 rating can be erroneous, dates within a 2 week timeframe Using big data, things can get worse. Narayanan & Shmatikov, 2008 10
How identifiable are we? (4) An experiment on Xing indicates that group memberships are identifying: ~8m users at the time ca. 42% uniquely identified extremely small anonymity sets: 2,912 collisions for 90% of users! Univisted Visited Univisted Visited Wondracek et al., 2010 11
How identifiable are we? (5) Fonts: Arial, sans-serif, Comic Sans,... Firefox 23.0 Timezone: -60 Eckersely, 2010 Boda et al., 2011 1280x1024 Fingerprinting evolves: 2010: Browser fingerprint (e.g., accuracy: 94.2%) 2011: System fingerprint (works well on Windows) 2012: Connecting personal devices Future: biometric fingerprinting? Billions of (device) fingerprints in databases Based on simple characteristics 12
How identifiable are we? (6) Unstructured data is also at stake! Writing style can be structured: e.g., inspecting the relative frequency of since and because many of these can enable stylometric profiling Results on in searching the author of a few posts: On 100,000 blogs, cross-context validation 20% of correct identification (3 posts) Improvements: Manual inspection of top 20 results 35% success rate Lower recall (10%) 80%+ precison 30-35% corr. id. with 20 post / author in avg. Narayanan et al., 2012 13
How identifiable are we? (7) Network alignment on temporal location information and social networks with ca. 80% TPR.? Srivatsa & Hicks, 2012 14
Sum of these problems Basic problem: population of 7 billion 33 bits of information Low similarity of items Heavy tail distribution of used attributes Easy feature selection! K-anonymity fails because of sparsity Narayanan & Shmatikov, 2008 15
Sum of these problems (2) Pros Publishing (anonymous) databases is good for research We have types and sizes of data never before. We have some ideas, but not there yet (privacy vs. usability). Cons Breakability of anonymization schemes? Provability? What about wholesale surveillance? One should prepare for attackers with strong auxiliary data! 16
STRUCTURAL DE-ANONYMIZATION IN SOCIAL NETWORKS 17
Data perturbation and sanitization Original data Sanitized data Perturbed data Emőke (nő, 14) Bea (nő, 45) #135 nő, 14 #16 nő, 45 #135 nő, 14 #16 nő, 45 Hajni (nő, 16) Bence (ffi, 48) #12 nő, 16 #1 ffi, 48 #12 nő, 16 #1 ffi, 48 Gergő (ffi, 17) Zalán (ffi, 45) Kati (nő, 41) #97 ffi, 45 #7 ffi, 17 #20 nő, 41 #97 ffi, 45 #7 ffi, 17 #20 nő, 41 18
Attacker model Auxiliary information, G src (a public crawl, e.g., Flickr) Anonimized graph, G tar (anonimized export, e.g., Twitter) Democratic Republican 19
Attacker model (2) Auxiliary information, G src (a public crawl, e.g., Flickr) Anonimized graph, G tar (anonimized export, e.g., Twitter) Global match 1. Seed identification 2. Propagation Narayanan & Shmatikov, 2009 Relative match (local reid.) 20
Large-scale re-identification Underlying concepts work on large social networks Auxiliary data: Flickr (3,3m ns, 53m es) Target (anon.) data: Twitter (224k ns, 8,5m es) Ground truth: 27k nodes (name/user/loc.) Results: 30% TP, only 12% FP (Init: 150 highdeg. seeds) 21
Global re-identification (a.k.a. seed identification) Originally 4-cliques Intact structure Node degree Common neighbor count of members Error within 1 (e.g., =0.05) 22
Global re-identification (2) LiveJournal subgraph ca. 10k nodes Better in dense networks NS09: 4-cliques on high degree nodes SNG: random not even displayed! IdSep: random from top 25% degree Strong start even with one clique Suprsigingly weak! But efficient in large networks! Under publication! 23
Details on the propagation phase Do v i V SRC until we have convergence: 1. Identified neighbors: {v 1,,v k } V SRC, mapped to {v 1,,v k } V TAR, e.g. (v 1 )=v 1 a. Select N={v u 1,,v um } V TAR from nbrs({v 1,,v k }) b. Calculate score: S={s u 1,,s um } 2. If v i is an outstanding candidate in S, do a reverse match checking by swaping the datasets G TAR and G SRC (and the mapping) 3. If v i is the reverse best-match, set (v i )=v i m G SRC : G TAR : e 6 5 A l A 13 f B D k 7 B D 12 g C h i j C 8 9 10 11 24
Details on the propagation phase (2) Score calculation: Cosine similarity: Eccentricity check: e m Score(v, v ) G SRC : G TAR : i CosSim(v, v ) Eccentrici ty(s) i j j 6 V V i V j V V i V i j max j V j S max S \ max S 5 S A l A 13 f B D k 7 B D 12 g C h i j C 8 9 10 11 25
MEASURING ANONYMITY 26
Measuring anonymity by node degree (Error factor: ) Measure Anonymity sets d(v i ) {A, C} {B, D, E, G} {F, H} Let anonymity of v i be: A(v i ) = 1 P(v i ) d(v i ) A(v i ) Alice 1 0.50 Bob 4 0.75 Carol 1 0.50 Dave 4 0.75 Ed 4 0.75 Fred 2 0.50 Greg 4 0.75 Harry 2 0.50 27
Local Topological Anonymity Principle: how v i is structurally hidden in its 2-neighborhood i.e., how similar v i is to its neighbors of neighbors Proposed metrics: sim vi,vk LTA A (vi) LTA LTA B C (v ) i (v ) Gulyas & Imre, 2012 i 2 2 v Vi v k V i 2 k V i sim max v i,v i k V,2 k i 2 2 v 2 Vi max deg Vi k V i sim v,v,1 d(v i )=1 v i d(v i )=2 Proposed: CosSim(v i,v k ) (for NS09) Replacable wrt. to the given attack. 28
Simulation evaluation of LTA Scoring: +1 for a TP -1 for a FP 29
Simulation evaluation of LTA (2) 83.9% of the overlapping nodes! 30
Simulation evaluation of LTA (3) LTA A LTA B LTA C 0-0,1-0,2-0,3-0,4 avg(lta A )=-0.42133 avg(lta B )=-0.34466 avg(lta C )=-0.26988-0,5-0,6-0,7 31
??? TACKLING STRUCTURAL DE- ANONYMIZATION 32
The friend-in-the-middle model Beato et al., 2013 the proxying friend Basic principle: some nodes act as a proxy (hiding edges) Cooperative: users choose proxy nodes (both trusted) Results: Proves 10% of users are enough (perhaps less) On a quite sparse network (easier to defend ) Requires cooperation: 3 nodes need to agree per edge 33
Idea: using identity management? Clauß et al., 2005 34
Idea: using identity management? (2) Auxiliary information, G src (a public crawl, e.g., Flickr) Anonimized graph, G tar (anonimized export, e.g., Twitter) Identity separation Gulyas & Imre, 2013 35
How to model identity separation? Modeled as splitting vertices into partial identities Number of new ids.: random variable Y Edge sorting distribution: Can connections overlap? Can connection be deleted? Edge deletion No edge deletion Overlap Realistic model Worst model Gulyas & Imre, 2011 No overlap Best model Basic model No real life data. 36
Measuring sensitivity to the number of identities Basic model with uniform edge sorting probability Creating Y=2 new vertices from one, and sorting edges with ½ probability to each. Recall rate: percent of correctly re-identified nodes. 37
Measuring sensitivity to the number of identities (2) Basic model with uniform edge sorting probability Disclosure rate: what the attacker learns. (i.e., the number of edges currently) Over all nodes! Over nodes with identity separation! 38
Measuring sensitivity to the number of identities (3) Interesting finding: Only for Y=2 Nodes with identity separation had higher recall rate than others Caused by using nonidsep nodes for seeding Conclusion: Natural choice bad implications on privacy Use Y=2+ 39
Measuring sensitivity to deletion of edges Used models: realistic model with minimal deletion / random deletion basic model with random deletion 40
In the search of privacy-enhancing methods Tackling the attack on a network level? Best model, Y=5, random deletion 41
In the search of privacy-enhancing methods (2) Tackling the attack on a network level? What if only few users care? Best model, Y=5, random deletion 42
Multiple models present in parallel ~ 33% With Y=2 in the LiveJournal network. 43
Using decoy identities Goal: to control what the adversary can discover Decoy identity: a public profile with most connections (90%) Hidden identity: having a few connections (20% with 10% overlap) Over all nodes! Only on hidden nodes 44
What is next? Cooperative identity separation? Goal: few nodes defeating the attack completely Lower number of nodes to be involved Global and local strategies Evaluation with different attacker settings Enhancing the decoy method Proposing certain user strategies for different attackers Provable privacy under different criteria What strategies to use if identity separation is enabled in both networks? 45
Conclusions Technology providing vast amount of data is here but we are not ready How do we detect privacy leakeges? How to design privacy friendly services? (and how to convince busniess men to do so ) How do we protect privacy? How can we evaluate protection schemes?... Can we handle big data technology somehow? Or have we yet passed the point of safe return? 46
Questions? THANK YOU FOR YOUR ATTENTION! Gábor György Gulyás Dept. Networked Systems and Services gulyas@crysys.hu http://gulyas.info // @GulyasGG 47
References Latanya Sweeney: Uniqueness of simple demographics in the US population. LIDAP-WP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA (2000). Philippe Golle: Revisiting the uniqueness of simple demographics in the US population. Proceedings of the 5th ACM workshop on Privacy in electronic society. ACM, 2006. Philippe Golle, Kurt Partridge: On the anonymity of home/work location pairs. Pervasive Computing. Springer Berlin Heidelberg, 2009. 390-397. Arvind Narayanan, Vitaly Shmatikov: Robust de-anonymization of large sparse datasets. Security and Privacy, 2008. SP 2008. IEEE Symposium on. IEEE, 2008. Arvind Narayanan, Vitaly Shmatikov: De-anonymizing social networks. Security and Privacy, 2009 30th IEEE Symposium on. IEEE, 2009. Gilbert Wondracek et al.: A practical attack to de-anonymize social network users. Security and Privacy (SP), 2010 IEEE Symposium on. IEEE, 2010. 48
References (2) Peter Eckersley: How unique is your web browser? Privacy Enhancing Technologies. Springer Berlin Heidelberg, 2010. Károly Boda et al.: User tracking on the Web via cross-browser fingerprinting. Information Security Technology for Applications. Springer Berlin Heidelberg, 2012. 31-46. Mudhakar Srivatsa, Mike Hicks: Deanonymizing mobility traces: Using social network as a side-channel. Proceedings of the 2012 ACM conference on Computer and communications security. ACM, 2012. Arvind Narayanan et al.: On the feasibility of internet-scale author identification. Security and Privacy (SP), 2012 IEEE Symposium on. IEEE, 2012. Filipe Beato, Mauro Conti, Bart Preneel: Friend in the Middle (FiM): Tackling De-Anonymization in Social Networks, 2013. Gábor György Gulyás, Sándor Imre: Hiding Information in Social Networks from De-anonymization Attacks by Using Identity Separation, 2013. 49
References (3) Clauß et al.: Privacy enhancing identity management: protection against re-identification and profiling, 2005. Gábor György Gulyás, Sándor Imre: Analysis of Identity Separation Against a Passive CliqueBased Deanonymization Attack, 2011. 50