Further Understanding the Intersection of Technology and Privacy to Ensure and Protect Client Data Special Thanks To Michelle Hayes Mary Joel Holin We can provably know where domestic violence shelter clients have been without knowing who they are. privacy.cs.cmu.edu Latanya Sweeney, PhD latanya@privacy.cs.cmu.edu Michael Roanhouse Julie Hovden Disclaimer The views and opinions in this presentation represent my own and are not necessarily those of HUD, Abt, or any affiliates (or my cat s or dog s). Known side effects include shock and applause. Privacy Technology 1.Example: tracking people 2.Example: anonymizing data 3.Example: distributed surveillance 4.Example: trails of dots 5.Example: learning who you know 6.Example: identity theft 7.Example: fingerprint capture 8.Example: bio-terrorism surveillance 9.Example: privacy-preserving surveillance 10.Example: DNA privacy 11.Example: Identity theft protections 12.Example: k-anonymity.example: webcam surveillance 14.Example: text de-identification 15.Example: face de-identification 16.Example: fraudulent Spam privacy.cs.cmu.edu And Technology Or Privacy Traditional Belief System This Work Privacy Question in this Work How can Shelters construct UIDs without risk of re-identification while still achieving an accurate unduplicated accounting? Usefulness This talk will examine old approaches and introduce a new solution with provable properties. Copyright (c) 1998-2006 Dr. Sweeney. 1
This Talk 1. The Setting 2. Technology Survey 3. A Provable Privacy Solution The Big Goal Perform local unduplicated accountings of homeless visit patterns without identifying clients. privacy.cs.cmu.edu Homeless Management Information Systems (HMIS) Goal: a local unduplicated accounting of Client Visit Patterns Client1 Client2 Client3 Client4 1 2 HUD Client Personal Information Shelter Universal Data Elements HUD Aggregate Information Client5 Shelter3 Universal Data Elements Unique Identifier ( UID ) Name Social Security Number Date of Birth Ethnicity and Race Gender Veteran Status Disabling Condition Residence Prior to Program Entry Code of Last Permanent Address Program Entry Date Program Exit Date Unique Person Identification Number Program Identification Number Household Identification Number HUD Reporting (Sample) Question # AHAR Questions: Emergency Shelter -Individuals 1 How many people used emergency shelters at time? 2 What is the distribution of family sizes using emergency shelters? 3 What are the demographics of individuals using emergency shelters? 3 distribution by gender? 3 distribution by race and ethnicity? 3 distribution by age group? 3 distribution by household size? 3 distribution by veteran status? By disabling condition? 4 What was the living arrangement the night before entering the emergency shelter? 4 within/outside geographical jurisdiction? 5 What is distribution of the number of nights in an emergency shelter? 5 distribution by gender? 5 distribution by age group? Copyright (c) 1998-2006 Dr. Sweeney. 2
Intimate Stalker Threat Knows detailed information about a targeted client Is highly motivated Can compromise a shelter or to find the location of the targeted client ( re-identification ) Threat Has lots of other information that may contain the client. Motivated to learn information about clients generally Link data on clients specifically ( re-identification ) Re-Identification occurs when explicit client identifiers (e.g., name or address) can be reasonably associated with the client s de-identified information. to Re-identify Clients Alice Personal Information Shelter 9/19/60 F 372 Alice Alice 9/19/60 1 Main St External Information 9/19/60 F 372 Alice to re-identify HMIS Data Ethnicity Visit date PIN Shelter ID Dataset Name Address ZIP Date Birth registered date Party Sex affiliation Date last voted Voter List L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine and Ethics. 1997, 25:98-110. County Town/ Place ZIP 5-digit Gender Re-identification Results 18.1% 0.04% 0.00004% 58.4% 3.6% 0.04% 87.1% 3.7% 0.04% Date of Birth Mon/Yr Birth Year of Birth Copyright (c) 1998-2006 Dr. Sweeney. 3
Thwarting Using re-identification analysis, we can quantify linking risks associated with data elements and make changes accordingly. We can thwart linking. Remainder of this talk assumes linking precautions done. Question in this Work How can Shelters construct UIDs without risk of re-identification while still achieving an accurate unduplicated accounting? This talk will examine old approaches and introduce a new solution with provable properties. This Talk 1. The Setting 2. Technology Survey 3. A Provable Privacy Solution Minimal Risk v. Provable Privacy Minimal risk technologies uses a combination of technology, practices and policy to show that there is a minimal re-identification risk. privacy.cs.cmu.edu Provable privacy technology provides guarantees against reidentification. Minimal Risk Technologies Technology Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-3 Pittsburgh: November 2005. Copyright (c) 1998-2006 Dr. Sweeney. 4
Concatenate parts of source information into a UID. Example: Using {date of birth, gender, ZIP} 021960F372 Date of birth Sex ZIP Providing explicitly sensitive source information. Need to use non-sensitive source information. Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-3 Pittsburgh: November 2005. UID based on part of the source information. Example: Using {date of birth, gender, ZIP} 8126r29ws 986s594652 Must be strong Can be examined publicly. Fast to compute but infeasible to reverse. Problems with Consistent (1) If the same hash value is broadly used with Clients, then it may lead to reidentifications through linking. If the intimate stalker compromises a Shelter or the, the hashed UID could be learned and used to locate the targeted Client. Problems with Consistent (2) If the source information is SSN or demographics, then could re-identify all UIDs by exhaustively computing all UIDs. Dataset UID 149875 072 976526 Social Security Number UID UID 8563 for try 000-00-0000 UID 962656 for try 000-00-0001 UID 072 for try 000-00-0002 UID 976526 for try 104-51-2572 UID 149875 for try 104-51-2573 Try 000-00-0000 Try 000-00-0001 Try 000-00-0002 Try 104-51-2572 Try 104-51-2573 Try 999-99-9999 Problems with Consistent (3) bits seconds 28 1 29 3 30 7 31 15 32 31 33 62 34 124 35 249 36 499 37 998 38 1996 93 40 7986 41 15963 42 31926 43 63888 44 127725 45 255463 46 510774 47 1021463 Time to Exhaust Count (seconds) 1200000 1000000 800000 600000 400000 200000 0 24 29 34 44 49 Number of Bits Size of source information matters. Exhaust all SSNs in 4 seconds! Copyright (c) 1998-2006 Dr. Sweeney. 5
Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-3 Pittsburgh: November 2005. Like hashing, but has a key to reverse result. Example: Using {date of birth, gender, ZIP} 8126r29ws 8126r29ws + key = 9/12/1960, F, 372 The person with the key can reveal the sensitive source information. Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-3 Pittsburgh: November 2005. Scan Cards / RFID Tags Issue a card containing a UID to each client, who presents for service. Can be lost of given away! Example #57817 #57817 Should not contain personal information or Shelter information. Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-3 Pittsburgh: November 2005. Use something always present with client and that typically does not change. Example: fingerprint 968c5z9 UID Fingerprints can often be linked to lawenforcement databases and re-identify clients. Copyright (c) 1998-2006 Dr. Sweeney. 6
Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-3 Pittsburgh: November 2005. Ask each client their permission to share data in exchange for services. Disclose uses of data and circumstances of sharing. They may say no. Identifiable information can be shared. Forwarding identifiable information is not good. Homeless Shelters. Carnegie Mellon Tech Report CMU-ISRI-05-3 Pittsburgh: November 2005. This Talk 1. The HMIS Setting 2. Technology Survey 3. A Provable Privacy Solution privacy.cs.cmu.edu Question in this Work How can Shelters construct UIDs without risk of re-identification while still achieving an accurate unduplicated accounting? This talk will examine old approaches and introduce a new solution with provable properties. The Big Idea in 3 Steps 1. Shelters assign UIDs. Client has same UID at same shelter, and different UID at other shelters. 2. Shelters securely ship data to Fedex UIDs and Universal Data Elements 3. and Shelters de-duplicate UIDs (described over next slides) Copyright (c) 1998-2006 Dr. Sweeney. 7
UID Assignment Each Shelter has a private value. Each Client has a private value. Strong hashing is used to combine the Shelter and Client value to produce a UID for the client. De-Duplication Each Shelter re-hashes the UIDs from all other Shelters. All re-hashed values that are the same represent the same client. The Commutative Property of Strong Simplified Multiplication Example, 1 There exists strong hash functions that when all Shelters re-hash all UIDs, the re-hashed values will only be the same for Clients whose source information was the same. J. Benaloh and M. de Mare. One-way accumulators: a decentralized alternative to digital signatures. In Proceedings of Advances in Cryptology - EUROCRYPT '93, Lecture Notes in Computer Science, v 765, pages 274-285, Lofthus, Norway, 1994. Each Shelter has its own private value. Simplified Multiplication Example, 2 Simplified Multiplication Example, 3 3 Mult(, ) = Mult( 7, ) = 3 Mult(, ) = Mult( 7, ) = Multiply Client and Shelter private value to get UIDs. The stores UIDs of Clients from Shelter 1. Copyright (c) 1998-2006 Dr. Sweeney. 8
Simplified Multiplication Example, 4 Simplified Multiplication Example, 5 3 Mult(, ) = 11 Mult(, ) = Multiply Client and Shelter private value to get UIDs. now knows there are 4 visits, but how many Clients? Simplified Multiplication Example, 6 Simplified Multiplication Example, 7 Mult(, )= 897 Mult(, ) = 3289 sends UIDs from Shelter 2 to Shelter 1 for re-hashing. stores the re-hashed values. 897 3289 Simplified Multiplication Example, 8 Simplified Multiplication Example, 9 sends UIDs from Shelter 1 to Shelter 2 for rehashing. 897 3289 Mult(, ) = 897 Mult(, ) = 2093 stores the re-hashed values. 897 2093 897 3289 Copyright (c) 1998-2006 Dr. Sweeney. 9
Simplified Multiplication Example, 10 Simplified Multiplication Example, 11 Re-hashed values that are the same represent the same Client. Which are the same? 897 2093 897 3289 The re-hashed value 897 appears twice. learns that Client at Shelter 1 is the same Client as at Shelter 2. 897 2093 897 3289 Learns Simplified Multiplication Example 3 Mult(, ) = 3 (3* * )* Client1 Client2 Completely Re-hashed UIDs 897 2093 3289 (3 * ) * = 897 (3 * ) * = 897 (33 * ) * 897 Client3 3 Mult(, ) = 897 The Big Idea in 3 Steps 1. Shelters assign UIDs. Client has same UID at same shelter, and different UID at other shelters. 2. Shelters securely ship data to Fedex UIDs and Universal Data Elements 3. and Shelters de-duplicate UIDs Re-hash UIDs to reveal which UIDs belong to the same client. Note The UIDs are not to be used for any other purpose than this reporting and deduplication. Shelters use different private values at each reporting period. This results in different hashes for the same Clients over different reporting periods. Copyright (c) 1998-2006 Dr. Sweeney. 10
A Provable Claim A Provable Claim Theorem. If the re-hashed values are the same, the Clients representing the original UIDs provided the same source information. A dictionary attack by the will not yield reliable reidentifications. Dataset UID 149875 072 976526 Social Security Number UID UID 8563 for try 000-00-0000 UID 962656 for try 000-00-0001 UID 072 for try 000-00-0002 UID 976526 for try 104-51-2572 UID 149875 for try 104-51-2573 Try 000-00-0000 Try 000-00-0001 Try 000-00-0002 Try 104-51-2572 Try 104-51-2573 Try 999-99-9999 Client1 Client2 Client3 A Provable Claim Compromising a Shelter will not help the intimate stalker learn where a targeted Client is (or has been) at another Shelter. A Provable Claim Compromising the will not help the intimate stalker learn where a targeted Client is (or has been). Completely Re-hashed UIDs 897 2093 3289 A Provable Claim Even if the pads the UIDs with known values, the does not learn the source information of Clients. b3s7 ghre Planning Office H2732 0yfh02 Planning Office Over the Limit If the intimate stalker compromises both the and a Shelter the targeted Client visited, the intimate stalker can learn the locations of all Shelters the Client visited. ax4 1804 H2732 nw450 Copyright (c) 1998-2006 Dr. Sweeney. 11
Technologies for HMIS This Distributed provable Query solution * ** ** ** * If compromise enough parties, can learn information. ** Shown is worst case, can be improved by source information selection. Question in this Work How can Shelters construct UIDs without risk of re-identification while still achieving an accurate unduplicated accounting? First, use strong hashing, inconsistently across Shelters to assign UIDs. Second, provide accounting information to the through a secure means. Then, have each Shelter re-hash the UIDs of all other Shelters, in turn, to de-duplicate UIDs. This Talk 1. The Setting 2. Technology Survey 3. A Provable Privacy Solution privacy.cs.cmu.edu latanya@privacy.cs.cmu.edu Copyright (c) 1998-2006 Dr. Sweeney. 12