Publishing CitiSense Data: Privacy Concerns and Remedies Kapil Gupta Advisor : Prof. Bill Griswold 1
Location Based Services Great utility of location based services data traffic control, mobility management, urban planning etc. Critical to preserve privacy of the users involved Anonymity cannot be assured by simply replacing users real identifiers with pseudonyms. This project deals with these issues for CitiSense dataset 2 CSE Dept., University of California, San Diego
CitiSense: Introduction Portable pollution monitoring system Real-time air quality readings on a phone 3
CitiSense: Introduction 4
CitiSense: Objective Deliver air quality estimation to individuals and public health agencies. Understand the behavior of air pollutants within urban areas. So far so good! Where is the problem? 5
CitiSense: Data Publishing 6
Solutions? Strip the location information Utility completely lost! Add noise to location data Will hurt the utility Addition is not privacy-aware Does striping user identifiers from dataset still has privacy implications? Yes!!, underlying linear relation between temporal and spatial movement of individual. 7
Rest of the presentation: CitiSense data overview Data preprocessing Why Steps Privacy Breaches What will an attacker do What information is compromised Location Anonymization How Utility 8
CitiSense: Data Overview Spatio-temporal data moving object data, trajectory data, or mobility data Readings from 30 users over a period of five weeks(jul30 - Sep7) 21.5 million readings (data points) 7 sensors 9
CitiSense: Visualization? If 1 marker = 1 pixel 21.5M markers => whole screen covered! 10
CitiSense: Visualization Millions of point Browser crashes after 600 MB of memory First try Quality Threshold Clustering? 11
CitiSense: Preprocessing Filter outlier/noise/speed smoothing Trip Segmentation Trajectory Smoothing Trajectory Compression Error Measure for Trajectory Compression 12
CitiSense: Preprocessing Filter outlier/noise/speed smoothing Trip Segmentation Trajectory Smoothing Trajectory Compression Error Measure for Trajectory Compression 13
CitiSense: Side Note location data => earth s coordinates Mercator projection cylindrical map projection Earth s Radius =6378100m 14
CitiSense: Filtering Duplication filter Multiple sensor s leads to multiple readings with same time and location information Speed and Acceleration filter few readings indicate speed of 546km/hr with 52m/sec 2 acceleration neighboring data points need to be smoothed accordingly 15
CitiSense: Filtering Results Data points combined threshold of 30 seconds && No location change. speed limit of 150km/s and acceleration of 10m/s 2 16 CSE Dept., University of California, San Diego
17 CSE Dept., University of California, San Diego
CitiSense: Preprocessing Filter outlier/noise/speed smoothing Trip Segmentation Trajectory Smoothing Trajectory Compression Error Measure for Trajectory Compression 18
CitiSense: Trip Segmentation extract trips from Change in speed. Time gap between consecutive positions Length of the trip Values set are: 300 seconds and 100 m for and respectively. 19
CitiSense: Trip Segmentation 20 CSE Dept., University of California, San Diego
CitiSense: Preprocessing Filter outlier/noise/speed smoothing Trip Segmentation Trajectory Smoothing Trajectory Compression Error Measure for Trajectory Compression 21
CitiSense: Trajectory Smoothing smooth noise Apply Median filter. Although suffer from lag 22
CitiSense: Preprocessing Filter outlier/noise/speed smoothing Trip Segmentation Trajectory Smoothing Trajectory Compression Error Measure for Trajectory Compression 23
CitiSense: Trajectory Compression Error Measure: Euclidean distance 24
CitiSense: Trajectory Compression Synchronous Euclidean distance (SED) 25
CitiSense: Trajectory Compression Similar to line generalization problem Uniform sampling algorithm? Douglas-Peucker Curve matching Top-down time-ratio (TD-TR) GTC trajectory compression algorithm Greedy solution Uses farthest point with an approximated SED less than the given error tolerance. 26
CitiSense: Trajectory Compression 27
CitiSense: Preprocessing Summary 28
CitiSense: Privacy Breaches Region of Interest (ROI) Behavior Mining Predictive Query Regular Routes Mining Recognizing Travel Modes 29 CSE Dept., University of California, San Diego
CitiSense: Privacy Breaches Region of Interest (ROI) Behavior Mining Predictive Query Regular Routes Mining Recognizing Travel Modes 30 CSE Dept., University of California, San Diego
CitiSense: Region of Interest (ROI) Stops Semantically important part of a trajectory 31 CSE Dept., University of California, San Diego
CitiSense: Region of Interest (ROI) Algorithms: IB-SMoT (Intersection Based Stops and Moves of Trajectories) CB-SMoT (Clustering-Based Stops and Moves of Trajectories) Depends on speed variation If stops are repeated frequently => ROI User s home, office, gym location etc 32 CSE Dept., University of California, San Diego
CitiSense: ROIs 33
CitiSense: ROIs 34
CitiSense: ROIs 35
CitiSense: Privacy Breaches Region of Interest (ROI) Behavior Mining Predictive Query Regular Routes Mining Recognizing Travel Modes 36 CSE Dept., University of California, San Diego
37 CSE Dept., University of California, San Diego
CitiSense: Privacy Breaches Region of Interest (ROI) Behavior Mining Predictive Query Regular Routes Mining Recognizing Travel Modes 38 CSE Dept., University of California, San Diego
39 CSE Dept., University of California, San Diego
CitiSense: Privacy Breaches Region of Interest (ROI) Behavior Mining Predictive Query Regular Routes Mining Recognizing Travel Modes 40 CSE Dept., University of California, San Diego
CitiSense: Regular Route Mining Routes Similarity Routes Grouping 41 CSE Dept., University of California, San Diego
CitiSense: Regular Route Mining 42 CSE Dept., University of California, San Diego
CitiSense: Privacy Breaches Region of Interest (ROI) Behavior Mining Predictive Query Regular Routes Mining Recognizing Travel Modes 43 CSE Dept., University of California, San Diego
CitiSense: Recognizing Travel Modes Walk 2-3 miles/hr 44 CSE Dept., University of California, San Diego
CitiSense: Recognizing Travel Modes Bike 4-5 miles/hr 45 CSE Dept., University of California, San Diego
CitiSense: Recognizing Travel Modes Car 55-70 miles/hr 46 CSE Dept., University of California, San Diego
Demo 47 CSE Dept., University of California, San Diego
CitiSense: Trajectory Anonymization Clustering based Anonymization ROI based Anonymization Temporal Cloaking 48
CitiSense: Trajectory Anonymization Clustering based Anonymization ROI based Anonymization Temporal Cloaking 49
CitiSense: Clustering based Anonymization Also called NWA (Never Walk Alone) Based on the inherent uncertainty of GPS system Trajectory is not a line, it is cylinder utilizes the uncertainty of trajectory data to group k co-localized trajectories within the same time period to form a k-anonymized aggregate trajectory. 50
CitiSense: Clustering based Anonymization 51
CitiSense: Clustering based Anonymization 3 main steps: Pre-processing step: group all trajectories that have the same starting and ending times. Trajectories trimmed if necessary Clustering step: clusters trajectories, near by k-1 Radius is bounded by Space transformation step: arithmetic mean of the cluster See next figure. 52
CitiSense: Clustering based Anonymization 53
CitiSense: Trajectory Anonymization Clustering based Anonymization ROI based Anonymization Temporal Cloaking 54
CitiSense: Utility Before proceeding further, lets analyze utility of CitiSense data. Utility of CitiSense data: is not hurt by changing temporal dimension by a small amount. does not depend on number of points in database, rather number of points in different regions. 55 CSE Dept., University of California, San Diego
CitiSense: ROI based Anonymization Information is revealed from ROIs. Remove frequent stops from trajectory data! Simply removing stops won t work, attacker can still extrapolate. Solution: remove all points in the neighboring regions of the stops also Parameters: Avg. duration and frequency of stop to qualify, area to be removed 56
CitiSense: ROI based Anonymization A user trajectory on a particular day 57
CitiSense: ROI based Anonymization Stops in the trajectory 58
CitiSense: ROI based Anonymization Trajectory after removal of stops 59
CitiSense: ROI based Anonymization Further improvements: Semantic analysis: Tag public and private places for each user Remove private ROIs Increasing utility If privately tagged location contains more than k users. 60
CitiSense: Trajectory Anonymization Clustering based Anonymization ROI based Anonymization Temporal Cloaking 61
CitiSense: Temporal Cloaking Privacy breaches depends on successful creation of trips. Trip segmentation depends heavily on temporal pattern. Idea: Blur the users presence at a location at a particular time by inserting Gaussian noise into time linear relation between distance and time is disrupted. 62
CitiSense: Temporal Cloaking Introduce Gaussian noise in temporal part of the data. Small noises does not hurt the utility of the CitiSense data. 63
CitiSense Anonymization Results 64
CitiSense: Clustering based Anonymization Percentage of points anonymize by NWA is only 48% Points belong to dense region All located near CSE, UCSD k-anonymity NWA needs regions having more than 1 CitiSense user present at approximately the same time. 65 CSE Dept., University of California, San Diego
CitiSense: ROI Results Side Note: concentrated data point => trips=0, stops=1 66
CitiSense: ROI Results 67
CitiSense: ROI Results 68
CitiSense: ROI Results Table suggests anonymization leads to loss of huge data and utility. Is it the right measure? 69 CSE Dept., University of California, San Diego
CitiSense: ROI Results, Coverage The coverage by a data point can be defined as the area where the readings from the sensor can be considered same is the diameter of the cluster is the coverage parameter 70
CitiSense: Results 71
CitiSense: ROI Results Note: Area covered does not take into account overlapping of trajectories. Why? NWA can take care. 72
CitiSense: Temporal Cloaking Results Gaussian parameters (mean, sigma) are set to 600 seconds and 1 respectively. Performing preprocessing in this transformed data results in 54% less trips Points discarded as outliers. 73 CSE Dept., University of California, San Diego
CitiSense: Anonymization Order In what order should we apply these 3 techniques? 1. NWA 2. ROI based 3. Temporal Cloaking. Why? 74 CSE Dept., University of California, San Diego
CitiSense: Results Summary % of points anonymize by NWA is 48% mainly in dense regions ROI based data anonymization Protect personal information of users by compromising utility by 6%. Temporal Cloaking () => 54% less trip segmentation. Implies low data mining to extract information Ex. finding regular routes, mode of transportation 75
Conclusion Major privacy concern is resolved by loss of 6% loss of in utility. NWA will work better in dense data Temporal Cloaking needs more analysis. Can we find mathematical guarantees for immunity against attackers? 76
Questions? 77