Publishing CitiSense Data: Privacy Concerns and Remedies

Size: px

Start display at page:

Download "Publishing CitiSense Data: Privacy Concerns and Remedies"

Theodora Stevenson
5 years ago
Views:

1 University of California, San Diego Master s Project Publishing CitiSense Data: Privacy Concerns and Remedies Author: Kapil Gupta Supervisor: Prof. Bill Griswold March 15, 2013

2 Publishing CitiSense Data: Privacy Concerns and Remedies KAPIL GUPTA University of California, San Diego Abstract Publishing original spatial trajectories obtained from a Location based Service (LBS) to the public or a third party for data analysis could result in serious privacy breaches. CitiSense generates huge collections of spatio-temporal data, variously called moving object data, trajectory data, or mobility data. In the first part of this report we study about the possible privacy violations for an individual such as identity revelation, if the CitiSense data is made public. Later we propose an existing methodology for privacy-preserving data publication called (k, δ)-anonymity and demonstrate its effectiveness on the CitiSense dataset. This technique utilizes the inherent uncertainty of location in order to decrease the extent of distortion required to anonymize data. Location based services data have great utility in various data analysis based applications such as city traffic control, mobility management, urban planning, and location-based service advertisements, just to mention a few. Therefore, extensive amount of research has been done on these data, which is clearly indicated by the large number of spatio-temporal data mining techniques that have been developed in the recent past [28, 29, 27, 42, 43, 49, 36, 37, 44, 9]. As such, it is critical that such techniques to transform a database of trajectories of moving objects, be developed that satisfy some concept of anonymity while maintaining most of their original utility in the transformed database. Anonymity cannot be assured by simply replacing users real identifiers (e.g., name, age, date of birth, etc.) with pseudonyms. As demonstrated in [1], using pseudonyms does not guarantee anonymity, since location is a property that can be used to determine the identification of an individual. For example, if a person is known to follow a certain route every day, it is highly likely that the end-points of the route are the workplace (or school) and the home of that person. Also due to the existence of the quasi-identifier locations, i.e., a set of locations that can be linked to external information to re-identify individuals, the anonymous location data may be traced back to personally identifying information with the help of additional data sources [10]. Contemporary techniques for trajectory data mining and knowledge discovery have concentrated both on the geometrical properties and the background geographic information (semantic trajectory mining) of trajectories. We cannot simply strip the location information from a reading in the CitiSense data as it will hurt the utility of the published data. Adding noise to location data to anonymize it will also hurt the utility of sensors readings in the CitiSense data. On the other hand, if we simply publish the location information of the readings as they are, we risk exposing many forms of sensitive information that the trajectories are likely to contain. Therefore trajectories cannot be released for public use before they are properly anonymized. The problem of location privacy has been well studied in the context of location-based services [39; 46; 31; 22; 47]. The focus is both on on-line, service-centric anonymity and off-line and datacentric anonymity (as in the context of data publishing). In this report, we will focus on the latter and study the problem of anonymity preserving data publishing of the CitiSense Data. We have used the NWA algorithm [6] which extends the concept of k-anonymity [2] to handle the type of data we have, and to utilize its inherent uncertainty [3], [4], [5]. Please note that discussing the extent to which the location of an individual represents vulnerable information or what exactly constitutes private and sensitive information are philosophical, social and individual concerns and beyond the scope of this project. Paper content and organization: The rest of the paper is organized as follows. Section I gives an overview of the CitiSense project and its dataset. Section II describes the preprocessing of the Ci- 2

3 Table 1: Sample reading of publicly available CitiSense Sensor s data sensorid reading datesampled Latitude Longitude locationaccuracy :57: tisense dataset to remove outliers and to compress the data. Section III discusses privacy concerns on the data publication and various information that can be extracted from the published data. This section also presents privacy breaches on the CitiSense data. Section IV, examines existing anonymization techniques and proposes Region of Interest based anonymization and temporal cloaking on the CitiSense data. Section V evaluates the dataset after applying anonymization and discusses the findings. Finally, Section VI concludes the paper and suggests some ideas for extensions to this work. I. CitiSense Dataset CitiSense is a portable pollution monitoring system that allows one to get real-time air quality readings for one s surroundings on a smart phone. The CitiSense system includes small sensors carried by users, users Android mobile phone and a backend infrastructure that stores the collected data. CitiSense devices can estimate air quality in the area where they re deployed, providing information to everyone, not just those carrying sensors. For publishing this dataset to individuals and public health agencies, providing only sensor s reading, date sampled and location information is sufficient. A sample of the publicly available dataset is shown in Table 1. The sensorid can take 7 different values. The dataset used in this project contains readings from 30 users over a period of five weeks (Jul30 Sep7). Total number of rows present in this dataset is more than 21.5 million. The sampling interval for sensor s reading in CitiSense System is very aggressive, usually about a few seconds. Processing data at such high rate would be computationally very challenging. Also, due to high sampling rate, the database will have enormous privacy implications (because even for only 30 days, the data for an individual would be enormous and would contain hundreds of thousands of data points leading to identification of his/her home, office and spatial patterns etc). II. Preprocessing Choosing high sampling rates for acquiring the sensors readings from individuals, leads to massive data collections. Thus, it is imperative to apply data compression methodologies during preprocessing of trajectory. Additionally, filtering data also helps in diminishing noise and assessing higher-level properties such as speed and direction. Since trajectories are normally measured by a sensor, they inevitably have some error, including occasional outliers. Simple techniques like mean and median filtering can reduce these errors. In addition to error reduction, certain filters like the Kalman filter and particle filter can also give error estimates and inferences on speed and direction. Because we acquire data using a sampling-based approach, the representation of object trajectories is in a discrete form despite the object movement being continuous. However, object movements display predictable patterns due to the linear properties of the underlying transportation framework. Consequently, much of the redundant and erroneous data can be eliminated from the trajectory without compromising much of the useful information [8]. These preprocessing steps are also necessary for an attacker to mine the underlying hidden information. A. Trajectory Filtering & Smoothing Due to the uncertainty of the data obtained from GPS devices, outliers need to be removed before behavior mining or region of interest extraction can be done. Filtering of data is particularly essential when one intends to deduce other properties from it, such as speed or direction. In this project, we discuss two filters to eliminate outliers and segment trajectory data into trips on different bases. All the calculations (speed/ acceleration/ compression/ data mining etc) are done after converting location data (latitude, longitude, altitude) into earth s coordinates using Mercator projection [7]. It is a cylindrical map projection which specify how the geographic detail is transferred from the globe to a cylinder tangential to it at the equator. The cylinder is then unrolled to give the planar map 3

Duplication filter: If the distance between two consecutive positions is smaller than a threshold, the duplication filter removes the second position.

4 (see Figure 1 ). how to extract trips from GPS data using the concept of moves, stops etc [11]. Figure 1: A cylindrical map projection to find coordinates in frame of reference of Earth s center using latitude and longitude. Figure taken from [7]. Duplication filter: If the distance between two consecutive positions is smaller than a threshold, the duplication filter removes the second position. CitiSense dataset contains multiple sensor types, leading to multiple readings with same location and time stamps. To make the computation on the dataset efficient, it is essential to remove these duplicate entries. Table 2 presents the results of application of duplication filter on the CitiSense data. Speed and Acceleration filter: It is assumed that individuals move at a plausible speed between two consecutive positions, and that there is a reasonable speed range for individuals (for different means of transportation like walking, biking, car, bus etc). The speed and acceleration filter removes the second position if the speed and/or acceleration between two consecutive positions are/is unreasonable. For example, there are a few readings in the CitiSense data which indicate impossible speed of 546km/hr with 52m/sec 2 acceleration. These invalid readings need to be removed for trajectory analysis and their neighboring data points need to be smoothed accordingly. Table 2 presents the results after application of speed and acceleration filter, with speed limit of 150km/s and acceleration of 10m/s 2, on the CitiSense data. Figure 2 shows the variation of speed for a user in a given trip and smoothed speed after removal of outliers. B. Trip Segmentation Route pattern mining requires recorded data to be segmented into trips. However, asking users to manually turn on and off their GPS devices several times a day for the purpose of trip segmentation would drastically decrease usability of the system and reliability of the data. This section discusses Figure 2: Variation of Speed before and after applying speed and acceleration filter The basic criterion for splitting GPS data is the time gap between two consecutive positions, since a stop indicates the end of a trip. Algorithm A is used to segment the trip. In this algorithm T is the array containing all recorded trips of a person, λ time_gap is the time threshold used to segment trips, λ trj_len is the threshold used to remove short trips, and Funct() is one of the data filtering functions described above. Funct() returns true if the positions comply with the restriction of the data filter; otherwise, it returns f alse, and the corresponding positions are removed. procedure Trip Segmentation(A) Input: T, λ time_gap, λ trj_len, Funct() Output: T tmp T tmp φ for each route r i in T do r tmp = φ for each position p j in r i do if Funct(p j, r i ) returns true then r tmp =Append(r tmp, p j ) else if Time(p j )-Time(p j 1 )>λ time_gap then if Size(r tmp )>λ trj_len then 4

5 T tmp =Append(T tmp, r tmp ) end if end if end for end for Return T tmp end procedure its n 1 predecessors in time. The mean filter can be thought of as a sliding window covering n temporally adjacent values of p i. A major drawback of the mean filter is its sensitivity to outliers. This outlier problem can be alleviated by using a median filter instead of a mean filter. In the median filter, everything is same as in mean filter except that the mean is replaced with a median [8]. ˆx i = median{p i n+1, p i n+2,..., p i 1, p i } (1) Figure 3: Various trips made by user on 10th Aug, Different colors represent different trips. The data filtering process can remove the noisy raw data, and greatly reduce the amount of the original real trip data. Applying trip segmentation on the CitiSense data results in number of trips per person over a period of 30 days. Note that trip count of 0 shows the presence of stationary node in the CitiSense dataset. This implies that simple filters and trip segmentation can identity stationary users in the dataset. The λ time_gap is chosen to be 300 seconds and λ trj_len is 100 meters. Figure 3 shows all the trips made by a user on 10 th Aug, C. Trajectory Smoothing: A simple method to smooth noise is to apply a mean filter. For a measured point p i, the estimate of the (unknown) true value is the mean of p i and Figure 4: Example of Median filter for Trajectory smoothing. Figure taken from [8] See figure 4 to see the Median filter in effect on a sample trajectory with outliers. For smoothing a trajectory both the mean filter and median filter are simple and effective techniques, but both these filters suffer from lag. Kalman filter and the particle filter are two more advanced techniques that reduce lag and can be designed to estimate more than just location. Though they are not used in this project, they are worth exploring. D. Error Measure for Trajectory Compression In this section, we discuss two error measures for the deviation of an approximate trajectory from its original trajectory - perpendicular Euclidean distance and time synchronized Euclidean distance. An estimate of the accuracy of the approximated Table 2: Filtering of the CitiSense Dataset Raw After Duplication filter After Duplication + Speed & Acceleration filter #Rows

6 (a) Error measure based on perpendicular Euclidean distance. This error measure takes into account the geometric relationship of the trajectories. (b) Error measure based on time synchronized Euclidean distance. This error measure takes into account both the geometric relationship and temporal factor of the trajectories. Figure 5: Error Measure for Trajectory Compression. Figure taken from [8] location values can be obtained from the distance between a location on the original trajectory and the estimated location on the approximated trajectory. The shortest distance from a sampled location point in the original trajectory to the approximated trajectory is perpendicular Euclidean distance. A measure of the error can be obtained by the averaging the perpendicular Euclidean distance for all sampled location points. Figure 5(a) illustrates the computation of error measure based on the perpendicular Euclidean distance between the original trajectory acquired by a moving object and an approximated trajectory generated by applying one of the trajectory data reduction algorithms. However, this conception of projecting each of the possible points in the original trajectory onto the segments of approximated trajectory, takes into consideration only the geometric characteristics of the trajectories. The temporal component of object movement in the trajectories is not accounted for [8]. Notice that a sampled data point < x, y, t > in the original trajectory denotes the time t when the moving object are located at x, y. Thus, there is a need to also consider the temporal factor in the projection. To take the temporal factor into account, time synchronized Euclidian distance was proposed [8] as a new error measure for approximated trajectories generated by trajectory data reduction algorithms [24, 25]. This error measure realizes that there should be a "time-synchronization" of the projected movement on the approximated trajectory with the real movement on the actual trajectory. Notice that a sampled data point < x, y, t > in the original trajectory denotes the time t when the moving object are located at x, y. Thus, there is a need to also consider the temporal factor in the projection. Figure 5(b) illustrates the idea of time synchronized Euclidean distance. As shown, the location points on the approximated trajectory, i.e. p 0, p 5 and p 1 6, are already synchronized by time. The other sampled location points, e.g. p 1, p 2, p 3 and p 4, are projected to time synchronized location points p 1, p 2, p 3, and p 4, on the line segment p 0 p 5. E. Trajectory Compression Our aim here is to produce an approximate trajectory from the actual trajectory by eliminating some location points while making sure that the error introduced is negligible. This problem is very much alike the well-studied line generalization problem in computer graphics and cartography [8]. A very simple approximation technique utilizes uniform sampling algorithm, where every i th location points (e.g. 10th, 20th, 30th etc) are retained and the other points are rejected [27]. This approach does not work if each location point in the original trajectory contains different amount of information required to represent the trajectory. Douglas-Peucker (DP), a renowned algorithm, can be employed for the approximation of original trajectory [9,15]. This algorithm, given a curve composed of line segments, finds a similar curve with fewer points. The objective is to use an approximate line segment to replace the actual trajectory. If the replacement does not comply with the specified error conditions, the original problem is partitioned into two sub-problems by choosing the location point responsible for maximum errors as the split point. This partitioning is a recursive process and it continues till it meets the stopping condition. 6

7 Table 3: Variation of each user s data after preprocessing User Id Raw After Filtering After Compression Compression % % % % % % % % % % % % % % % % % % % Total % The stopping condition would be that the error between the approximate and original trajectories falls below the given error threshold. A modified DP algorithm, called the top-down time-ratio (TD-TR) algorithm [24], which uses synchronous Euclidean distance (SED), as compared to the perpendicular Euclidean distance is also very popular algorithm for trajectory compression. Figure 7: Variation of % compression vs SED Figure 6: Pseudocode of proposed GPS trajectory approximation process In this project we have used the GTC trajectory compression algorithm [14] (See Figure 6) which is a greedy solution for the trajectory approximation. It starts from the first point, and the farthest point is found with an approximated SED less than the given error tolerance. The pseudocode is shown in Figure 6. The rest of the analysis in this paper is done on compressed dataset for SED = 5m unless specified. Figure 7 shows the variation of percentage of points/rows left for different values of SED 7

8 used. Table 4 shows variation in number of readings for each user after filtering and compression. III. Privacy Breaches There are many real-life situations when attackers exploit location-detection technologies to gain access to private location information and other sensitive information about victims [16, 17, 18, 19]. Following are some of the techniques which can be applied on LBS data to mine information about the individuals: A. Region of Interest (ROI) In 2008, Spaccapietra et al. proposed the first data model looking at trajectories from the conceptual point of view which provides robust semantic analysis, called stops and moves [11]. A stop is a semantically important part of a trajectory that is relevant for an application, and where the object has stayed for a minimal amount of time. For instance, on weekdays a stop could be an office or workplace and on weekends or holidays, a stop could be a touristic place, a restaurant, a movie theater, etc. Figure 8 describes this idea pictorially. STPM is an extension of Weka for spatio-temporal data. Figure 9 shows some of the stops taken by an individual over a period of 15 days. It can easily be inferred that if a stop is repeated more than a particular number of times, it is region of interest for an attacker. Taking this notion a step forward and plotting stops over time can lead to identification of region of interest for an attacker such as victim s home, office, gym location, preferred shopping mall etc. Although this interpretation requires manual endeavor, there exist semantic trajectory frameworks to perform this automatically [20]. B. Behavior Mining For most purposes, we can assume that individuals adhere to the same paths (approximately) over regular intervals in time. For instance, people usually follow a fixed routine throughout the day; they wake up at the same time, take just about the same route to work and follow daily or weekly chores in a regular way. Therefore, trajectory patterns most likely represent summaries of repeated behavior, in terms of both space (i.e., the regions of space visited during movements) and time (i.e., the duration of movements) [8]. Figure 8: Identifying stops and moves from GPS data points. Figure taken from [23] To extract stops and moves from trajectory points, Alvares et al. introduced an algorithm called IB-SMoT (Intersection Based Stops and Moves of Trajectories) [12]. While IB-SMoT searches for intersections among trajectories, there are several other ways like speed-based spatiotemporal clustering approach (CB-SMoT) to find important points of interest [13]. In this project we have used Weka-STPM [21] to do IB-SMot and CB-SMot analysis. Weka- Figure 9: Visualization of stops of an individual over a period of 30 days. The markers are color-coded to emphasize the frequency of a stop taken by the user. The discovery of hidden periodic movement patterns in spatiotemporal data may violate privacy of users. Figure 10 and 11 provide examples of revelation of hidden information of an individual. 8

D. Regular Routes Mining Figure 10: Loss of privacy. It can be inferred that the user is a faculty at CSE, UCSD and uses faculty parking to park his/her car.

It involves following steps [33]: Trajectory Similarity: An estimate of the similarity between two trajectories can be obtained by some form of aggregation of distances between trajectory points.

9 D. Regular Routes Mining Figure 10: Loss of privacy. It can be inferred that the user is a faculty at CSE, UCSD and uses faculty parking to park his/her car. This technique is useful for mining Regular (or frequently repeated) Routes from users route sets. It involves following steps [33]: Trajectory Similarity: An estimate of the similarity between two trajectories can be obtained by some form of aggregation of distances between trajectory points. On this ideology, we have several similarity functions developed for different purposes, including Closest-Pair Distance, Sum-of- Pairs Distance [34], Dynamic Time Warping (DTW) [38], Longest Common Subsequence (LCSS) [37], and Edit Distance with Real Penalty (ERP) [40], Edit Distance on Real Sequences (EDR) [41]. Even though some of these similarity functions were initially put forth for time series data, they can also be employed for trajectory data as trajectories can be viewed as a distinctive type of time series in multi-dimensional space. Figure 12 shows the basic step to break route into frequent directed edge (FDE) to compare two trajectories [33]. Figure 11: The trajectory paths on weekdays, from 8am to 11 am and from 4pm -9pm for a user. It can be inferred that the user is a student at CSE, UCSD and uses bike as conveyance and takes the same route most of the time. C. Predictive Query Given the recent movements of an individual and the current time, predictive queries ask for the probable location of the individual at some future time. [30, 32] accurately forecast locations when the forecast time is far away from the current time. The long term prediction uses previously extracted movement patterns named Trajectory Patterns, which are a concise representation of behaviors of moving objects as sequences of regions frequently visited within a typical travel time. It has been shown that prediction based on the trajectory patterns of an object is a powerful method [35]. Figure 12: Steps involved in converting a route into FDE to convert it into time series data for further computation. Figure taken from [33]. Routes Grouping: We group the routes that are followed by someone at approximately same times of the day and which have the high trajectory similarity (from above). Finding Regular Routes: Then we mine Regular routes from each set of routes. For qualifying as a regular route, the route must have been traveled on approximately same hours frequently. Figure 13 shows regular routes taken by a particular user. The open source code T-Pattern [48] is used to mine the regular trajectory pattern which uses the algorithm proposed in [23]. 9

10 This section discusses privacy preserving trajectory data publication algorithm. With regards to the difficulties in privacy protection, it is different from continuous LBS data publication in the following ways [45]: (1) The need for privacy protecting mechanisms to be scalable is much more for continuous LBS than for trajectory data publication. This is because continuous LBS s anonymization module handles enormous number of real-time location updates at high rates; whereas trajectory data publication can accomplish the anonymization process offline. (2) Global optimization techniques can be implemented for trajectory data publication as its anonymization process can scrutinize the entire trajectory data (static) for optimization possibilities. On the other hand, attaining global optimization is very tough for continuous LBS, due to run-time data caused by extremely dynamic, unpredictable user movements. Figure 13: Regular routes taken by a user. Red denotes the most common route taken by the user followed by green and blue respectively. On a side note, all the stops of the user are localized and pointing his/her home (Rita Atkinson Residence) and office (CSE, UCSD) location. E. Recognizing Travel Modes: The different travel modes of a route can be recognized. It is observed that a public transport stops frequently, and also stops periodically at fixed positions. Therefore, fixed stop rate (FSR) can be used to recognize the different travel modes along with speed variation. Figure 14 compares the speed variation of 3 users using walking, bike and car for commuting. From this figure we can also see the FSR. IV. Trajectory Anonymization Figure 14: Speed variations for different modes of transportation. Top graph shows mode of transportation as walking with speed of 2-4 miles/hr, middle graph shows mode as biking with speed of 3-7 miles/hr while bottom graph shows mode of transportation as car with speed upto 70 miles/hr. In the literature there are four major trajectory anonymization techniques for static trajectory data publication, namely, clustering-based [6], generalization-based [50], suppression-based [51] 10

and grid-based anonymization [30] approaches. In this project we have used a combination of three techniques, namely, clustering based techniques, Temporal Cloaking and ROI anonymization. A.

11 and grid-based anonymization [30] approaches. In this project we have used a combination of three techniques, namely, clustering based techniques, Temporal Cloaking and ROI anonymization. A. Clustering based Anonymization The clustering-based approach [6] utilizes the uncertainty of trajectory data to group k co-localized trajectories within the same time period to form a k-anonymized aggregate trajectory. Given a trajectory T between times t 1 and t n, i.e., [t 1, t n ], and an uncertainty threshold δ, each location sample in T, p i = (x i, y i, t i ), is modeled by a horizontal disk with radius δ centered at (x i, y i ). The union of all such disks constitutes the trajectory volume of T, as shown in Figure 15. Two trajectories T p and T q defined in [t 1, t n ] are said to be co-localized with respect to δ, if the Euclidean distance between each pair of points in T p and T q at time t [t 1, t n ] is less than or equal to δ. An anonymity set of k trajectories is defined as a set of at least k co-localized trajectories. The cluster of k co-localized trajectories is then transformed into an aggregate trajectory where each of its location points is computed by the arithmetic mean of the location samples at the same time. The clustering-based anonymization algorithm consists of three main steps as mentioned in [6]: 1. Pre-processing step. The main task of this phase is to group all trajectories that have the same starting and ending times, i.e., they are in the same equivalence class with respect to time span. To increase the number of trajectories in an equivalence class, given an integer parameter π, all trajectories are trimmed if necessary such that only one timestamp every π can be the starting or ending point of a trajectory. 2. Clustering step. This phase clusters trajectories based on a greedy clustering scheme. For each equivalence class, a set of appropriate pivot trajectories are selected as cluster centers. For each cluster center, its nearest k 1 trajectories are assigned to the cluster, such that the radius of the bounding trajectory volume of the cluster is not larger than a certain threshold (e.g., δ/2). 3. Space transformation step. Each cluster is transformed into a k-anonymized aggregate trajectory by moving all points at the same time to the corresponding arithmetic mean of the cluster. Figure 15: Uncertain trajectory: uncertainty area, trajectory volume and possible motion curve. Figure taken from [6] Figure 16 gives the trajectory volumes of T p and T q that are represented by grey dotted lines, respectively. The trajectory volume with black lines is a bounding trajectory volume for T p and T q. The bounding trajectory volume is then transformed into an aggregate trajectory which is represented by the sequence of square markers. Figure 16: A (2, δ)-anonymity set formed by two colocalized trajectories, their respective uncertainty volumes, and the central cylindrical volume of radius δ/2 that contains both trajectories. Figure taken from [6] B. ROI based Anonymization As mentioned earlier ROIs are regions where a large number of moving objects remain for at least 11

12 a given time interval. As shown in previous sections, the main threat in publishing the CitiSense data is revelation of home and office locations of the users. Since information is revealed by analyzing stops and moves of trajectory data, easiest way to remedy such kind of privacy threat is to remove from trajectory data, neighboring regions of stops that satisfy certain criteria like duration of stop being greater than 30 minutes etc. Therefore for this kind of analysis, two parameter values need to be decided upon 1) Circular area with radius λ r from a stop, and 2) Duration of stop λ t to qualify a stop for anonymization. Figure 17 shows the result of applying ROI based anonymization on a user s trip. Adding semantic analysis: The previous approach can be improved by taking into account semantics of graphical location. It performs graphical semantic analysis on the stops and tags all the locations as public (like highways, shopping malls, parkways, highways etc) or private (residential places, offices etc). For publishing the data, we can selectively choose the location data tagged as public. Increasing utility: To further decrease the amount of data lost by discarding private location data in dense region, we can take advantage of the notion of k-anonymity. If there are sufficient number of data points available from k or more users within a circular area of radius δ, we can average the readings for that circular area into buckets of minutes or hours and publish them. Publishing private location data in this way will keep our notion of (k, δ)-anonymity and maintain the utility of CitiSense data for places tagged as private. C. Temporal Cloaking All trajectory pattern mining and behavior mining algorithms depend on successful creation of trips from the raw GPS data. If the GPS data does not contain user identifier (as in case of publicly available CitiSense data), the trip segmentation is heavily dependent on temporal pattern. The idea of temporal cloaking is to blur the users presence at a location at a particular time by inserting Gaussian noise into time so that the linear relation between distance and time doesn t hold. Gaussian noise is statistical noise that has its probability density function equal to that of the normal distribution, which is also known as the Gaussian distribution [53]. In other words, the values that the noise can take on are Gaussian-distributed. P(x) = 1 / σ 2 2π e (x µ) 2σ 2 (2) Temporal cloaking can result in drastic decrease in trip segmentation and, hence, revelation of information from trajectory data. It is noteworthy that utility of the CitiSense data is not much affected by introducing uncertainty in time by a few minutes. D. Results Figure 17: Application of ROI based anonymization on a user s trip. All the analysis done in this section uses filtered dataset ( readings) and not the compressed dataset. Data compression was needed for making data mining algorithms computationally efficient. As mentioned in previous section, the inherent uncertainty parameter (δ) in NWA algorithm is set to be 50m while k is set to be 2. The percentage of points changed by NWA algorithm is only about 48%. On further analysis it is seen that most of these 12

13 points are located in one major region (CSE, UCSD). Hence, NWA does not anonymize the entire dataset. The reason behind this concentration of data points is the co-existence of different users at a given time. This shows a problem in the CitiSense data i.e data points are sparsely distributed and there is hardly any other region where CitiSense users coexistent. This drawback of NWA is addressed by ROI based anonymization which is discussed next. Table 4: Variation in number of stops, trips and active days for each user. This information is further used in ROI based anonymization. userid # trips # stops # active days peculiarly greater than the number of trips as data points sampled in a period are concentrated in a dense area leading to no actual movement (see Figure 18). After recognizing points which belong to stops, we need to remove them from the dataset for anonymization. Simply removing points which belong to a stop-cluster can still pose a privacy threat as surrounding points that survived can still be extrapolated to the removed ROIs. To circumvent this possibility, data points from all stops are first clustered using DBSCAN. The advantage of using DB-SCAN (density-based spatial clustering of applications with noise) over other clustering algorithms is that it can find arbitrarily shaped clusters. Once the diameters (d) of the clusters are found, all the data points present in the circular area with radius (d/2 + γ) and center as mean position calculated from that cluster members are removed. Figures 19 and 20 depict this idea pictorially. The two parameters required by above DBSCAN algorithm are set to ɛ = 100 (i.e distance between farthest points in the cluster) and minpts = 200 (i.e minimum number of points to consider a set as a cluster). Also γ is set to 50m. Table 5 presents the result after removing the stops in such manner. From the table, it looks like this approach destroyed the number of data points in the original dataset and hurt the original utility. Although this is not the complete picture. To fully understand why the situation is not as bad as it appears from Table 5, we introduce the concept of coverage. # rows # rows after Anonymization % data loss (in %) Table 5: Loss of data in terms of number of rows. Figure 18: Concentrated stop, leading to 0 trips. Table 4 shows the number of trips and stops taken by users. Also, the number of stops can be Coverage: The coverage by a data point can be defined as the area where the readings from the sensor can be considered same. For example, in CitiSense dataset, CO 2 reading at a location x can be treated same as the reading at location x + γ for a very small value of γ. Hence the surrounding area can be said to be covered by a single data point. The coverage by clustered stops data points can be thought of as the circular area with radius (d/2 + ɛ) and center as the mean calculated from the cluster members. Similarly, for each pair of adjacent moving points, a rectangular area covered by those points can be thought of as covered area. This is 13

14 Figure 19: DB-SCAN clustering performed on the stop data points. Figure 20: To find the coverage loss, areas spanned by stops (circular area) are removed. Total area is calculated by using the notion that if a reading is present at a point, it covers some surrounding area. 14

15 also shown pictorially in Figure 22. Further Figure 21 shows this concept in real trajectory. algorithm on those points, thus, keeping the utility unaltered. Table 6: Loss of data in terms of area coverage. Area is in m 2 Area covered Area covered after Anonymization % coverage loss (in %) Figure 22: The circular area and rectangular boxes around the trajectory path depict areas covered by the data points. Kindly note that in the CitiSense data, utility is directly related to coverage, not to the number of data points. Using this notion of coverage, approximate coverage loss is calculated for ROI based anonymization which is shown in Table 6. Hence ROI based anonymization hurts utility by only 6%. Interestingly, the coverage calculated above does not need to take into account the overlapping of different users trajectories. This is because if there are overlapping trajectories, we can apply NWA We applied temporal cloaking on the dataset obtained by applying the above anonymization techniques. The Gaussian parameters µ and σ for temporal cloaking are set to 600 seconds and 1 respectively. Performing preprocessing on this transformed dataset resulted in creation of 54% less trips as opposed to those created by our previous analysis on non-anonymized dataset. This will result in even lesser information that can be gained (for example, finding regular routes, mode of transportation etc). V. Conclusion Lately, it has been recognized in [7] and in many other works, that k-anonymity alone does not put us on the safe side, because although one individual Figure 21: The rectangular boxes around the trajectory path depict areas covered by the data points. 15

16 is hidden in a group, if the group has not enough diversity of the sensitive attributes then an attacker can still associate one individual to sensitive information. However, in the context of moving object data the problem is very challenging, because location is a particular kind of information that could be considered sensitive as well as quasi-identifier at the same time. Moreover major privacy concern of identification of locations private to the user is resolved by ROI based anonymization method with a mere 6% loss of in coverage. Another concern regarding lack of effectiveness of clustering based anonymization technique as mentioned in the results will disappear when the data becomes denser ( more precisely when each region has more than 1 CitiSense user present at approximately the same time). Temporal Cloaking needs more analysis in order to derive rigorous mathematical guarantees for immunity against attackers. Another interesting area to explore is continuous CitiSense real time data publication. This is a relatively newer field and worth exploring in context of CitiSense. VI. Acknowledgement I would like to thank Prof. Bill Griswold, Department of Computer Sciennce for his constant support and guidance throughout the course of this project. I would also like to thank Prof. Sanjoy Dasgupta and Prof. Hovav Shacham for their valuable inputs. Last but certainly not the least I am grateful to the kind assistance and cooperation of Nima Nikzad and Celal Ziftci for helping me to obtain the CitiSense data. References [1] C. Bettini, X. S. Wang, and S. Jajodia, "Protecting Privacy Against Location-Based Personal Identification." in Proc. of the Second VLDB Workshop on Secure Data Management (SDM 05). [2] P. Samarati and L. Sweeney, "Generalizing data to provide anonymity when disclosing information (abstract)," in Proc. of the 17th ACM Symp. on Principles of Database Systems (PODS 98). [3] O. Wolfson, S. Chamberlain, S. Dao, L. Jiang, and G. Mendez, "Cost and imprecision in modeling the position of moving objects." in Proc. of the 14th IEEE Int. Conf. on Data Engineering (ICDE 98). [4] G. Trajcevski, O. Wolfson, K. Hinrichs, and S. Chamberlain, "Managing uncertainty in moving objects databases." ACM Trans. Database Syst., vol. 29, no. 3, pp , [5] D. Pfoser and C. S. Jensen, "Capturing the uncertainty of moving-object representations." in Proc. of the 6th International Symp. on Advances in Spatial Databases (SSD 99). [6] Osman Abul, Francesco Bonchi, Mirco Nanni, Never Walk Alone: Uncertainty for Anonymity in Moving Objects Databases, Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, p , April 07-12, 2008 [7] _projection [8] Y. Zheng, X. Zhou, Computing with spatial trajectories. Springer ISBN: [9] Douglas, D., Peucker, T.: Algorithms for the Reduction of the Number of Points Required to Represent a Line or its Caricature. The Canadian Cartographer 10(2), (1973) [10] Francesco Bonchi, Laks V.S. Lakshmanan, Hui (Wendy) Wang, Trajectory anonymity in publishing personal mobility data, ACM SIGKDD Explorations Newsletter, v.13 n.1, June 2011 [11] Spaccapietra, S., Parent C., Damiani M. L., Macedo J. A., Porto F., Vangenot C A Conceptual View on Trajectories Data and Knowledge Engineering (DKE) [12] L. O. Alvares, V. Bogorny, B. Kuijpers, J. A. F. de Macedo, B. Moelans, and A. Vaisman. A model for enriching trajectories with semantic geographical information. In ACM-GIS, pages , New York, NY, USA, ACM Press [13] Nanni, M., Pedreschi, D Timefocused clustering of trajectories of moving objects. Journal of Intelligent Information Systems 27(3) (2006) [14] M. Chen, M. Xu and P. Franti "Compression of GPS trajectories", Proc. IEEE Data Compression Conf., pp [15] Hershberger, J., Snoeyink, J.: Speeding up the Douglas-Peucker Line simplification Algorithm. In: International Symposium on Spatial Data Handling, pp (1992) [16] Dateline NBC: Tracing a stalker. (2007) [17] FoxNews: Man accused of stalking exgirlfriend with GPS. story/0,2933,131487,00.html (2004) 16

17 [18] USAToday: Authorities: GPS system used to stalk woman. com/tech/news/ gps-stalker_x.htm (2002) [19] Voelcker, J.: Stalked by satellite: An alarming rise in gps-enabled harassment. IEEE Spectrum 47(7), (2006) [20] Yan, Z., (2009), "Towards Semantic Trajectory Data Analysis : A Conceptual and Computational Approach". VLDB 09, Lyon, France. [21] L.O. Alvares, A. Palma, G. Oliveira, and V. Bogorny, "Weka-STPM: From Trajectory Samples to Semantic Trajectories", Proceedings of the XI Workshop de Software Livre, WSL 10, Porto Alegre, Brazil, 2010, pp [22] Gedik, B., and Liu, L. Location Privacy in Mobile Systems: A Personalized Anonymization Model. In Proc. of the 25th Int. Conf. on Distributed Computing Systems (ICDCS 05). [23] Norma Saiph Savage, Shoji Nishimura, Norma Elva Chavez, and Xifeng Yan Frequent trajectory mining on GPS data. In Proceedings of the 3rd International Workshop on Location and the Web (LocWeb 10). ACM, New York, NY, USA. [24] Maratnia, N., de By, R.: Spatio-Temporal Compression Techniques for Moving Point Objects. In: International Conference on Extending Database Technology (EDBT), pp (2004) [25] Potamias, M., Patroumpas, K., Sellis, T.: Sampling Trajectory Streams with Spatio-Temporal Criteria. In: International Conference on Scientific and Statistical Database Management (SSDBM), pp (2006) [26] Ye Qian,Chen Ling,Chen Gencai.Personal continuous route pattern mining[j].journal of Zhejiang University,2009,10(2): [27] gil Lee, J., and Han, J. Trajectory clustering: A partition-and-group framework. In Proc. of the 2007 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 07) (2007), pp [28] gil Lee, J., Han, J., and Li, X. Trajectory outlier detection: A partition-and-detect framework. In Proc. of the 24th IEEE International Conference on Data Engineering (ICDE 08) (2008). [29] gil Lee, J., Han, J., Li, X., and Gonzalez, H. Traclass: Trajectory classification using hierarchical region-based and trajectory-based clustering? abstract. In Proc. of the 34th Int. Conf. on Very Large Databases (VLDB 08) (2008). [30] Gidofalvi, G., Huang, X., Pedersen, T.B.: Privacy-preserving data mining on moving object trajectories. In: Proceedings of the International Conference on Mobile Data Management (2007) [31] Gruteser, M., and Grunwald, D. Anonymous Usage of Location-Based Services Through Spatial and Temporal Cloaking. In Proc. of the First Int. Conf. on Mobile Systems, Applications, and Services (MobiSys 2003). [32] Freudiger, J., Raya, M., Felegyhazi, M., Papadimitratos, P., Hubaux, J.P.: Mix-zones for location privacy in vehicular networks. In: Proceedings of the InternationalWorkshop onwireless Networking for Intelligent Transportation Systems (2007) [33] Mining Regular Routes from GPS Data for Ridesharing Recommendation Wen He, Deyi Li, Tianlei Zhang, Mu Guo, Lifeng An [34] Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. FODO pp (1993) [35] Anna Monreale, Fabio Pinelli, Roberto Trasarti, Fosca Giannotti, WhereNext: a location predictor on trajectory pattern mining, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France [36] Jeung, H., Liu, Q., Shen, H. T., and Zhou, X. A hybrid prediction model for moving objects. In Proc. of the 24th IEEE International Conference on Data Engineering (ICDE 08) (2008). [37] Zheng, Y., Zhang, L., Xie, X., Ma, W.Y.: Mining interesting locations and travel sequences from gps trajectories. WWW (2009) [38] Yi, B.K., Jagadish, H., Faloutsos, C.: Efficient retrieval of similar time sequences under time warping. ICDE (1998) [39] Kido, H., Yanagisawa, Y., and Satoh, T. An Anonymous Communication Technique using Dummies for Location-based Services. In Proc. of the Third Int. Conf. on Pervasive Computing (Pervasive 2005) (2005), pp [40] Chen, Z., Shen, H.T., Zhou, X., Zheng, Y., Xie, X.: Searching trajectories by locations - an efficiency study. SIGMOD (2010) [41] Chen, L., Ozsu, M.T., Oria, V.: Robust and fast similarity search for moving object trajectories. SIGMOD (2005) [42] Li, X., Han, J., Kim, S., and Gonzalez, H. Anomaly detection in moving object. [43] Li, X., Han, J., Lee, J.-G., and Gonzalez, H. Traffic density-based discovery of hot routes in road 17

18 networks. [44] Mamoulis, N., Cao, H., Kollios, G., Hadjieleftheriou, M., Tao, Y., and Cheung, D. W.: Mining, indexing, and querying historical spatiotemporal data. [45] Chow, Chi-Yin: Trajectory Privacy in Location-based Services and Data. In: ACM SIGKDD Explorations Newsletter 13 (2011), Nr. 1, [46] Mokbel, M. F., Chow, C.-Y., and Aref, W. G. Casper: Query processing for location services without compromising privacy. In Proceeding of the 32nd International Conference on Very Large Databases (VLDB 06) [47] Mokbel, M. F., Chow, C.-Y., and Aref, W. G. The new casper: A privacy-aware location-based database server. In Proc. of the 23rd IEEE International Conference on Data Engineering (ICDE 07). [48] [49] Nanni, M., and Pedreschi, D. Time-focused clustering of trajectories of moving objects. Journal of Intelligent Information Systems 27, 3 (2006), [50] Nergiz, M.E., Atzori, M., Saygin, Y., GÂĺucÂÿ, B.: Towards trajectory anonymization: A generalization-based approach. Transactions on Data Privacy 2(1), (2009) [51] Terrovitis, M., Mamoulis, N.: Privacy preservation in the publication of trajectories. In: Proceedings of the International Conference on Mobile Data Management (2008) [52] Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp (1996) [53] _distribution 18

Publishing CitiSense Data: Privacy Concerns and Remedies

Publishing CitiSense Data: Privacy Concerns and Remedies Kapil Gupta Advisor : Prof. Bill Griswold 1 Location Based Services Great utility of location based services data traffic control, mobility management,