Mining Frequent Trajectory Using FP-tree in GPS Data

Size: px

Start display at page:

Download "Mining Frequent Trajectory Using FP-tree in GPS Data"

Gertrude Pope
5 years ago
Views:

1 Journal of Computational Information Systems 9: 16 (2013) Available at Mining Frequent Trajectory Using FP-tree in GPS Data Junhuai LI 1,, Jinqin WANG 1, Hailing LIU 2, Lei YU 1, Jing ZHANG 1 1 School of Computer Science & Engineering, Xi an University of Technology, Xi an , China 2 College of Electronic Information Engineering, Chongqing University of Science and Technology, Chongqing , China Abstract Pervasiveness of location-acquisition technologies makes it convenient to collect the movement data of moving objects, and the spatial-temporal information contained implicitly in the historical trajectories unveils important knowledge about movement behaviors. This paper presents a novel frequent trajectory mining method using FP-Tree Most existing approaches transform trajectories into sequences of popular region-ids using a statically predefined grid of cells with the same size, and then merge popular cells into larger popular regions. However, due to the size of these popular regions have not been limited, the movements of objects in the region may be lost. And predefined grid may be lack of adaptability. This study defines a Boundary Function to limit the maximum size of the popular regions and selects the size of the grid dynamically by defining a distance threshold d. Then, an improved FP-Tree algorithm is proposed to mine frequent trajectories. The experimental results show our method is efficient. Keywords: Trajectory; Frequent Trajectory Mining; FP-tree; GPS Data 1 Introduction The fast developments in tracking technologies and rapid improvement in location-based services have made it convenient to collect a large amount of time-stamped location data of moving objects. Moving objects consist of customers using embedded GPS mobile phones or PDAs, and vehicles with navigational equipment. Movement information of objects can be collected in the form of time-stamped location data. For example, GPS-equipped portable devices can record their latitude-longitude position during fixed time interval, and transmit them to a collecting server. The spatial-temporal information contained in the historical trajectories of moving objects can succinctly provide useful information. By analyzing the historical travel data of tourists over a long time, we can find the places of interest (mean the places tourists visit most) or the most Project supported by the National Nature Science Foundation of China (No ), the Science and Research Plan Project of Shaanxi Province (No. 2011NXC01-12), Science and Research Plan Project of Shaanxi Province Department of Education (No. 2010JC15) Corresponding author. address: lijunhuai@xaut.edu.cn (Junhuai LI) / Copyright 2013 Binary Information Press DOI: /jcisP0732 August 15, 2013

2 6556 J. Li et al. /Journal of Computational Information Systems 9: 16 (2013) popular routes, so some recommendations can be gave. Frequent movements of some vehicles during a period of time can help drivers select the best route. It is important and meaningful in future urban planning or urban computing. As a result, mining frequent trajectory has attracted increasing attention recently. In this paper, we present a FP-tree-based method for mining frequent trajectories in GPS data. In general, due to the noise and the limitation of location-acquisition equipment, the location information (e.g., coordinate values in 2D) of a fixed place cannot be exactly the same. Therefore although two GPS sequences represent the same route, the values of them will rarely be exactly the same. To solve this problem, we firstly define a distance threshold d and divide the trajectories area into cells with the same size of dd. According to the density of these cells, by defining a minimum support s min, the popular cells can be extracted. Then, we merge these popular cells into larger popular regions by defining a Boundary Function. Finally, the original trajectories can be transformed into sequences of popular region-ids. The other problem is detecting frequent trajectories, most studies adopt sequential pattern mining paradigm. But as we know, pattern discovery techniques in transactional database are not readily applicable for finding trajectory patterns. Without candidate generation, FP-tree proposed in [1] has great efficiency in mining sequential patterns. In this paper, we consider the continuity of trajectory mining, and have made some improvements to make it ready for trajectory mining. In general, frequent trajectory mining in this paper contains three phases: (1) create an index structure for each frequent spatial region; (2) traverse these segments to construct a FP-tree; (3) extract frequent trajectories. The remainder of the paper is organized as follows: Section 2 discusses the related work. Section 3 describes trajectory preprocessing. Section 4 describes the proposed algorithm in detail. Section 5 illustrates the results and performance of our method. Section 6 presents conclusion. 2 Related Works Recently, many researches have been made on trajectory analysis, and lots of achievements have been drawn. Giannotti et al. introduced the concepts of Region-of-Interest (RoI) in [2]. To efficiently compute popular points, they discretize the trajectory space through a regular grid with cells of small size. The density of cells is computed by taking each single trajectory and incrementing the density of all the cells that contain any of its points. Then, detecting popular cells, and merging them into RoIs. And they detect trajectory patterns using TAS (Temporally Annotated Sequences). J.Y.Kang et al. first approximated original trajectories into simplified line segments and transformed them into sequences of spatio-temporal regions by incorporating temporal constraints, finally proposed a prefix-projection approach to extract frequent spatiotemporal patterns [3]. A framework had been put forward in [4] to analyze, manages, and query frequent periodic patterns in spatio-temporal data. A density-based algorithm had been adopted to find the pattern regions. Then, they proposed two methods to find longer patterns: a bottomup, level-wise technique and a faster top-down approach. Anthony J.T et al. presented a GBM (graph-based mining) algorithm for mining the frequent trajectory patterns [5]. First, they scan the database once to generate a mapping graph and trajectory information lists (TI-lists), and then traverses the mapping graph in a depth-first search manner to mine all frequent trajectory patterns. But they have not related to the preprocessing of trajectories. Savage NS et al. selected K most frequent edges and combined them to create a list of the most frequent paths [6]. Akasapu A et al. use Apriori-TFP, and the T-tree finally contains all frequent sets with their complete

3 J. Li et al. /Journal of Computational Information Systems 9: 16 (2013) support-counts [7]. 3 Trajectory Preprocessing A trajectory of a moving object is a temporally ordered sequence of triples T =< (x 0, y 0, t 0 ), (x 1, y 1, t 1 ),, (x n, y n, t n ) >, where t i (i = 0 n) is a time stamp, ( 0 i n )t i t (i+1) and (x i, y i ) are coordinates in two-dimensional. Due to the noise and the limitation of location-acquisition equipment, even though two GPS sequences represent the same route, it is highly unlikely that they have the identical location values. In addition, matching the items of a sequence in standard sequential pattern mining requires simple equality tests between symbols. Therefore, in order to use the approaches based on sequential pattern mining, the coordinate values of a trajectory should be discretized prior to mining process. To discretize trajectory data, we take the method similar to it in [2]. Firstly, we discretize the trajectory space through a dynamically grid with cells of small size. Each cell has the same size with fixed width and height. Then, we map each of the GPS points that belong to a trajectory into the cells. As a result, a trajectory can be converted to a sequence of cell id. The densities of the cells with GPS point located in are incremented. It should be noted that a trajectory touching a point multiple times should be counted only once. After matching all the trajectories, we know the density of each cell. If it is larger than a threshold defined by user, we call it popularcell. Generally, the cell size is very small. Therefore, the number of popular cells can be extremely large. To solve this problem, in [2] they merge cells into larger regions. For each region they consider the average density of its cells, the final results may contain very large regions, as shown in F igure1a. Therefore, the movements in the region may be lost. In order to make up for the deficiency, we introduce a BoundaryF unction, which is defined in the following way: B(m, n) x max x min m, y max y min n (1) The BoundaryF unction is used to limit the maximum size of popular region. Then, the definition of popular region set can be redefined as following: Definition 1 (Popular region set) Given a trajectory database and a distance threshold d, the trajectory space can be divided into a grid ζ of n m cells, each cell with its density ζ(i, j)(1 i n, 1 j m), a minimum support S min. A popular region set for ζ is a collection R of sets of cells from ζ, such that: (i) each r R forms a rectangular region; (ii) sets in R are pairwise disjoint; (iii) all popular cells in ζ are contained in some set r R; (iv) all r R have avg ( i, j) r)ζ(i, j) S min ; (v) assuming that r R has size h k, all its rectangular supersets r r of size (h + 1) k or h (k + 1) violate (iv) or r and r contain exactly the same number of popular cells; (vi) Given BoundaryF unctionb(m, n), h m, k n. It should be noted that a grid ζ of n m cells is not statically pre-defined. F igure1b shows the results with BoundaryF unction. We can see that the large regions are divided into several small ones.

4 6558 J. Li et al. /Journal of Computational Information Systems 9: 16 (2013) (a) the results with method in [2] (b) the results with Boundary Function Fig. 1: Extracted RoIs After finding all the RoIs, each with an id. The original trajectory sequences can be transformed into a list of RoI id sequence segments. The rules are as follows: (1) Segment a trajectory from the point that are not covered by any RoI, and delete the point. (2) Delete the repeat trajectory segments of each trajectory. Definition 2 (Frequent Trajectories mining(ftm)) Given a database of input trajectories D, a distance threshold d, a minimum support S min, Boundary Function B(m, n). FTM problem consists of finding all frequent trajectories T such that: Support(T ) >= S min, where Support(T ) is the support value of T in D. 4 FP-Tree-Based Frequent Trajectory Mining FP-Tree proposed in [8] has great efficiency in mining sequential patterns. But as we know, pattern discovery techniques in transactional database are not readily applicable for finding trajectory patterns. In this paper, we consider the continuity of trajectory mining, and have made some improvement. In this section, we discuss the proposed FTTBM (FP-Tree-based mining) algorithm for mining frequent trajectories. FTTBM comprises three phases: (1) create an index structure for each frequent spatial region; (2) traverse these segments to construct a FP-Tree; (3) extract frequent trajectories. 4.1 FP-tree construction In FTTBM method, each node of FP-tree contains six fields: the parent domain points to the parent node, the data contained in the node, support of the node, the objectid that latest generates the node, all the objectids that generating the node, and the refe domain points to the next node which contains the same data. In addition, in order to facilitate the traversal of the tree, we create a project head table Htable, which consists of two domains: itemname and itemhead. Each ROI, via an itemhead, to its first occurrence in the tree. Let us give an example to illustrate the construction of FP-Tree. Suppose Table 1 is the trajectory database after clustering the minimum support threshold is 3. Let us consider the database shown in Table 1, we can get all ROIs :< (1 : 3), (2 : 3), (3 : 3), (4 : 3), (5 : 3) >, where the domain before : is the data and the domain after it is the support value. Through data preprocessing mentioned in

5 J. Li et al. /Journal of Computational Information Systems 9: 16 (2013) Table 1: The trajectories consist of popular region-ids Object id Trajectory 1 1,2,3,4,5,6,7,8,1,3,5,3,2 2 2,4,5,7,6,1,2,3,4 3 1,2,3,4,5,8,2,3 Section 3, we can get the results in Table 2. Next, we create an Htable for all ROIs (i.e., < (1), (2), (3), (4), (5) >), as shown in table 3. Then a root node will be created, initialized to null. The scan of the first trajectory segment leads to the construction of the first branch of the tree: < (1 : 1), (2 : 1), (3 : 1), (4 : 1), (5 : 1) > of object 1. For the second segment, although it shares a common prefix < 1 > with the existing path < 1, 2, 3, 4, 5 >, the support of node < 1 > is not increased by 1 for it belongs to object 1. For the next segment < 2, 4, 5 >, since it shares no common prefix with the existing tree, a new branch is created. For the segment < 1, 2, 3, 4 >, since it shares a common prefix < 1, 2, 3, 4 > with the segment < 1, 2, 3, 4, 5 > of objects, the support of each node along the prefix is increased by 1. For the last segment < 1, 2, 3, 4, 5 >, since it is identical to the first segment, the support of each node along the segment is increased by 1. Table 2: The trajectories after preprocessing Object id Trajectory segments 1 (1,2,3,4,5)(1,3,5,3,2) 2 (2,4,5)(1,2,3,4) 3 (1,2,3,4,5) Table 3: Htable Itemname Itemhead The algorithm for constructing FP-Tree is as follows: Algorithm: Construct T-ree Input: The trajectory segments after preprocessing and a minimum support threshold Output: T treet (1) Create an Htable for all ROIs; (2) Create a root node, and initialize it to Null; (3) Scan the trajectory segments [p P ] of each object, where p is the first element of the segment and P is the remaining part; (4) Callinsert tree([p P ], T, ids), its implementation is: if T-tree T has a child node N, making N.data = p, and the current object id is not included in ids, then the support value of node N increases 1 and ids = ids+id.t ostring(). Otherwise, create a new node N, N.data = p, N.parent

6 6560 J. Li et al. /Journal of Computational Information Systems 9: 16 (2013) points to its previous node in the segment. N.ref e points to the next node that has the same data, N.ids = thecurrentobjectid. If P! = N ull, then recursively callinsert tree([p P ], T, ids). And the construction process is illustrated in Figure 2. (a) Construction of the segment (1,3,5,3,2) of object 1 (c) Construction of the segment (1,3,5,3,2) of object 1 (b) Construction of the segment (1,2,3,4,5) of object 1 (d) Construction of the segment (e) Construction of the segment (1,2,3,4,5) of object 1 (1,3,5,3,2) of object 1 Fig. 2: FP-Tree construction 4.2 Frequent trajectory mining For each itemname in Htable, scan the tree to ﬁnd the set of trajectory sequences that ending of it. Since the support of each sequence is known after the construction of tree, considering the consecutive subsequence of each trajectory sequence, calculating its support in the set. And the most important information is that the support value of the last symbol represents the support of the whole sequence. Let us consider the example above again, set Smin = 3 and frequent trajectory set FT to null, after obtaining the tree in Figure 5. Suppose itemname is 5, we traverse the tree to ﬁnd the set < (1, 2, 3, 4, 5), (2, 4, 5) >. For the ﬁrst segment (1, 2, 3, 4, 5), we consider sequence (4, 5). Its support is 2 and the object 1 and 3 contain it, but the segment (2, 4, 5) of object 2 also contains it, so its ﬁnal support value is 3. Adding (4, 5) to set FT. For the second subsequence (3, 4, 5), its support value is 2 according to the support of symbol 5 in the ending of the segment. For the subsequences (2, 3, 4, 5), (1, 2, 3, 4, 5), their support are 2. For the next subsequence (4, 5) of (2, 4, 5), the scan of FT shows that the frequent trajectory (4, 5) has already existed. Therefore, considering the next subsequence (2, 4, 5), its support is 1. An algorithm FTM of extracting frequent trajectories is as follows: Algorithm: FTM()

7 J. Li et al. /Journal of Computational Information Systems 9: 16 (2013) Input: T-tree, Minimum support Output: the set of frequent trajectories FT (1) FT=;//the set of frequent trajectories (2) For each item name in Htable, scan T-tree to find the set of all the sequences that ending of item name. Marked as SeriesList (Item name). (3) For each element in SeriesList (item name), getting the consecutive subsequences of it, which ending of item name. Scan SeriesList (item name) to calculate its support. (4) The rule to calculate the support is: if a sequence s1 is a consecutive subsequence of another sequence s 2 in SeriesList (item name), then call function Dis(s 1, s 2 ). If it returns n, then the support of s1 increases n. (5) If the support of s1 is larger than the given minimum support, and it is not contained in a subset of FT, then FT union s 1. Dis(s 1, s 2 ) (1) get the ids of the first element in s 1 and s 2 respectively. (2) For all the elements in s 1, calculating the different element in s 2 from s 1, and return the number. 5 Experiments In this section we summarize the results of a set of experiments obtained from real data. The real data used in these experiments describe the trajectories of 2 school buses collecting (and delivering) students around Athens metropolitan area in Greece for 108 distinct days, and the data of one day of each bus forms a trajectory, as shown in Figure 3a. It should be declared that, in order to make the results more intuitive, we only take the compacted data. It should be noted that a grid ζ of n m cells is not statically pre-defined. The value of n and m are relying on the space size of trajectory database and distance threshold d. This property makes it adaptive. In order to verify the efficiency of our approach, we set the parameters as follows: distance threshold d = 100, minimum support s min = 10, Boundary Function B(5, 5). The frequent trajectories extracted are shown in Figure 3b and 3c using the two different trajectory preprocessing method mentioned in section 3. The movements of objects inside a large popular region can be extracted by using Boundary Function. (a) Trajectory of a bus (b) Frequent trajectories extracted by method in [2] (c) Frequent trajectories extracted by our method Fig. 3: The original trajectory database and frequent trajectories

8 6562 J. Li et al. /Journal of Computational Information Systems 9: 16 (2013) Conclusion In this paper, we have discussed the issue that mining frequent trajectories. We introduce a distance threshold d to dynamically divide the trajectory space into a grid of cells with the same size of d d. Due to the size of popular regions have not been limited, the movements of objects in the region may be lost, and we define a Boundary Function to limit the maximum size of the popular regions. As a result, the original trajectories can be transformed into sequences of popular region-ids. Then, we propose an improving FP-Tree algorithm to mine frequent trajectories. Acknowledgement This work was supported by the grant from the Natural Science Foundation of China (No ), the Science & Research Plan Project of Shaanxi Province (No. 2011NXC01-12) and Science & Research Plan Project of Shaanxi Province Department of Education (No.2010JC15). The authors are grateful for the anonymous reviewers who made constructive comments. References [1] Han, J., J. Pei and Y. Yin. Mining frequent patterns without candidate generations[c]. ACM SIGMOD Record, 29: [2] Giannotti, F., M. Nanni, F. Pinelli and D. Pedreschi. Trajectory pattern mining[c]. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 12-15, ACM, New York, USA, [3] Kang, J.Y. and H.S. Yong. Mining trajectory patterns by incorporating temporal properties[c]. Proceedings of the 1st International Conference on Emerging Database, August 27-28, 2009, Busan, Korea, pp [4] Mamoulis, N., H. Cao, G. Kollios, M. Hadjieleftheriou, U.Y. Tao and D. W. Cheung. Mining, indexing and querying historical spatiotemporal data[c]. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 22-25, 2004, ACM, New York, USA., pp [5] Lee, A.J.T., Y.A. Chen and W.C. Ip. Mining frequent trajectory patterns in spatial-temporal databases[j]. Jounal Information Science, 179: [6] Savage, N.S., S. Nishimura, N.E. Chavez and X. Yan,. Frequent trajectory mining on GPS data[c]. In: Proceedings of the 3rd International Workshop on Location and the Web, Tokyo, Japan, November 29, 2010, ACM, pp [7] Akasapu, A., L.K. Sharma and G. Ramarkrishan. Efficient trajectory pattern mining for both sparse and dense dataset[j]. International Journal Computer Application, 9: [8] Han, J., G. Dong and Y. Yin. Efficient mining of partial periodic patterns in time series database[c]. Proceedings of the 1999 International Conference on Data Engineering, March 23-26, Sydney, Australia, pp

Fosca Giannotti et al,.

Trajectory Pattern Mining Fosca Giannotti et al,. - Presented by Shuo Miao Conference on Knowledge discovery and data mining, 2007 OUTLINE 1. Motivation 2. T-Patterns: definition 3. T-Patterns: the approach(es)