Incremental Sub-Trajectory Clustering of Large Moving Object Databases

Similar documents
Mobility Data Management and Exploration: Theory and Practice

Detect tracking behavior among trajectory data

Mobility Data Management & Exploration

On Discovering Moving Clusters in Spatio-temporal Data

Mobility Data Mining. Mobility data Analysis Foundations

A Joint approach of Mining Trajectory Patterns according to Various Chronological Firmness

On Discovering Moving Clusters in Spatio-temporal Data

Hermes - A Framework for Location-Based Data Management *

Trajectory Voting and Classification based on Spatiotemporal Similarity in Moving Object Databases

Introduction to Trajectory Clustering. By YONGLI ZHANG

Implementation and Experiments of Frequent GPS Trajectory Pattern Mining Algorithms

Trajectory Compression under Network Constraints

On-Line Discovery of Flock Patterns in Spatio-Temporal Data

Trajectory Compression under Network constraints

xiii Preface INTRODUCTION

A System for Discovering Regions of Interest from Trajectory Data

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Faster Clustering with DBSCAN

An Efficient Technique for Distance Computation in Road Networks

Analysis and Extensions of Popular Clustering Algorithms

Where Next? Data Mining Techniques and Challenges for Trajectory Prediction. Slides credit: Layla Pournajaf

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Data Mining II Mobility Data Mining

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

TRAJECTORY PATTERN MINING

Segmentation and sampling of moving object trajectories based on representativeness.

Nearest Neighbor Search on Moving Object Trajectories

Online Clustering for Trajectory Data Stream of Moving Objects

City, University of London Institutional Repository

A FRAMEWORK FOR MINING UNIFYING TRAJECTORY PATTERNS USING SPATIO- TEMPORAL DATASETS BASED ON VARYING TEMPORAL TIGHTNESS Rahila R.

C-NBC: Neighborhood-Based Clustering with Constraints

Searching for Similar Trajectories on Road Networks using Spatio-Temporal Similarity

Incremental Clustering for Trajectories

Introduction to Spatial Database Systems

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces

CS570: Introduction to Data Mining

Discovering Frequent Mobility Patterns on Moving Object Data

Clustering Moving Objects in Spatial Networks

Clustering Part 4 DBSCAN

Mining Dense Trajectory Pattern Regions of Various Temporal Tightness Ms. Sumaiya I. Shaikh 1, Prof. K. N. Shedge 2

Similarity-based Analysis for Trajectory Data

Nearest Neighbor Search on Moving Object Trajectories

Fosca Giannotti et al,.

Spatiotemporal Access to Moving Objects. Hao LIU, Xu GENG 17/04/2018

arxiv: v1 [cs.db] 9 Mar 2018

CSE 5243 INTRO. TO DATA MINING

Chapter 1, Introduction

University of Florida CISE department Gator Engineering. Clustering Part 4

CS570: Introduction to Data Mining

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Datasets Size: Effect on Clustering Results

Unsupervised learning on Color Images

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

Towards a Taxonomy of Location Based Services

Clustering Large Dynamic Datasets Using Exemplar Points

Contents. Part I Setting the Scene

COMP 465: Data Mining Still More on Clustering

Clustering Algorithm for Network Constraint Trajectories

Continuous Query Processing in Spatio-temporal Databases

Density Based Clustering using Modified PSO based Neighbor Selection

Distributed k-nn Query Processing for Location Services

An Efficient and Effective Algorithm for Density Biased Sampling

An algorithm for Trajectories Classification

Design Considerations on Implementing an Indoor Moving Objects Management System

Introduction to Trajectory Data Mining. Zhe Zhang Maa Spatial Data Mining

A Framework for Mobility Pattern Mining and Privacy- Aware Querying of Trajectory Data

Data Clustering With Leaders and Subleaders Algorithm

Leveraging Set Relations in Exact Set Similarity Join

Clustering Techniques

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Searching Similar Trajectories in Real Time: an Effectiveness and Efficiency Study *

A Review on Cluster Based Approach in Data Mining

Trajectory Bayesian Indexing : The Airport Ground Traffic Case

Multiplexing Trajectories of Moving Objects

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Clustering Algorithms for Data Stream

An Empirical Study of Lazy Multilabel Classification Algorithms

Data mining and warehousing for Temporal Data Objects Kola Surya Prakash Asst Prof Computer Science Tagore Arts College, Lawspet, puducherry

Pointwise-Dense Region Queries in Spatio-temporal Databases

Hybrid Feature Selection for Modeling Intrusion Detection Systems

CHAPTER 3 ANTI-COLLISION PROTOCOLS IN RFID BASED HUMAN TRACKING SYSTEMS (A BRIEF OVERVIEW)

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Joint Entity Resolution

Improving K-Means by Outlier Removal

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

NDoT: Nearest Neighbor Distance Based Outlier Detection Technique

CHAPTER 4: CLUSTER ANALYSIS

Effective Density Queries on Continuously Moving Objects

Clustering CS 550: Machine Learning

Clustering part II 1

Close Pair Queries in Moving Object Databases

Pattern Mining in Frequent Dynamic Subgraphs

Mining Representative Movement Patterns through Compression

Effective Density Queries for Moving Objects in Road Networks

Evaluation of Seed Selection Strategies for Vehicle to Vehicle Epidemic Information Dissemination

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

DeLiClu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking

Clustering Algorithms In Data Mining

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Transcription:

Incremental Sub-Trajectory Clustering of Large Moving Object Databases Information Management Lab (InfoLab) Department of Informatics University of Piraeus Nikos Pelekis Panagiotis Tampakis Marios Vodas Yannis Theodoridis August 2013

Table of Contents 1. INTRODUCTION... 3 2. RELATED WORK... 3 3. THE RETRATREE STRUCTURE... 4 4. RETRATREE ALGORITHMS... 8 REFERENCES... 12

1. INTRODUCTION Huge volumes of location information are available nowadays due to the rapid growth of positioning devices (GPS-enabled smartphones and tablets, on-board navigation systems in vehicles, vessels and planes, smart chips for animals, etc.). In the near future, it is unavoidable that this explosion will contribute in what is called the BIG DATA era, raising high challenges for the data management research community. Assume, for example, the following scenario: Location-based Services (LBS) users transmit their location to a central LBS server asynchronously and in batch mode; from the server side, a Moving Object Database (MOD) system is responsible for organizing users traces in terms of trajectories; for providing high quality services, the server executes extensive querying and mining processes on the trajectory database stored in the MOD engine. Clear challenges arise from the above scenario: Challenge 1: Since users (the data producers) transmit their location information in batch mode / asynchronously, the underlying index that supports MOD query processing should be able to handle this kind of information transmission. Challenge 2: Mining operations in MOD should be treated as first-class citizens, at least as of the same class with querying operators. To achieve this, it is objective that the mining functionality be provided as query operators function in a MOD engine. Challenge 3: Unlike BIRCH and CURE for relational data, in the MOD domain, there is a lack of an efficient incremental clustering algorithm. The incremental characteristic is essential in the above scenario since updates in the database are frequent and clustering results could quickly degrade. This incremental clustering approach should also support PAUSE/RESUME operations due to the big volume of data to be handled. In this report, we address the above issues and propose a novel indexing scheme for large MODs, which is sophisticatedly based on abstractions of trajectory data, the so-called Representative Trajectories, hence the term ReTraTree, and turns out to be able to (a) efficiently process both querying and mining operations in the MOD, (b) work in distributed mode, and (c) support trajectory databases that are fed in asynchronous and batch mode. The rest of the report is organized as follows. Section 2 reviews related work. Section 3 presents the ReTraTree structure, and Section 4 provides the algorithms for maintaining the ReTraTree and exploiting in it for querying and mining operations along with a cost analysis. Section V discusses the settings and results of our empirical performance study. 2. RELATED WORK In this section, we review related work on mobility data mining (focusing on clustering) and related access methods for mobility data. Recently, several approaches try to make well-known mining algorithms operational to trajectories. The common building block of these approaches is the use of different similarity functions as the mean to group trajectories into clusters. An interesting approach, which is also adopted by our work, is proposed in [10] for the efficient processing of most-similar trajectory (MST) queries. A similar distance function is used in T-OPTICS algorithm [32] (and its variant TF-OPTICS, which focuses on the discovery of the temporal intervals that lead to best clustering results), where the OPTICS [4] clustering algorithm, is made applicable for trajectory data. The previously mentioned temporal intervals are given by the user, which is a limitation of the approach, so TF-OPTICS reapplies T-OPTICS on portions of trajectories, that all live in exactly the same temporal period. The best out of the possible clusterings is chosen by applying some qualitative measures. Our approach can be viewed as a generalization of this approach, as we automatically identify patterns of sub-trajectories in an un-supervised way, which may have various nonpredefined lifespans. In [7] the authors proposed probabilistic techniques based on EM algorithm for clustering short trajectories using regression mixture models. This approach also aims on performing global clustering of whole trajectories and the notions of segmentation, sampling and sub-trajectory pattern mining is out of scope. In [39] the authors proposed a variant of FCM algorithm for MOD, called CenTR-I-FCM. The

approach makes use of local patterns in time dimension as the base to identify global clusters of whole approximate/symbolic trajectories. Again the discovered local patterns are predefined with respect to their lifespan. In [28] the authors proposed a partition-and-group framework for clustering 2-D trajectories which enables the grouping of similar sub-trajectories, based on a trajectory partitioning algorithm that uses the minimum description length principle. In its core it uses a variant of the DBSCAN algorithm that operates on the partitioned directed line segments. This work was the first to tackle the problem of identifying subpatterns in mobility data, and, although similar in principle with our approach, it presents certain limitations as discussed earlier on the example of Fig. 1. An interesting line of research include works that aim to discover several types of collective behavior among moving objects like flocks, leadership, convergence, encounter and sub-trajectory clusters patterns [5][16][26][15], moving clusters [23] convoys [22], and swarms [30]. Although these approaches provide lucid definitions of the mined patterns, their main limitation is that they are rather rigorous and sensitive to parameters, while their computation raises efficiency issues. Our approach also finds commonalities to well-known approaches of clustering algorithms of point (vector) data [47] [43], which first sample the dataset and then start the clustering process (aiming at high efficiency). Of course, it is not only that these vector-based algorithms are not applicable to MOD (due to the complex structure and properties of mobility data), there is also an essential difference between those techniques and our approach: while those rely on random sampling, in our approach the clustering is driven by a sample resulted by an optimization formula, thus leading to a deterministic solution of the sub-trajectory clustering problem. The previous algorithms usually handle the issue of efficiency by employing a general-purpose access method (e.g. an R-tree-like structure), which however is implemented ad hoc and outside a DBMS or a specialized MOD. This means that concurrency and recovery are left outside of the scene of requirements, as such diminishing the usage of the algorithms in real-world applications. Extending MOD engines, and generally commercial ORDBMSs like Informix Dynamic Server [21] Oracle s Extensible Indexing Interface [35] and DB2 UDB s table functions [8], it does not reduce the complexity of understanding their concurrency and recovery protocols, and as such it does not reduce the implementation effort of an external access method when compared to a built-in one, if identical levels of concurrency, robustness and integration are desired [25], as in our case. Actually, this complexity is the main reason that although in the literature there have been proposed literally dozens of efficient access method for mobility data (to name but a few representatives [41][44][19][34]), none of them has been integrated in a real DBMS under the afore-mentioned specifications. To handle this problem the GiST [20] structure has been proposed, which however, to the best of our knowledge it has not been used in the context of mobility data. More interestingly, not all of the research proposals can be realized as GiST s instances. For example, the TBtree [41] cannot be reproduced due to each double linked list in the leaves of the tree. In this report, our proposal is to simulate the well-known 3D-Rtree [44] on PostgreSQL s GiST extensibility interface, applied on appropriate data types that allow mobility data representation. Our choice was driven by the fact that 3D-Rtree has been used as the access method by several diverse algorithms, while it has exhibited balanced performance in a variety of benchmark queries [9]. 3. THE RETRATREE STRUCTURE Hierarchical splitting the time domain The idea of hierarchical partitioning of the time domain is to first partition each trajectory into p << L k equi-sized disjoint temporal periods (i.e. first level partitioning into so-called chunks), and secondly to organize each of the latter into possibly overlapping equivalence classes according to the lifespans of the sub-trajectories inside the chunks (i.e. second level partitioning into so-called sub-chunks).

More formally, given a trajectory T k as a sequence of L k -1 (3D) line segments e k,i, the lifespan l of all trajectories in the trajectory database D, and a target partitioning granularity p << L k, the chunked subtrajectory ST k,i of trajectory T k is the one resulted by T k when restricted inside a temporal period p j. ( 1 ) li, ) =, 1 i p l i p i p p where l is the length of each time interval (i.e. the lifespan of each chunk) and timestamps l ( i 1) p td. +, 2 i p are called splitting timestamps. As such, each trajectory is split into multiple sub-trajectories using the same p-1 splitting timestamps. Note that this strategy is different from that in [18], [19], [42], where each trajectory selects different splitting timestamps, while it is similar to that in [34]. See in Fig. 1 the splitting of the MOD into two chunks, each corresponding to data of one day (i.e. see mauve (green) colored sub-trajectories, respectively). The chunking process is applied incrementally whenever a batch of recordings from a moving object arrives. Then, the algorithm tries to fit it in the existing chunks, taking into consideration the already created chunking borders. If the given trajectory cannot be fitted in the existing temporal range, then the set of chunks is extended suitably in order to fit the trajectory. At the second level, each chunk produced by the first phase is subdivided into (possibly) overlapping equivalence classes. Specifically, we partition each chunk into smaller sub-chunks by grouping the subtrajectories contained in the chunk by their lifespan and their starting and ending timepoints. Therefore, in this phase the temporal borders of each sub-chunk, which are not defined by the user but from the data, are the same or similar w.r.t. a temporal tolerance parameter tau that is user-defined. This parameter implies that two sub-trajectories are considered temporally similar if their starting (ending) timepoints do not differ more than tau/2, respectively. Moreover, this parameter assumes that two sub-trajectories cannot be considered as spatio-temporally similar (as such they will not be in the same cluster and their distance function will not be calculated) if the union of their non-common lifespans is bigger than tau (i.e. a trajectory of 20 minutes duration cannot be similar with a trajectory of 30 minutes duration, when tau=10 minutes). Obviously, when setting tau=0 minutes, this will result into a large number of equivalence classes as in real-world data it is rare to have many trajectories starting their route absolutely concurrently. In Fig. 1 the chunk corresponding to the first day (i.e. mauve colored sub-trajectories) is subdivided to two sub-chunks, containing <T 1, T 2, T 3, T 4 > and <T 5, T 6 > sets of sub-trajectories, respectively. t p Day 2 T 6 T 3 T 5 T 4 T 3 T 1 T 2 Day 1 x y Fig. 1: A MOD consisting of six trajectories

Table 1 summarizes the definitions of the symbols used in this report. Symbols Definitions D Given MOD, D = {T 1, T 2,, T N } T k k th trajectory of D L k Number of points of T k p k,i i-th (3D) point of trajectory T k, p k,i = (x k,i, y k,i, t k,i ) e k,i i-th (3D) line segment of trajectory T k, e k,i = [p k,i, p k,i+1 ] l k,i Lifespan of e k,i, calculated as: l k,i = t k,i+1 t k,i LP k Number of sub-trajectories partitioning trajectory T k P k Set of the sub-trajectories partitioning trajectory T k P k,i i th sub-trajectory of trajectory T k V Set of all voting descriptors in dataset D V k The voting descriptor of trajectory T k VP k,i The voting descriptor of sub-trajectory P k,i Nl k,i The descriptor of sub-trajectory P k,i w.r.t. normalized lifespan of its line segments S Sampling set of representatives S={R 1,..., R M } M The cardinality of S, also the number of clusters in the resulting clustering SR(S) Representativeness function of S V(P k,i,p m,n ) Voting descriptor of P k,i D S w.r.t. P m,n S C Clustering of sub-trajectories in M clusters, C = {C 1,, C M } Out Set of sub-trajectories not belonging to C (i.e., outliers) t.x Minimum timestamp of object x T.x Maximum timestamp of object x l Lifespan of all T k in D p Number of equi-time disjoint intervals (i.e. chunks) CK i i-th chunk of D corresponding to p i, 1 i pperiod ST k,i S n CK i tau S n CK i.per S n CK i.s S n CK i.out Table 1: Symbol table. Sub-Trajectory of T k that belongs to CK i n-th subchunk of i-th chunk Temporal threshold (tolerance) Temporal period for S n CK i Representatives for S n CK i Outliers for S n CK i The ReTraTree data structure The previous discussion regarding the hierarchical splitting of the time domain implicitly describes the first two levels of the ReTraTree. In detail, the root of the ReTraTree consists of entries corresponding to chunks sorted by time. Note that for each chunk CK i there is no need to maintain the temporal periods in the index nodes as these correspond to equal-length splitting intervals. Each entry CK i only maintains a pointer to the respective set of sub-chunks S n CK i, n 1, forming the second level of ReTraTree. Each entry of a sub-chunk is a sequence of triplets <S n CK i.per, S n CK i.s, S n CK i.out>, where per is the temporal period of the sub-chunk, while S (Out) are pointers to the set of representative (outlier) sub-trajectories of S n CK i. The sequence of triplets are ordered initially by the starting timepoint and secondly by the ending timepoint of per. The entries of the set S consist of pairs <R j, C!! >, each of which include the representative sub-trajectory R j and a pointer C!! to the subset of sub-trajectories that formulate a cluster around R j. Similarly S is ordered by the time period of R j. The set Out contains the outlier sub-trajectories of the current sub-chunk. The sets S and Out (whose utility and role will be discussed in the subsequent section) form a third level of partitioning in ReTraTree, while the actual data corresponding to all clusters C!! is the fourth level of the structure. Let s refer to this subset of data as D n,i. Note that all the sub-

trajectories of all C!! in a sub-chunk, namely D n,i, are organized in a relation, whose column including the sub-trajectories is indexed by a pg3dr-tree, while the column including the cluster identifier is indexed by a B+ tree. Of course, the relation further includes the identifiers of the trajectories, also indexed by a B+ tree. Obviously, these indices enable us to apply spatio-temporal queries to sub-chunk D n,i and facilitate the direct access into data of a specific cluster C!!. Subsequently, we formalize the above discussion, while in Fig. 2 and Fig. 3 we depict the structure of the ReTraTree and its instantiation for the MOD of Fig. 1, respectively: root = {,CK i, }, 1 i p CK i = {S n CK i, }, n 1 S n CK i = <S n CK i.per, S n CK i.s, S n CK i.out> S n CK i.s = {<C!!, R j >}, j 1 Fig. 2: The structure of the ReTraTree Fig. 3: A ReTraTree (omitting the fourth (data) level) built from the MOD of Fig. 1

4. RETRATREE ALGORITHMS Below we provide a technical description of the algorithm that is presented abstractly in Fig. 2 Algorithm S 2 T-Clustering Input: MOD D = {T 1, T 2,, T N }, w, ε Output: Sampling set S, Clusters C i, i {1,..., M}, Outliers O. 1. V ß GVA(D, ε) 2. for each V k V do 3. P k ß TSA(V k, w)! 4. S ß SSA(V,!!! P! )! 5. (C, Out) ß SCA(S,!!! P!, ε) 6. return (S, C, Out) Fig. 4: Algorithm for Sampling-based Sub-Trajectory Clustering Incremental maintenance of ReTraTree Recall that our goal is to incrementally maintain the ReTraTree whenever a batch of recordings of a moving object (i.e. a trajectory T k ) arrives. This methodology is described in Algorithm 2. In the previous discussion we have described how our method incrementally performs the first phase of partitioning in the time dimension (line 1). The update_root function returns the set of chunks CK and the respective set of sub-trajectories ST that correspond to the input trajectory T k. Briefly, the rest of the methodology assigns each sub-trajectory to an appropriate sub-chunk (lines 4-6). If there is not a matching sub-chunk w.r.t. time, a new subchunk is created, which is initialized with an empty representative set S, and an outliers set Out including the unmatched sub-trajectory (lines 35-39). If there is an appropriate sub-chunk for the sub-trajectory under processing, the algorithm tries to assign it to an existing cluster (lines 8-13). If this attempt fails, then the algorithm adds the sub-trajectory into the outliers set, which act as a temporary relation upon which sampling-based sub-trajectory clustering (i.e. S 2 T-Clustering Algorithm) is applied whenever the size of the relation exists a user-defined threshold (e.g. > α Mb). When this process takes place, a resulting new representative sub-trajectory will extend the existing set of representatives, only if it is different from them. Subsequently, for each of the resulting new outlier sub-trajectories, we either delete them (store them in a permanent outliers relation) if their size is smaller than w, which means that it will not be able to be clustered in a future clustering round, or we re-drop the sub-trajectory from the top of the ReTraTree structure. This implies that we recursively apply the procedure for that sub-trajectory (till it is either clustered or partitioned to smaller pieces, due to successive applications of the S 2 T-Clustering algorithm) in order to search for other sub-chunks wherein the latter could be clustered or to form a new sub-chunk (lines 15-28).

Algorithm IS 2 T-Clustering Input: ReTraTree root, trajectory T k, tau, w, ε Output: Updated ReTraTree //PHASE 1: Chunking in the time domain 1. (CK, ST)ß update_root(root, T k ) 2. for each pair (CK i, ST k,i ) (CK, ST) do //PHASE 2: Data-Driven Incremental Sub-Chunking and Clustering 3. clusteredß false; matchß false 4. SCK i ß {S n CK i, t.st k,i t.s n CK i < tau/2} 5. for each S n CK i SCK i do 6. if ( T.ST k,i T.S n CK i < tau/2) then 7. for each R j S n CK i.s do 8. if (non_common_lifespan(st k,i, R j ) < tau) then 9. if (V ST!,!, R! ε) then 10. C!! ß C!! ST k,i 11. clusteredß true 12. if (clustered=false) then 13. S n CK i.out ß S n CK i.out ST k,i //PHASE 3: Sampling-based Sub-Trajectory Clustering 14. if S n CK i.out > α_mb then 15. (S, C, Out) ß S 2 T-Clustering(S n CK i.out, w, ε) 16. S n CK i.s ß S n CK i.s {S S ΝΟΤ ε-join(s n CK i.s, S)} 17. for each outlier O in Out do 18. if O < w then 19. delete O 20. else 21. IS 2 T-Clustering(root, O, tau, w, ε) 22. matchß true 23. if (match = true) then 24. break 25. if (match = false) then 26. SCK i ß SCK i S n+1 CK i // i.e. create new sub-chunk 27. S n+1 CK i.s ß 28. S n+1 CK i.out ß S n+1 CK i.out ST k,i 29. return Fig. 5: Algorithm for Inserting a trajectory in the ReTraTree structure Query-based T-Clustering on ReTraTree The above algorithm maintains incrementally the ReTraTree structure, which in its leaves includes already clustered sub-trajectories. However, given a temporal period it is not enough to retrieve the clusters (i.e. sub-trajectories following the representatives) that overlap this period, as it is possible that the sub-trajectory clustering process of overlapping sub-chunks to form clusters, namely representatives

that: (a) are almost identical (as such, a merge process should take place in order to report only one cluster as the union of the two clusters built around the two similar representatives), and/or (b) one representative can be the continuation of another (as such, an append process should take place to identify maximal clusters). In other words what we require is a methodology that takes as input the ReTraTree structure as input and searches it, so as to identify maximal patterns w.r.t. the user requirements, while at the same time identifies places where internal re-organization could take place to improve the effectiveness and efficiency of ReTraTree. Such a user requirement could be the discovery of all the valid clusters during a specific period of time (eventual this period could be the whole lifespan of the MOD, providing a solution also for the whole- (vs. sub-) trajectory clustering probel. This is a reasonable requirement in the BIG mobility data setting that we envision and the fact that state-of-art clustering algorithms are not able to be applied in the currently available MOD sizes. To put differently, the proposed methodology implies an algorithm that will act as a query operator in a MOD engine and that it will retrieve already clustered data according to user parameters and it will perform the aforementioned necessary merge and append refinements on the query results. To the best of our knowledge, such a query-based clustering approach is novel in the mobility data management and mining literature. The following algorithm proposes such a solution on top of ReTraTree. The user gives as parameter the period of interest and the algorithm traverses the tree and returns clusters valid in this period. More specifically, the algorithm initially filters the chunks that overlap the given period and for each of them the corresponding valid sub-chunks (lines 1-3). These sub-chunks are organized in a priority queue which at this step (line 4) partitions the sub-chunks in equivalence classes according to whether the representatives that have been discovered inside these sub-chunks temporally overlap or not. To illustrate this, Fig. 7 shows only the representative sub-trajectories (not the outliers) of one chunk. Note that for simplicity, y-dimension has been omitted and specific borders of sub-chunks are not depicted, while the representatives form two equivalence classes, i.e. the blue and the red one. Subsequently, the algorithm pops each equivalence class one-by-one and sorts all representatives w.r.t. time dimension, similarly to sub-chunks, by interleaving the already sorted representatives in each sub-chunk (line 6). In Fig. 7 representatives coming from different sub-chunks are distinguished as dashed vs. continuous polylines. Then, the algorithm sweeps in time dimension the temporally interleaved representatives (line 7) and for each pair of overlapped sub-trajectories it only checks whether the two representatives have either the same lifespan (line 8) or one ends when the next is starting (line 11); w.r.t. the tau threshold. In the first case, if the two representatives are similar (this means that come for sure from different sub-chunks), then the first (in order) is being annotated with MERGE flag so as to hint that a merge process should take place at this step. This implies that the ReTraTree should be appropriately updated (shrinked) to keep only one of them. Note that such a re-organization is not performed at query time, but queries results gives the required hints in order to apply whenever applications allow it. Such a merging hint is depicted in Fig. 7 between sub-trajectories R 1 and R 2. Obviously, representatives like R 5 and R 6, will both be maintained in the final outcome although they have similar lifespans. In the second case, if the Euclidean distance of the last point of the first representative is close (w.r.t. a distance threshold) to the first point of the second representative and a sufficient number of the same moving objects are represented by both representatives (w.r.t. a percentage threshold), the latter is appended to the first one (lines 12-13). This case is depicted in Fig. 7 between sub-trajectories R 3 and R 4. In any other case (line 15) the algorithm does nothing, meaning that it continues to the next pair, as such it maintains both representatives into the sorted list. At the end of each sweep, the algorithm simply maintains in the next round only those representatives that end at most tau seconds before the border of the current chunk (e.g. R 7, as candidates for merging with subsequent representatives) (lines 16-18). The rest of the representatives are part of the final outcome of the algorithm.

Algorithm QuT-Clustering Input: ReTraTree root, temporal period tp=[s, e), tau, d, γ Output: Clusters C valid inside tp 1. CKß {CK i, overlap(root.ck i, tp)} 2. for each CK i CK do 3. SCK i ß {S n CK i, overlap(tp, CK i.s n CK i.per)} 4. TEQ_PQß bulk_push_2_teq(teq_pq, SCK i ) 5. while TEQ_PQ Ø do 6. Sß temporal_interleaving(teq_pq.pop()) 7. for each R j S do 8. if (non_common_lifespan(r j, R j+1 ) < tau) then 9. if (V R!, R!!! ε) then 10. annotate R j with MERGE flag 11. else if ( T.R j - t.r j+1 < tau) then 12. if (euclidean_dist(p(t.r j ), p(t.r j+1 )) < d) AND (common_ids(r j, R j+1 ) > γ) then 13. annotate R j with APPEND flag 14. else 15. continue 16. S ß {R j S, T.R j - T.CK i > tau} 17. Sß S-S 18. Cß C S 19. return C Fig. 6: Algorithm for trajectory search in the ReTraTree structure Fig. 7: Representatives of a chunk organized in a temporal equivalence class

REFERENCES [1] Almeida, V.T., Güting, R.H., & Behr, T. 2006. Querying moving objects in secondo. In Proceedings of MDM. [2] Andrienko, G., Andrienko, N., Rinzivillo, S., Nanni, M., and Pedreschi D. 2009. A visual analytics toolkit for cluster-based classification of mobility data. In Proceedings of SSTD, pages 432-435. [3] Andrienko, G., Andrienko, N., Rinzivillo, S., Nanni, M., Pedreschi D., and Giannotti, F. 2009. Interactive visual clustering of large collections of trajectories. In Proceedings of VAST, pages 3-10. [4] Ankerst, M., Breunig, M. M., Kriegel, H.-P. and Sander, J. 1999. Optics: Ordering points to identify the clustering structure. In Proceedings of SIGMOD. [5] Benkert, M., Gudmundsson, J., Hubner, F. and Wolle T. 2006. Reporting flock patterns. In Proceedings of ESA, pages 660-671. [6] Brinkhoff T. A framework for generating network-based moving objects. GeoInformatica, 6(2):153180, 2002. [7] Cadez, I. V., Gaffney, S., and Smyth, P. 2000. A general probabilistic framework for clustering individuals and objects. In Proceedings of KDD, pages 140-149. [8] Dessloch, S. and Mattos, N. 1997. Integrating SQL Databases with Content-Specific Search Engines. In Proceedings of VLDB, pages 528 537. [9] Düntgen, C., Behr, T., and Güting, R. H. 2009. BerlinMOD: a benchmark for moving object databases. VLDB Journal, 18(6): 1335-1368. [10] Frentzos, E., Gratsias, K., and Theodoridis, Y. 2007. Index-based most similar trajectory search. In Proceedings of ICDE. [11] Frentzos, E., Gratsias, K., Pelekis, N., and Theodoridis, Y. 2007. Algorithms for nearest neighbor search on moving object trajectories. GeoInformatica, 11:159-193. [12] Gaffney, S., and Smyth, P. 1999. Trajectory clustering with mixtures of regression models. In Proceedings of KDD, pages 63-72. [13] Giannotti, F. and Pedreschi, D. 2008. Mobility, Data Mining and Privacy, Geographic Knowledge Discovery. Springer-Verlag. [14] Giannotti, F., Nanni, M. Pinelli, F. and Pedreschi D. 2007. Trajectory pattern mining. In Proceedings of KDD, pages 330-339. [15] Gudmundsson, J. van Kreveld, M. J. and Speckmann, B. 2007. Efficient detection of patterns in 2d trajectories of moving points. GeoInformatica, 11(2):195215. [16] Gudmundsson, J., Loffler, M., Buchin, K., Buchin, M. and Luo, J. 2008. Detecting commuting patterns by clustering subtrajectories. In Proceedings of ISAAC. [17] Guttman, A. 1984. R-Trees. A Dynamic Index Structure for Spatial Searching. In Proceedings of SIGMOD. [18] Hadjieleftheriou, M., Kollios, G., Tsotras, V.J. and Gunopulos, D. 2002. Efficient Indexing of Spatiotemporal Objects, In Proceedings of EDBT, pages 251-268. [19] Hadjieleftheriou, M., Kollios, G., Gunopulos, D. and Tsotras, V.J., 2006. Indexing Spatio-Temporal Archives, VLDB J., vol. 15, no. 2, pages 143-164. [20] Hellerstein, J., Naughton, J. and Pfeffer, A. 1995. Generalized Search Trees for Database Systems. In Proceedings of VLDB, pages 562 573. [21] Informix Corp. 1998. Virtual Index Interface Guide. [22] Jeung, H., Yiu, M. L., Zhou, X., Jensen, C., and Shen, H. T. 2008. Discovery of convoys in trajectory databases. In Proceedings of VLDB. [23] Kalnis, P., Mamoulis, N., and Bakiras, S. 2005. On discovering moving clusters in spatio-temporal data. In Proceedings of SSTD, pages 364-381. [24] Kollios, G., Gunopulos, D., Koudas, N., and Berchtold, S. 2003. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE Transactions on Knowledge and Data Engineering, 15:398-404. [25] Kornacker, M. 1999. High-Performance Extensible Indexing. In Proceedings of VLDB, pages 3-10.

[26] Laube, P., Imfeld, S., and Weibel, R. 2005. Discovering relative motion patterns in groups of moving point objects. International Journal of Geographical Information Science. 19(6), 639-668. [27] Lee, J.-G., Han, J., and Li, X. 2008. Trajectory outlier detection: A partition-and-detect framework. In Proceedings of ICDE, pages 140-149. [28] Lee, J.-G., Han, J., and Whang, K.-Y. 2007. Trajectory clustering: a partition-and-group framework. In Proceedings of SIGMOD. [29] Lee, J.-G., Han, J., Li, X. and Gonzalez, H. 2008. Traclass: trajectory classification using hierarchical region-based and trajectory-based clustering. PVLDB, pages 1081-1094. [30] Li, Z., Ding, B., Han, J. and Kays, R. 2010. Swarm: Mining Relaxed Temporal Moving Object Clusters. In Proceedings of VLDD. [31] Li, Y., Han, J., and Yang, J. 2004. Clustering moving objects. In Proceedings of KDD, pages 617-622. [32] Nanni, M., and Pedreschi, D. 2006. Time-focused clustering of trajectories of moving objects. Journal of Intelligent Information Systems, 27(3):267-289. [33] Nanopoulos, A., Theodoridis, Y., and Manolopoulos, Y. 2006. Indexed-based density biased sampling for clustering applications. Data and Knowledge Engineering, 57(1):37-63. [34] Ni, J. and Ravishankar, C. V., 2007. Indexing Spatio-Temporal Trajectories with Efficient Polynomial Approximations, IEEE TKDE, vol. 19, no. 5, pages 663-678. [35] Oracle Corp. 1998. All Your Data: The Oracle Extensibility Architecture. [36] Panagiotakis, C., Pelekis, N., Kopanakis, I., Ramasso, E., and Theodoridis, Y. 2011. Segmentation and sampling of moving object trajectories based on representativeness. IEEE Transactions on Knowledge and Data Engineering. [37] Pelekis, N., Andrienko, G., Andrienko, N., Kopanakis, I., Marketos, G., Theodoridis, Y. 2011. Visually Exploring Movement Data via Similarity-based Analysis, Journal of Intelligent Information Systems. [38] Pelekis, N., Frentzos, E., Giatrakos, N., and Theodoridis, Y. 2008. HERMES: Aggregative LBS via a trajectory DB engine. In Proceedings of SIGMOD, pages 1255-1258. [39] Pelekis, N., Kopanakis, I., Kotsifakos, E., Frentzos, E. and Theodoridis, Y. 2011. Clustering uncertain trajectories. Knowledge and Information Systems, 28(1):117-147. [40] Pelekis, N., Panagiotakis, C., Kopanakis, I., and Theodoridis, Y. 2010. Unsupervised trajectory sampling. In Proceedings of ECML-PKDD. [41] Pfoser, D., Jensen, C.S., and Theodoridis, Y. 2000. Novel approaches to the indexing of moving object trajectories. In Proceedings of VLDB. [42] Rasetic, S., Sander, J., Elding, J. and Nascimento, M.A., 2005. A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing, In Proceedings of VLDB, pages 934-945. [43] Shim K. Guha, S. Rastogi R. 1998. Cure: An efficient clustering algorithm for large databases. In Proceedings of SIGMOD. [44] Theodoridis, Y., Vazirgiannis, M. and Sellis, T. 1996. Spatio-Temporal Indexing for Large Multimedia Applications. In Proceedings of ICMS. [45] The R-Tree website, 2011. [Online]. Available: http://www.rtreeportal.org. [46] Vodas, M. 2013. Hermes - Building an Efficient Moving Object Database Engine, MSc. Thesis, University of Piraeus. [47] Zhang, T., Ramakrishnan, R., and Livny, M. 1996. Birch: An efficient data clustering method for very large databases. In Proceedings of SIGMOD.