TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets Philippe Cudré-Mauroux Eugene Wu Samuel Madden Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Overview Introduction and Motivation Large-Scale GPS Data Mining Conventional approach R-Trees & Trajectory-Segmentation TrajStore Architecture Sparse Spatial Index Adaptivity Compression Performance Conclusions

Introduction - Problems Rise of GPS and broadband-speed wireless devices cause more demand for trajectory data Users activities and movement patterns in different locations MIT CarTel project Current database storage systems are inadequate for manipulating the very large and dynamic spatio-temporal data sets. Extremely slow when trying to retrieve data Also inadequate for process or doing computation large amount of trajectories simultaneously

Motivation Explosion of position-aware devices & apps MIT s CarTel project: CarTel is a mobile sensor computing system designed to collect, process, deliver, and visualize data from sensors located on mobile units such as automobiles. It collected live data from cars in Boston, and process with historical data to provide an efficient route plan.

MIT CarTel

Motivation CarTel Massive amounts of GPS data Real-time, high insert rates Large spatiotemporal queries New class of applications Live feeds from large fleets of mobile objects Current solutions (e.g., PostGIS) failed Designed for (relatively) sparse data

Outline Motivation Large-Scale GPS Data Mining Conventional approach R-Trees & Trajectory-Segmentation TrajStore Architecture Sparse Spatial Index Adaptivity Compression Performance Conclusions

Inserting [Conventional Approach: R-Tree] R1 R4 R11 R3 R9 R5 R13 R8 R10 R14 R12 R2 R6 R16 R7 R17 R19 R15 R18 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19

Querying [Conventional Approach: R-Tree] R1 R4 R11 R3 R9 R5 R13 R2 R6 R15 R8 R10 R14 R12 R7 R17 R16 R18 R1 R2 R3 R4 R5 R6 R7 R19......... {R11, TrajID, (x1, y1, t1), (x2, y2, t2), (x3, y3,t3), (x4, y4, t4)...}... {R12, TrajID, (x1, y1, t1), (x2, y2, t2)(x3,y3,t3),...}... {R14, TrajID, (x1, y1, t1), (x2, y2, t2)(x3,y3,t3),...}...... R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19

Some Improvements on R-tree TB-Trees & SEB-Trees Do not deal well with very long trajectories that tend to have very large bounding rectangles, and can include a high number of I/Os per lookup Does not explicitly discuss how to cluster data, and both are nonadaptive

Issues with Current Systems Efficient for sparse data only Catastrophic for large, dense, overlapping data Slow inserts Bounding boxes creation Multiple index updates per new trajectory Slow queries Index considers a very high number of overlapping objects Inefficient selects of records Complex index maintenance & look-up One disk seek for each trajectory sub-segment Several minutes/hours to resolve aggregate queries 12

Objective Efficient for non-sparse data Eg. Trajectory data which is dense in spatio and time Works for large, dense, overlapping data Efficiently retrieving all of the trajectories in a particular geospatial/temporal region Fast insert with large amount of data but less index process cost Fast query with less IO operations

Outline Motivation Large-Scale GPS Data Mining Conventional approach R-Trees & Trajectory-Segmentation TrajStore Architecture Sparse Spatial Index Adaptivity Compression Performance Conclusions

TrajStore Adaptive system to store & query very large trajectory data sets Sparse, non-overlapping spatial index Chunk-based data organization co-location, dense-packing & compression Buffered, amortized IO operations Minimization of total IO cost 15

16 Architecture

Index & Storage TrajStore storage structures are optimized for spatial queries over specific regions with relatively large time bounds, rather than finding just one or a few trajectories that pass through a region at an exact point in time. Storage is primarily organized according to a spatial index, with temporal predicates applied on the data retrieved from this spatial index, as we expect spatial predicates to be generally more selective than temporal predicates.

Index & Storage subtrajectories indexed and ordered in time Spatial index: quadtree 18 [new] Optimal quadtree construction [new] Adaptive, index-driven data storage

TrajStore Inserts C2 C3 Densed-Packed / Compressed chunks C1 C5 C6 t1 t2 t3 C4 C8 C7 C9 C10 t4 t5 C1 C2 C3 C4 C5 C6 19 C7 C7 C9 C10

TrajStore Queries C2 C3 Densed-Packed / Compressed chunks C1 C5 C6 t1 t2 t3 C4 C8 C7 C9 C10 C1 C2 C3 Trivial index look-up One seek per cell C4 C5 C6 20 C7 C7 C9 C10

Sparse Spatial Index (1/3) Cost-based, optimal spatial partitioning Efficient, hierarchical partitioning 21

Sparse Spatial Index (2/3) Basic idea Cost-model for query execution times based on #cells accessed Optimal quadtree construction based on cost-model, query workload, local density & page size Optimal balance between Oversized cells potentially retrieves data that is not queried Undersized cells seek not amortized if too little data read unnecessary seeks if dense data and relatively large query 22

Sparse Spatial Index (3/3) Algorithm Split: Input: A cell cell that will be split Output: The number of cells nbnewcells that have been inserted into the quadtree to replace this cell Algorithm Merge: Input: A cell cell that will be merged with its neighbors Output: The number of cells nbnewcells that have been merged and replaced by a new cell 23

System Adaptivity Data evolution Adapt the index & storage with every incoming trajectory No-op / Split() / Merge() Very fast, incremental operations Query evolution Highly-skewed queries in practice Per-cell query statistics EWMA-based re-clustering 24

Compression Unique opportunities due to high spatial redundancy Intra-segment redundancy High-sampling rate, bounded speed Delta encoding (lossless) Linear interpolation (lossy/lossless) δ Inter-segments redundancy Repeated trips Spatially constraint by roads, paths Online cluster-detection Cluster compression (lossy) Combination of approaches based on user needs Bounded total error 25

Compression Algorithm for forming cluster groups: Input: A cell containing a set of trajectories Output: A list of cluster groups {G1,..., Gn}, where all trajectories in a group Gi have dist < from each other. Algorithm for Eliminating Extraneous Points: Input: List of points (pi,..., pj) Output: Returns True if the points between pi and pj can be linearly extrapolated. Else, returns False. 26

Experimental Setup Query answering on 40-200M GPS readings CarTel data Large queries (0.1% / 1% / 10%) Approaches compared PostGIS Optimal trajectory segmentation TrajStore TrajStore variants Fixed grid Capacity-bound quadtree Compression schemes 27

Experimental Result

Approaches Compared Adaptive Adaptive clustering approach Grid Segment trajectory to a fixed size grid ClustSplit Trajectory is split into sub-trajectories NoSplit Store each trajectory in R-Tree CapacityQuad Use capacit-bound quadtree as index 30

Experimental Results Blazing fast query execution 1-2 orders of magnitude faster than existing approaches Superior indexing scheme 31 [query size = 1%] adaptivity & compression turned off

Experimental Result

Experimental Results Further results High-insert rate 100K GPS points / s on average Scalable Very resilient to data & query evolution fixed grid Compression (1m) 1:8 compression ratio 2.5 performance improvement 36

Conclusions Explosion of location-aware devices & applications Urgent need to support very large-scale GPS analytics TrajStore: rethink both index & storage layers in combination to provide Sparse, adaptive, non-overlapping index optimal w.r.t. IO cost-model Index-driven data co-location High compression ratios intra + inter-segments compression System of choice for analytical queries over very large collections of trajectories 37