TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

Similar documents
TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

HYRISE In-Memory Storage Engine

Spatio-temporal Range Searching Over Compressed Kinetic Sensor Data. Sorelle A. Friedler Google Joint work with David M. Mount

Mobility Data Management & Exploration

Constructing Popular Routes from Uncertain Trajectories

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Fosca Giannotti et al,.

Approximately Uniform Random Sampling in Sensor Networks

Processing of Very Large Data

Column Stores vs. Row Stores How Different Are They Really?

Publishing CitiSense Data: Privacy Concerns and Remedies

The CarTel Project. Lewis Girod. M.I.T. Computer Science & Artificial Intelligence Lab cartel.csail.mit.edu

Detect tracking behavior among trajectory data

STRAW - An integrated mobility & traffic model for vehicular ad-hoc networks

In-Memory Data Management

Multidimensional Data and Modelling

Scalable Selective Traffic Congestion Notification

On the Scalability of Hierarchical Ad Hoc Wireless Networks

data parallelism Chris Olston Yahoo! Research

Administração e Optimização de Bases de Dados 2012/2013 Index Tuning

Graph-Based Synopses for Relational Data. Alkis Polyzotis (UC Santa Cruz)

Multidimensional Data and Modelling - DBMS

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

pcube: Update-Efficient Online Aggregation with Progressive Feedback and Error Bounds

Indexing the Positions of Continuously Moving Objects

Dynamic Spatial Partitioning for Real-Time Visibility Determination. Joshua Shagam Computer Science

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

DS595/CS525: Urban Network Analysis --Urban Mobility Prof. Yanhua Li

M. Andrea Rodríguez-Tastets. I Semester 2008

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Optimal Linear Interpolation Coding for Server-based Computing

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

TrajAnalytics: A software system for visual analysis of urban trajectory data

Latent Space Model for Road Networks to Predict Time-Varying Traffic. Presented by: Rob Fitzgerald Spring 2017

Spatiotemporal Access to Moving Objects. Hao LIU, Xu GENG 17/04/2018

Motion in 2D image sequences

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

Some Practice Problems on Hardware, File Organization and Indexing

Mobile Macroscopes: The CarTel Project

An Introduction to Spatial Databases

TSAR : A Two Tier Sensor Storage Architecture using Interval Skip Graphs

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Introduction to Spatial Database Systems

Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems

Data Management Issues in Disconnected Sensor Networks

Was ist dran an einer spezialisierten Data Warehousing platform?

Searching for Similar Trajectories in Spatial Networks

Semantic-Based Surveillance Video Retrieval

Introduction to Geographic Information Science. Some Updates. Last Lecture 4/6/2017. Geography 4103 / Raster Data and Tesselations.

Clustering Part 4 DBSCAN

Spatial Scattering for Load Balancing in Conservatively Synchronized Parallel Discrete-Event Simulations

Architecture and Implementation of Database Systems (Winter 2014/15)

Visual Traffic Jam Analysis based on Trajectory Data

Hyrise - a Main Memory Hybrid Storage Engine

Contact: Ye Zhao, Professor Phone: Dept. of Computer Science, Kent State University, Ohio 44242

System Support for Internet of Things

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Tackling the Challenges of Big Data! Tackling The Challenges of Big Data. This Module. Samuel Madden. Samuel Madden. Visualizing Twitter

ADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004

A Distributed Approach to Fast Map Overlay

gsketch: On Query Estimation in Graph Streams

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

A Novel Method for Activity Place Sensing Based on Behavior Pattern Mining Using Crowdsourcing Trajectory Data

The Fusion Distributed File System

PF-OLA: A High-Performance Framework for Parallel Online Aggregation

Efficient Processing of Multiple DTW Queries in Time Series Databases

MauveDB: Statistical Modeling inside Database Systems

Outline. Database Management and Tuning. Index Data Structures. Outline. Index Tuning. Johann Gamper. Unit 5

Data Warehousing & Data Mining

Algorithm Engineering Applied To Graph Clustering

Chapter 25: Spatial and Temporal Data and Mobility

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Oracle Spatial Technologies: An Update. Xavier Lopez Director, Spatial Technologies Oracle Corporation

Real-Time Model-Free Detection of Low-Quality Synchrophasor Data

AN OVERVIEW OF SPATIAL INDEXING WITHIN RDBMS

Multidimensional Indexes [14]

Data Model and Management

DATA MINING AND WAREHOUSING

Update-efficient indexing of moving objects in road networks

Differentially Private Multi- Dimensional Time Series Release for Traffic Monitoring

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

ROUTING ALGORITHMS Part 2: Data centric and hierarchical protocols

ScalaIOTrace: Scalable I/O Tracing and Analysis

Algorithms for GIS:! Quadtrees

Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets. Fernando Chirigati Harish Doraiswamy Theodoros Damoulas

Compressing Intermediate Keys between Mappers and Reducers in SciHadoop

Evolution of Database Systems

CACHE-OBLIVIOUS MAPS. Edward Kmett McGraw Hill Financial. Saturday, October 26, 13

Relational Database Support for Spatio-Temporal Data

CMSC724: Access Methods; Indexes 1 ; GiST

University of Florida CISE department Gator Engineering. Clustering Part 4

Pointwise-Dense Region Queries in Spatio-temporal Databases

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Descrambling Privacy Protected Information for Authenticated users in H.264/AVC Compressed Video

Foundations of Multidimensional and Metric Data Structures

Current Topics in OS Research. So, what s hot?

JAVA Projects. 1. Enforcing Multitenancy for Cloud Computing Environments (IEEE 2012).

Scale-out Data Deduplication Architecture

Transcription:

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets Philippe Cudré-Mauroux Eugene Wu Samuel Madden Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Overview Introduction and Motivation Large-Scale GPS Data Mining Conventional approach R-Trees & Trajectory-Segmentation TrajStore Architecture Sparse Spatial Index Adaptivity Compression Performance Conclusions

Introduction - Problems Rise of GPS and broadband-speed wireless devices cause more demand for trajectory data Users activities and movement patterns in different locations MIT CarTel project Current database storage systems are inadequate for manipulating the very large and dynamic spatio-temporal data sets. Extremely slow when trying to retrieve data Also inadequate for process or doing computation large amount of trajectories simultaneously

Motivation Explosion of position-aware devices & apps MIT s CarTel project: CarTel is a mobile sensor computing system designed to collect, process, deliver, and visualize data from sensors located on mobile units such as automobiles. It collected live data from cars in Boston, and process with historical data to provide an efficient route plan.

MIT CarTel

MIT CarTel

Motivation CarTel Massive amounts of GPS data Real-time, high insert rates Large spatiotemporal queries New class of applications Live feeds from large fleets of mobile objects Current solutions (e.g., PostGIS) failed Designed for (relatively) sparse data

Outline Motivation Large-Scale GPS Data Mining Conventional approach R-Trees & Trajectory-Segmentation TrajStore Architecture Sparse Spatial Index Adaptivity Compression Performance Conclusions

Inserting [Conventional Approach: R-Tree] R1 R4 R11 R3 R9 R5 R13 R8 R10 R14 R12 R2 R6 R16 R7 R17 R19 R15 R18 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19

Querying [Conventional Approach: R-Tree] R1 R4 R11 R3 R9 R5 R13 R2 R6 R15 R8 R10 R14 R12 R7 R17 R16 R18 R1 R2 R3 R4 R5 R6 R7 R19......... {R11, TrajID, (x1, y1, t1), (x2, y2, t2), (x3, y3,t3), (x4, y4, t4)...}... {R12, TrajID, (x1, y1, t1), (x2, y2, t2)(x3,y3,t3),...}... {R14, TrajID, (x1, y1, t1), (x2, y2, t2)(x3,y3,t3),...}...... R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19

Some Improvements on R-tree TB-Trees & SEB-Trees Do not deal well with very long trajectories that tend to have very large bounding rectangles, and can include a high number of I/Os per lookup Does not explicitly discuss how to cluster data, and both are nonadaptive

Issues with Current Systems Efficient for sparse data only Catastrophic for large, dense, overlapping data Slow inserts Bounding boxes creation Multiple index updates per new trajectory Slow queries Index considers a very high number of overlapping objects Inefficient selects of records Complex index maintenance & look-up One disk seek for each trajectory sub-segment Several minutes/hours to resolve aggregate queries 12

Objective Efficient for non-sparse data Eg. Trajectory data which is dense in spatio and time Works for large, dense, overlapping data Efficiently retrieving all of the trajectories in a particular geospatial/temporal region Fast insert with large amount of data but less index process cost Fast query with less IO operations

Outline Motivation Large-Scale GPS Data Mining Conventional approach R-Trees & Trajectory-Segmentation TrajStore Architecture Sparse Spatial Index Adaptivity Compression Performance Conclusions

TrajStore Adaptive system to store & query very large trajectory data sets Sparse, non-overlapping spatial index Chunk-based data organization co-location, dense-packing & compression Buffered, amortized IO operations Minimization of total IO cost 15

16 Architecture

Index & Storage TrajStore storage structures are optimized for spatial queries over specific regions with relatively large time bounds, rather than finding just one or a few trajectories that pass through a region at an exact point in time. Storage is primarily organized according to a spatial index, with temporal predicates applied on the data retrieved from this spatial index, as we expect spatial predicates to be generally more selective than temporal predicates.

Index & Storage subtrajectories indexed and ordered in time Spatial index: quadtree 18 [new] Optimal quadtree construction [new] Adaptive, index-driven data storage

TrajStore Inserts C2 C3 Densed-Packed / Compressed chunks C1 C5 C6 t1 t2 t3 C4 C8 C7 C9 C10 t4 t5 C1 C2 C3 C4 C5 C6 19 C7 C7 C9 C10

TrajStore Queries C2 C3 Densed-Packed / Compressed chunks C1 C5 C6 t1 t2 t3 C4 C8 C7 C9 C10 C1 C2 C3 Trivial index look-up One seek per cell C4 C5 C6 20 C7 C7 C9 C10

Sparse Spatial Index (1/3) Cost-based, optimal spatial partitioning Efficient, hierarchical partitioning 21

Sparse Spatial Index (2/3) Basic idea Cost-model for query execution times based on #cells accessed Optimal quadtree construction based on cost-model, query workload, local density & page size Optimal balance between Oversized cells potentially retrieves data that is not queried Undersized cells seek not amortized if too little data read unnecessary seeks if dense data and relatively large query 22

Sparse Spatial Index (3/3) Algorithm Split: Input: A cell cell that will be split Output: The number of cells nbnewcells that have been inserted into the quadtree to replace this cell Algorithm Merge: Input: A cell cell that will be merged with its neighbors Output: The number of cells nbnewcells that have been merged and replaced by a new cell 23

System Adaptivity Data evolution Adapt the index & storage with every incoming trajectory No-op / Split() / Merge() Very fast, incremental operations Query evolution Highly-skewed queries in practice Per-cell query statistics EWMA-based re-clustering 24

Compression Unique opportunities due to high spatial redundancy Intra-segment redundancy High-sampling rate, bounded speed Delta encoding (lossless) Linear interpolation (lossy/lossless) δ Inter-segments redundancy Repeated trips Spatially constraint by roads, paths Online cluster-detection Cluster compression (lossy) Combination of approaches based on user needs Bounded total error 25

Compression Algorithm for forming cluster groups: Input: A cell containing a set of trajectories Output: A list of cluster groups {G1,..., Gn}, where all trajectories in a group Gi have dist < from each other. Algorithm for Eliminating Extraneous Points: Input: List of points (pi,..., pj) Output: Returns True if the points between pi and pj can be linearly extrapolated. Else, returns False. 26

Experimental Setup Query answering on 40-200M GPS readings CarTel data Large queries (0.1% / 1% / 10%) Approaches compared PostGIS Optimal trajectory segmentation TrajStore TrajStore variants Fixed grid Capacity-bound quadtree Compression schemes 27

Experimental Result

Approaches Compared Adaptive Adaptive clustering approach Grid Segment trajectory to a fixed size grid ClustSplit Trajectory is split into sub-trajectories NoSplit Store each trajectory in R-Tree CapacityQuad Use capacit-bound quadtree as index 30

Experimental Results Blazing fast query execution 1-2 orders of magnitude faster than existing approaches Superior indexing scheme 31 [query size = 1%] adaptivity & compression turned off

Experimental Result

Experimental Result

Experimental Result

Experimental Result

Experimental Results Further results High-insert rate 100K GPS points / s on average Scalable Very resilient to data & query evolution fixed grid Compression (1m) 1:8 compression ratio 2.5 performance improvement 36

Conclusions Explosion of location-aware devices & applications Urgent need to support very large-scale GPS analytics TrajStore: rethink both index & storage layers in combination to provide Sparse, adaptive, non-overlapping index optimal w.r.t. IO cost-model Index-driven data co-location High compression ratios intra + inter-segments compression System of choice for analytical queries over very large collections of trajectories 37