TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

Similar documents
TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

HYRISE In-Memory Storage Engine

Mobility Data Management & Exploration

Spatio-temporal Range Searching Over Compressed Kinetic Sensor Data. Sorelle A. Friedler Google Joint work with David M. Mount

The CarTel Project. Lewis Girod. M.I.T. Computer Science & Artificial Intelligence Lab cartel.csail.mit.edu

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Tackling the Challenges of Big Data! Tackling The Challenges of Big Data. This Module. Samuel Madden. Samuel Madden. Visualizing Twitter

Fosca Giannotti et al,.

data parallelism Chris Olston Yahoo! Research

Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems

In-Memory Data Management

Constructing Popular Routes from Uncertain Trajectories

ADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

Processing of Very Large Data

TrajAnalytics: A software system for visual analysis of urban trajectory data

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Column Stores vs. Row Stores How Different Are They Really?

Algorithm Engineering Applied To Graph Clustering

Dynamic Spatial Partitioning for Real-Time Visibility Determination. Joshua Shagam Computer Science

Hyrise - a Main Memory Hybrid Storage Engine

Introduction to Geographic Information Science. Some Updates. Last Lecture 4/6/2017. Geography 4103 / Raster Data and Tesselations.

Visual Traffic Jam Analysis based on Trajectory Data

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

COMP 430 Intro. to Database Systems. Indexing

pcube: Update-Efficient Online Aggregation with Progressive Feedback and Error Bounds

PARALLEL AND DISTRIBUTED PLATFORM FOR PLUG-AND-PLAY AGENT-BASED SIMULATIONS. Wentong CAI

Spatial Scattering for Load Balancing in Conservatively Synchronized Parallel Discrete-Event Simulations

Data Model and Management

On the Scalability of Hierarchical Ad Hoc Wireless Networks

Descrambling Privacy Protected Information for Authenticated users in H.264/AVC Compressed Video

Was ist dran an einer spezialisierten Data Warehousing platform?

Administração e Optimização de Bases de Dados 2012/2013 Index Tuning

Graph-Based Synopses for Relational Data. Alkis Polyzotis (UC Santa Cruz)

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Extended R-Tree Indexing Structure for Ensemble Stream Data Classification

Differentially Private H-Tree

C-STORE: A COLUMN- ORIENTED DBMS

Clustering Part 4 DBSCAN

Job Re-Packing for Enhancing the Performance of Gang Scheduling

SciSpark 201. Searching for MCCs

Scalable Compression and Transmission of Large, Three- Dimensional Materials Microstructures

Contact: Ye Zhao, Professor Phone: Dept. of Computer Science, Kent State University, Ohio 44242

ROUTING ALGORITHMS Part 2: Data centric and hierarchical protocols

Low Latency Data Grids in Finance

Cost Models for Query Processing Strategies in the Active Data Repository

Accelerating Analytical Workloads

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

ScalaIOTrace: Scalable I/O Tracing and Analysis

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance

Distributed computing: index building and use

A Distributed Approach to Fast Map Overlay

Approximately Uniform Random Sampling in Sensor Networks

DATA MINING AND WAREHOUSING

Optimal Linear Interpolation Coding for Server-based Computing

Publishing CitiSense Data: Privacy Concerns and Remedies

Architecture and Implementation of Database Systems (Winter 2014/15)

Semantic-Based Surveillance Video Retrieval

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

1 (eagle_eye) and Naeem Latif

Evolution of Database Systems

Clustering from Data Streams

Outline. Database Management and Tuning. Index Data Structures. Outline. Index Tuning. Johann Gamper. Unit 5

Data Management Issues in Disconnected Sensor Networks

School of Computing & Singapore-MIT Alliance, National University of Singapore {cuibin, lindan,

Data Centric Computing

Extracting Rankings for Spatial Keyword Queries from GPS Data

Scalable Selective Traffic Congestion Notification

Efficient Computation of Data Cubes. Network Database Lab

Delay Tolerant Networks

class 5 column stores 2.0 prof. Stratos Idreos

A Work-Efficient GPU Algorithm for Level Set Segmentation. Mike Roberts Jeff Packer Mario Costa Sousa Joseph Ross Mitchell

MauveDB: Statistical Modeling inside Database Systems

XML Data Stream Processing: Extensions to YFilter

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Coding for the Network: Scalable and Multiple description coding Marco Cagnazzo

Citation for published version (APA): Ydraios, E. (2010). Database cracking: towards auto-tunning database kernels

Automatic Tuning of Sparse Matrix Kernels

Chapter 12: Indexing and Hashing. Basic Concepts

Database Replication in Tashkent. CSEP 545 Transaction Processing Sameh Elnikety

Mind the Gap: Large-Scale Frequent Sequence Mining

A Comparison of Two Fully-Dynamic Delaunay Triangulation Methods

DS595/CS525: Urban Network Analysis --Urban Mobility Prof. Yanhua Li

University of Florida CISE department Gator Engineering. Clustering Part 4

4G WIRELESS VIDEO COMMUNICATIONS

Data Warehousing & Data Mining

The Effectiveness of Deduplication on Virtual Machine Disk Images

Architecture-Conscious Database Systems

PF-OLA: A High-Performance Framework for Parallel Online Aggregation

Graph Partitioning for Scalable Distributed Graph Computations

Chapter 12: Indexing and Hashing

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Suggested Topics for Written Project Report. Traditional Databases:

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Chapter 17: Parallel Databases

Summary. 4. Indexes. 4.0 Indexes. 4.1 Tree Based Indexes. 4.0 Indexes. 19-Nov-10. Last week: This week:

Efficient Computation of Radial Distribution Function on GPUs

Transcription:

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets Philippe Cudré-Mauroux Eugene Wu Samuel Madden Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology ICDE 2010, March 2 Long Beach, CA, USA

Motivation (1/2) Explosion of position-aware devices & apps MIT s CarTel project (Balakrishnan, Madden) 2

Motivation (2/2) CarTel Massive amounts of GPS data Real-time, high insert rates Large spatiotemporal queries New class of applications Live feeds from large fleets of mobile objects Current solutions (e.g., PostGIS) failed Designed for (relatively) sparse data 3

Outline 4 Motivation Large-Scale GPS Data Mining Conventional approach R-Trees & Trajectory-Segmentation TrajStore Architecture Sparse Spatial Index Adaptivity Compression Performance Conclusions

Inserting [Conventional Approach] R1 R4 R11 R3 R8 R10 R9 R5 R14 R13 R12 R2 R6 R16 R7 R17 R19 R15 R18 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 5

Querying [Conventional Approach] R1 R4 R11 R3 R9 R5 R13 R2 R6 R15 R8 R10 R14 R12 R7 R17 R16 R18 R1 R2 R3 R4 R5 R6 R7 R19......... {R11, TrajID, (x1, y1, t1), (x2, y2, t2), (x3, y3,t3), (x4, y4, t4)...}... {R12, TrajID, (x1, y1, t1), (x2, y2, t2)(x3,y3,t3),...}... {R14, TrajID, (x1, y1, t1), (x2, y2, t2)(x3,y3,t3),...}...... R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 6

Issues with Current Systems Efficient for sparse data only Catastrophic for large, dense, overlapping data Slow inserts Bounding boxes creation Multiple index updates per new trajectory Slow queries Index considers a very high number of overlapping objects Inefficient selects of records Complex index maintenance & look-up One disk seek for each trajectory sub-segment Several minutes/hours to resolve aggregate queries 7

TrajStore Adaptive system to store & query very large trajectory data sets Sparse, non-overlapping spatial index Chunk-based data organization co-location, dense-packing & compression Buffered, amortized IO operations Minimization of total IO cost 8

Architecture Workload Mix of queries and trajectory insertions Storage Manager densepacking/ compression schemes Spatial QuadTree Cells info Split/ Merge Cell Clusterer 1 ----- 2 ----- 3 ----- Query Processor Data retrieval/ insertion Pages w/ segments + temporal indices Buffered operations disks 9

Index & Storage subtrajectories indexed and ordered in time 10 Spatial index: quadtree [new] Optimal quadtree construction [new] Adaptive, index-driven data storage

TrajStore Inserts C2 C3 Densed-Packed / Compressed chunks C1 C5 C6 t1 t2 t3 C4 C8 C7 C9 C10 t4 t5 C1 C2 C3 C4 C5 C6 11 C7 C7 C9 C10

TrajStore Queries C2 C3 Densed-Packed / Compressed chunks C1 C5 C6 t1 t2 t3 C4 C8 C7 C9 C10 C1 C2 C3 Trivial index look-up One seek per cell C4 C5 C6 12 C7 C7 C9 C10

Sparse Spatial Index (1/2) Cost-based, optimal spatial partitioning Efficient, hierarchical partitioning 13

Sparse Spatial Index (2/2) Basic idea Cost-model for query execution times based on #cells accessed Optimal quadtree construction based on cost-model, query workload, local density & page size cellsize opt (Q, D, pagesize) Optimal balance between Oversized cells potentially retrieves data that is not queried Undersized cells seek not amortized if too little data read unnecessary seeks if dense data and relatively large query 14

System Adaptivity Data evolution Adapt the index & storage with every incoming trajectory No-op / Split() / Merge() Very fast, incremental operations Query evolution Highly-skewed queries in practice Per-cell query statistics EWMA-based re-clustering 15

Compression Unique opportunities due to high spatial redundancy Intra-segment redundancy High-sampling rate, bounded speed Delta encoding (lossless) Linear interpolation (lossy/lossless) δ Inter-segments redundancy Repeated trips Spatially constraint by roads, paths Online cluster-detection Cluster compression (lossy) Combination of approaches based on user needs Bounded total error 16

Experimental Setup Query answering on 40-200M GPS readings CarTel data Large queries (0.1% / 1% / 10%) Approaches compared PostGIS Optimal trajectory segmentation TrajStore TrajStore variants Fixed grid Capacity-bound quadtree Compression schemes 17

Experimental Results (1/2) Blazing fast query execution 1-2 orders of magnitude faster than existing approaches Superior indexing scheme 18!"#$%&'#%#()*+#,*'-# :FFF' @FFF'?FFF' =FFF' ;FFF' >FFF' -FFF' +FFF' F'!"#$%&' ()*"+,+' ()*"+,-'./01%&' 2$/*%' 342$/*%'.#$#5*%6 70#"' 34&'8,91' +:-' ;<=' >-?' @@;' >++-' =-;' A0BCDE' ;>>' +-+<' @<F' @-:+' ==?@' >->+' 34&'8,91' A0BCDE' [query size = 1%] adaptivity & compression turned off

Experimental Results (2/2) Further results High-insert rate 100K GPS points / s on average Scalable Very resilient to data & query evolution fixed grid Compression (1m) 1:8 compression ratio 2.5 performance improvement See paper for full results 19

Conclusions Explosion of location-aware devices & applications Urgent need to support very large-scale GPS analytics TrajStore: rethink both index & storage layers in combination to provide Sparse, adaptive, non-overlapping index optimal w.r.t. IO cost-model Index-driven data co-location High compression ratios intra + inter-segments compression 20 System of choice for analytical queries over very large collections of trajectories