Overview of Data Exploration Techniques. Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri

Similar documents
Data Systems that are Easy to Design, Tune and Use. Stratos Idreos

NoDB: Querying Raw Data. --Mrutyunjay

class 5 column stores 2.0 prof. Stratos Idreos

complex plans and hybrid layouts

Query processing on raw files. Vítor Uwe Reus

basic db architectures & layouts

Bridging the Processor/Memory Performance Gap in Database Applications

Column Stores vs. Row Stores How Different Are They Really?

class 6 more about column-store plans and compression prof. Stratos Idreos

column-stores basics

Holistic Indexing in Main-memory Column-stores

class 17 updates prof. Stratos Idreos

column-stores basics

data systems 101 prof. Stratos Idreos class 2

Weaving Relations for Cache Performance

Toward timely, predictable and cost-effective data analytics. Renata Borovica-Gajić DIAS, EPFL

H2O: A Hands-free Adaptive Store

Parallel In-situ Data Processing with Speculative Loading

Data Exploration. Heli Helskyaho Seminar on Big Data Management

Course Outline. Performance Tuning and Optimizing SQL Databases Course 10987B: 4 days Instructor Led

Citation for published version (APA): Ydraios, E. (2010). Database cracking: towards auto-tunning database kernels

systems & research project

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

[MS10987A]: Performance Tuning and Optimizing SQL Databases

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

data systems 101 prof. Stratos Idreos class 2

DBMS Data Loading: An Analysis on Modern Hardware. Adam Dziedzic, Manos Karpathiotakis*, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki

class 8 b-trees prof. Stratos Idreos

Performance Tuning & Optimizing SQL Databases Microsoft Official Curriculum (MOC 10987)

NoDB: Efficient Query Execution on Raw Data Files

Segregating Data Within Databases for Performance Prepared by Bill Hulsizer

Benchmarking Adaptive Indexing

How Achaeans Would Construct Columns in Troy. Alekh Jindal, Felix Martin Schuhknecht, Jens Dittrich, Karen Khachatryan, Alexander Bunte

DB2 Performance Essentials

Storage hierarchy. Textbook: chapters 11, 12, and 13

MIS Database Systems.

BIS Database Management Systems.

SQL Server Administration 10987: Performance Tuning and Optimizing SQL Databases. Upcoming Dates. Course Description.

Triangle SQL Server User Group Adaptive Query Processing with Azure SQL DB and SQL Server 2017

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

6.830 Lecture 8 10/2/2017. Lab 2 -- Due Weds. Project meeting sign ups. Recap column stores & paper. Join Processing:

Hyrise - a Main Memory Hybrid Storage Engine

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Chapter 9. Cardinality Estimation. How Many Rows Does a Query Yield? Architecture and Implementation of Database Systems Winter 2010/11

Cost Models. the query database statistics description of computational resources, e.g.

Adaptivity. Luca Schroeder & Thomas Lively

Self Tuning Databases

Databasesystemer, forår 2005 IT Universitetet i København. Forelæsning 8: Database effektivitet. 31. marts Forelæser: Rasmus Pagh

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

7. Query Processing and Optimization

Performance in the Multicore Era

Advanced Database Systems

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Hybrid Shipping Architectures: A Survey

CSE 544 Principles of Database Management Systems

CSE 544: Principles of Database Systems

Chapter 17: Parallel Databases

Column-Oriented Database Systems. Liliya Rudko University of Helsinki

Parallel In-situ Data Processing Techniques

Column-Stores vs. Row-Stores: How Different Are They Really?

Introduction to Database Systems CSE 344

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Optimization Overview

Jignesh M. Patel. Blog:

Database Systems CSE 414

HYRISE In-Memory Storage Engine

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Performance Tuning and Optimizing SQL Databases (10987)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Module 4. Implementation of XQuery. Part 0: Background on relational query processing

CSE 344 FEBRUARY 14 TH INDEXING

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

I am: Rana Faisal Munir

DASlab: The Data Systems Laboratory

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Crescando: Predictable Performance for Unpredictable Workloads

class 13 scans vs indexes prof. Stratos Idreos

Database Systems CSE 414

class 12 b-trees 2.0 prof. Stratos Idreos

Database Systems CSE 414

Administrivia Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Weaving Relations for Cache Performance

Principles of Data Management. Lecture #9 (Query Processing Overview)

Column-Stores vs. Row-Stores: How Different Are They Really?

CSC 261/461 Database Systems Lecture 19

Lecture 12. Lecture 12: The IO Model & External Sorting

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

Performance Monitoring

Goals for Today. CS 133: Databases. Relational Model. Multi-Relation Queries. Reason about the conceptual evaluation of an SQL query

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data

Jyotheswar Kuricheti

Oracle Database 10g The Self-Managing Database

Transcription:

Overview of Data Exploration Techniques Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri

data exploration not always sure what we are looking for (until we find it) data has always been big volume velocity variety veracity

content structure user interaction middleware database kernel

user interaction visualization 45 min interfaces prefetching approximation sampling 45 min middleware adaptive[ indexing, loading, storage] kernel 45 min

Part 3* *for Part1 and 2 please look at the websites of the tutorial co-authors

applications sql algorithms/operators database kernel cpu caches memory gpu data data index disk

too many preparation options lead to complex installation schema storage load indexes query timeline expert users - idle time - workload knowledge

users/applications declarative interface ask what you want DBA db system

users/applications need to choose the proper system db administrator 1 db administrator 2 data system 1 data system 2

how can we prepare if we do not know what we are up against? (loading, indexing, storage, )

how can we prepare if we do not know what we are up against? (loading, indexing, storage, ) data systems kernels tailored for data exploration no preparation - easy to use - fast

adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

indexing load tune query tune= create proper indices offline performance 10-100X

indexing load tune query tune= create proper indices offline performance 10-100X but it depends on the workload! which indices to build? on which data parts? and when to build them?

big data V s volume velocity variety veracity not enough space to index all data what can go wrong? not enough idle time to finish proper tuning by the time we finish tuning, the workload changes not enough money - energy - resources

big data V s volume velocity variety veracity not enough space to index all data what can go wrong? not enough idle time to finish proper tuning by the time we finish tuning, the workload changes not enough money - energy - resources

database cracking

database cracking idle time workload knowledge external tools human control

database cracking auto-tuning database kernels incremental, adaptive, partial indexing idle time workload knowledge external tools human control

initialization querying indexing database cracking auto-tuning database kernels incremental, adaptive, partial indexing idle time workload knowledge external tools human control

initialization querying indexing database cracking auto-tuning database kernels incremental, adaptive, partial indexing idle time workload knowledge external tools human control

initialization querying indexing database cracking auto-tuning database kernels incremental, adaptive, partial indexing every query workload is treated external as an advice human idle time knowledge tools control on how data should be stored

column-store database a fixed-width and dense array per attribute relation1/table1...... Database Cracking CIDR 2007

column-store database a fixed-width and dense array per attribute relation1/table1...... Database Cracking CIDR 2007

column A Q1: select R.A from R where R.A>10 and R.A<14 13 16 4 9 2 12 7 1 19 3 14 11 8 6 Database Cracking CIDR 2007

column A Q1: select R.A from R where R.A>10 and R.A<14 13 16 4 9 2 12 7 1 19 3 14 11 8 6 Database Cracking CIDR 2007

column A Q1: select R.A from R where R.A>10 and R.A<14 13 16 4 9 2 12 7 1 19 3 14 11 8 6 sort 1 2 3 4 6 7 8 9 11 12 13 14 16 19 Database Cracking CIDR 2007

column A Q1: select R.A from R where R.A>10 and R.A<14 13 16 4 9 2 12 7 1 19 3 14 11 8 6 sort binary search 1 2 3 4 6 7 8 9 11 12 13 14 16 19 Database Cracking CIDR 2007

column A Q1: select R.A from R where R.A>10 and R.A<14 13 16 4 9 2 12 7 1 19 3 14 11 8 6 sort binary search 1 2 3 4 6 7 8 9 11 12 13 14 16 19 result Database Cracking CIDR 2007

column A Q1: select R.A from R where R.A>10 and R.A<14 13 16 4 9 2 12 7 1 19 3 14 11 8 6 sort binary search 1 2 3 4 6 7 8 9 11 12 13 14 16 19 result time + knowledge Database Cracking CIDR 2007

Q1: select R.A from R where R.A>10 and R.A<14 column A 13 16 4 9 2 12 7 1 19 3 14 11 8 6 Database Cracking CIDR 2007

Q1: select R.A from R where R.A>10 and R.A<14 column A 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 Database Cracking CIDR 2007

Q1: select R.A from R where R.A>10 and R.A<14 column A 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 Database Cracking CIDR 2007

Q1: select R.A from R where R.A>10 and R.A<14 column A 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 piece3: A>=14 Database Cracking CIDR 2007

column A Q1: select R.A from R where R.A>10 and R.A<14 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 piece3: A>=14 result Database Cracking CIDR 2007

gain knowledge on how data is organized column A Q1: select R.A from R where R.A>10 and R.A<14 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 piece3: A>=14 result Database Cracking CIDR 2007

gain knowledge on how data is organized column A Q1: select R.A from R where R.A>10 and R.A<14 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 piece3: A>=14 result dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<=16 column A 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 piece3: A>=14 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<=16 column A 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 piece3: A>=14 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

column A Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<=16 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 piece3: A>=14 4 2 1 3 6 7 9 8 13 12 11 14 16 19 piece1: A<=7 piece2: 7<A<=10 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

column A Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<=16 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 piece3: A>=14 4 2 1 3 6 7 9 8 13 12 11 14 16 19 piece1: A<=7 piece2: 7<A<=10 piece3: 10<A<14 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

column A Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<=16 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 piece3: A>=14 4 2 1 3 6 7 9 8 13 12 11 14 16 19 piece1: A<=7 piece2: 7<A<=10 piece3: 10<A<14 piece4: 14<=A<=16 piece5: A>16 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

column A Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<=16 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 piece3: A>=14 4 2 1 3 6 7 9 8 13 12 11 14 16 19 piece1: A<=7 piece2: 7<A<=10 piece3: 10<A<14 piece4: 14<=A<=16 piece5: A>16 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

the more we crack, the more we learn column A Q1: select R.A from R where R.A>10 and R.A<14 Q2: select R.A from R where R.A>7 and R.A<=16 13 16 4 9 2 12 7 1 19 3 14 11 8 6 4 9 2 7 1 3 8 6 13 12 11 16 19 14 piece1: A<=10 piece2: 10<A<14 piece3: A>=14 4 2 1 3 6 7 9 8 13 12 11 14 16 19 piece1: A<=7 piece2: 7<A<=10 piece3: 10<A<14 piece4: 14<=A<=16 piece5: A>16 dynamically/on-the-fly within the select-operator Database Cracking CIDR 2007

Database Cracking CIDR 2007

select [15,55] Database Cracking CIDR 2007

select [15,55] Database Cracking CIDR 2007

select [15,55] 10 20 30 40 50 60 Database Cracking CIDR 2007

select [15,55] 10 20 30 40 50 60 select [15,55] Database Cracking CIDR 2007

select [15,55] 10 20 30 40 50 60 select [15,55] Database Cracking CIDR 2007

touch at most two pieces at a time pieces become smaller and smaller select [15,55] 10 20 30 40 50 60 select [15,55] Database Cracking CIDR 2007

continuous adaptation set-up 100K random selections random selectivity random value ranges in a 10 million integer column Response time (secs) 1000 100 10 1 0.1 Scan Crack 0.01 Full Index 0.001 1 10 100 1000 10000 100000 Query sequence (x1000) Database Cracking CIDR 2007

continuous adaptation set-up 100K random selections random selectivity random value ranges in a 10 million integer column almost no initialization overhead Response time (secs) 1000 100 10 1 0.1 0.01 Full Index Scan Crack 0.001 1 10 100 1000 10000 100000 Query sequence (x1000) Database Cracking CIDR 2007

continuous adaptation set-up 100K random selections random selectivity random value ranges in a 10 million integer column almost no initialization overhead Response time (secs) 1000 100 10 1 0.1 0.01 Full Index Scan Crack continuous improvement 0.001 1 10 100 1000 10000 100000 Query sequence (x1000) Database Cracking CIDR 2007

continuous adaptation set-up 100K random selections random selectivity random value ranges in a 10 million integer column almost no initialization overhead Response time (secs) 1000 100 10 1 0.1 0.01 Full Index Scan Crack continuous improvement 0.001 1 10 100 1000 10000 100000 Query sequence (x1000) Database Cracking CIDR 2007

continuous adaptation set-up 10K random selections selectivity 10% random value ranges in a 30 million integer column Cumulative average response time (secs) 200 100 10 1 0.1 0.01 0.004 Full Index Scan Crack 0.001 1 10 100 1000 10000 Query sequence Database Cracking CIDR 2007

continuous adaptation set-up 10K random selections selectivity 10% random value ranges in a 30 million integer column Cumulative average response time (secs) 200 100 10 1 0.1 0.01 0.004 Full Index Scan Crack 0.001 1 10 100 1000 10000 Query sequence Database Cracking CIDR 2007

continuous adaptation set-up 10K random selections selectivity 10% random value ranges in a 30 million integer column 10K queries later, Full Index still has not amortized the initialization costs Cumulative average response time (secs) 200 100 10 1 0.1 0.01 0.004 0.001 Full Index Scan Crack 1 10 100 1000 10000 Query sequence Database Cracking CIDR 2007

A table1

table1......

select R.A from R where R.A>10 and R.A<14 table1......

select R.A from R where R.A>10 and R.A<14 table1 select max(r.a),max(r.b),max(s.a),max(s.b) from R,S where v1 <R.C<v2 and v3 <R.D<v4 and v5 <R.E<v6 and k1 <S.C<k2 and k3 <S.D<k4 and k5 <S.E<k and R.F = S.F......

select R.A from R where R.A>10 and R.A<14 table1 select max(r.a),max(r.b),max(s.a),max(s.b) from R,S where v1 <R.C<v2 and v3 <R.D<v4 and v5 <R.E<v6 and k1 <S.C<k2 and k3 <S.D<k4 and k5 <S.E<k and R.F = S.F updates... joins concurrency control......

basics (CIDR07) cracking databases updates (SIGMOD07) >1 columns >1 columns (SIGMOD09) storagerestrictions (SIGMOD09) robustness robustne (PVLDB12) algorithms (PVLDB11) benchmarking (TPCTC10) concurrency control (PVLDB12) adaptive storage (SIGMOD14) time-series (SIGMOD14) hadoop (Yale/Saarland) multi-cores (SIGMOD15) b-trees (HP Labs)

cracking tangram base data table 1 as queries arrive... table 2

cracking tangram base data table 1 as queries arrive... table 1 table 2 table 2

cracking tangram base data table 1 as queries arrive... table 1 partial materialization table 2 table 2

cracking tangram base data table 1 as queries arrive... table 1 partial materialization partial indexing table 2 table 2

cracking tangram base data table 1 as queries arrive... table 1 partial materialization partial indexing continuous adaptation table 2 table 2

cracking tangram base data table 1 as queries arrive... table 1 x partial materialization partial indexing continuous adaptation storage adaptation table 2 table 2 x x

cracking tangram base data table 1 as queries arrive... table 1 partial materialization partial indexing continuous adaptation storage adaptation table 2 table 2

cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction

cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment

cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment

cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment sort in caches

cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment sort in caches crack joins

cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 q1 q2 partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment sort in caches crack joins lightweight locking

cracking tangram base data table 1 table 2 as queries arrive... table 1 table 2 query random partial materialization partial indexing continuous adaptation storage adaptation no tuple reconstruction adaptive alignment sort in caches crack joins lightweight locking stochastic cracking

adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

loading load tune query copy data inside the database database now has full control slow process not all data might be needed all the time Adaptive Loading, CIDR 11

database vs. unix tools single query cost (secs) 1 file, 4 attributes, 1 billion tuples DB Awk Adaptive Loading, CIDR 11

database vs. unix tools single query cost (secs) 1 file, 4 attributes, 1 billion tuples DB Awk break down db cost Adaptive Loading, CIDR 11

database vs. unix tools single query cost (secs) 1 file, 4 attributes, 1 billion tuples DB Awk break down db cost loading is a major bottleneck Adaptive Loading, CIDR 11

database vs. unix tools single query cost (secs) 1 file, 4 attributes, 1 billion tuples DB Awk break down db cost loading is a major bottleneck but writing/maintaining scripts does not scale Adaptive Loading, CIDR 11

adaptive loading load/touch only what is needed and only when it is needed Adaptive Loading, CIDR 11

but raw data access is expensive tokenizing - parsing - no indexing - no statistics challenge: fast raw data access Adaptive Loading, CIDR 11

query plan Adaptive Loading, CIDR 11

query plan scan Adaptive Loading, CIDR 11

query plan scan db Adaptive Loading, CIDR 11

query plan scan loading Adaptive Loading, CIDR 11

query plan scan files loading access raw data adaptively on-the-fly Adaptive Loading, CIDR 11

query plan scan files access raw data adaptively on-the-fly cache loading Adaptive Loading, CIDR 11

selective parsing file indexing file splitting online statistics query plan scan files access raw data adaptively on-the-fly cache loading Adaptive Loading, CIDR 11

three graphs from paper NoDB, SIGMOD 2012

three graphs from paper NoDB, SIGMOD 2012

three graphs from paper NoDB, SIGMOD 2012

three graphs from paper NoDB, SIGMOD 2012

three graphs from paper NoDB, SIGMOD 2012

reducing data-to-query time three graphs from paper NoDB, SIGMOD 2012

adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

rows & columns row-store column-store H20, SIGMOD 14

no fixed optimal solution Execution Time (sec) 40 30 20 10 0 DBMS-C DBMS-R 2 10 20 30 40 50 60 70 80 90 100 Attributes Accessed (%) H20, SIGMOD 14

rows & columns row-store column-store H20, SIGMOD 14

rows & columns row-store column-store hybrid-store H20, SIGMOD 14

which layout is best? H20, SIGMOD 14

which layout is best? H20, SIGMOD 14

which layout is best? H20, SIGMOD 14

which layout is best? too many combinations to maintain in parallel H20, SIGMOD 14

query cost q(l)= L Â i=1 max(costi IO,costi CPU ) for a given query we can know which layout is best the one that will cause the fewer cache misses H20, SIGMOD 14

if we know all queries up front we can choose the layouts adaptive storage: continuously adapt layouts based on incoming queries H20, SIGMOD 14

but computing all possible combinations is expensive query select A+B+C+D from R where A<10 and E>10 1. deal only with attributes referenced in queries 2. handle select clause separately from where clause 3. start from pure column-store and build up 4. stop when no improvement possible H20, SIGMOD 14

Query&Response&Time&(sec)& 8" 7" 6" 5" 4" 3" 2" 1" 0" Row.store" Column.store" Best.Hybrid" H2O" 0" 5" 10" 15" 20" 25" 30" 35" 40" 45" 50" Query&Sequence& H20, SIGMOD 14

adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

querying load tune query SQL interface correct and complete answers dbtouch, CIDR 2013

querying load tune query complex and slow - not fit for exploration SQL interface correct and complete answers dbtouch, CIDR 2013

just touch the data you need dbtouch, CIDR 2013

just touch the data you need this is not about query building it is about query processing dbtouch, CIDR 2013

cracking example dbtouch CIDR 2013

what does this mean for db kernels? dbtouch, CIDR 2013

db select R.a from R what does this mean for db kernels? dbtouch, CIDR 2013

db select R.a from R what does this mean for db kernels? dbtouch 56 38 45 2 process only what you touch dbtouch, CIDR 2013

from touch to query processing touch location (x) v size(width) row ID=(tuples*x)/size storage dbtouch, CIDR 2013

select avg(r.c) where R.a=S.b and S.b<20 S.b R.c R.a avg(r.c) R.a=S.b σ(<20) S.b R.a dbtouch, CIDR 2013

select avg(r.c) where R.a=S.b and S.b<20 S.b R.c R.a avg(r.c) R.a=S.b σ(<20) S.b R.a dbtouch, CIDR 2013

select avg(r.c) where R.a=S.b and S.b<20 S.b R.c R.a avg(r.c) R.a=S.b σ(<20) S.b R.a dbtouch, CIDR 2013

visual objects samples hierarchy dbtouch, CIDR 2013

visual objects samples hierarchy base data initial sample dbtouch, CIDR 2013

visual objects samples hierarchy base data initial sample create incrementally and on demand dbtouch, CIDR 2013

dbtouch, CIDR 2013

dbtouch, CIDR 2013

adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

sampling outside the engine application sampling logic/query rewrite Kernel Aqua, VLDB 1999 BlinkDB, Eurosys 2013

sampling with SciBORG queries data Kernel maintain hierarchy of biased samples SciBORG, CIDR 2011

sampling with SciBORG impression 1 impression 2 impression 3 base column continuously reorganized based on the workload SciBORG, CIDR 2011

adaptive loading adaptive indexing adaptive storage vision declarative and flexible systems sampling dbtouch

building systems declaratively abstraction performance vision: being able to define system components in a higher level language without significant performance penalty RodentStore, CIDR 2009 Abstraction without regrets, IEEE Data Engin. Bulletin/PVLDB 2014

data systems today allow us to answer queries fast db data systems for exploration should allow us to find fast which queries to ask db explore + approximate processing techniques

data systems today allow us to answer queries fast db data systems for exploration should allow us to find fast which queries to ask db explore + approximate processing techniques thank you!