MISO: Souping Up Big Data Query Processing with a Multistore System

Size: px

Start display at page:

Download "MISO: Souping Up Big Data Query Processing with a Multistore System"

Marylou Wilson
5 years ago
Views:

1 MISO: Souping Up Big Data Query Processing with a Multistore System Jeff LeFevre, UC Santa Cruz* Jagan Sankaranarayanan, NEC Labs Hakan Hacıgümüş. NEC Labs Junichi Tatemura, NEC Labs Neoklis Polyzotis, UC Santa Cruz Michael J. Carey, UC Irvine Appeared in SIGMOD 2014 *Now at HP Vertica R&D

2 Analytical System in today s Organizations Big Data / Exploratory Queries Business Reporting Queries HDFS Big Data RDBMS Business Data Data that collect but there may be a use for it Unknown business value Formatted on-the-fly as needed (schema-on-read) Evolutionary workload (i.e., adhoc) Data important to your organization High business value Well formatted (schema-on-write) Well known workload 2

3 System Characteristics Big Data Store offers ability to store massive amounts of data and begin querying right away Executes SQL + UDFs Slow query performance [Pavlo et. al. 2009, Dean et al. 2010] RDBMS offers the ability to achieve high performance for queries Executes SQL High data loading time [Simitsis et al. 2009] Emerging system architectures combine both Create a Multistore system for workload co-processing 3

4 Multistore Architecture to Speedup Big Data Processing Big Data / Exploratory Queries Query Plan Business Reporting Queries HV (HDFS) Big Data DW (RDBMS) DW has limited spare capacity to speedup big data queries 4

Multistore System Architecture Multistore query plan is split across both stores Transfers data and computation during query processing Two related work [Simitsis et al.

5 Multistore System Architecture Multistore query plan is split across both stores Transfers data and computation during query processing Two related work [Simitsis et al. 2012, DeWitt et al. 2013] focus on optimizing multistore plans Let us look at the performance for all possible splits for a single plan q 1 QUERY OPTIMIZER Multistore Plan EXECUTION LAYER HV DW Big Data 5

6 Execution Profile for all Possible Splits Execu&on)Time)(10 3 )sec)) 12" 10" 8" 6" 4" 2" 0" B" On-the-fly data loading 28k" " DW*EXE" TRANSFER" S" DUMP" HV*EXE" H" Mul&store)Plans) Each plan represents a different split of the plan Up-front data loading 6

7 Our Approach: Multistore Design Problem is at the core of multistore architectures Up-front data loading too time consuming On-the-fly data loading results in redundant work Need to decide what subset of data will be useful to load into DW Multistore physical design problem is a data placement problem that determines: What data to materialize (which views ) Where to materialize the data (which store ) We introduce MISO, our MultIStore Online tuner 7

8 Multistore System with MISO q1 q2 MISO is our secret sauce to tune the multistore MISO Keeps track of opportunistic views Split across stores Meta-data Store Tuner Rewriter Split across stores Q2 benefits from the opportunistic views of Q1 [SIGMOD 2014 paper on Wednesday] produces opportunistic views Execution Layer HV DW Periodically moves the useful views to DW [Today s talk] LOG DATA 8

9 Our MultIStore Online Tuner (MISO) Periodically reorganizes the data in each store Transfer views between the stores Adapt the design as the workload changes dynamically (system observes a stream of queries) High-level goals: 1. Facilitate earlier splits for each query 2. Most part of the queries run on DW, even bypassing HV sometimes 3. Limit the impact on Data Warehouse (DW) Challenges Have to solve 2 physical design problems Problems are tied together in a unique way 9

(RDBMS) Views Vd View Storage Budget Given an observed query stream current view placement

10 Problem Statement q 1, q 2 QUERY OPTIMIZER HISTORY What-If MISO TUNER Plan new design EXECUTION LAYER View Storage Budget HV (HDFS) Views Vh Big Data View Transfer Budget DW (RDBMS) Views Vd View Storage Budget Given an observed query stream current view placement budget constraints, compute new placement of views that minimizes the expected future workload cost. 10

11 MISO Tuner Solution DW Storage budget B h HV View Transfer budget B t Views Vd Storage budget B d Views V h Use leftover transfer budget Use as much of the transfer budget Solve HV design SECOND Solve DW design FIRST 11

Selecting Views for Each Store MISO solves a multi-dimensional 0-1 knapsack problem for each store Inputs are the set of views and budget constraints Dimensions are the storage and transfer budgets

12 Selecting Views for Each Store MISO solves a multi-dimensional 0-1 knapsack problem for each store Inputs are the set of views and budget constraints Dimensions are the storage and transfer budgets Knapsack complication: view benefits are not independent A pair of views (x,y) can interact in two ways [Schnaitter et al. 2009] Positive interaction Negative interaction [Schnaitter et al. 2012] Packing Heuristics 12

13 Experimental Setup 2 Independent Clusters connected via 1 GbE Hadoop (HV) 15 nodes (1 head + 14 slaves) Latest version DW 9 nodes (1 head + 8 slaves) Commercial parallel data warehouse All machines have identical hardware 2x Xeon CPUs, 16 GB RAM, 2 TB disk 13

14 Experimental Setup (2) Workload (See our SIGMOD DANAC 2013 paper) Mimic data analyst/scientist exploration to find value 32 queries: 8 analysts each writing 4 versions of a query Complex queries with UDFs Average query runtime on Hadoop cluster is about 10k secs Data sets 1 TB Twitter tweet stream (social data) 1 TB Foursquare check-in stream (social data) 12 GB Landmarks data (static/historic data) 14

TTI (10 3 Sec) System Performance Comparison 400 350 300 250 200

15 TTI (10 3 Sec) System Performance Comparison DW- EXE TRANSFER HV- EXE TUNE ETL HV- ONLY DW- ONLY MS- BASIC HV- OP MS- MISO TTI è Time To Insight. Holistic metric that includes everything, ETL, tuning, transfer, execution time. 15

Breakdown of Execution Time 400 350 300 DW- EXE

150 100 50 0 HV- ONLY DW- ONLY MS- BASIC HV- OP

90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%#

1# 3# 5# 7# 9# 11# 13# 15# 32# Query)Rank)

11" 13" 15" 17" 19" 21" 32" Query&Rank& (b)

17" 19" 21" 32" Query&Rank& (c) MS-MISO with

16 Breakdown of Execution Time DW- EXE TRANSFER HV- EXE TUNE ETL TTI (103 Sec) HV- ONLY DW- ONLY MS- BASIC HV- OP MS- MISO Percentage)of)Execu/on)Time) 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# (a) MS-BASIC %DW/EXE# %#TRANSFER# %HV/EXE# 1# 3# 5# 7# 9# 11# 13# 15# 32# Query)Rank) %HV+EXE" %"TRANSFER" %DW+EXE" 1" 3" 5" 7" 9" 11" 13" 15" 17" 19" 21" 32" Query&Rank& (b) MS-MISO with small storage budget %HV+EXE" %"TRANSFER" %DW+EXE" 1" 3" 5" 7" 9" 11" 13" 15" 17" 19" 21" 32" Query&Rank& (c) MS-MISO with large storage budget 16

17 MultiStore Tuning Methods TTI (10 3 Sec) DW- EXE TRANSFER TUNE MS- BASIC MS- OFF MS- LRU MS- MISO MS- ORA 17

Varying Storage Budgets (B t = 10GB) TTI (x 10 3 seconds) 300

18 Varying Storage Budgets (B t = 10GB) TTI (x 10 3 seconds) MS- LRU MS- OFF MS- MISO 0.125X 0.5X 1X 2X 18

19 Impact on DW Execute a background workload on DW TPC-DS reporting queries consuming IO and CPU resources Consider DW with 20% spare capacity and 40% spare capacity for both IO and CPU case Avg. impact (slowdown) on DW queries < 2% 19

20 Summary Tuning the multistore important to obtaining good performance Our MISO Tuner algorithm periodically reorganizes views in each store This improves upon both up-front data loading and onthe-fly data loading Outperforms multistore query processing without tuning (MS-BASIC) and naïve tuning approaches such as LRU 20

21 21

22 22

Data Transformation and Migration in Polystores

Data Transformation and Migration in Polystores Adam Dziedzic, Aaron Elmore & Michael Stonebraker September 15th, 2016 Agenda Data Migration for Polystores: What & Why? How? Acceleration of physical data