Accelerating Analytical Workloads

Similar documents
Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Things To Know. When Buying for an! Alekh Jindal, Jorge Quiané, Jens Dittrich

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

HyPer on Cloud 9. Thomas Neumann. February 10, Technische Universität München

a linear algebra approach to olap

Database Group Research Overview. Immanuel Trummer

Column Stores vs. Row Stores How Different Are They Really?

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Huge market -- essentially all high performance databases work this way

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

CSE 344 MAY 2 ND MAP/REDUCE

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin

MapReduce. Stony Brook University CSE545, Fall 2016

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

Improving the MapReduce Big Data Processing Framework

arxiv: v1 [cs.db] 15 Sep 2017

HyPer-sonic Combined Transaction AND Query Processing

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Jignesh M. Patel. Blog:

Data Transformation and Migration in Polystores

Hash Joins for Multi-core CPUs. Benjamin Wagner

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

Introduction to Column Stores with MemSQL. Seminar Database Systems Final presentation, 11. January 2016 by Christian Bisig

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Introduction to Data Management CSE 344

Evaluation of Relational Operations

CSE 190D Spring 2017 Final Exam Answers

Introduction to Data Management CSE 344

Red Fox: An Execution Environment for Relational Query Processing on GPUs

New Developments in Spark

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

Similarity Joins in MapReduce

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Red Fox: An Execution Environment for Relational Query Processing on GPUs

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

Service Oriented Performance Analysis

A Fast and High Throughput SQL Query System for Big Data

Leveraging Lock Contention to Improve Transaction Applications. University of Washington

Introduction to Data Management CSE 344

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Chapter 18: Parallel Databases

CSE 190D Spring 2017 Final Exam

CS-2510 COMPUTER OPERATING SYSTEMS

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

HyPer-sonic Combined Transaction AND Query Processing

Load Balancing for Entity Matching over Big Data using Sorted Neighborhood

April Copyright 2013 Cloudera Inc. All rights reserved.

Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

A Graph-based Database Partitioning Method for Parallel OLAP Query Processing

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

The Stratosphere Platform for Big Data Analytics

Developing MapReduce Programs

Programming Models MapReduce

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

High-Performance ACID via Modular Concurrency Control

Whitepaper / Benchmark

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

data parallelism Chris Olston Yahoo! Research

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

HYRISE In-Memory Storage Engine

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

The Evolution of Big Data Platforms and Data Science

Revisiting Aggregation for Data Intensive Applications: A Performance Study

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Welcome to the presentation. Thank you for taking your time for being here.

Data Processing on Emerging Hardware

A Comparison of MapReduce Join Algorithms for RDF

Introduction to Database Systems CSE 414

DBMS Data Loading: An Analysis on Modern Hardware. Adam Dziedzic, Manos Karpathiotakis*, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Analytics in the cloud

Can We Trust SQL as a Data Analytics Tool?

Implementation of Relational Operations

Chapter 17: Parallel Databases

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Track Join: Distributed Joins with Minimal Network Traffic

Advanced Databases: Parallel Databases A.Poulovassilis

Shark: Hive (SQL) on Spark

CompSci 516: Database Systems

Advanced Database Systems

Transcription:

Accelerating Analytical Workloads Thomas Neumann Technische Universität München April 15, 2014

Scale Out in Big Data Analytics Big Data usually means data is distributed Scale out to process very large inputs but for analytics sort data has to be combined and aggregated typically map/reduce-based, group Hadoop/Hive etc. join data is copied to processing nodes for aggregations lineitem orders not very smart, dominated by network traffic TPC-H Query 12 smart data movement can speed up processing significantly node 2 node 1 node 3 SWITCH node 4 node 6 node 5 Main-Memory Cluster Thomas Neumann Accelerating Analytical Workloads 2 / 26

Running Example (1) Focus on analytical query processing in this talk TPC-H query 12 used as running example Runtime dominated by join orders lineitem Example from well-known benchmark, but applicable for all distributed joins sort group join lineitem orders TPC-H Query 12 Thomas Neumann Accelerating Analytical Workloads 3 / 26

Running Example (2) Relations are equally distributed across nodes We make no assumptions on the data distribution Thus, tuples may join with tuples on remote nodes Communication over the network required node 3 node 2 node 1 orders lineitem key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 21 2-HIGH 19 SHIP Thomas Neumann Accelerating Analytical Workloads 20 SHIP 4 / 26 key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT

CPU vs. Network computing power [MIPS] 10 6 10 5 10 4 10 3 10 2 10 1 10 0 Core i7 10 6 Core 2 per 10 5 core Pentium 4 10 4 Pentium Pro 10 Gigabit Pentium (forecast) 10 3 Token Ring Gigabit Ethernet Fast Ethernet 10 2 80486 80386 10 1 80286 1980 1990 2000 2010 2020 100 LAN bandwidth [Mbits] CPU speed has grown much faster than network bandwidth Thomas Neumann Accelerating Analytical Workloads 5 / 26

Scale Out: Network is the Bottleneck 200 Single node: Performance is bound algorithmically Cluster: Network is bottleneck for query processing Investing time and effort in decreasing network traffic pays off Goal: Increase local processing to close the performance gap join performance [M tuples/s] 150 100 50 0 distributed local Thomas Neumann Accelerating Analytical Workloads 6 / 26

Neo-Join: Network-optimized Join [ICDE14] 1. Open Shop Scheduling Efficient network communication 2. Optimal Partition Assignment Increase local processing 3. Selective Broadcast Handle value skew Thomas Neumann Accelerating Analytical Workloads 7 / 26

Bandwidth Sharing node 11 node 22 node 11 bottleneck SWITCH bottleneck node 33 SWITCH node 22 node 33 Simultaneous use of a single link creates a bottleneck Reduces bandwidth by at least a factor of 2 Thomas Neumann Accelerating Analytical Workloads 8 / 26

Naïve Schedule node 1 5 5 node 2 5 4 node 3 5 4 bandwidth sharing time Node 2 and 3 send to node 1 at the same time Bandwidth node 1 sharing 4 increases network 4 duration 1 1 significantly node 2 4 4 1 node 3 4 4 1 time Thomas Neumann Accelerating Analytical Workloads 9 / 26

Open Shop Scheduling (1) Avoiding bandwidth sharing translates to open shop scheduling: A sender has one transfer per receiver A receiver should receive at most one transfer at a time A sender should send at most one transfer at a time node 1 transfers node 1 senders node 2 node 3 node 2 node 3 receivers Thomas Neumann Accelerating Analytical Workloads 10 / 26

Open Shop Scheduling (2) Compute optimal schedule: Edge weights represent total transfer duration Scheduler repeatedly finds perfect matchings Each matching specifies one communication phase Transfers in a phase will never share bandwidth node 1 transfers node 1 senders node 2 node 3 5 5 4 node 2 receivers node 3 Thomas Neumann Accelerating Analytical Workloads 11 / 26

node 3 1 1 1 1 1 4 Optimal Schedule bandwidth sharing node 1 4 4 1 1 node 2 4 4 1 node 3 4 4 1 time maximum straggler Open shop schedule achieves minimal network duration Schedule duration determined by maximum straggler Thomas Neumann Accelerating Analytical Workloads 12 / 26

Distributed Join Tuples may join with tuples on remote nodes Repartition and redistribute both relations for local join Tuples will join only with the corresponding partition Using hash, range, radix, or other partitioning scheme In any case: Decide how to assign partitions to nodes O L O1 O2 O3 L1 L2 L3 fragmented redistributed Thomas Neumann Accelerating Analytical Workloads 13 / 26

Running Example: Hash Partitioning node 3 node 2 node 1 orders lineitem key priority key shipmode 1 1-URGENT 1 MAIL 2 2-HIGH 1 MAIL 3 1-URGENT 1 MAIL 4 5-LOW 2 SHIP 5 3-MEDIUM 2 MAIL x+2 mod 3 6 1-URGENT 6 SHIP 6 5 5 7 2-HIGH 6 SHIP 8 1-URGENT 6 SHIP P1 P2 P3 9 1-URGENT 6 MAIL 10 2-HIGH 10 SHIP 11 3-MEDIUM 11 MAIL 12 5-LOW 11 MAIL 13 1-URGENT 13 MAIL x+2 mod 3 14 3-MEDIUM 13 MAIL 5 4 4 15 1-URGENT P1 P2 P3 16 3-MEDIUM 13 MAIL 17 2-HIGH 13 SHIP 18 3-MEDIUM 17 MAIL 19 5-LOW 18 MAIL 20 1-URGENT 18 MAIL x+2 mod 3 21 2-HIGH 19 SHIP 5 4 4 20 SHIP P1 P2 P3 Thomas Neumann Accelerating Analytical Workloads 14 / 26

Assign Partitions to Nodes (1) Option 1: Minimize network traffic Assign partition to node that owns its largest part Only the small fragments of a partition sent over the network Schedule with minimal network traffic may have high duration n3 n3 hash partitioning (x mod 3) 5 6 5 4 2 35 34 4 2 35 34 P1 P2 P3 open shop schedule 13 13 traffic: 26 time: 26 Thomas Neumann Accelerating Analytical Workloads 15 / 26

Assign Partitions to Nodes (2) hash partitioning (x mod 3) 5 6 5 Option 2: Minimize response time: 4 2 35 3 4 Query response time is time from requestn3 to result 4 2 35 3 4 Query response P1 time dominated P2 by P3 network duration To minimize network open shop duration, schedule minimize maximum straggler n3 13 13 n3 n3 hash partitioning (x mod 3) 5 6 5 4 5 4 4 5 4 P1 P2 P3 open shop schedule 4 5 1 4 4 1 4 4 1 traffic: 26 time: 26 traffic: 28 time: 10 Thomas Neumann Accelerating Analytical Workloads 16 / 26

he tuples from every other example, Minimize radix partitioning Maximum Straggler rk phase by almost a factor ing. of all local fragments of partitions that are not assigned to it. Likewise, equation 4 adds the size of remote fragments of partitions that were assigned to node i to the receive cost. MILPs require a linear objective, which minimizing a maximum is not. Fortunately, we can rephrase the objective and (Phase Formalized 2) instead minimize a new variable w. Additional constraints take as mixed-integer care that w assumes the maximum over the send/receive costs: howlinear to repartition program the input e join key fall into the same minimize w, subject to itions Shown are fragmented to be NP-hard across in worst p 1 ts case of one specific partition w h ij (1 x ij ) 0 i < n j=0 me But nodeinfor practice joining. fast Thisenough p 1 e an assignment of partitions (OPT-ASSIGN) n 1 using CPLEX or Gurobi w ork phase duration. j=0 x ij h kj 0 i < n k=0,i k a (< node0.5 as% theoverhead number of for 32 nodes, n 1 s for 200 thempartitions tuples that each) were 1 = x ij 0 j < p ost Partition is defined as assignment the number i=0 can odes. Section III-C4 shows optimize any partitioning uration is determined by the One can obtain an optimal solution for a specific partition eive cost. The assignment is assignment problem (OPT-ASSIGN) by passing the mixed his maximum cost. integer linear program to an optimizer such as Microsoft a partition to the node that Gurobi 3 or IBM CPLEX 4. These solvers can be linked as a ver, this is not optimal in library to create and solve linear programs via API calls. for the running example in 3 http://www.gurobi.com Thomas Neumann 4 Accelerating Analytical Workloads 17 / 26

Running Example: Locality orders lineitem node 1 key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP radix 15 P1 1 P2 P3 node 2 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL radix 11 1 1 P1 P2 P3 node 3 16 3-MEDIUM 13 MAIL 17 2-HIGH 13 SHIP 18 3-MEDIUM 17 MAIL 19 5-LOW 18 MAIL 20 1-URGENT 18 MAIL 21 2-HIGH 19 SHIP radix 2 11 20 SHIP P1 P2 P3 Thomas Neumann Accelerating Analytical Workloads 18 / 26

Locality Running example exhibits time-of-creation clustering Radix repartitioning on most significant bits retains locality Partition assignment can exploit locality Significantly reduces query response time n3 n3 n3 n3 radix radix partitioning partitioning (MSB) (MSB) 15 15 1 0 1 11 11 1 0 2 11 11 1 1 1 1 1 2 2 P1 P2 P3 P1 P2 P3 open shop schedule open shop schedule traffic: 5 time: 3 traffic: time: Thomas Neumann Accelerating Analytical Workloads 19 / 26

Broadcast O Alternative to data repartitioning Replicate the smaller relation between all nodes O L O L Larger relation remains fragmented across nodes O broadcast O local join Thomas Neumann Accelerating Analytical Workloads 20 / 26

Selective Broadcast Decide per partition whether to assign or broadcast Broadcast orders for P 2, let line items remain fragmented Assign the other partitions taking locality into account Improves performance for high skew and many duplicates n3 n3 n3 n3 hash hash partitioning partitioning (mod (mod 3) 3) 2 1 1 6 1 1 1 1 1 6 1 1 1 1 1 5 2 2 O1 L1 O2 L2 O3 L3 O1 L1 O2 L2 O3 L3 open shop schedule open shop schedule 3 1 1 3 3 3 1 3 1 3 traffic: 14 time: 6 traffic: 14 time: Thomas Neumann Accelerating Analytical Workloads 21 / 26

Experimental Setup Cluster of 4 nodes Core i7, 4 cores, 3.4 GHz, 32 GB RAM Gigabit Ethernet Tuples consist of 64 bit key, 64 bit payload Thomas Neumann Accelerating Analytical Workloads 22 / 26

Locality Vary locality from 0 % (uniform distribution) to 100 % (range partitioning) Neo-Join improves join performance from 29 M to 156 M tuples/s (> 500 %) 3 nodes, 600 M tuples join performance [tuple/s] Neo-Join Hadoop Hive DBMS-X MySQL Cluster 160 M 120 M 80 M 40 M 0 M 0 % 25 % 50 % 75 % 100 % locality Thomas Neumann Accelerating Analytical Workloads 23 / 26

TPC-H Results (scale factor 100) hash broadcast Neo-Join 7 Results for three selected TPC-H queries Broadcast outperforms hash for large relation size differences Neo-Join always performs better due to selective broadcast and locality 4 nodes, ca. 100GB data execution time [s] 6 5 4 3 2 1 0 Q12 Q14 Q19 TPC-H queries Thomas Neumann Accelerating Analytical Workloads 24 / 26

Further Optimizations Network-aware joining is only one ingredient All Query Processing steps are important parallel, network aware, maximize locality [PVDB12] group by, sort, cube,... [DEBUL14, SIGMOD13, PVLDB11] also: smart loading/parsing [PVLDB13] Query Optimization has a huge impact Reformulate the query into a more efficient form [EDBT14,ICDE12] Involves algebraic optimization, exploiting statistics, etc. [ICDE11] Can improve runtimes by orders of magnitude! Result is much faster than a naive map/reduce approach. Thomas Neumann Accelerating Analytical Workloads 25 / 26

Conclusion Analyzing Big Data is challenging very large volume, distributed many operations require joining data network is a bottleneck We can use optimization techniques to speed up the analysis maximize bandwidth exploit data characteristics (locality, skew, etc.) smart scheduling of operations Improves over commonly used approaches like Hive by order of magnitudes. Thomas Neumann Accelerating Analytical Workloads 26 / 26