Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Similar documents
Track Join: Distributed Joins with Minimal Network Traffic

CS542. Algorithms on Secondary Storage Sorting Chapter 13. Professor E. Rundensteiner. Worcester Polytechnic Institute

Accelerating Analytical Workloads

7. Query Processing and Optimization

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Column Stores vs. Row Stores How Different Are They Really?

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

Architecture and Implementation of Database Systems (Winter 2014/15)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Hardware Acceleration of Database Operations

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Data Transformation and Migration in Polystores

Next-Generation Parallel Query

External Sorting. Chapter 13. Comp 521 Files and Databases Fall

Database Applications (15-415)

Millisort: An Experiment in Granular Computing. Seo Jin Park with Yilong Li, Collin Lee and John Ousterhout

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Database Applications (15-415)

CSE 344 MAY 7 TH EXAM REVIEW

EXTERNAL SORTING. Sorting

6.830 Lecture 8 10/2/2017. Lab 2 -- Due Weds. Project meeting sign ups. Recap column stores & paper. Join Processing:

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

Sorting & Aggregations

Advanced Database Systems

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Histogram-Aware Sorting for Enhanced Word-Aligned Compress

Database Applications (15-415)

Evaluation of Relational Operations

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

RELATIONAL OPERATORS #1

Evaluation of Relational Operations: Other Techniques

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Query optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.

Database Applications (15-415)

Principles of Data Management. Lecture #9 (Query Processing Overview)

External Sorting Implementing Relational Operators

Advanced Data Management

Introduction to Column Stores with MemSQL. Seminar Database Systems Final presentation, 11. January 2016 by Christian Bisig

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

EECS 647: Introduction to Database Systems

Introduction Alternative ways of evaluating a given query using

Evaluation of Relational Operations: Other Techniques

Huge market -- essentially all high performance databases work this way

Chapter 11: Query Optimization

Architecture and Implementation of Database Systems Query Processing

Sorting a file in RAM. External Sorting. Why Sort? 2-Way Sort of N pages Requires Minimum of 3 Buffers Pass 0: Read a page, sort it, write it.

CSE 124: Networked Services Lecture-16

CSC 261/461 Database Systems Lecture 19

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Architecture and Implementation of Database Systems (Summer 2018)

CSE 530A. B+ Trees. Washington University Fall 2013

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

CSIT5300: Advanced Database Systems

CSIT5300: Advanced Database Systems

Clustering Billions of Images with Large Scale Nearest Neighbor Search

CSIT5300: Advanced Database Systems

Chapter 13: Query Optimization. Chapter 13: Query Optimization

University of Waterloo Midterm Examination Sample Solution

Query Execution [15]

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Query Processing and Advanced Queries. Query Optimization (4)

Information Systems (Informationssysteme)

Evaluation of Relational Operations: Other Techniques

HANA Performance. Efficient Speed and Scale-out for Real-time BI

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

Query Processing. Introduction to Databases CompSci 316 Fall 2017

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

MapReduce. Stony Brook University CSE545, Fall 2016

CMSC424: Database Design. Instructor: Amol Deshpande

Principles of Database Management Systems

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

data parallelism Chris Olston Yahoo! Research

Query Optimization. Chapter 13

Chapter 12: Query Processing. Chapter 12: Query Processing

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Overview of Implementing Relational Operators and Query Evaluation

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Query Optimization. Shuigeng Zhou. December 9, 2009 School of Computer Science Fudan University

Hash-Based Indexing 165

Developing MapReduce Programs

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Background: disk access vs. main memory access (1/2)

ECS 165B: Database System Implementa6on Lecture 7

Chapter 12: Query Processing

External Sorting Sorting Tables Larger Than Main Memory

CSE 124: Networked Services Fall 2009 Lecture-19

Sepand Gojgini. ColumnStore Index Primer

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Transcription:

Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross

Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk Bounded by disk bandwidth When RAM resident Scale by number of cores Bounded by RAM bandwidth 2

RAM > Network RAM bandwidth? An example 2-channel 1333 MHz RAM = ~18 GB/s Add 4-channel RAM = ~3 GB/s Add 4 CPUs = ~12 GB/s Partition = ~1/3 of bandwidth Partition = copy [Satish et.al. SIGMOD 1, Wassenberg et.al. EuroPar 11] Network bandwidth? Measure (partition) all-to-all Bandwidth (GB/sec) 12 8 4 2-channel 4-channel x4 CPUs 3 2.4 1.8 1.2.6 1 Gbit Ethernet QDR InfiniBand 4X 1 Gbit Ethernet < 1 GB/s QDR InfiniBand 4X < 3 GB/s 3

Broadcast Join Tables / Nodes 1 2 3 4 R S Network cost Transfer min( R, S ) * 3 Schedule transfers optimally 4

Hash Join Tables / Nodes 1 2 3 4 R S Network cost Transfer ( R + S ) * 3/4 Distribution of (almost) equal partitions 5

Hash Join Pros & cons Broadcast join can be expensive Before partition key: X val: Y Useful only if R << S Good for load balancing Hashing randomizes the keys Bad in locality awareness (Again) Hashing randomizes the keys Real datasets have locality Node: key: X val: Y 1 2 3 1 2 3 key: X val: Z Deliberate clustering (optimization) Time-based locality due to appends key: X val: Y After partition 6

Track Join (2-phase) Nodes 1 2 3 4 Nodes 1 2 3 4 R Tables S 2) Distribute unique keys 1) Partition unique keys Tracking Hash distribute join keys Eliminate duplicates

Track Join (2-phase) Selective broadcast (last step) For a single join key 1 2 3 4 1 2 3 4 R tuples (key=x) key X nodes 1,3,4 tuples (key=x) tuples (key=x) key X nodes 1,3,4 tuples (key=x) tuples (key=x) tuples (key=x) S 1 2 3 4 8 1 2 3 4

Track Join (2-phase) 2-phase track join Move R tuples to S tuple locations S payloads stay in place: never move over the network Cost: tracking + min( R, S ) * repeats min( R, S ) decided by tuple width ( = payload width ) 3-phase track join Decides tuple direction dynamically Which table to move & which to keep in-place Augment tracking with counts Counts per unique key 9

Track Join (3-phase) keys counts Nodes 1 2 3 4 Nodes 1 2 3 4 R Tables S 1) Count-Aggregate unique keys 2) Partition unique keys & counts Tracking 3) Distribute unique keys & counts Count-aggregate keys Hash distribute keys & counts

Schedules / Algorithm Hash Join (cost = 1) 2-phase Track Join (cost = 12) """"""""""""""""""""""""""""""" R" 2 4 """"""""""""""""""""""""""""""" S" 3 1 """"""""""""""""""""""""""""""" 2 4 """"""""""""""""""""""""""""""" 3 1 3-phase Track Join (cost = 8) 4-phase Track Join (cost = 6) """"""""""""""""""""""""""""""" 2 4 """"""""""""""""""""""""""""""" 3 1 """"""""""""""""""""""""""""""" 2 4 """"""""""""""""""""""""""""""" 3 1 11

Track Join (4-phase) Compute optimal Cartesian product join schedule Track using keys & counts As in 3-phase track join Optimize R to S broadcast, and S to R Compute R to S broadcast, and S to R Allow migration of S tuples for R to S, and R tuples for S to R Provably optimal in linear time Pick best (optimized) direction for migrate & broadcast Execute the optimal schedule First migrate tuples from one table Then broadcast tuples from the other table 12

Schedule Optimization Broadcast (cost = + 33) max" """"""""""""""""""""""""""""""" R" 4 8 9 6 """"""""""""""""""""""""""""""" S" 2 5 3 1 Migrate 4? Yes (cost = 4 + 24 < 33) """"""""""""""""""""""""""""""" 4 8 9 6 """"""""""""""""""""""""""""""" 2 5 3 1 Migrate 9? No (cost = 13 + 16 > 28) Migrate 6? Yes (cost = 1 + 14 < 28) """"""""""""""""""""""""""""""" 4 8 9 6 """"""""""""""""""""""""""""""" 4 8 9 6 """"""""""""""""""""""""""""""" 2 5 3 1 """"""""""""""""""""""""""""""" 2 5 3 1 13

Network Cost Approximation When to use instead of hash join? Using standard statistics # tuples # distinct keys Distinguish classes of correlation ( = similar cartesian products ) Use correlated sampling [Yu et.al. SIGMOD 13] Use track join 2-phase if at least one table has unique keys 4-phase if many key repeats or locality is expected Use hash join If payloads are small (e.g. key & record id only) and no locality exists 14

Track/Hash/Semi Joins Track join is a form of semi-join Tracking generates schedules for valid Cartesian products only Non-approximate like Bloom filter based semi-join (Bloom join) Cost (of tracking) = distribute unique join keys (& counts) Still may use semi-join on top of track join Bloom filtering < tracking However may skip semi-join unlike hash join Tracking < Bloom filtering Hash join can become tracking-aware Use record ids (rids) to track joining payloads In the best case as good as 2-phase track join 15

Network Traffic Simulations Unique keys join (1 billion vs. 1 billion tuples) R: 2 bytes S: 6 bytes R: 4 bytes S: 6 bytes R: 6 bytes S: 6 bytes S Tuples R Tuples Keys & Nodes Keys & Counts 12 12 12 Network Traffic (GB) 9 6 3 9 6 3 9 6 3 16

Simulating Locality Simulate locality patterns and degree of locality Experiment 1: 1 vs. 5 keys per Cartesian product Experiment 2: 5 vs. 5 keys (=25 in result) intra-table collocated Experiment 3: 5 vs. 5 keys intra-table & inter-table collocated S Tuples R Tuples Keys & Nodes Keys & Counts 5,,,,,, 2,2,1,,,, 1,1,1,1,1,, Network Traffic (GB) 6 4 2 6 4 2 6 4 2 17

Simulating Locality 18 Network Traffic (GB) 2 4 6 2 4 6 2 4 6 Network Traffic (GB) 2 4 6 2 4 6 2 4 6

Real Workloads Real commercial vendor workloads Profiled using commercial DBMS 4 nodes x 2 CPUs (2.9 GHz) x 8 cores QDR InfiniBand 4X Extracted the most expensive queries Extracted the most expensive join from them Executed in the DBMS as a hash join Simulating track join Multiple encoding schemes Variable length types Optimal compression schemes 19

Real Workload 1 Traffic Simulation Most expensive query of workload Query joins 7 relations and aggregates Most expensive join takes 23% of time Almost entirely unique keys S Tuples R Tuples Keys & Nodes Keys & Counts Fixed byte encoding Variable byte encoding Dictionary compression Network Traffic (GB) 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 2

Real Workload 1 Traffic Simulation Most expensive query of workload Exhibited significant locality Shuffle the data randomly No locality is possible now S Tuples R Tuples Keys & Nodes Keys & Counts Fixed byte encoding Variable byte encoding Dictionary compression Network Traffic (GB) 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 21

Real Workload 2 Traffic Simulation Most expensive query of workload 2-phase suffices for unique keys 3-phase / 4-phase are redundant Workload 2 is different No unique keys Very high selectivity R: ~4 million tuples S: ~2 million tuples RS: >1 billion tuples Variable byte encoding Base 1 / byte Network Traffic (GB) 4 3 2 1 4 3 2 1 S Tuples R Tuples Keys & Nodes Keys & Counts Original order Shuffled order 22

Real Workload Experiments Implementation Sort for in-memory join De-pipelined operators De-couple network & CPU measurement Experiments are invariant of network speed Run on small private cluster 4 nodes x 2 CPUs (2.66 GHz) x 4 cores Accurately project any network speed Evaluate real workloads The same cases we simulated On the same expensive join 23

Real Workload Time Experiments Projected (accurately) to 1 Gbit Ethernet CPU vs. network analogous to commercial platforms DBMS profiling platform: ~2.8X network & ~2.2X CPU Schedule generation is fast (insignificant in workload 2) Network CPU 16 Original real workload 1 16 Shuffled real workload 1 8 Original real workload 2 8 Shuffled real workload 2 Time (seconds) 12 8 4 12 8 4 6 4 2 6 4 2 2TJ 2TJ 2TJ 2TJ 24

Conclusions We introduced Track Join For distributed joins Not a hash join Not a broadcast join Optimize network traffic Track matching keys using hash join Works at join key granularity (not at hash groups) Generate optimal Cartesian join schedules fast (and in linear time) Experimental results Reduces network traffic significantly CPU time penalty is modest Better with data locality 25

Questions 26