Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Similar documents
Architecture-Conscious Database Systems

Accelerating Analytical Workloads

Hash Joins for Multi-core CPUs. Benjamin Wagner

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

PacketShader: A GPU-Accelerated Software Router

8 Introduction to Distributed Computing

Performance in the Multicore Era

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

Using Graphics Chips for General Purpose Computation

POST-SIEVING ON GPUs

Column Stores vs. Row Stores How Different Are They Really?

Hardware Acceleration for Database Systems using Content Addressable Memories

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

DATABASE CRACKING: Fancy Scan, not Poor Man s Sort! Don. Holger Pirk Eleni Petraki Strato Idreos

Query Processing Models

Storage. Holger Pirk

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Hybrid Implementation of 3D Kirchhoff Migration

Query Processing. Introduction to Databases CompSci 316 Fall 2017

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Introduction II. Overview

On the efficiency of the Accelerated Processing Unit for scientific computing

Main-Memory Databases 1 / 25

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Performance Issues and Query Optimization in Monet

A Distributed Hash Table for Shared Memory

MapReduce. Stony Brook University CSE545, Fall 2016

HG-Bitmap Join Index: A Hybrid GPU/CPU Bitmap Join Index Mechanism for OLAP

RELATIONAL OPERATORS #1

Martin Dubois, ing. Contents

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

A Multi Join Algorithm Utilizing Double Indices

Weaving Relations for Cache Performance

CS307 Operating Systems Main Memory

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Advanced Database Systems

The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases

Caribou: Intelligent Distributed Storage

Hardware-Accelerated Memory Operations on Large-Scale NUMA Systems

Was ist dran an einer spezialisierten Data Warehousing platform?

LegUp: Accelerating Memcached on Cloud FPGAs

University of Waterloo Midterm Examination Sample Solution

Hardware Acceleration of Database Operations

Storage and File Structure

Introduction to Multicore Programming

ColumnStore Indexes. מה חדש ב- 2014?SQL Server.

Sorting & Aggregations

BDCC: Exploiting Fine-Grained Persistent Memories for OLAP. Peter Boncz

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

Sorting. Overview. External sorting. Warm up: in memory sorting. Purpose. Overview. Sort benchmarks

Experts in Application Acceleration Synective Labs AB

Real-Time Support for GPU. GPU Management Heechul Yun

Database Systems CSE 414

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GPGPU introduction and network applications. PacketShaders, SSLShader

Data Processing on Emerging Hardware

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Query Processing with Indexes. Announcements (February 24) Review. CPS 216 Advanced Database Systems

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Anatomy of AMD s TeraScale Graphics Engine

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

Introduction to Multicore Programming

Big Data Management and NoSQL Databases

Welcome to CO 572: Advanced Databases

1 (eagle_eye) and Naeem Latif

Database Systems CSE 414

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Column-Stores vs. Row-Stores: How Different Are They Really?

Bridging the Processor/Memory Performance Gap in Database Applications

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Declarative Partitioning Has Arrived!

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto

Motivation for Sorting. External Sorting: Overview. Outline. CSE 190D Database System Implementation. Topic 3: Sorting. Chapter 13 of Cow Book

Deep Learning Performance and Cost Evaluation

7. Query Processing and Optimization

HyperTransport Technology

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Evaluation of Relational Operations

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSE 544 Principles of Database Management Systems

COURSE 12. Parallel DBMS

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

Lecture 13: March 25

Fundamental CUDA Optimization. NVIDIA Corporation

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

JPEG decoding using end of block markers to concurrently partition channels on a GPU. Patrick Chieppe (u ) Supervisor: Dr.

Chapter 8 Main Memory

Video. What's New. Cisco WebEx Training Center Release Notes (version WBS29.13) 1

Transcription:

Accelerating Foreign-Key Joins using Asymmetric Memory Channels Holger Pirk Stefan Manegold Martin Kersten holger@cwi.nl manegold@cwi.nl mk@cwi.nl

Why? Trivia: Joins are important But: Many Joins are (Indexed) Foreign Key Joins Important example: Data Warehouse Star Joins Large Fact Table (many Gigabytes) with Foreign Keys on Small Dimension Tables (hundreds of Megabytes) Joins are Performed for Projection or Condition Evaluation

A History of In Memory Join Processing (and incidentally of this talk) Foreign-Key Joins on CPUs Naïve (Indexed Hash) Foreign-Key Joins Clustered Foreign-Key Joins Foreign-Key Joins on GPGPUs t Clustered Foreign-Key Joins Naïve (Asymmetric) Foreign-Key Joins

Premises for this talk Focus is on Indexed Foreign Key Joins We assume an unclustered Foreign Key Index Column i.e., Pointers to Matching Tuples of the Target Table

Foreign-Key Joins on CPUs

The Problem: Bandwidth Bound CPUs Core 1 Core 2... Core 4 Memory Controller 8.5 GB/s 8.5 GB/s Memory Memory Memory Bottleneck for almost all relational (Memory Resident) DBMS operations Valuable Target for Optimization

Naïve Foreign-Key Joins on CPUs

Naïve Foreign-Key Joins on CPUs Foreign-Key Index is scanned sequentially Target Table is randomly accessed

Naïve Foreign-Key Joins on CPUs Optimization Targets: Foreign-Key Index is scanned sequentially Index Cache lines accessed once, i.e., not much to optimize

Naïve Foreign-Key Joins on CPUs Large Target Tables cause heavy Cache Thrashing Assume a (theoretical) 1-Slot Cache with 2-Value Cache Lines How many Cache Misses do you get? 80% of are Capacitive Misses } } 6 6 Target Table Lookups are rewarding optimization Targets

Clustered Foreign-Key Joins on CPUs

Clustered Foreign-Key Joins on CPUs Require a Value-partitioned Foreign-Key Index Foreign-Key Index Partitions are scanned Target Table Random Access Locality is improved

Clustered Foreign-Key Joins on CPUs Require a Value-partitioned Foreign-Key Index Foreign-Key Index Partitions are scanned [ ][ ] Target Table Random Access Locality is improved [ ][ ]

} } Clustered Foreign-Key Joins on CPUs What is the number of Target Table Cache Misses? So Cache behavior is greatly improved [ ][ ] Also: (disjoint) Partitions can be processed in parallel Parallelism is generally good for GPUs 1 But: The Index has to be Value (e.g., Radix)-partitioned first 1

Clustered Foreign-Key Joins on CPUs Processing Time in seconds 6.0 4.5 3.0 1.5 Unpartitioned Join (CPU) Partition (CPU) Join (CPU) Radix Join (CPU) about 30 % (consistent with [Blanas11]) 0 1 10 100 1000 10000 Number of Partitions in Clustering Step (Target Table Size: 64M)

Clustered Foreign-Key Joins on CPUs 25.00 Partition time in Seconds 18.75 12.50 6.25 CPU 11.040 GPU 6.492860 20.512 0 1.695271 256 Clusters 65536 Clusters

Foreign-Key Joins on GPUs

GPU Architecture Overview Processing Unit 1 Processing Unit 16 Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Local Memory Instruction Dispatcher..... Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Local Memory Instruction Dispatcher Expensive Atomic Operations Global Memory

The Problem: Bandwidth Bound GPUs Core 1 Core 2... Core 4 GPU GPU PCI-E x 16 (8 GB/s) 177.4 GB/s Memory Controller PCI-E Bottleneck is a big Problem for GPUs 8.5 GB/s 8.5 GB/s I/O QPI Hub (25.6 GB/s) Memory Memory Memory Graphics Memory Graphics Memory

Clustered Foreign-Key Joins on GPUs

Clustered Foreign-Key Joins on GPUs Port of the Radix-Join from CPU to GPU [He08] Improved Cache Performance is Important on GPUs too Allows efficient parallel processing of Partitions

Clustered Foreign-Key Joins on GPUs Processing Time in seconds 6.0 4.5 3.0 1.5 0 Join (GPU) Partition (CPU) Radix Join (CPU/GPU) 1 10 100 1000 10000 Number of Partitions in Clustering Step about 11 % about 70 %

Naïve Foreign-Key Joins on GPUs

Naïve Foreign-Key Joins on GPUs Primary Bottleneck is PCI-E, not GPU Memory Idea: Accelerate FK Joins by exploiting the Strength of GPUs: Random Access to target table Limit the PCI-E traffic to a minimum: Stream only the Index

Asymmetric Foreign-Key Joins on GPUs Primary Bottleneck is PCI-E, not GPU Memory Idea: Accelerate FK Joins by exploiting the Strength of GPUs: Random Access to target table Limit the PCI-E traffic to a minimum: Stream only the Index

Asymmetric Foreign-Key Joins on GPUs 6.0 Processing Time in seconds 4.5 3.0 1.5 Join (GPU) Partition (CPU) Radix Join (CPU/GPU) 0 1 10 100 1000 10000 Number of Partitions in Clustering Step

Asymmetric Foreign-Key Joins on GPUs Processing Time in seconds 6.0 4.5 3.0 1.5 Unpartitioned Join (GPU) Join (GPU) Partition (CPU) Radix Join (CPU/GPU) about factor 3.5 0 1 10 100 1000 10000 Number of Partitions in Clustering Step

Asymmetric Foreign-Key Joins on GPUs Message: Clustering improves performance on GPUs But: Cluster Costs don t pay off in Cache Friendliness But what about parallel processing of Partitions?

But what about Parallelization Processing time in Seconds 1.00 0.75 0.50 0.25 Total Costs Data transfer Costs about 41 % (not enough) 0 1 10 100 1000 Number of Utilized Cores per Processor (Work Group Size)

Foreign-Key Joins on GPUs vs CPUs Processing Time in Seconds 1.100 0.825 0.550 0.275 Intel i7 860 ATI Radeon HD 5850 PCI-E Data Transfer Costs 0 1K 10K 100K 1000K 10000K 100000K Dimension Table Size

Foreign-Key Joins on GPUs vs CPUs Processing Time in Seconds 1.100 0.825 0.550 0.275 Intel i7 860 ATI Radeon HD 5850 (processing only) PCI-E Data Transfer Costs 0 1K 10K 100K 1000K 10000K 100000K Dimension Table Size

Conclusion Unclustered FK-Joins Cause Heavy Cache Thrashing Thrashing can be offloaded to fast GPU memory PCI-E Bandwidth can be saturated without Clustering Naïve FK-Joins on GPUs outperform Clustered Joins on CPUs

Thank you for Listening

Questions? Feedback?