When MPPDB Meets GPU:

Similar documents
CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

PG-Strom v2.0 Release Technical Brief (17-Apr-2018) PG-Strom Development Team

Just In Time Compilation in PostgreSQL 11 and onward

Tesla GPU Computing A Revolution in High Performance Computing

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

GPU for HPC. October 2010

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPUs and GPGPUs. Greg Blanton John T. Lubia

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

Parallel Architectures

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Vectorized Postgres (VOPS extension) Konstantin Knizhnik Postgres Professional

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Tesla Architecture, CUDA and Optimization Strategies

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Computer Architecture

Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine

Columnstore and B+ tree. Are Hybrid Physical. Designs Important?

From Shader Code to a Teraflop: How Shader Cores Work

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

SDA: Software-Defined Accelerator for general-purpose big data analysis system

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Mathematical computations with GPUs

Real-Time Rendering Architectures


High Performance Computing on GPUs using NVIDIA CUDA

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

CUDA Programming Model

Introduction to GPU computing

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Recent Advances in Heterogeneous Computing using Charm++

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

GPU Fundamentals Jeff Larkin November 14, 2016

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

GPUfs: Integrating a file system with GPUs

CME 213 S PRING Eric Darve

Parallel Processing SIMD, Vector and GPU s cont.

GPUs have enormous power that is enormously difficult to use

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

ADMS/VLDB, August 27 th 2018, Rio de Janeiro, Brazil OPTIMIZING GROUP-BY AND AGGREGATION USING GPU-CPU CO-PROCESSING

Tesla GPU Computing A Revolution in High Performance Computing

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

45-year CPU Evolution: 1 Law -2 Equations

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Modern Processor Architectures. L25: Modern Compiler Design

Building NVLink for Developers

Portland State University ECE 588/688. Graphics Processors

Real-Time Support for GPU. GPU Management Heechul Yun

Database Acceleration Solution Using FPGAs and Integrated Flash Storage

Antonio R. Miele Marco D. Santambrogio

CS 426 Parallel Computing. Parallel Computing Platforms

CSE 160 Lecture 24. Graphical Processing Units

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

General Purpose GPU Computing in Partial Wave Analysis

TUNING CUDA APPLICATIONS FOR MAXWELL

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Deep Learning Compiler

Jignesh M. Patel. Blog:

Introduction to GPGPU and GPU-architectures

Practical Near-Data Processing for In-Memory Analytics Frameworks

Threading Hardware in G80

Improving Performance of Machine Learning Workloads

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

The Case for Heterogeneous HTAP

Programming GPUs for database applications - outsourcing index search operations

University of Bielefeld

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Memories. Introduction 5/4/11

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

CS427 Multicore Architecture and Parallel Computing

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

CUDA Performance Optimization. Patrick Legresley

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

CPU-GPU Heterogeneous Computing

Transcription:

When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang

Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU Since 1993, GPU performance speed up 2.8X per year GPU computation power and bandwidth can be 39~50X of Computation (TFLOPS): Bandwidth(GB/s): Intel:1 Intel:68 NAVIDIA(8x):52.8 NAVIDIA(8x):2692 GPU-accelerated DB Positioning:a differentiator for high-end market, for workload that is sensitive to response time, support real-time computing and decision making State-of-Art Research: join, sort, group-by, etc. Speedup from end-to-end: (for complicated analytical queries) Main-stream GPU:5~10X High-end GPU: 10 ~ 20X HISILICON HUAWEI TECHNOLOGIES SEMICONDUCTOR CO., LTD. Page 2

Background: GPU vs. Features Comparison Optimization Strategies Comparison GPU GPU Model Nvidia TITAN X Intel Xeon E7-8890 v4 Architecture Pascal Broadwell Launch Aug-2016 Jun-2016 GP compute unit Include Cache and Control Data parallelism Mostly computation unit Parallelism Fine-grained massive data parallelism - Thousands of cores - Hide memory latency - Multi-threaded: coarse-grained task parallelism - SIMD: fine-grained data parallelism (max. 512-bit SIMD width) # of cores 3584 (simple) 24 (functional) Core clock Peak Flops (single precision) DRAM size 1.5 GHz 11 TFLOPS 12 GB, GDDR5X (384 bits bus) 2.2 GHz, up to 3.4 GHz 1689.6 GFLOPS (with AVX2) 768 GB/socket, DDR4 Memory band 480 GB/s 85 GB/s Power consumption 250 W 165 W Price at launch $1200 $7,174 Massive parallel cores Much higher DRAM bandwidth Better price/performance ratio Efficient for: Parallel floating point processing SIMD ops High comp. per mem access Agility in multi-threading Not efficient for: Complex control logic Logic operations Random access, mem-intensive ops Memory access pattern Cache usage Global sync Other concerns HISILICON HUAWEI TECHNOLOGIES SEMICONDUCTOR CO., LTD. Page 3 Coalesced: high memory bandwidth with high latency e.g., < 10 GB/sec for random > 100 GB/sec for coalesced Programmable L1 cache (up to 48 KB) on each multi-processor - The key to improve data locality Not supported - Branch divergence - Slow PCI-E bandwidth - Relatively small memory size Sequential (Almost) Transparent to programmers Feasible

When MPPDB Meets GPU MPPDB: share-nothing architecture, massively parallel processing Fully utilize multi-core: SMP, etc. Hybrid approach: worth to offload compute-intensive operators to GPU? How? Page 4

Extendible Framework for Hardware Acceleration SQL SQL Parser Resource-aware optimizer Org. Plan GPU Plan Optimizer Resource-aware optimizer InitPlan: ExecPlan: Query Execution Engine exec GPU kernel execution GPU task queue Parallelized kernel execution Columnar Storage Row Storage Host Memory DMA Data GPU Memory Feeding data to GPU Page 5

Resource-Aware Optimizer SQL SQL Parser Org. Plan Scan GPU Plan GpuPlan (GpuScan) Get the best out of both and GPU Optimizer exec InitPlan: ExecPlan: Columnar Storage Query Execution Engine GpuPlan Agg GPU kernel execution Scan GPU task queue Agg GpuPlan (GpuAgg) Plan Time: Enhanced cost-model to determine either use or GPU plan, push which operators to GPU. Wrap GPU plan nodes in unified GpuPlan node Use plan hook to hook up to other parts of plan Interface between plan nodes stay same GpuPlan is the connector for GPU to the original runtime engine Same API can be applied to operators such as Scan, Agg, Join, Sort, and can connect to other hardware (FPGA) Row Storage Host Memory DMA GPU Memory Page 6

Parallelized GPU Kernel Execution SQL SQL Parser Org. Plan GPU Plan Agg Agg Optimizer GpuPlan Scan GpuPlan (GpuAgg) InitPlan: exec Query Execution Engine GPU kernel exec GpuPlan Runtime API Kernel algorithms on GPU: Fine-grained massive data parallelism Coalesced access pattern JIT compilation ExecPlan: Columnar Storage Row Storage Host Memory Prepare data Chunk data 2 3 DMA GPU task queue 4 GPU Memory 5 Code gen 1 Cuda program Exec Time: 1. Code gen to generate CUDA program on-the-fly (JIT compile) 2. Prepare chunk size of data for GPU tasks, load to DMA buffer 3. Kick asyn DMA over PCI-E 4. Launch GPU kernel program 5. Write back results Page 7

Feeding Data to GPU SQL SQL Parser Org. Plan GPU Plan Agg Agg Optimizer GpuPlan Scan GpuPlan (GpuAgg) Query Execution Engine exec GPU kernel exec InitPlan: ExecPlan: Columnar Storage Row Storage Host Memory Prepare data Chunk data 2 3 DMA GpuPlan Runtime API 5 GPU task queue 4 GPU Memory Code gen 1 Cuda program Data flow to GPU: Assemble chunk size data Through PCI-e bus (16GB/s): HDD/SSD mem GPU mem SSD GPU mem Through high-speed connect: NVLink (5~12x), CAPI, CCIX Page 8

Example: TPC-H Q1 TPC-H Q1 100x (with 1 Nvidia GTX 1080) TPCH-Q1 : GPU PLAN PLAN -------------------------------------------------------------------------------------------------------------------------------- 800 12X Steam (Gather) Steam (Gather) 700 600 8X Sort Sort 500 400 300 200 3X 3X GPU base-row base-col HashAgg Steam (Redistribute) HashAgg Steam (Redistribute) 100 0 cold run warm run HashAgg HashAgg QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Streaming (type: GATHER) (cost=440338.85..440335.24 rows=6 width=36) (actual time=8534.439..8534.440 rows=4 loops=1) Node/s: All datanodes -> Sort (cost=440334.85..440334.86 rows=6 width=36) (actual time=[8205.003,8205.003]..[8205.321,8205.321], rows=4) Sort Key: l_returnflag, l_linestatus Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=440334.27..440334.83 rows=6 width=36) (actual time=[8204.954,8204.955]..[8205.269,8205.270], rows=4) Group By Key: l_returnflag, l_linestatus -> Streaming(type: REDISTRIBUTE) (cost=220167.14..220167.59 rows=12 width=36) (actual time=[8128.316,8204.796]..[8163.807,8205.152], rows=8) Spawn on: All datanodes -> HashAggregate (cost=220167.14..220167.24 rows=12 width=36) (actual time=[5906.548,5906.552]..[5932.072,5932.076], rows=8) Group By Key: l_returnflag, l_linestatus GpuPlan GpuPreAgg+ GpuProjection+ Outer Scan +GPU Filter -> GpuPlan (GpuPreAgg) on lineitem (cost=10671.20..108150.40 rows=0 width=72) (actual time=[5824.801,5906.320]..[5851.165,5931.877], rows=336) Reduction: Local + Global GPU Projection: l_returnflag, l_linestatus, l_quantity, l_extendedprice, l_discount, l_tax Outer Scan: lineitem (never executed) GPU Filter: (l_shipdate <= '1998-11-28 00:00:00'::timestamp without time zone) SeqScan Page 9

More Challenges for GPU on MPPDB How to share GPU resources by multi-tasks concurrently? Abstract GPU mem to enable context-switch Resource pool of logical GPU units How to optimize performance when MPPDB shuffles data between nodes? Overlap data transfer with computation with enhanced pre-fetching Enable faster network protocol to lower network latency, bypass Page 10

GPU on Cloud How to deploy GPUs on cloud database? How does data transfer overlap with computation in Cloud environment? What about GPU clusters? Page 11

Take-Away GPU is for massive data parallelism GPU can be used to speed up analytical queries Extendible acceleration framework requires: Resource-aware optimizer Parallelized GPU kernel Overlap data transfer with computation Page 12

Thank you! Page 13