Red Fox: An Execution Environment for Relational Query Processing on GPUs

Similar documents
Red Fox: An Execution Environment for Relational Query Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Accelerating Data Warehousing Applications Using General Purpose GPUs

Scaling Data Warehousing Applications using GPUs

Rela*onal Processing Accelerators: From Clouds to Memory Systems

The Era of Heterogeneous Compute: Challenges and Opportunities

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

ACCELERATION AND EXECUTION OF RELATIONAL QUERIES USING GENERAL PURPOSE GRAPHICS PROCESSING UNIT (GPGPU)

Re-architecting Virtualization in Heterogeneous Multicore Systems

SDA: Software-Defined Accelerator for general-purpose big data analysis system

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Accelerating RDBMS Operations Using GPUs

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Commodity Converged Fabrics for Global Address Spaces in Accelerator Clouds

Ocelot: An Open Source Debugging and Compilation Framework for CUDA

When MPPDB Meets GPU:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

arxiv: v1 [cs.db] 6 Jan 2019

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Modern Processor Architectures. L25: Modern Compiler Design

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

The Case for Heterogeneous HTAP

Adaptive Data Processing in Heterogeneous Hardware Systems

Tesla Architecture, CUDA and Optimization Strategies

Introduction to GPU computing

A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS

Portland State University ECE 588/688. Graphics Processors

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Fast Computation on Processing Data Warehousing Queries on GPU Devices

TUNING CUDA APPLICATIONS FOR MAXWELL

Join Algorithms on GPUs: A Revisit After Seven Years

LDetector: A low overhead data race detector for GPU programs

Applications of Berkeley s Dwarfs on Nvidia GPUs

Native Offload of Haskell Repa Programs to Integrated GPUs

Oncilla: A GAS Runtime for Efficient Resource Allocation and Data Movement in Accelerated Clusters

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

GPU Implementation of a Multiobjective Search Algorithm

Jignesh M. Patel. Blog:

Introduction. L25: Modern Compiler Design

TUNING CUDA APPLICATIONS FOR MAXWELL

QR Decomposition on GPUs

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

Improving performances of an embedded RDBMS with a hybrid CPU/GPU processing engine

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

HARMONY: AN EXECUTION MODEL FOR HETEROGENEOUS SYSTEMS

Document downloaded from:

Performance potential for simulating spin models on GPU

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

CME 213 S PRING Eric Darve

Towards a Performance- Portable FFT Library for Heterogeneous Computing

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Tesla GPU Computing A Revolution in High Performance Computing

Overtaking CPU DBMSes with a GPU in Whole-Query Analytic Processing

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GPUfs: Integrating a file system with GPUs

General Purpose GPU Computing in Partial Wave Analysis

XPU A Programmable FPGA Accelerator for Diverse Workloads

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPUfs: Integrating a file system with GPUs

Advanced CUDA Optimization 1. Introduction

GPGPU Programming & Erlang. Kevin A. Smith

Trends in HPC (hardware complexity and software challenges)

CUDA Performance Optimization. Patrick Legresley

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

88X + PERFORMANCE GAINS USING IBM DB2 WITH BLU ACCELERATION ON INTEL TECHNOLOGY

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Solving Dense Linear Systems on Graphics Processors

ADMS/VLDB, August 27 th 2018, Rio de Janeiro, Brazil OPTIMIZING GROUP-BY AND AGGREGATION USING GPU-CPU CO-PROCESSING

Supporting Data Parallelism in Matcloud: Final Report

High Performance Computing on GPUs using NVIDIA CUDA

Tesla GPU Computing A Revolution in High Performance Computing

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Optimization of Tele-Immersion Codes

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Higher Level Programming Abstractions for FPGAs using OpenCL

Technology for a better society. hetcomp.com

Mixed Precision Methods

Accelerating Analytical Workloads

Optimization solutions for the segmented sum algorithmic function

Software and Tools for HPE s The Machine Project

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Research Faculty Summit Systems Fueling future disruptions

Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Introduction to Parallel Computing with CUDA. Oswald Haan

GViM: GPU-accelerated Virtual Machines

Transcription:

Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. NVIDIA 3. Portland State University 4. LogicBlox Inc.

System Diversity Today Amazon EC2 GPU Instances Mobile Platforms (DSP, GPUs) Hardware Diversity is Mainstream Keeneland System (GPUs) Cray Titan (GPUs) 2

Relational Queries on Modern GPUs The Opportunity Significant potential data parallelism If data fits in GPU memory, 2x 27x speedup has been shown 1 The Problem Need to process 1-50 TBs of data 2 Fine grained computation 15 90% of the total time spent in moving data between CPU and GPU 1 1 B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, 2009. 2 Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey. 3

New Accelerator Architectures New Applications and Software Stacks The Challenge Candidate Application Domains LargeQty(p) <- Qty(q), q > 1000. Large Graphs Relational Computations Over Massive Unstructured Data Sets: Sustain 10X 100X throughput over multicore 4

Goal and Strategy GOAL Build a compilation chain to bridge the semantic gap between Relational Queries and GPU execution models 10x-100X speedup for relational queries over multicore Strategy 1. Optimized Primitive Design Fastest published GPU RA primitive implementations (PPoPP2013) 2. Minimize Data Movement Cost (MICRO2012) Between CPU and GPU Between GPU Cores and GPU Memory 3. Query level compilation and optimizations (CGO2014) 5

The Big Picture LogiQLQueries LogicBlox RT parcels out work units and manages out-of-core data. RT Red Fox extends LogicBlox environment to support GPUs. CPUs GPU CPU Cores 6

LogicBlox Domain Decomposition Policy Sand, Not Boxes Fitting boxes into a shipping container => hard (NP-Complete) Pouring sand into a dump truck => dead easy Large query is partitioned into very fine grained work units Work unit size should fit GPU memory GPU work unit size will be larger than CPU size Still many problems ahead, e.g. caching data in GPU Red Fox: Make the GPU(s) look like very high performance cores! 7

Domain Specific Compilation: Red Fox LogiQLQueries LogiQL-to-RA Frontend Language Front-End Query Plan First thing first, mapping the computation to GPU RA-to-GPU Compiler (nvcc + RA-Lib) Harmony IR RA Primitives Translation Layer Harmony Runtime* Machine Back-End * G. Diamos, and S. Yalamanchili, Harmony: An Execution Model and Runtime for Heterogeneous Many-Core Processors, HPDC, 2008. 8

Source Language: LogiQL LogiQL is based on Datalog A declarative programming language Extended Datalog with aggregations, arithmetic, etc. Find more in http://www.logicblox.com/technology.html Example ancestor(x,y)<-parent(x,y). ancestor(x,y)<-ancestor(x,t),ancestor(t,y). recursive definition 9

Language Front-end Front-End Compilation Flow LogiQL Queries LogicBlox Parser Parsing Type Checking AST Optimization Red Fox Compilation Flow: Translating LogiQL Queries to Relational Algebra (RA) AST RA Translation LogicBlox Flow Industry strength optimization Query Plan Pass Manager Red Fox common (sub)expression elimination dead code elimination more optimizations are needed 10

Structure of the Two IRs: Query Plan Module RA Primitives Harmony IR Module Variable Types Data RA-to-GPU Compiler (nvcc + RA-Lib) Variable Types Data Basic Block Operator Basic Block Operator Input Output Input Output CUDA 11

Two IRs Enable More Choices LogiQLQueries LogiQL-to-RA Frontend Query Plan RA-to-GPU (nvcc + RA-Lib) Harmony IR SQL Queries SQL-to-RA Frontend CUDA Library OpenCL Library Synthesized RA operators Design Supports Extensions to Other Language Front-Ends Other Back-ends 12

Primitive Library: Data Structures Key-Value Store Arrays of densely packed tuples Support for up to 1024 bit tuples Support int, float, string, date id price tax 4 bytes 8 bytes 16 bytes Key Value 13

Primitive Library: When Storing Strings String Table (Len = 32) P P P String Table (Len = 64) Key Value String Table (Len = 128) 14

Primitive Library: Performance Stores the GPU implementation of following primitives Relational Algebra PROJECT PRODUCT SELECT JOIN SET Math Arithmetic: + - * / Aggregation Built-in String Datetime Others Sort Unique RA performance on GPU (PPoPP 2013)* Measured on Tesla C2050 Random Integers as inputs * G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP, 2013. 15

Forward Compatibility: Primitive Library Today Use best implementations from the state of the art Easily integrate improved algorithms designed by 3 rd parties Relational Algebra PROJECT PRODUCT SELECT JOIN SET Math Arithmetic: + - * / Aggregation Built-in String Datetime Others Merge Sort Radix Sort Unique Red: Thrust library Green: ModernGPU library 1 Merge Sort Sort-Merge Join Purple: Back40Computing 2 Black: Red Fox Library 1 S. Baxter. Modern GPU, http://nvlabs.github.io/moderngpu/index.html 2 D. Merrill. Back40Computing, https://code.google.com/p/back40computing/ 16

Harmony Runtime Schedule GPU Commands on available GPUs Harmony IR Scheduler... Runtime GPU Driver APIs Current scheduling method attempts to minimize memory footprint j_1:= p_1:= PROJECT j_1 Allocate j_1 Allocate p_1 Free j_1 Complex Scheduling such as speculative execution* is also possible *G. Diamos, and S. Yalamanchili, Speculative Execution On Multi-GPU Systems, IPDPS, 2010. 17

Benchmarks: TPC-H Queries A popular decision making benchmark suite Comprised of 22 queries analyzing data from 6 big tables and 2 small tables Scale Factor parameter to control database size SF=1 corresponds to a 1GB database Courtesy: O Neil, O Neil, Chen. Star Schema Benchmark. 18

Experimental Environment Red Fox CPU GPU PCIe 3.0 x 16 Intel i7-4771 @ 3.50GHz Geforce GTX Titan (2688 cores, $1000 USD) OS Ubuntu 12.04 G++/GCC 4.6 NVCC 5.5 Thrust 1.7 LogicBlox 4.0 Amazon EC2 instance cr1.8xlarge 32 threads run on 16 cores CPU cost - $3000 USD 19

q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20 q21 q22 Ave Speedup Red Fox TPC-H (SF=1) Comparison with CPU 90 80 70 60 50 40 30 20 10 0 w/ PCIe w/o PCIe >10x Faster with 1/3 Price On average (geo mean) GPU w/ PCIe : Parallel CPU = 11x GPU w/o PCIe : Parallel CPU = 15x This performance is viewed as lower bound - more improvements are coming Find latest performance and query plans in http://gpuocelot.gatech.edu/projects/red-fox-a-compilation-environment-for-data-warehousing/ 20

q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20 q21 q22 Ave Speedup Red Fox TPC-H (SF=1) Comparison with CPU 90 80 70 60 50 40 30 20 10 0 w/ PCIe w/o PCIe Highest Speedup: string centric queries LogicBlox uses string library Red Fox re-implements string ops Lowest Speedup: poor query plan 1 thread manages 1 string Performance depends on string contents Branch/Memory Divergence 21

Time Performance of Primitives 250 200 150 100 50 0 0% Frequency 300 project select product join sort_join diff unique sort_unique union agg sort_agg others 10% 20% 30% 40% Most time consuming Solutions: Better order of primitives New join algorithms, e.g. hash join, multi-predicate join More optimizations, e.g. kernel fusion, better runtime scheduling method 22

Next Steps: Running Faster, Smarter, Bigger.. Running Faster Additional query optimizations Improved RA algorithms Improved run-time load distribution Running Smarter: Extension to single node multi-gpu Extension to multi-node multi-gpu Running Bigger From in-core to out-of-core processing 23

The Future is Acceleration topnews.net.tz Waterexchange.com Large Graphs Thank You 24