Red Fox: An Execution Environment for Relational Query Processing on GPUs

Size: px

Start display at page:

Download "Red Fox: An Execution Environment for Relational Query Processing on GPUs"

Ferdinand Stevenson
5 years ago
Views:

: Daniel Zinn, Martin Bravenboer, Molham Aref NVIDIA: Gregory Diamos, Sean Baxter, Michael

1 Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer, Molham Aref NVIDIA: Gregory Diamos, Sean Baxter, Michael Garland Portland State University: Tim Sheard NEC Laboratories America: Srihari Cadambi, Srimat Chakradhar 1

2 System Diversity Today Amazon EC2 GPU Instances Mobile Platforms (DSP, GPUs) Hardware Diversity is Mainstream Keeneland System (GPUs) Cray Titan (GPUs) 2

3 New Accelerator Architectures New Applications and Software Stacks The Challenge Candidate Application Domains LargeQty(p) <- Qty(q), q > Large Graphs Relational Computations Over Massive Unstructured Data Sets: Sustain 10X 100X Throughput Over Multicore 3

Opportunities and Problems The Opportunity Significant potential data parallelism The Problems Need to process 1-50 TBs of data * Small Mem Capacity &

4 Opportunities and Problems The Opportunity Significant potential data parallelism The Problems Need to process 1-50 TBs of data * Small Mem Capacity & Small PCIe bandwidth Fine grained computation * Independent Oracle Users Group. A New Dimension to Data Warehousing: IOUG Data Warehousing Survey. 4

5 Goal and Strategy GOAL Build a compilation chain to bridge the semantic gap between Relational Queries and GPU execution models 10x-100X speedup for relational queries over multicore Strategy 1. Optimized Primitive Design Fast GPU RA primitive implementations (PPoPP2013) 2. Minimize Data Movement Cost (MICRO2012) Between CPU and GPU Between GPU Cores and GPU Memory 3. Query level compilation and optimizations (CGO2014) 5

6 The Big Picture LogiQLQueries LogicBlox RT parcels out work units and manages out-of-core data. RT Red Fox extends LogicBlox environment to support GPUs. CPUs GPU CPU Cores 6

7 LogicBlox Domain Decomposition Policy Sand, Not Boxes Fitting boxes into a shipping container => hard (NP-Complete) Pouring sand into a dump truck => dead easy Large query is partitioned into very fine grained work units Work unit size should fit GPU memory GPU work unit size will be larger than CPU size Still many problems ahead, e.g. caching data in GPU Red Fox: Make the GPU(s) look like very high performance cores! 7

8 Domain Specific Compilation: Red Fox 1 First thing first, mapping the computation to GPU RA Primitives LogiQL Queries Query Plan Harmony IR LogiQL-to-RA Frontend RA-to-GPU Compiler (nvcc + RA-Lib) Harmony Runtime 2 Kernel Weaver Language Front-End Translation Layer Machine Back-End 1. H. Wu, G. Diamos, T. Sheard, M. Aref, S. Baxter, M. Garland, S. Yalamanchili. Red Fox: An Execution Environment for Relational Query Processing on GPUs. In CGO, G. Diamos, and S. Yalamanchili. Harmony: An Execution Model and Runtime for Heterogeneous Many-Core Processors. In HPDC,

9 Source Language: LogiQL LogiQL is based on Datalog A declarative programming language Extended Datalog with aggregations, arithmetic, etc. Find more about LogiQL in Example ancestor(x,y)<-parent(x,y). ancestor(x,y)<-ancestor(x,t),ancestor(t,y). recursive definition Executed by LogicBlox Platform. Find more about LogicBlox: 9

Language Front-end Front-End Compilation Flow LogiQL Queries LogicBlox Parser Parsing Type Checking AST Optimization Red Fox Compilation Flow: Translating LogiQL Queries to Relational Algebra

10 Language Front-end Front-End Compilation Flow LogiQL Queries LogicBlox Parser Parsing Type Checking AST Optimization Red Fox Compilation Flow: Translating LogiQL Queries to Relational Algebra (RA) AST RA Translation LogicBlox Flow Industry strength optimization Query Plan Pass Manager Red Fox common (sub)expression elimination dead code elimination more optimizations are needed 10

11 Structure of the Two IRs: Query Plan Module RA Primitives Harmony IR Module Variable Types Data RA-to-GPU Compiler (nvcc + RA-Lib) Variable Types Data Basic Block Operator Basic Block Operator Input Output Input Output CUDA 11

SQL-to-RA Frontend CUDA Library OpenCL Library Synthesized RA

12 Two IRs Enable More Choices LogiQLQueries LogiQL-to-RA Frontend Query Plan RA-to-GPU (nvcc + RA-Lib) Harmony IR SQL Queries SQL-to-RA Frontend CUDA Library OpenCL Library Synthesized RA operators Design Supports Extensions to Other Language Front-Ends Other Back-ends 12

13 Primitive Library: Data Structures Key-Value Store Arrays of densely packed tuples Support for up to 1024 bit tuples Support int, float, string, date id price tax 4 bytes 8 bytes 16 bytes Key Value 13

14 Primitive Library: Performance Stores the GPU implementation of following primitives Relational Algebra PROJECT PRODUCT SELECT JOIN SET Math Arithmetic: + - * / Aggregation Built-in String Datetime Others Sort Unique RA performance on GPU (PPoPP 2013)* Measured on Tesla C2050 Random Integers as inputs * G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP,

15 Forward Compatibility: Primitive Library Today Use best implementations from the state of the art Easily integrate improved algorithms designed by 3 rd parties Relational Algebra PROJECT PRODUCT SELECT JOIN SET Math Arithmetic: + - * / Aggregation Built-in String Datetime Others Merge Sort Radix Sort Unique Red: Thrust library Green: ModernGPU library 1 Merge Sort Sort-Merge Join Purple: Back40Computing 2 Black: Red Fox Library 1 S. Baxter. Modern GPU, 2 D. Merrill. Back40Computing, 15

16 Kernel Weaver * : Automatically Fusing Kernels A1: A2: A1: A2: A3: Kernel A A3: Kernel B Fused Kernel *H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In MICRO, / Inspired by loop fusion Increase the granularity of kernel computation Reduce data movement throughout the hierarchy Compile-time automation Input is an optimized query plan 16

17 Harmony Runtime Managing Data Movements Schedule GPU Commands on available GPUs Harmony IR Scheduler... Runtime GPU Driver APIs Current scheduling method attempts to minimize memory footprint j_1:= p_1:= PROJECT j_1 Allocate j_1 Allocate p_1 Free j_1 Complex Scheduling such as speculative execution* is also possible *G. Diamos, and S. Yalamanchili. Speculative Execution On Multi-GPU Systems. In IPDPS,

18 Benchmarks: TPC-H Queries A popular decision making benchmark suite Comprised of 22 queries analyzing data from 6 big tables and 2 small tables Scale Factor parameter to control database size SF=1 corresponds to a 1GB database Courtesy: O Neil, O Neil, Chen. Star Schema Benchmark. 18

19 Experimental Environment Red Fox CPU GPU PCIe 3.0 x 16 Intel 3.50GHz Geforce GTX Titan (2688 cores, $1000 USD) OS Ubuntu G++/GCC 4.6 NVCC 5.5 Thrust 1.7 LogicBlox 4.0 Amazon EC2 instance cr1.8xlarge 32 threads run on 16 cores CPU cost - $3000 USD 19

20 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20 q21 q22 Ave Speedup Red Fox TPC-H (SF=1) Comparison with CPU w/ PCIe w/o PCIe >10x Faster with 1/3 Price On average (geo mean) GPU w/ PCIe : Parallel CPU = 11x GPU w/o PCIe : Parallel CPU = 15x This performance is viewed as lower bound - more improvements are coming Find latest performance and query plans in 20

21 Speedup Performance of Kernel Weaver Fused vs. Not Fused (both on GPU) a b c d e Measured on Tesla C2075 Random Integers as inputs Average additional 2.89x speedup over without fusion 21

Next Steps: Running Faster, Smarter, Bigger.

single node multi-gpu Extension to multi-node

22 Next Steps: Running Faster, Smarter, Bigger.. Running Faster Additional query optimizations Improved RA algorithms Improved run-time load distribution Running Smarter: Extension to single node multi-gpu Extension to multi-node multi-gpu Running Bigger From in-core to out-of-core processing 22

23 Current Work: Implementing Leapfrog Triejoin* in GPU Leapfrog Triejoin: A Simple Worst-Case Optimal Join Algorithm in CPU A Multiple-predicate Join Algorithm Benefits: No sort Less temporary result storage Testing Rules: Triangles(a,b,c) <- Edge(a,b),Edge(b,c),Edge(a,c). Searching triangles in a large graph 30M random edges (Edge nodes are 64-bit int) Current Performance: 1.5x faster than Red Fox using pairwise joins *T. Veldhuizen. Leapfrog Triejoin: A Simple Worst-Case Optimal Join Algorithm. In ICDT,

24 The Future is Acceleration topnews.net.tz Waterexchange.com Large Graphs Thank You 24

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia