Scaling Data Warehousing Applications using GPUs

Size: px

Start display at page:

Download "Scaling Data Warehousing Applications using GPUs"

Beverly Spencer
6 years ago
Views:

, NVIDIA, Intel Outline n New Rules n Scaling and energy efficiency n Data movement costs n Thermal issues and processor

1 Scaling Data Warehousing Applications using GPUs Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA Sponsors: National Science Foundation, LogicBlox Inc., NVIDIA, Intel Outline n New Rules n Scaling and energy efficiency n Data movement costs n Thermal issues and processor physics n Scaling Relational Database Performance with GPUs n Optimized primitives n Optimization of Data Movement n DRAM memory aggregation in clusters 2 1

Scaling Computing Performance Data Movement Costs Thermal Limits Energy

Performance Scaling Performance scaled with number of transistors Dennard

2 Scaling Computing Performance Data Movement Costs Thermal Limits Energy Limits Cray Titan: Heterogeneous Computing 3 3 Moore s Law Goal: Sustain Performance Scaling Performance scaled with number of transistors Dennard scaling: power scaled with feature size From wikipedia.org From R. Dennard, et al., Design of ion-implanted MOSFETs with very small physical dimensions, IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp , Oct

3 Post Dennard Architecture Performance Scaling Power Delivery Cooling! Perf # " ops s W. J. Dally, Keynote IITC 2012 $! & = Power( W ) Efficiency# % " ops $ & joule% You can hide latency but you cannot hide energy! Data_movement_cost Three operands x 64 bits/operand Moving 1-bit of data 1mm at 22nm 1 = ~1 pj 1 HIPEAC Roadmap hipeacvision.pdf Energy = # bits dist mm energy bit mm 5 Scaling Performance: Cost of Data Movement Embedded Platforms Big Science: To Exascale Cost of Data Movement Goal: GOps/w Goal: 20MW/Exaflop Courtesy: Sandia National Labs :R. Murphy. Sustain performance scaling through massive concurrency Data movement becomes more expensive than computation 6 3

Post Dennard Architecture Performance Scaling! Perf # " ops s W. J. Dally, Keynote IITC 2012 $!

heterogeneity and asymmetry Three operands x 64 bits/operand Energy = # bits dist mm energy bit mm 7

com) n Extracting single thread performance costs energy n Out-of-order execution n Branch prediction

4 Post Dennard Architecture Performance Scaling! Perf # " ops s W. J. Dally, Keynote IITC 2012 $! & = Power( W ) Efficiency# % " ops $ & joule% Operator_cost + Data_movement_cost Specialization à heterogeneity and asymmetry Three operands x 64 bits/operand Energy = # bits dist mm energy bit mm 7 Scaling Performance: Simplify, Diversify & Multiply AMD Bulldozer Core ARM A7 Core (arm.com) n Extracting single thread performance costs energy n Out-of-order execution n Branch prediction n Scheduling etc. Still important! NVIDIA Fermi n Multithread performance exploits parallelism n Simpler pipelines n Core scaling 8 4

n Multiple voltage and frequency islands n Different memory technologies n STT-RAM, PCM, Flash

Distinct microarchitecture n Fault and migrate model of operation 1 Uniform ISA n Multi-ISA n

, Operating system support for shared ISA asymmetric multi-core architectures, in WIOSCA, 2008.

5 Asymmetry vs. Heterogeneity Performance Asymmetry Functional Asymmetry Heterogeneous MC MC MC MC MC MC MC MC n Multiple voltage and frequency islands n Different memory technologies n STT-RAM, PCM, Flash n Complex cores and simple cores n Shared instruction set architecture (ISA) n Subset ISA n Distinct microarchitecture n Fault and migrate model of operation 1 Uniform ISA n Multi-ISA n Microarchitecture n Memory & Interconnect hierarchy Multi-ISA 1 Li., T., et.al., Operating system support for shared ISA asymmetric multi-core architectures, in WIOSCA, The Challenge: The Memory System Xeon Phi Hybrid Memory Cube n What should the memory hierarchy look like? n Parallelism vs. locality tradeoffs n Minimize data movement à Processor in Memory? 10 5

Thermal Capacity n Exploit package physics n Temperature changes on the order of milliseconds n Workload behaviors change on the order of microseconds n Impact on device behavior?

6 Thermal Capacity n Exploit package physics n Temperature changes on the order of milliseconds n Workload behaviors change on the order of microseconds n Impact on device behavior? Thermal Capacity Time Varying Workload Instructions/cycle Time Figures: psdgraphics.com and wikipedia.org Power-Performance Management! 11 Summary: New Performance Scaling Rules n Energy efficiency: Scale performance by scaling energy efficiency à diversify à programming models? n Parallelism: Scale number of cores rather than performance of a single core à multiply à programming models n Data Movement: Energy cost of data movement is more expensive than the energy cost of computation à communication-centric n Physics Capacity: Scaling limited by thermal/power capacity à power/thermal management 12 6

Outline n New Rules n Scaling and energy efficiency n Data movement

Database Performance with GPUs n Optimized primitives n Optimization

Diversity Amazon EC2 GPU Instances Hardware Diversity is Mainstream

7 Outline n New Rules n Scaling and energy efficiency n Data movement costs n Thermal issues and processor physics n Scaling Relational Database Performance with GPUs n Optimized primitives n Optimization of Data Movement n DRAM memory aggregation in clusters 13 System Diversity Amazon EC2 GPU Instances Hardware Diversity is Mainstream Mobile Platforms (DSP, GPUs) Keeneland System (GPUs) Cray Titan (GPUs) 14 7

8 System Model Large Graphs Programming Models Data Movement Optimizations System Abstractions e.g. GAS, Virtual DIMMs, etc Domain Specific Languages Compiler and Run-Time Support Cluster Wide Hardware Consolidation Hardware Customization 15 Databases: Not a Traditional Domain of GPUs LargeQty(p) <- Qty(q), q > Relational Computations Over Massive Data Sets 16 8

Data Warehousing Applications on GPUs n The Opportunity n Significant potential data parallelism n If data fits in GPU memory, 2x 27x speedup has been shown 1 n The Challenge n Need to process 1-50

9 Data Warehousing Applications on GPUs n The Opportunity n Significant potential data parallelism n If data fits in GPU memory, 2x 27x speedup has been shown 1 n The Challenge n Need to process 1-50 TBs of data 2 n 15 90% of the total time * spent in moving data between CPU and GPU * n Fine grained computation 1 B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey. 17 Red Fox: Goal and Status n Goal Haicheng Wu n Build a compiler/runtime framework to accelerate Datalog LB query using GPUs n Understand the Good, the Bad and the Ugly! n Status n Capable of running all/full TPC-H queries on GPUs n Requires that data fits in the GPU memory à move to fusion parts n Focus to date: correctness and performance n Moving forward à performance and scale 18 9

10 Domain Specific Compilation: Red Fox Datalog LB Queries Joint with LogicBlox Inc. LogicBlox Front-End Language Front-End src-src Optimization Kernel Weaver IR Optimization RA-To-PTX (nvcc + RA-Lib) Red Fox RT Query Plan Kernel IR RA Primitives Translation Layer Machine Neutral Back-End Targeting Accelerator Clouds for meeting the demands of data warehousing applications In-core databases 19 Datalog LB Query and Front-end Example Datalog LB Query Example Harmony IR (CFG) 1 number(n)->int32 (n). 2 number(0). 3 // other number facts elided for brevity 4 next(n,m)->int32(n), int32(m). 5 next(0,1). 6 // other next facts elided for brevity 7 8 even(n)-> int32(n). 9 even(0). Recursive Definition 10 even(n)<-number(n),next(m,n),odd(m) odd (n)->int32(n). 13 odd (n)<-next(m,n),even(m). Front-end BB1: COPY(pre_odd,odd){PTX} COPY(pre_even,even){PTX} JOIN_PARTITION(next,even){PTX} JOIN_COMPUTE(next,even){PTX} JOIN_GATHER(temp_odd){PTX} PROJECT(odd,temp_odd){PTX} BB2: PROJECT(m_1,next){PTX} JOIN_PARTITION(number,m_1){PTX} JOIN_COMPUTE(number,m_1){PTX} JOIN_GATHER(temp_j_1){PTX} PROJECT(j_1,temp_j_1){PTX} JOIN_PARTITION(j_1,odd){PTX} JOIN_COMPUTE(j_1,odd){PTX} JOIN_GATHER(temp_even){PTX} PROJECT(even,temp_even){PTX} BB3: if pre_odd == odd? Y BB4: pre_even == even? Y BB5: HALT N N 20 10

11 Research Thrusts n I: Optimized implementations of primitives n Relational algebra n Data management within the GPU memory hierarchy n II: Data movement optimizations n Between hosts and (local or remote) accelerators n Within an accelerator n III: In-core processing n Cluster wide memory aggregation techniques n Change the ratio of host memory size to accelerator memory size 21 Primitives Map Operators to GPU implementations From RA Library PROJECT PRODUCT SELECT JOIN From Thrust Library SORT UNIQUE AGGREGATION SET Family Data Structure: weekly sorted arrays of densely id price tax padding packed tuples zeros 4 bytes 8 bytes 16 bytes Key Value Tuple fields can be integer, float, datetime, string, etc

12 RA Primitives Library: Multistage Algorithms Hybrid multi-stage algorithm (partition, compute, gather) to make trade-offs between computation complexity and memory access efficiency Strategy: Increase core utilizations until the computation becomes memory bound, and then achieve near peak utilization of the memory interface Example of SELECT * G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP, RA Primitives Library: Example of JOIN Most complicated JOIN: 57%~72% peak performance Most efficient PRODUCT, PROJECT and SELECT: 86%~92% peak performance Measured on Tesla C2050 Random Integers as inputs * G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP,

Research Thrusts n I: Optimized implementations of primitives n Relational algebra n Data management within the GPU memory hierarchy n II: Data movement optimizations n Between hosts and (local or

13 Research Thrusts n I: Optimized implementations of primitives n Relational algebra n Data management within the GPU memory hierarchy n II: Data movement optimizations n Between hosts and (local or remote) accelerators n Within an accelerator n III: In-core processing n Cluster wide memory aggregation techniques n Change the ratio of host memory size to accelerator memory size 25 Data Movement in Kernel Execution T M N 2 Execute Thread Block or Cooperative Thread Array (CTA) ~250GB/s 1 Input 3 Result 26 13

automation n Input is an optimized query plan 27 Kernel Weaving and Fusion Interweaving and

14 Kernel Fusion- A Data Movement Optimization n Increase the granularity of kernel computation n Reduce data movement throughout the hierarchy n Inspired by loop fusion n Compile-time automation n Input is an optimized query plan 27 Kernel Weaving and Fusion Interweaving and Fusing individual stages (CUDA kernels) Use registers or shared memory to store temporary result 28 14

15 Kernel Weaver: Major Benefits n Reduce Data Footprint n Reduction in accesses to global memory n Access to common data across kernels improves temporal locality n Reduction in PCIe transfers n Expand optimization scope of the compiler n Data re-use n Increase textual scope of optimizers A1 A2 A1 A2 A3 Temp Kernel A Kernel B Result A3 Fused Kernel A, B Result * H. Wu, G.Diamos, S.Cadambi, and S. Yalamanchili. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In MICRO Kernel Weaver: Micro-benchmarks If fusing below operators together on Tesla C2070 Speedup Fused vs. Not Fused Average 2.89x speedup a b c d e 30 15

16 Resource Usage & Occupancy Individual primitive After kernel fusion PTX Reg # Shared MEM (Byte) Occupancy (%) PROJECT SELECT JOIN / Multiply PTX Reg # Shared MEM (Byte) Occupancy (%) (a) (b) (c) (d) (e) n Kernel fusion may increase resource usage and thus decrease occupancy n Retains other benefits 31 TPC-H Queries n A popular decision making benchmark suite n Have 22 queries analyzing data from 6 big tables n Scale Factor parameter to control database size n Red Fox can run SF=1 for all 22 queries n GPU benchmark suite being generated (Summer 2013) 32 16

17 Experimental Environment CPU Xeon 2.80GHz GPU 1 Tesla C2075 (6GB GDDR5 memory) OS Ubuntu Server GCC NVCC 4.2 Thrust TPC-H Performance (SF = 1) n 22 queries totally takes seconds n Compared with MySQL implementation in 4 node CPU cluster*, Red Fox is 59x faster on average Example: Q22 Input Size: 192MB Operator #: 92 CUDA Kernel #: 205 Query Plan: *Ngamsuriyaroj, Pornpattana, Performance Evaluation of TPC-H Queries on MySQL Cluster. WAINA

18 Where is the time spent? 48.82% 38.94% project select product join diff sort unique merge agg arith conv others copy pcie n Most of time is spent in JOIN and SORT n PCIe transfer time is less than 10% n PROJECT used most frequently, but takes less than 5% 35 Future Improvements n Optimized query plan n Reduce tuple size n Common operator reduction n Reorder operators n n More RA implementations n Hash Join n Radix Sort n n Pipeline the execution n Expect 10x-100x speedup from above techniques n Increase scale factor à Oncilla 36 18

19 Research Thrusts n I: Optimized implementations of primitives n Relational algebra n Data management within the GPU memory hierarchy n II: Data movement optimizations n Between hosts and (local or remote) accelerators n Within an accelerator n III: In-core processing n Cluster wide memory aggregation techniques n Change the ratio of host memory size to accelerator memory size 37 II. In-Core Processing GPU ~2K Cores GPU ~2K Cores GPU ~2K Cores GPU ~2K Cores GPU MEM ~6GB GPU MEM ~6GB GPU MEM ~6GB GPU MEM ~6GB MAIN MEM ~128GB MAIN MEM ~128GB MAIN MEM ~128GB MAIN MEM ~128GB CPU (Multi Core) 2-16 Cores CPU (Multi Core) 2-16 Cores CPU (Multi Core) 2-16 Cores CPU (Multi Core) 2-16 Cores n Cluster-based memory aggregation n Hardware support for global non-coherent, physical address space system n Change the ratio of host-memory : GPU-memory n Joint project with the University of Heidelberg 38 19

Oncilla: Fabrics for Accelerator Clouds Jeff

for accelerators in data centers n Solution:

fabrics (HT, QPI, PCIe, 10GE, IB) n Support

project 39 Oncilla TPC-H Microbenchmarks

20 Oncilla: Fabrics for Accelerator Clouds Jeff Young n Goal: Efficient memory aggregation for accelerators in data centers n Solution: Use Global Address Spaces (GAS) and commodity fabrics (HT, QPI, PCIe, 10GE, IB) n Support in-core databases using software from Red Fox project 39 Oncilla TPC-H Microbenchmarks (Preliminary Results) Using Disk Using Aggregation 40 20

EXTOLL Network Adapter and Fabric Courtesy, Prof. H.

operations for GAS (SMFU), and support for efficient, small messages (VELO) n Current V6

ASIC projected to have bandwidth of 8-12 GB/s [1] H.

GB of DRAM n NVIDIA C2070 GPUs n EXTOLL cluster n Network adapters and fabric developed by

21 EXTOLL Network Adapter and Fabric Courtesy, Prof. H. Fröning, the University of Heidelberg n Provides RDMA transfer (RMA), MMIO-based put/get operations for GAS (SMFU), and support for efficient, small messages (VELO) n Current V6 prototype: 300 ns latency per hop, 24 Gbps bandwidth, very low overhead (64 B per packet) [1] n ASIC projected to have bandwidth of 8-12 GB/s [1] H. Fröning, On Achieving High Message Rates, CCGRID n Two node cluster prototypes n GB of DRAM n NVIDIA C2070 GPUs n EXTOLL cluster n Network adapters and fabric developed by University of Heidelberg, Germany n AIC custom blades n Galibier Virtex 6 prototypes Oncilla Infrastructure n IB cluster based on KIDS n Mellanox QDR IB adapter n Dual-socket Intel Xeon X

22 System Software Scaling Rules Applications Technology Architecture Thank You Questions? 43 22

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia