Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

Overview Computation / Bandwidth / Power GPU Characteristics 2

Data Processing in General memory wall lack of parallelism memory OUT IN Processor OUT IN memory 3

Old and New Wisdom in Computer Architecture Old: Power is free, Transistors are expensive New: Power wall, Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) Old: Multiplies are slow, Memory access is fast New: Memory wall, Multiplies fast, Memory slow (200 clocks to DRAM memory, 4 clocks for FP multiply) Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, ) New: ILP wall, diminishing returns on more ILP HW (Explicit thread and data parallelism must be exploited) New: Power Wall + Memory Wall + ILP Wall = Brick Wall slide courtesy of Christos Kozyrakis 4

Uniprocessor Performance (SPECint( SPECint) Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 25%/year 52%/year??%/year Sea change in chip design: multiple cores or processors per chip 3X 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 slide courtesy of Christos Kozyrakis 5

SW Peformance: FeatFlow 1993 2008 100 Best Average 10 1 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 80x speedup in 16 years from HW for free But: HW peak performance had 1000x speedup And: Since 2006 stagnation No future for serial code: Parallelism is indispensable 6

Instruction-Stream Stream-Based Processing instructions Processor memory data data data data cache data memory 7

Instruction- and Data-Streams Addition of 2D arrays: C= A + B instuction stream processing data for(y=0; y<height; y++) for(x=0; x<width; x++) { C[y][x]= A[y][x]+B[y][x]; } data streams undergoing a kernel operation inputstreams(a,b); outputstream(c); kernelprogram(op_add); processstreams(); 8

Data-Stream Stream-Based Processing memory data Processor pipeline pipeline pipeline data memory configuration 9

Architectures: Data Processor Locality Field Programmable Gate Array (FPGA) Compute by configuring Boolean functions and local memory Processor Array / Multi-core Processor Assemble many (simple) processors and memories on one chip Processor-in-Memory (PIM) Insert processing elements directly into RAM chips Stream Processor Create data locality through a hierarchy of memories Graphics Processor Unit (GPU) Hide data access latencies by keeping 1000s of threads in-flight GPUs often excel in the performance/price ratio 10

Overview Computation / Bandwidth / Power GPU Characteristics 11

The GPU is a Fast, Highly Multi-Threaded Processor Input Arrays: nd Output Arrays: nd Start thousands of parallel threads in groups of m, e.g. 32 Each group operates in a SIMD fashion, with predication if necessary In general threads are independent but certain collections of groups may use on-chip memory to exchange data 12

Input and Output Arrays Single threaded Input and output arrays may overlap Multi threaded Input and output arrays should rather not overlap Input Input Output Output 13

Native Memory Layout Data Locality General memory 1D input 1D output Other dimensions with offsets Texture memory 2D input 2D output Other dimensions with offsets Input Input Output Color coded locality red (near), blue (far) Output 14

GPUs are Optimized for Local Data Access Memory access types: Cache, Sequential, Random CPU Large cache Few processing elements Optimized for spatial and temporal data reuse GeForce 7800 GTX Pentium 4 GPU Small cache Many processing elements Optimized for sequential (streaming) data access chart courtesy of Ian Buck 15

Configuration Overhead Configu- ration limited Compu- tation limited chart courtesy of Ian Buck 16

Bandwidth in a CPU-GPU System

Sparse MatVec on Tensor Product Grid 13GFLOP/s (single) with GPGPU on GeForce 8800 GTX 46GFLOP/s (single), 140GB/s with CUDA on GeForce GTX 280 18

Conclusions Parallelism is now indispensable to further increase performance For most applications the data processor locality plays an important role GPUs offer a fast, inexpensive solution, but understanding the parallel tradeoffs is crucial 19