GPU for HPC. October PDF Free Download

GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1

Moore s law: in the old days, power increase exponentially 2

The free lunch is over: no further increase of clock rate 3

A limit to clock rate: power consumption Power Cost 4

Another limit: memory access 5

Check Point The free lunch is over: no further automatic increase of CPU frequency. Our only chance to keep up with Moore's law: parallel programming. 6

Needs Must use all cores efficiently Careful data and memory management Must rethink software design Must rethink algorithms Must learn new skills! 7

GPU (Graphic Processing Unit) PC hardware dedicated for 3D graphics Massively parallel SIMD processor Performance pushed by game industry 8

Games and Graphics 9

Computer Games PC games business: $11 bio/year market ( 08) 111 mio GPUs shipped in 2008 1/3 of all PCs have more than one GPU High-end GPUs sold for around $300 10

GPGPU General Purpose computing on the GPU Started in Computer Graphics research community Mapping computational problems to graphics rendering pipeline 11

Speed-ups 12

Why GPU computing? GPU is fast Massively parallel CPU : ~4 cores @ 3.2 GHs (Intel Quad Core) GPU: ~30 cores @ 1.3 GHz (NVIDIA GT200) Programmable NVIDIA CUDA, DirectX Compute Shader, OpenCL High precision floating point support 64bit floating point (IEEE 754) Inexpensive desktop supercomputer NVIDIA Tesla C1060 : ~ 1 Tflops @ 1000 $ 13

NVIDIA : Company History 1993: NVIDIA is founded by Jen-Hsun Huang, Chris Malachowsky, and Curtis Priem. 1995: NVIDIA introduces NV1, the first mainstream multimedia processor. 1997: NVIDIA introduces Real-time Interactive Video Animation 3-D graphics chip or RIVA 128, the first high-performance, 128-bit Direct3D processor. 1999: NVIDIA goes public in January. 2000: Microsoft Corporation selects NVIDIA to provide the graphics processors for its forthcoming gaming console, X-Box. 2001: NVIDIA introduces GeForce3, the industry's first programmable graphics processor. 2006 : CUDA project was announced together with G80 in November,14 Public beta version of CUDA SDK was released in February, 2007.

CPU vs GPU FLOPS 15

CPU vs GPU Memory Bandwidth 16

CPU vs GPU Power Consumption: Flops per Watt Green500 list: Rate of computation that can be delivered by a computer for every watt of power consumed. 17

To understand this difference between CPU and GPU, let's investigate the architecture of a CPU. 18

Example: AMD Opteron 19

Example: AMD Opteron 20

Example: AMD Opteron 21

Example: AMD Opteron 22

Example: AMD Opteron 23

Why are CPUs so complicated? Instruction-level parallelism ( superscalar processors ) More than one instruction is executed during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. 9 Cycles 24

Instruction-level parallelism Compiler to extract best performance, reordering instructions if necessary. Out-of-order CPU execution to avoid delays waiting for read/write or earlier operations. Branch prediction to minimise delays due toconditional branching (loops, if-then-else). Memory hierarchy to deliver data to registers fast enough to feed the processor. These all limit the number of pipelines that can be used, and increase the chip complexity; 90% of Intel chip devoted to control and data? 25

Comparison: GPU is much simpler than CPU Intel Core 2 / Xeon / i7 4 MIMD cores few registers, multilevel caches 5-10 GB/s bandwidth to main memory NVIDIA GTX280: 240 cores, arranged as 30 units each with 8 SIMD cores lots of registers, almost no cache 5 GB/s bandwidth to host processor (PCIe x16 gen 2) 140 GB/s bandwidth to graphics memory 26

Comparison: GPU is much simpler than CPU GPU Up to 240 cores on a single chip Simplified logic (minimal caching, no out-of-order execution, no branch prediction) Most of the chip is devoted floating-point computation Usually arranged as multiple units with each unit being effectively a vector unit Very high bandwidth (up to 140GB/s) to graphics memory (up to 4GB) 27

Multi-threaded parallelism on CPU: two completely independent instruction streams. 2 cores = 2 simultaneous instruction streams 28

Thread-level parallelism on GPU: common instruction stream for groups of functional units 29

NVIDIA GeForce GTX 285 core Groups of 32 threads share instruction streams (calles WARPS) Up to 32 groups are simultaneously interleaved Up to 1024 fragment contexts can be stored 30

NVIDIA GeForce GTX 285 There are 30 of these things on the GTX 285: 30,00031 threads!

SIMD vs MIMD MIMD (Multiple Instruction / Multiple Data) each core operates independently each can be working with a different code, performing different operations with entirely different data SIMD (Single Instruction / Multiple Data) all cores executing the same instruction at the same time, but working on different data only one instruction de-coder needed to control all cores functions like a vector unit 32

Summary: two ways of handling parallelism CPU Instruction-level parallelism with branch prediction. GPU MIMD model for thread-level parallelism across cores. Simplified hardware, no branch prediction. Processor is packed full of ALUs (by sharing instruction stream across groups of threads). SIMD execution model.

CPU-style memory CPU cores run efficiently when data is resident in cache (reduce latency, provide high bandwidth) 34

GPU-style memory More ALUs, no traditional cache hierarhy: Need high bandwidth connection to memory 35

GPU-style memory On a high-end GPU: 11x compute performance on high-end CPU 6x bandwidth to feed it No complicated cache hierarchy GPU memory system is designed for throughput Wide bus (150 GB/sec) Repack/reorder/interleave memory maximize use of memory bus requests to 36

Data Throughput 37

What is CUDA? CUDA/Nvidia Architecture) (Compute Unified Device Unified hardware and software specification for parallel computation.. As an enabling hardware and software technology, CUDA makes it possible to use the many computing cores in a graphics processor to perform general-purpose mathematical calculations, achieving dramatic speedups in computing performance. 38

Books, links CUDA 2.x Programming Guide, NVIDIA GPU Gems 3 by Hubert Nguyen (Hardcover - Aug 12, 2007) Introduction to Parallel Computing (2nd Edition) by Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta (Hardcover - Jan 26, 2003) Cuda Zone: Education 39

GPGPU/CUDA Application Fields 40

Performance/Development Streaming SIMD Extensions 41

GPU for HPC. October 2010