Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Size: px

Start display at page:

Download "Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008"

Hubert Johnson
5 years ago
Views:

1 Parallel Systems Course: Chapter IV GPU Programming Jan Lemeire Dept. ETRO November 6th 2008 GPU Message-passing Programming with Parallel CUDAMessagepassing Parallel Processing Processing

2 Overview 1. CUDA-enabled GPU architecture 2. Programming for GPUs 3. How a CUDA program runs 4. Optimizing CUDA programs 5. Analysis & Conclusions GPU Message-passing Programming Parallel with Processing CUDA Jan Lemeire

3 Overview 1. CUDA-enabled GPU architecture 2. Programming for GPUs 3. How a CUDA program runs 4. Optimizing CUDA programs 5. Analysis & Conclusions GPU Message-passing Programming Parallel with Processing CUDA Jan Lemeire

html PC graphics market largely subsidizes the development of these GPGPUs

4 Utilization of Graphics Card Link 1 processor (GPU) for High- Performance Computing Via Nvidia s CUDA API: PC graphics market largely subsidizes the development of these GPGPUs (General-Purpose computation on GPUs) Cards that support CUDA: 8, 9, 200 series GPU Programming with CUDA Jan Lemeire

5 Goal of chapter Understand benefits & disadvantages of technology. If you have to decide whether or not a new technology should be introduced Understand consequences! GPU Programming with CUDA Jan Lemeire

6 Why Are GPUs So Fast? GPU specialized for math-intensive highly parallel computation So, more transistors can be devoted to data processing rather than data caching and flow control Control ALU ALU ALU ALU Cache DRAM DRAM CPU GPU Commodity industry: provides economies of scale Competitive industry: fuels innovation

G80 GPU Computing Processors execute computing threads Thread Execution Manager issues threads 128 Thread Processors Parallel Data Cache accelerates processing Host Input Assembler Thread Execution

7 G80 GPU Computing Processors execute computing threads Thread Execution Manager issues threads 128 Thread Processors Parallel Data Cache accelerates processing Host Input Assembler Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Load/store NVIDIA Corporation 2007 Global Memory

8 Goal: Scaling the Architecture Same program Scalable performance Host Input Assembler Thread Execution Manager Host Input Assembler Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Load/store Load/store Global Memory Global Memory NVIDIA Corporation 2007

9 Graphics Programming Model Graphics Application Vertex Program Rasterization Fragment Program Display NVIDIA Corporation 2007

10 What s Wrong With GPGPU? Application Input Registers Vertex Program Rasterization Pixel Program Texture Constants Pixel Program Temp Registers Display Output Registers NVIDIA Corporation 2007

11 What s Wrong With GPGPU? APIs are specific to graphics Application Vertex Program Rasterization Fragment Program Display NVIDIA Corporation 2007 Input Registers Fragment Program Limited instruction set No thread communication Output Registers Limited texture size and dimension Texture Constants Temp Registers Limited local storage Limited shader outputs No scatter

12 Building a Better Pixel Thread Features Millions of instructions Full Integer and Bit instructions Thread Number No limits on branching, looping 1D, 2D, or 3D thread ID allocation Texture Thread Program Constants Registers Output Registers

13 Global Memory Thread Number Features Fully general load/store to GPU memory Untyped, not fixed texture types Pointer support Texture Thread Program Constants Registers Global Memory

14 Parallel Data Cache Thread Number Features Dedicated on-chip memory Shared between threads for inter-thread communication Explicitly managed As fast as registers Texture Thread Program Constants Registers Parallel Data Cache Global Memory

15 Hardware Implementation: Memory Architecture The local, global, constant, and texture spaces are regions of device memory Each multiprocessor has: A set of 32-bit registers per processor Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory On-chip shared memory Where the shared memory space resides Registers Processor 1 Registers Processor 2 Registers Processor M Instruction Unit A read-only constant cache To speed up access to the constant memory space A read-only texture cache Constant Cache Texture Cache To speed up access to the texture memory space Device memory NVIDIA Corporation 2007

Memory P1,P2 P3,P4 Multiple passes through video memory Thread Execution Manager Control ALU P n =P 1 +P 2 +P 3 +P 4 Control ALU P n =P 1 +P 2 +P 3 +P 4

16 Example Fluid Algorithm GPU Computing with CUDA Control AL U CPU P n =P 1 +P 2 +P 3 +P 4 Cache P 1 P 2 P 3 P 4 Single thread out of cache Data/Computation DRAM GPGPU Control ALU P n =P 1 +P 2 +P 3 +P 4 Control ALU P n =P 1 +P 2 +P 3 +P 4 Control ALU P n =P 1 +P 2 +P 3 +P 4 P1, P2 P3, P4 P1,P2 P3,P4 Video Memory P1,P2 P3,P4 Multiple passes through video memory Thread Execution Manager Control ALU P n =P 1 +P 2 +P 3 +P 4 Control ALU P n =P 1 +P 2 +P 3 +P 4 Control ALU P n =P 1 +P 2 +P 3 +P 4 Parallel Data Cache Shared Data P 1 P 2 P 3 P 4 P 5 DRAM Program/Control NVIDIA Corporation 2007 Parallel execution through cache

17 Overview 1. CUDA-enabled GPU architecture 2. Programming for GPUs 3. How a CUDA program runs 4. Optimizing CUDA programs 5. Analysis & Conclusions GPU Message-passing Programming Parallel with Processing CUDA Jan Lemeire

18 CUDA: Programming GPU in C Philosophy: provide minimal set of extensions necessary to expose power Declaration specifiers to indicate where things live global void KernelFunc(...); // kernel callable from host device void DeviceFunc(...); // function callable on device device int GlobalVar; // variable in device memory shared int SharedVar; // shared in PDC by thread block Extend function invocation syntax for parallel kernel launch KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each Special variables for thread identification in kernels dim3 threadidx; dim3 blockidx; dim3 blockdim; dim3 griddim; Intrinsics that expose specific operations in kernel code syncthreads(); // barrier synchronization within kernel

19 CUDA: Runtime support Explicit memory allocation returns pointers to GPU memory cudamalloc(), cudafree() Explicit memory copy for host device, device device cudamemcpy(), cudamemcpy2d(),... Texture management cudabindtexture(), cudabindtexturetoarray(),... OpenGL & DirectX interoperability cudaglmapbufferobject(), cudad3d9mapvertexbuffer(), NVIDIA Corporation 2007

20 Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C) { int i = threadidx.x + blockdim.x * blockidx.x; C[i] = A[i] + B[i]; } NVIDIA Corporation 2007

21 Example: Host code for memory // allocate host (CPU) memory float* h_a = (float*) malloc(n * sizeof(float)); float* h_b = (float*) malloc(n * sizeof(float)); initalize h_a and h_b // allocate device (GPU) memory float* d_a, d_b, d_c; cudamalloc( (void**) &d_a, N * sizeof(float)); cudamalloc( (void**) &d_b, N * sizeof(float)); cudamalloc( (void**) &d_c, N * sizeof(float)); // copy host memory to device cudamemcpy( d_a, h_a, N * sizeof(float),cudamemcpyhosttodevice)); cudamemcpy( d_b, h_b, N * sizeof(float),cudamemcpyhosttodevice)); // execute the kernel on N/256 blocks of 256 threads each vecadd<<<n/256, 256>>>(d_A, d_b, d_c); NVIDIA Corporation 2007

22 CUDA SDK Libraries:FFT, BLAS, Example Source Code Integrated CPU and GPU C Source Code NVIDIA C Compiler NVIDIA Assembly for Computing CPU Host Code CUDA Driver Debugger Profiler Standard C Compiler GPU CPU NVIDIA Corporation 2007

23 Example program global matrixmultiplicationinoneblock(float *inputa, float *inputb, float *output, int size){ // allocate memory for maximal matrix size shared float matrixa[512], matrixb[512]; float result=0.; const int tx=threadidx.x, ty=threadidx.y; int position=ty*img_w+tx; matrixa[position]=inputa[position]; matrixb[position]=inputa[position]; syncthreads(); for(int i=0;i< size;i++) result+=matrixa[ty*size+i]*matrixb[i*size+tx]; } output[position]=result; GPU Programming with CUDA Jan Lemeire

24 Overview 1. CUDA-enabled GPU architecture 2. Programming for GPUs 3. How a CUDA program runs 4. Optimizing CUDA programs 5. Analysis & Conclusions GPU Message-passing Programming Parallel with Processing CUDA Jan Lemeire

25 Threads: grouped in blocks & warps A block of threads is executed on the same multiprocessor, use the same shared memory (16KB) and can be synchronized. A block is divided into warps which are run together. One multiprocessor can run 4 thread blocks in parallel. Warp size is 32: 32 threads are executed in a SIMD fashion on the 8 cores of the multiprocessor. To keep deep pipelines full on the FPUs. It takes 4 cycles for a memory or arithmetic operation. Use of a 32-bit ActiveMask: a bit for every running thread in a warp GPU Programming with CUDA Jan Lemeire

26 CUDA Scalable Execution Model Host Device A hierarchy of threads Threads execute a kernel in blocks, blocks are organized in a grid Kernel 1 Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Threads within a block cooperate Block (0, 1) Block (1, 1) Block (2, 1) share on-chip memory in PDC barrier synchronization Kernel 2 Grid 2 Blocks within a grid are independent blocks run to completion in unspecified order No global sync, no per-block mutex Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Guarantees scalable execution! Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2)

27 How thread blocks are partitioned Thread blocks are partitioned into warps Thread IDs within a warp are consecutive and increasing Warp 0 starts with Thread ID 0 For a 2D block: ThreadID = threadindex.x + blockwidth * threadindex.y Partitioning is always the same Thus you can use this knowledge in control flow (Covered next) However, DO NOT rely on any ordering between warps If there are any dependencies between threads, you must syncthreads() to get correct results NVIDIA Corporation 2006

28 A quick review device = GPU = set of multiprocessors Multiprocessor = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory Memory Location Cached Access Who Local Off-chip No Read/write One thread Shared On-chip N/A Read/write All threads in a block Global Off-chip No Read/write All threads + host Constant Off-chip Yes Read All threads + host Texture Off-chip Yes Read All threads + host NVIDIA Corporation 2007

29 Quick terminology review Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads) The unit of parallelism in CUDA Note difference from CPU threads: creation cost, resource usage, and switching cost of GPU threads is much smaller Warp: a group of threads executed physically in parallel (SIMD) Half-warp: the first or second half of a warp of threads Thread Block: a group of threads that are executed together and can share memory on a single multiprocessor Grid: a group of thread blocks that execute a single CUDA program logically in parallel NVIDIA Corporation 2006

30 Device Runtime Component: Synchronization Function void syncthreads(); Synchronizes all threads in a block Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared or global memory Allowed in conditional code only if the conditional is uniform across the entire thread block NVIDIA Corporation 2007

31 Thread divergences in a SIMD thread divergence: supported by the hardware! For example: if (x < 5) y = 5; else y = -5; SIMD performs the 3 steps y = 5; is only executed on threads for which x < 5 y = -5; is executed on all others Only when treads in the same warp do the same thing => effective parallelism Even more general: instruction predication GPU Programming with CUDA Jan Lemeire

32 Control Flow Instructions Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths must be serialized Avoid divergence when branch condition is a function of thread ID Example with divergence: If (threadidx.x > 2) { } Branch granularity < warp size Example without divergence: If (threadidx.x / WARP_SIZE > 2) { } Branch granularity is a whole multiple of warp size NVIDIA Corporation 2006

33 Instruction Predication Comparison instructions set condition codes (CC) Instructions can be predicated to write results only when CC meets criterion (CC!= 0, CC >= 0, etc.) Compiler tries to predict if a branch condition is likely to produce many divergent warps If guaranteed not to diverge: only predicates if < 4 instructions If not guaranteed: only predicates if < 7 instructions May replace branches with instruction predication ALL predicated instructions take execution cycles Those with false conditions don t write their output Or invoke memory loads and stores Saves branch instructions, so can be cheaper than serializing divergent paths NVIDIA Corporation 2006

34 Memory Instruction Latency Memory instructions take 4 cycles per warp to issue Issue global and local memory loads / stores (not cached) Constant and texture loads (cached) Shared memory reads / writes Example shared float shared[]; device float global[]; shared[threadidx.x] = global[threadidx.x]; 4 cycles to issue read from global (device) memory, 4 cycles to issue write to shared memory cycles to read a float from global (device) memory But can be hidden by scheduling independent math instructions or even other loads / stores if there are enough active threads NVIDIA Corporation 2006

35 Arithmetic Instruction Latency int and float add, shift, min, max and float mul, mad: 4 cycles per warp int multiply (*) is by default 32-bit requires multiple cycles / warp Use mul24() / umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are more expensive Compiler will convert literal power-of-2 divides to shifts But we have seen it miss some cases Be explicit in cases where compiler can t tell that divisor is a power of 2! Useful trick: foo % n == foo & (n-1) if n is a power of 2 NVIDIA Corporation 2006

36 Arithmetic Instruction Latency Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp These are the versions prefixed with Examples: rcp(), sin(), exp() Other functions are combinations of the above y / x == rcp(x) * y takes 20 cycles per warp sqrt(x) == rcp(rsqrt(x)) takes 32 cycles per warp NVIDIA Corporation 2006

37 Latency Hiding for Memory Accesses During global to shared memory copying During shared memory reads Keep Multiprocessors busy with a huge amount of threads 1 multiprocessor can simultaneously execute multiple thread blocks of maximal 512 threads Is limited by amount of shared and register memory needed by each thread Note: GPU communicates with CPU via relatively slow PCI Express bus (500 Mb/s) GPU Programming with CUDA Jan Lemeire

38 Overview 1. CUDA-enabled GPU architecture 2. Programming for GPUs 3. How a CUDA program runs 4. Optimizing CUDA programs 5. Analysis & Conclusions GPU Message-passing Programming Parallel with Processing CUDA Jan Lemeire

39 Optimizing CUDA Mark Harris AstroGPU 2007

40 CUDA is fast and efficient CUDA enables efficient use of the massive parallelism of NVIDIA GPUs Direct execution of data-parallel programs Without the overhead of a graphics API Using CUDA on Tesla GPUs can provide large speedups on data-parallel computations straight out of the box! Even higher speedups are achievable by understanding and tuning for GPU architecture This presentation covers general performance, common pitfalls, and useful strategies 2

41 CUDA Optimization Strategies Optimize Algorithms for the GPU Optimize Memory Access Coherence Take Advantage of On-Chip Shared Memory Use Parallelism Efficiently 5

42 Optimize Algorithms for the GPU Maximize independent parallelism Maximize arithmetic intensity (math/bandwidth) Sometimes it s better to recompute than to cache GPU spends its transistors on ALUs, not memory Do more computation on the GPU to avoid costly data transfers Even low parallelism computations can sometimes be faster than transferring back and forth to host 6

43 Optimize Memory Coherence Coalesced vs. Non-coalesced = order of magnitude Global/Local device memory Optimize for spatial locality in cached texture memory In shared memory, avoid high-degree bank conflicts 7

44 Coalesced Access: Reading floats t0 t1 t2 t3 t14 t All threads participate t0 t1 t2 t3 t14 t Some Threads Do Not Participate 12

45 Uncoalesced Access: Reading floats t0 t1 t2 t3 t14 t Permuted Access by Threads t0 t1 t2 t3 t13 t14 t Misaligned Starting Address (not a multiple of 64) 13

46 Take Advantage of Shared Memory Hundreds of times faster than global memory Threads can cooperate via shared memory Use one / a few threads to load / compute data shared by all threads Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order noncoalesceable addressing Matrix transpose example later 8

47 Use Parallelism Efficiently Partition your computation to keep the GPU multiprocessors equally busy Many threads, many thread blocks Keep resource usage low enough to support multiple active thread blocks per multiprocessor Registers, shared memory 9

48 Optimizing threads per block Choose threads per block as a multiple of warp size Avoid wasting computation on under-populated warps More threads per block == better memory latency hiding But, more threads per block == fewer registers per thread Kernel invocations can fail if too many registers are used Heuristics Minimum: 64 threads per block Only if multiple concurrent blocks 128 to 256 threads a better choice Usually still enough regs to compile and invoke successfully This all depends on your computation! Experiment! 31

49 Occupancy Thread instructions executed sequentially, executing other warps is the only way to hide latencies and keep the hardware busy Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently Minimize occupancy requirements by minimizing latency Maximize occupancy by optimizing threads per multiprocessor 25

50 Parameterize Your Application Parameterization helps adaptation to different GPUs GPUs vary in many ways # of multiprocessors Memory bandwidth Shared memory size Register file size Threads per block You can even make apps self-tuning (like FFTW and ATLAS) Experiment mode discovers and saves optimal configuration 33

51 Wavefront algorithm About Wavefront parallelism: see exercises 512x512 image divided into 8x8 blocks => 64 x 64 blocks On GTX280: 240 cores => 30 multiprocessors Conclusion: keep al multiprocessors busy GPU Programming with CUDA Jan Lemeire

52 Parallel Memory Architecture In a parallel machine, many threads access memory Therefore, memory is divided into banks Essential to achieve high bandwidth Each bank can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 38

53 Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp read the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank 42

54 Bank Addressing Examples No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 39

Bank 6 Bank 7 Bank 15 8-way Bank Conflicts Linear addressing stride == 8 x8 Thread 0 Thread 1 Thread 2

55 Bank Addressing Examples 2-way Bank Conflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 Linear addressing stride == 2 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 8-way Bank Conflicts Linear addressing stride == 8 x8 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 15 40

56 Unrolling Last Steps Only one warp is active during the last few steps Unroll them and remove unneeded syncthreads() for (unsigned int s = bd/2; s > 32; s >>= 1) { if (t < s) { data[t] += data[t + s]; } syncthreads(); } if (t < 32) data[t] += data[t + 32]; if (t < 16) data[t] += data[t + 16]; if (t < 8) data[t] += data[t + 8]; if (t < 4) data[t] += data[t + 4]; if (t < 2) data[t] += data[t + 2]; if (t < 1) data[t] += data[t + 1];

57 CUDA Optimization Priorities Memory coalescing is #1 priority Highest!/$ optimization Optimize for locality Take advantage of shared memory Very high bandwidth Threads can cooperate to save work Use parallelism efficiently Keep the GPU busy at all times High arithmetic / bandwidth ratio Many threads & thread blocks Leave bank conflicts for last! 4-way and smaller conflicts are not usually worth avoiding if avoiding them will cost more instructions NVIDIA Corporation 2006

58 Overview 1. CUDA-enabled GPU architecture 2. Programming for GPUs 3. How a CUDA program runs 4. Optimizing CUDA programs 5. Analysis & Conclusions GPU Message-passing Programming Parallel with Processing CUDA Jan Lemeire

59 Strategy Light-weight threads, supported by the hardware Thread processors, upto 96 threads per processor Context switch can happen in 1 cycle! No caching mechanism, branch prediction, GPU does not try to be efficient for every program, does not spend transistors on optimization Simple straight-forward sequential programming should be abandoned Less higher-level memory: GPU: 16KB shared memory per SIMD multiprocessor CPU: L2 cache contains several MB s Massively floating-point computation power Transparent system organization Modern (sequential) CPUs based on simple Von Neumann architecture GPU Programming with CUDA Jan Lemeire

60 Link 1: white paper Strategy II Don't write explicitly threaded code Compiler handles it => no chance of deadlocks or race conditions Think differently: analyze the data instead of the algorithm. In contrast with modern superscalar CPUs: programmer writes sequential code (singlethreaded), processor tries to execute it in parallel, through pipelining etc. (instruction parallelism). But by the data and resource dependencies more speedup cannot be reached with > 4-way superscalar CPUs. 1.5 Instructions per cycles seems a maximum. GPU Programming with CUDA Jan Lemeire

61 Results Performance doubling every 6 months! 1000s of threads possible! High Bandwidth PCI Express bus (connection GPU-CPU) is the bottleneck Enormous possibilities for latency hiding Matrix Multiplication 13 times faster on a standard GPU (GeForce 8500GT) compared to a state-of-the art CPU (Intel Dual Core) 200 times faster on a high-end GPU, 50 times if quadcore. Low threshold: C, good documentation, many examples, easy-to-install, automatic card detection, easy-compilation GPU Programming with CUDA Jan Lemeire

62 How to get maximal performance, or call it... limitations Create many threads, make them aggressively parallel Keep threads busy in a warp Align memory reads Global memory <> Shared memory Using shared memory Limited memory per thread Close to hardware architecture Hardware is made for exploiting data parallelism GPU Programming with CUDA Jan Lemeire

63 When to use CUDA? Special computational intensive programs. Keep it simple GPU Programming with CUDA Jan Lemeire

64 Disadvantages Maintenance CUDA = NVIDIA Alternatives: OpenCL: a standard language for writing code for GPUs and multicores. Supported by ATI, NVIDIA, Apple, RapidMind s Multicore Development, supports multiple architectures, less dependent on it AMD, IBM, Intel, Microsoft and others are working on standard parallel-processing extensions to C/C++ Larrabee: combining processing power of GPUs with programmability of x86 processors Links in Scientific Study section CUDA promises an abstract, scalable hardware model, but is it true? Link 1: white paper GPU Programming with CUDA Jan Lemeire

65 Heterogeneous Chip Designs Augment standard CPU with attached processors performing the compute-intensive portions: Graphics Processing Units (GPUs) Field Programmable Gate Arrays (FPGAs) Cell processors, designed for video games Parallel Systems: Introduction Jan Lemeire

66 Cell processor 8 Synergistic Processing Elements (SPEs) 128-bit wide data paths for vector instructions 256K on-chip RAM No memory coherence Performance and simplicity Programmers should carefully manage data movement Parallel Systems: Introduction Jan Lemeire

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment