Massively Parallel Architectures

Size: px

Start display at page:

Download "Massively Parallel Architectures"

Gavin Douglas
6 years ago
Views:

1 Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat Bureau janvier 2009

2 Motivation The CELL processor Harder,Better,Faster,Stronger (famous tune) Scientific Computation is largely demanding of computation power Faster computation = more results now Biology and Health Care Oiling and Finance Video Games Industry

3 Motivation The CELL processor Harder,Better,Faster,Stronger (famous tune) Scientific Computation is largely demanding of computation power Faster computation = more results now Biology and Health Care Oiling and Finance Video Games Industry The Silent Revolution Computing Power : 400 GFLOPS vs 32 GFLOPS Memory bandwidth : GB/s vs 10 GB/s GPU are in everyday PCs Cell went from server blade to the game industry (PS3)

4 Motivation The CELL processor When Video games ruled the World Game design has become ever more sophisticated. Fast GPUs lead to complex shader for real-time effects. In turn, the demand for speed has led to ever-increasing innovation in card design. The gaming industry has overtaken the defense, finance, oil and healthcare industries as the main driving factor for high performance processors.

5 Motivation The CELL processor When Video games ruled the World Game design has become ever more sophisticated. Fast GPUs lead to complex shader for real-time effects. In turn, the demand for speed has led to ever-increasing innovation in card design. The gaming industry has overtaken the defense, finance, oil and healthcare industries as the main driving factor for high performance processors. The NV40 architecture has 225 million transistors, compared to about 175 million for the Pentium 4 EE 3.2Ghz chip.

6 Motivation The CELL processor

7 Objectives The CELL processor Theory! Hardware architecture of GPU and Cell processor Pros and Cons of those architectures

8 Objectives The CELL processor Theory! Hardware architecture of GPU and Cell processor Pros and Cons of those architectures... and Practice Tools and Languages Sample code

9 Motivation The CELL processor Architecture Coding for the CELL Less is More GP CPU increases in complexity Peak performances slow down Building more with less complex PU

10 Motivation The CELL processor Architecture Coding for the CELL Less is More GP CPU increases in complexity Peak performances slow down Building more with less complex PU The CELL Processor Heterogenous multi-core DSP-like coprocessor High-memory bandwidth ( 200GB/s)

11 Where to find it??? The CELL processor Architecture Coding for the CELL

12 The CELL Processor Architecture Coding for the CELL Structure 1 PowerPC Processing Unit 8 Synergetic Processing Unit 1 XDRAM Interface 1 4-way DMA bus Parallelism source TLP over the PPE TLP over the SPE ILP inside each SPE

13 The CELL Processor Architecture Coding for the CELL

14 Available Tools The CELL processor Architecture Coding for the CELL... that work GCC/G++ for the Cell GFORTRAN for the Cell Use a dual source compilation process

15 Available Tools The CELL processor Architecture Coding for the CELL... that work GCC/G++ for the Cell GFORTRAN for the Cell Use a dual source compilation process... that don t work OpenMP : bad scaling, huge executable Task-based MPI : huge latency, low bandwidth

16 Separate development Architecture Coding for the CELL Specificities of the PPE All the features of a PPC Core Supports up to two threads Full-fledged Altivec SIMD extension

17 Separate development Architecture Coding for the CELL Specificities of the PPE All the features of a PPC Core Supports up to two threads Full-fledged Altivec SIMD extension Specificities of the SPEs Specialized Altivec SIMD extension No scalar ALU Cacheless and predictorless

18 Memory and Communications Architecture Coding for the CELL Communicating between PPE and SPEs SPE LS are virtually mapped into PPE memory PPE and SPE code share the same process space SPE code must be downloaded when application starts

19 Memory and Communications Architecture Coding for the CELL Communicating between PPE and SPEs SPE LS are virtually mapped into PPE memory PPE and SPE code share the same process space SPE code must be downloaded when application starts Handling SPE Local Store SPE LS is only 256KB for code+data SPE memories aren t shared Need for explicit data transfer primitives

20 Memory and Communications Architecture Coding for the CELL Mailbox Allow transfer of small data (32bits) between SPE and PPE Two mailbox per SPE (in and out) Two mode : waiting or polling Useful for simple synchronization (thread pool pattern) Primitives : spe_in_mbox_write and spe_in_mbox_read

21 Memory and Communications Architecture Coding for the CELL Mailbox Allow transfer of small data (32bits) between SPE and PPE Two mailbox per SPE (in and out) Two mode : waiting or polling Useful for simple synchronization (thread pool pattern) Primitives : spe_in_mbox_write and spe_in_mbox_read Signal Allow transfer of small data (32bits) between SPEs Two signal slots per SPE (generic purpose) Useful for message-passing emulation with DMA transfers Primitives : mfc_sndsig and spe_read_signal

22 DMA Transfers The CELL processor Architecture Coding for the CELL Principles Offload the SPU from being blocked during memory transfer Used to download SPE code into SPE LS Up to 4 transfers cna be done in parallel over the SPE-Bus Up to one upload and one download in parallel over the PPE bus Primitives : mfc_get,mfc_put and mfc_read_tag_status_all

23 DMA Transfers The CELL processor Architecture Coding for the CELL Principles Offload the SPU from being blocked during memory transfer Used to download SPE code into SPE LS Up to 4 transfers cna be done in parallel over the SPE-Bus Up to one upload and one download in parallel over the PPE bus Primitives : mfc_get,mfc_put and mfc_read_tag_status_all Traps and Pitfalls Data to send/receive must be aligned on a 128bits boundary Data size should be 1,2,4,8 or any multiple of 16 bytes Limited number of DMA channels Double buffering must be considered

24 Motivation The CELL processor The NVIDIA Architecture Programming with CUDA GPU beyond 3D graphics Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation

25 Motivation The CELL processor The NVIDIA Architecture Programming with CUDA GPU beyond 3D graphics Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Back in the day of opengl GPGPU Limited texture size/dimension Limited outputs Lack of integers and bitwise operators Limited communications

26 The NVIDIA Products The NVIDIA Architecture Programming with CUDA GeForce series Separate HW interface Work as an external MPM

interface Work as an external MPM Tesla machines

27 The NVIDIA Products The NVIDIA Architecture Programming with CUDA GeForce series Separate HW interface Work as an external MPM Tesla machines 8-series GPUs : 200 GFLOPS stand-alone or 1U rackable unit

28 Inside a GPU The CELL processor The NVIDIA Architecture Programming with CUDA Hierarchical Memory Global Memory Shared Memory Local Memory

29 Inside a GPU The CELL processor The NVIDIA Architecture Programming with CUDA Hierarchical Memory Global Memory Shared Memory Local Memory Processors High density SMP Support 4-way SIMD

30 Global View The CELL processor The NVIDIA Architecture Programming with CUDA Kernels A GPGPU application is made of CPU computation GPU Kernels

31 Global View The CELL processor The NVIDIA Architecture Programming with CUDA Kernels A GPGPU application is made of CPU computation GPU Kernels Grids and Blocks Kernel = grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate

32 Block and Thread IDs The NVIDIA Architecture Programming with CUDA Threads and blocks have IDs Each thread decide the data to process Block ID : 1D or 2D Thread ID : 1D, 2D, or 3D

33 Block and Thread IDs The NVIDIA Architecture Programming with CUDA Threads and blocks have IDs Each thread decide the data to process Block ID : 1D or 2D Thread ID : 1D, 2D, or 3D Memory Access Depend son domain Image : 2D Physics : 3D

34 Memory Access Patterns The NVIDIA Architecture Programming with CUDA Each thread can R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant

35 Memory Access Patterns The NVIDIA Architecture Programming with CUDA Each thread can R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant The host can R/W constant memory R/W texture memory R/W global memory

36 The NVIDIA Architecture Programming with CUDA Global, Constant, and Texture Memories Global Memory Main means of communicating between host and device Contents visible to all threads

37 The NVIDIA Architecture Programming with CUDA Global, Constant, and Texture Memories Global Memory Main means of communicating between host and device Contents visible to all threads Texture and Constant Constants initialized by host Contents visible to all threads

38 CUDA Processing Flow The NVIDIA Architecture Programming with CUDA

39 Copy Processing Data The NVIDIA Architecture Programming with CUDA Create data on Host cudamallochost() : allocate memory on the host cudamalloc() : allocate memory in the device Global Memory

40 Copy Processing Data The NVIDIA Architecture Programming with CUDA Create data on Host cudamallochost() : allocate memory on the host cudamalloc() : allocate memory in the device Global Memory Copy to Device cudamemcpy() : copy memory between host and device Asynchronous since Cuda 1.1 Works 4-way : (host,device) X (host,device)

41 Copy Processing Data The NVIDIA Architecture Programming with CUDA Create data on Host cudamallochost() : allocate memory on the host cudamalloc() : allocate memory in the device Global Memory Copy to Device Example cudamemcpy() : copy memory between host and device Asynchronous since Cuda 1.1 Works 4-way : (host,device) X (host,device) float *host, *device; cudamallochost(&host, sizeof(float)*64*64); cudamalloc(&device, sizeof(float)*64*64); cudamemcpy(host, device, sizeof(float)*64*64, cudamemcpyhosttodevice);

42 Instruct the Processing The NVIDIA Architecture Programming with CUDA Define the device mapping CUDA provides built-in types for dimension Define a block grid Define a thread grid

43 Instruct the Processing The NVIDIA Architecture Programming with CUDA Define the device mapping CUDA provides built-in types for dimension Define a block grid Define a thread grid Run the kernel CUDA provides a synatx extnsion for calling a given function over a given grid

44 Instruct the Processing The NVIDIA Architecture Programming with CUDA Define the device mapping CUDA provides built-in types for dimension Define a block grid Define a thread grid Run the kernel CUDA provides a synatx extnsion for calling a given function over a given grid Example dim3 dimblock(16,16); dim3 dimgrid(64 / dimblock.x, 64 / dimblock.y); device_kernel<<<dimgrid, dimblock>>>(host,64);

45 Build a Parallel kernel The NVIDIA Architecture Programming with CUDA kernel.cu global void device_kernel(float* data, size_t size) { // Block index int bx = blockidx.x; int by = blockidx.y; } // Thread index int tx = threadidx.x; int ty = threadidx.y; // Index of the first sub-matrix of A processed by the block int begin = size * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int end = begin + size - 1; // Step size used to iterate through the sub-matrices of A int step = BLOCK_SIZE; for(int a = begin; a <= end; a += step) data[a + size * ty + tx] = data[a + size * ty + tx];

46 Sample Code The CELL processor The NVIDIA Architecture Programming with CUDA see mmul.*

47 As a... Some research topics... High-level tools are needed. WIP includes : Algorithmic Skeletons for the Cell Bulk Synchronous Parallelism for GPU Architecture-independant Algebra library

48 As a... Some research topics... High-level tools are needed. WIP includes : Algorithmic Skeletons for the Cell Bulk Synchronous Parallelism for GPU Architecture-independant Algebra library Some untapped domain Operationnal Research Cryptography/Compression Artificial Intelligence

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU