Massively Parallel Architectures

Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009

Motivation The CELL processor Harder,Better,Faster,Stronger (famous tune) Scientific Computation is largely demanding of computation power Faster computation = more results now Biology and Health Care Oiling and Finance Video Games Industry The Silent Revolution Computing Power : 400 GFLOPS vs 32 GFLOPS Memory bandwidth : 100-200 GB/s vs 10 GB/s GPU are in everyday PCs Cell went from server blade to the game industry (PS3)

Motivation The CELL processor When Video games ruled the World Game design has become ever more sophisticated. Fast GPUs lead to complex shader for real-time effects. In turn, the demand for speed has led to ever-increasing innovation in card design. The gaming industry has overtaken the defense, finance, oil and healthcare industries as the main driving factor for high performance processors.

Motivation The CELL processor

Objectives The CELL processor Theory! Hardware architecture of GPU and Cell processor Pros and Cons of those architectures

Objectives The CELL processor Theory! Hardware architecture of GPU and Cell processor Pros and Cons of those architectures... and Practice Tools and Languages Sample code

Motivation The CELL processor Architecture Coding for the CELL Less is More GP CPU increases in complexity Peak performances slow down Building more with less complex PU

Motivation The CELL processor Architecture Coding for the CELL Less is More GP CPU increases in complexity Peak performances slow down Building more with less complex PU The CELL Processor Heterogenous multi-core DSP-like coprocessor High-memory bandwidth ( 200GB/s)

Where to find it??? The CELL processor Architecture Coding for the CELL

The CELL Processor Architecture Coding for the CELL Structure 1 PowerPC Processing Unit 8 Synergetic Processing Unit 1 XDRAM Interface 1 4-way DMA bus Parallelism source TLP over the PPE TLP over the SPE ILP inside each SPE

The CELL Processor Architecture Coding for the CELL

Available Tools The CELL processor Architecture Coding for the CELL... that work GCC/G++ for the Cell GFORTRAN for the Cell Use a dual source compilation process

Available Tools The CELL processor Architecture Coding for the CELL... that work GCC/G++ for the Cell GFORTRAN for the Cell Use a dual source compilation process... that don t work OpenMP : bad scaling, huge executable Task-based MPI : huge latency, low bandwidth

Separate development Architecture Coding for the CELL Specificities of the PPE All the features of a PPC Core Supports up to two threads Full-fledged Altivec SIMD extension

Separate development Architecture Coding for the CELL Specificities of the PPE All the features of a PPC Core Supports up to two threads Full-fledged Altivec SIMD extension Specificities of the SPEs Specialized Altivec SIMD extension No scalar ALU Cacheless and predictorless

Memory and Communications Architecture Coding for the CELL Communicating between PPE and SPEs SPE LS are virtually mapped into PPE memory PPE and SPE code share the same process space SPE code must be downloaded when application starts Handling SPE Local Store SPE LS is only 256KB for code+data SPE memories aren t shared Need for explicit data transfer primitives

Memory and Communications Architecture Coding for the CELL Mailbox Allow transfer of small data (32bits) between SPE and PPE Two mailbox per SPE (in and out) Two mode : waiting or polling Useful for simple synchronization (thread pool pattern) Primitives : spe_in_mbox_write and spe_in_mbox_read Signal Allow transfer of small data (32bits) between SPEs Two signal slots per SPE (generic purpose) Useful for message-passing emulation with DMA transfers Primitives : mfc_sndsig and spe_read_signal

DMA Transfers The CELL processor Architecture Coding for the CELL Principles Offload the SPU from being blocked during memory transfer Used to download SPE code into SPE LS Up to 4 transfers cna be done in parallel over the SPE-Bus Up to one upload and one download in parallel over the PPE bus Primitives : mfc_get,mfc_put and mfc_read_tag_status_all Traps and Pitfalls Data to send/receive must be aligned on a 128bits boundary Data size should be 1,2,4,8 or any multiple of 16 bytes Limited number of DMA channels Double buffering must be considered

Motivation The CELL processor The NVIDIA Architecture Programming with CUDA GPU beyond 3D graphics Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Back in the day of opengl GPGPU Limited texture size/dimension Limited outputs Lack of integers and bitwise operators Limited communications

The NVIDIA Products The NVIDIA Architecture Programming with CUDA GeForce series Separate HW interface Work as an external MPM

The NVIDIA Products The NVIDIA Architecture Programming with CUDA GeForce series Separate HW interface Work as an external MPM Tesla machines 8-series GPUs : 200 GFLOPS stand-alone or 1U rackable unit

Inside a GPU The CELL processor The NVIDIA Architecture Programming with CUDA Hierarchical Memory Global Memory Shared Memory Local Memory

Inside a GPU The CELL processor The NVIDIA Architecture Programming with CUDA Hierarchical Memory Global Memory Shared Memory Local Memory Processors High density SMP Support 4-way SIMD

Global View The CELL processor The NVIDIA Architecture Programming with CUDA Kernels A GPGPU application is made of CPU computation GPU Kernels

Global View The CELL processor The NVIDIA Architecture Programming with CUDA Kernels A GPGPU application is made of CPU computation GPU Kernels Grids and Blocks Kernel = grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate

Block and Thread IDs The NVIDIA Architecture Programming with CUDA Threads and blocks have IDs Each thread decide the data to process Block ID : 1D or 2D Thread ID : 1D, 2D, or 3D

Block and Thread IDs The NVIDIA Architecture Programming with CUDA Threads and blocks have IDs Each thread decide the data to process Block ID : 1D or 2D Thread ID : 1D, 2D, or 3D Memory Access Depend son domain Image : 2D Physics : 3D

Memory Access Patterns The NVIDIA Architecture Programming with CUDA Each thread can R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant

The NVIDIA Architecture Programming with CUDA Global, Constant, and Texture Memories Global Memory Main means of communicating between host and device Contents visible to all threads

The NVIDIA Architecture Programming with CUDA Global, Constant, and Texture Memories Global Memory Main means of communicating between host and device Contents visible to all threads Texture and Constant Constants initialized by host Contents visible to all threads

CUDA Processing Flow The NVIDIA Architecture Programming with CUDA

Copy Processing Data The NVIDIA Architecture Programming with CUDA Create data on Host cudamallochost() : allocate memory on the host cudamalloc() : allocate memory in the device Global Memory

Copy Processing Data The NVIDIA Architecture Programming with CUDA Create data on Host cudamallochost() : allocate memory on the host cudamalloc() : allocate memory in the device Global Memory Copy to Device Example cudamemcpy() : copy memory between host and device Asynchronous since Cuda 1.1 Works 4-way : (host,device) X (host,device) float *host, *device; cudamallochost(&host, sizeof(float)*64*64); cudamalloc(&device, sizeof(float)*64*64); cudamemcpy(host, device, sizeof(float)*64*64, cudamemcpyhosttodevice);

Instruct the Processing The NVIDIA Architecture Programming with CUDA Define the device mapping CUDA provides built-in types for dimension Define a block grid Define a thread grid

Instruct the Processing The NVIDIA Architecture Programming with CUDA Define the device mapping CUDA provides built-in types for dimension Define a block grid Define a thread grid Run the kernel CUDA provides a synatx extnsion for calling a given function over a given grid Example dim3 dimblock(16,16); dim3 dimgrid(64 / dimblock.x, 64 / dimblock.y); device_kernel<<<dimgrid, dimblock>>>(host,64);

Build a Parallel kernel The NVIDIA Architecture Programming with CUDA kernel.cu global void device_kernel(float* data, size_t size) { // Block index int bx = blockidx.x; int by = blockidx.y; } // Thread index int tx = threadidx.x; int ty = threadidx.y; // Index of the first sub-matrix of A processed by the block int begin = size * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int end = begin + size - 1; // Step size used to iterate through the sub-matrices of A int step = BLOCK_SIZE; for(int a = begin; a <= end; a += step) data[a + size * ty + tx] = 255 - data[a + size * ty + tx];

Sample Code The CELL processor The NVIDIA Architecture Programming with CUDA see mmul.*

As a... Some research topics... High-level tools are needed. WIP includes : Algorithmic Skeletons for the Cell Bulk Synchronous Parallelism for GPU Architecture-independant Algebra library

As a... Some research topics... High-level tools are needed. WIP includes : Algorithmic Skeletons for the Cell Bulk Synchronous Parallelism for GPU Architecture-independant Algebra library Some untapped domain Operationnal Research Cryptography/Compression Artificial Intelligence