Numerical Algorithms on Multi-GPU Architectures

Size: px

Start display at page:

Download "Numerical Algorithms on Multi-GPU Architectures"

MargaretMargaret Gibbs
5 years ago
Views:

1 Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan

2 2

3 3

4 Contents Motivation: Applications GPU Architectures and Programming Paradigms Numerical Algorithms on GPU: Multigrid Software concepts for Multi-GPU Lattice Boltzmann Method Future Work 4

5 System Simulation Group in Erlangen (LSS) APPLICATIONS 5

6 MRI Reconstruction on ATI GPUs ACS: autocalibration signal Siemens Healthcare, Erlangen 6

7 FEM multigrid solver on HLRB II at Leibniz- Rechenzentrum Garching 9728 processor cores (1.6 GHz Intel Itanium2) 62 TFlop/s peak performance 39 Tbytes of main memory Compare to 10 9 unknowns in 1 s on one GPU 7

8 walberla: parallel LBM framework for CFD applications 8

9 Free Surface Flows Gas phase only modeled by boundary conditions Rising bubble & Metal foam & Sinking box & PEM (proton exchange membrane) fuel cell 9

10 Fluid-structure interaction: particle segregation 64.3 million objects in a LBM grid with 151 billion lattice cells Runs on cores of a Blue Gene/P at Jülich Supercomputing center 10

11 Current Trends GPU ARCHITECTURE AND PROGRAMMING PARADIGMS 11

12 Nvidia GeForce GTX 295 Costs: 450 Interface: PCI-E 2.0 x16 Shader Clock: 1242 MHz Memory Clock: 999 MHz Memory Bandwidth: 2x112 GB/s FLOPS: 2x894 GFLOPS Max Power Draw: 289 W Framebuffer: 2x896 MB Memory Bus: 2x448 bit Shader Processors: 2x240 12

13 ATI Radeon HD 4870 Costs: 150 Interface: PCI-E 2.0 x16 Shader Clock: 750 MHz Memory Clock: 900 MHz Memory Bandwidth: 115 GB/s FLOPS: 1200 GFLOPS Max Power Draw: 160 W Framebuffer: 1024 MB Memory Bus: 256 bit Shader Processors:

14 HLRS Stuttgart NEC Nehalem Cluster Peak Performance: 62 TFlops No. Nodes: 700 Dual Sockel Quad Core Processor: Intel Xeon(X5560) GHz,8MB Cache Memory/Node: 12 GB Disk: 60 TB Node-node interconnnect: infiniband, GigE Accelerators: 32 nodes provide Nvidia Tesla S

15 Why GPUs? Graphics Processing Units (GPUs) are a massively parallel computer architecture GPUs offer high computational performance at low costs GPGPU General-Purpose computing on a GPU Using graphic hardware for non-graphic computations Programming Tools: CUDA or OpenCL But: without explicitly parallel algorithms, the performance potential cannot be used 15

16 CPU vs. GPU CPUs are great for task parallelism Fast caches Branching adaptability GPUs are great for data parallelism Multiple ALUs Fast onboard memory High throughput on parallel tasks Executes program on each fragment Think of the GPU (device) as a massively-threaded coprocessor 16

17 Memory Architecture Constant Memory Shared Memory Texture Memory Device Memory 17

18 Data-parallel Programming What is CUDA? Compute Unified Device Architecture Software architecture for managing data-parallel programming Write kernels (functions) that execute on the device and process multiple data elements in parallel Massive threading Local memory 18

19 Paradigm of the CUDA toolkit I Thread 0 Thread 1 Thread 2 OpenMP Divide domain into huge chunks Equally distribute to threads Parallelize outer most loop to minimize overhead 19

20 Paradigm of the CUDA toolkit II CUDA Divide domain into small pieces But! Data Mapping is different to enblock distribution of OpenMP Alignment constraints must be met No cache and cache lines need to be considered Block 1 Block 2 Thread 0 Thread 1 Thread 2 Thread 0 Thread Thread

21 OpenCL OpenCL (Open Computing Language) is an open standard for heterogeneous parallel computing Managed by non-profit technology consortium Khronos group Analogous to the open industry standards OpenGL for 3D graphics OpenCL exploits task-based and data-based parallelism Similar to CUDA, OpenCL includes a language (based on C99) for writing kernels executing on devices like GPU, cell, or multicore processors Performance of high level heterogeneous tools? 21

22 A Multigrid Solver for Complex Diffusion NUMERICAL ALGORITHMS ON GPU 22

23 Multigrid Idea Multigrid methods are based on two principles: 1. Smoothing property Smooth error on fine grid 23

24 Multigrid Idea Multigrid methods are based on two principles: 1. Smoothing property 2. Coarse grid principle Approximate smooth error on coarser grids 24

25 Noise model Assumption (white additive Gaussian noise): Relation between an original, unknown image u : Ω R d a R and an observed image u 0 can be expressed by u 0 where η stands for the noise. = u +η 25

26 Denoising by Diffusion Idea: Use nonlinear anisotropic diffusion process to denoise the image u 0 in the domain Ω,i.e. solve the time-dependent PDE u div( g u) = in Ω T t g u, n = 0 on Ω T u( x,0) = u 0 ( x) in Ω 26

27 Complex Diffusion complex diffusivity function g( u) = 1+ Im( u) kθ Parameter θ denotes a small angle, k is a soft threshold e Solution of the anisotropic diffusion PDE becomes complex Real part of solution is denoised image, its imaginary part acts as an edge detector iθ 2 27

28 Image Denoising Example Denoising of a test image with added Gaussian noise. Noisy Image Denoised Image Imaginary Part 28

29 Multigrid for Complex Diffusion Time-dependent, complex, nonlinear diffusion PDE: Spatial discretization by finite volumes Semi-implicit time discretization Nonlinear diffusion is handled by inexact lagged diffusivity Cell-based FAS (full approximation scheme) multigrid Damped Jacobi smoother Standard transfers 29

Complex Diffusion on Nvidia GeForce GTX 295 80 70 runtime V(2,2)

4096x4096 Image size Runtime on 2 x C2 Penryn (8 cores) @ 2.

30 Complex Diffusion on Nvidia GeForce GTX runtime V(2,2) in ms x x x x x4096 Image size Runtime on 2 x C2 Penryn (8 2.8 GHz, 8 x 22,4 GFLOPs is 1100 ms for 4096 x 4096 Speedup factor 14 30

31 Multigrid Multi-GPU Challenges PCIe bus Advanced multigrid techniques Block smoothers CCA coarsening Matrix-dependent transfers Treatment of coarser grids Parallel direct solver Hybrid approach CPU + GPU More coarse grids Shared memory for temporal or spatial blocking 31

32 Software concepts MULTI-GPU 32

33 Integrating Multi-GPU support into walberla Main design goal: massive parallel software framework for various engineering applications Fast kernels, high level C++ management structures Software quality factors Usability Reliability Maintainability and Expandability Portability Efficiency and Scalability Patch, sweep, and MPI parallelization concepts 33

34 Patch concept I 34

35 Patch concept II 35

36 Sweep Concept 36

37 Communication Concept 37

38 Lattice Boltzmann Method (LBM) NUMERICAL ALGORITHMS ON GPU 38

39 Boltzmann equation Mesoscopic approach to solving the Navier-Stokes equations Boltzmann equation describes the statistical distribution of one particle in a fluid f + ζ f = Ω t f is the probability distribution function (PDF), the particle velocity, and Ω(f) is the change due to collision (f) Models behavior of fluids in statistical physics ζ 39

40 Lattice Boltzmann Method Lattice Boltzmann Method solves the discrete Boltzmann equation Fluid domain is split into simple cells that are treated independently in each time step Microscopic numerical fluid solver Simple updating rules Handle complex boundary conditions for static and moving objects Inherent parallel scheme for hardware acceleration 40

41 D3Q19 LBM Cell 41

42 Distribution of work to threads Each thread treats one cell One block treats one stripe along x-direction with its threads, thus, all cells have the same y and z index 42

43 LBM Algorithm single precision f(0:xmax+1,0:ymax+1,0:zmax+1,0:18,0:1) x = threadidx.x ; // set i index of current cell y = blockidx.x+1; // set j index of current cell z =blockidx.y+1; // set k index of current cell if( fluidcell(x,y,z) ) then endif LOAD f(x,y,z, 0:18,t) Relaxation (complex computations) SAVE f(x,y,z, 0,t+1) SAVE f(x+1,y+1,z, 1,t+1) SAVE f(x,y+1,z, 2,t+1) SAVE f(x-1,y+1,z, 3,t+1) SAVE f(x,y-1,z-1,18,t+1) 43

44 Single GPU Node Performance on Tesla C

45 Weak Scaling on Tesla C1070 and HLRS cluster 45

46 Summary and Future Work When to use GPGPU Suitable algorithms? Get the promised performance? Get it at low effort and cost? Future Work Heterogeneous CPU-GPU computing Multi-GPU multigrid OpenCL 46

47 Acknowledgements Nvidia Corporation developer.nvidia.com/cuda Technical Brief Architecture Overview CUDA Programming Guide AMD/ATI pdf ATI Stream SDK User Guide 47

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3