Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters

Size: px

Start display at page:

Download "Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters"

Harvey Cameron
5 years ago
Views:

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Robert Strzodka Integrative Scientific Computing Max Planck Institut

1 Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik ~strzodka Dominik Göddeke Institute for Applied Mathematics Technical University of Dortmund ~goeddeke

2 Structure of Double Lecture 2 x 90 min PART 1 PART 2 Parallelism Grid Discretization Multigrid & Smoothers Mixed-Precision Data Layout FEM on GPU Clusters New MPI Application (Rewrite) Legacy MPI Code (Accelerate) INRIA Sophia Antipolis Jun

3 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun

4 Parallelism Sequential execution is an illusionary software concept All transistors always do something in parallel! Billions of transistors in modern CPUs(>0.5) & GPUs(>2) Old: Implicit parallelism with caches, ILP, speculation diminishing returns, power constraints New: Explicit parallelism on multiple levels much more efficient & natural INRIA Sophia Antipolis Jun

5 SIMD Parallelism instructions Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7 Data 0 Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Unit Data Unit 15 Data 15 It is impossible to execute just one instruction a= b + c; Actually means execute (add, nop, nop, nop, ) Penalty for ignoring SIMD 4x on current CPUs (SSE) 8-16x on future CPUs (AVX, LRB) 16-80x on GPUs INRIA Sophia Antipolis Jun 2011

6 Many-Core Parallelism Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7 Unit 15 Data 0 Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 15 Host Input Assembler Execution Manager Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Cache Cache Cache Cache Cache Cache Cache Cache Load/store Penalty Load/store for ignoring Load/store many-cores Load/store Load/store Load/store 4-8x on current CPUs Global Memory 32-48x on future CPUs (LRB) 10-30x on GPUs INRIA Sophia Antipolis Jun

7 Intra-Node Parallelism (multiple CPUs/GPUs per PC) CPU1 CPU1 CPU3 CPU4 Multi-CPU system GPU1 GPU2 GPU3 GPU4 Multi-GPU system Penalty for ignoring intra-node parallelism 4-8x for multi-cpu systems 4-8x for multi-gpu systems INRIA Sophia Antipolis Jun

8 Inter-Node Parallelism in a Cluster PC PC PC Penalty for ignoring inter-node parallelism Depends on the number of nodes in the cluster Capability computing INRIA Sophia Antipolis Jun

9 Bandwidth in a CPU-GPU System CPU socket (4-12 cores) CPU CPU core core CPU CPU core ALU core ALU on-chip on-chip FPUALU on-chip FPU memory GB/s on-chip FPU memory 40 GB/s FPU memory memory 40 GB/s high-end GPU socket ALU ALU FPU FPU 2 TB/s on-chip on-chip memory memory 2 GB/s 20 GB/s 200 GB/s system memory 4 GB/s device memory INRIA Sophia Antipolis Jun

10 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun

11 Generalized Poisson Problem We seek a function u( x) : Ω R m, Ω R d which satisfies interior boundary div( G u=f u) f ν u=bn or in Ω u= b D on Ω We consider the scalar (m=1) 2D case (d=2) with operator anisotropies. Given a vector field ( v 1 (x,y), v 2 (x,y) ) we define: INRIA Sophia Antipolis Jun

12 Discretization Grids Equidistant grid topology: implicit geometry: implicit access: direct Generalized tensor-product grid topology: implicit geometry: explicit access: direct INRIA Sophia Antipolis Jun 2011

13 Discretization Grids Adaptive grid topology: explicit geometry: implicit/explicit access: hash, tree or page table Unstructured grid topology: explicit geometry: explicit access: index array INRIA Sophia Antipolis Jun 2011

14 nd Arrays Generalized tensor-product grid Pros Cons topology: implicit geometry: implicit/explicit access: direct simple implementation implicit topology saves precious global bandwidth allows efficient on-chip stencil operations good SIMD utilization topology constraints make object modeling more difficult element form may require a more powerful solver INRIA Sophia Antipolis Jun 2011

15 Deformation Adaptivity This grid is a tensor-product! Easier to accelerate in hardware than resolution adaptive grids Anisotropy level determines optimal solver INRIA Sophia Antipolis Jun 2011

16 nd Arrays Adaptive/Unstructured grid Pros Cons topology: explicit geometry: implicit/explicit access: indirect general scheme for arbitrary node arrangements only one indirection for local data access clever node numberings preserve some data locality expensive encoding of topology/connectivity no global view of the structure difficult to handle dynamic changes in parallel INRIA Sophia Antipolis Jun 2011

17 Hash Adaptive/Unstructured grid Pros Cons topology: explicit geometry: implicit/explicit access: indirect general scheme for arbitrary node arrangements only one indirection for arbitrary global data access perfect hashes have no collisions, thus good SIMD use expensive encoding of topology/connectivity hashes tend to produce fine-grained random data access perfect hash generation too expensive for dynamic changes INRIA Sophia Antipolis Jun 2011

18 Tree Adaptive grid Pros Cons topology: explicit geometry: implicit/explicit access: indirect allows refinement of arbitrary depths compact encoding of global topology/connectivity allows dynamic changes in parallel several memory indirection in data access difficult to extract SIMD parallelism INRIA Sophia Antipolis Jun 2011

19 Structured and Unstructured Sparse MatVec Larger is better chart from [Bell and Garland SC 2009] INRIA Sophia Antipolis Jun

20 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun

21 Generalized Poisson Problem We seek a function u( x) : Ω R m, Ω R d which satisfies interior boundary div( G u=f u) f ν u=bn or in Ω u= b D on Ω We consider the scalar (m=1) 2D case (d=2) with operator anisotropies. Given a vector field ( v 1 (x,y), v 2 (x,y) ) we define: INRIA Sophia Antipolis Jun

22 Discretization Approach Finite Differences Finite Volumes Ax=b For 2D linear FEM A is a 9-band matrix. Finite Elements INRIA Sophia Antipolis Jun

23 Geometric Multigrid Method Linear equation system after discretization Ax=b Observation: Basic solvers quickly reduce the high frequency error components, but struggle with low frequencies Idea: Solve the system on a pyramid of grids, thus dealing with different frequencies one after another d k = b-ax k Fine grid Ac k = d k Coarse grid x k+1 = x k +c k Back on fine grid INRIA Sophia Antipolis Jun

24 Multigrid Transfers Restriction Interpolate values from fine into coarse array Local weighted gather operation 2i-1 2i 2i+1 i fine adjust index to read neighbors output region coarse coarse result array INRIA Sophia Antipolis Jun

25 Multigrid Transfers Prolongation Scatter values from fine to coarse with weighting stencil Local weighted scatter operation INRIA Sophia Antipolis Jun

26 Preconditioners x = Ax=b Damped, preconditioned defect correction: x + ωc ( b Ax k+ 1 k 1 k ) A = ( LL+ LD+ LU) + ( DL+ DD+ DU) + ( UL+ UD+ UU) JACOBI GSROW TRIDI TRIGSROW C=DD C = ( LL+ LD+ LU) + ( DL+ DD) C = ( DL+ DD+ DU) C = ( LL+ LD+ LU) + ( DL+ DD+ DU) INRIA Sophia Antipolis Jun

27 CPU Numerical and Runtime Efficiency Iter.Ref.(double) MG(float) V(2,2) CG x = 0,34?????? INRIA Sophia Antipolis Jun

28 Gauss-Seidel Preconditioner Ax=b Damped, preconditioned defect correction: x = x + ωc ( b Ax k+ 1 k 1 k ) GSROW C = ( LL+ LD+ LU) + ( DL+ DD) INRIA Sophia Antipolis Jun 2011

29 Multi-Colored Gauss-Seidel INRIA Sophia Antipolis Jun

30 ADI-TRIDI Preconditioner x Ax=b Damped, preconditioned defect correction: = x + ωc ( b Ax k+ 1 k 1 k ) TRIDI-ROW C = ( DL+ DD+ DU) TRIDI-COLUMN INRIA Sophia Antipolis Jun 2011

31 SIMD Parallelism: Cyclic Reduction 3*8 3*4-1 3*2-1 *2 *4 *8 3*2-1 3*4-1 3*9-2 O(N) *8 *4 *2 O(N*logN) INRIA Sophia Antipolis Jun

32 Memory Friendly Cyclic Reduction [Göddeke et al. Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid, TPDS 2011] INRIA Sophia Antipolis Jun

33 ADI-TRIGS Preconditioner x Ax=b Damped, preconditioned defect correction: = x + ωc ( b Ax k+ 1 k 1 k ) TRIGS-ROW C = ( LL+ LD+ LU) + ( DL+ DD+ DU) TRIGS-COLUMN INRIA Sophia Antipolis Jun

34 Many-Core Parallelism Host Input Assembler full/serial coupling Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Cache Cache Cache Cache Cache Cache Cache Cache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 4-way coupling 2-way coupling INRIA Sophia Antipolis Jun

35 CPU vs. GPU Numerical Efficiency Iter.Ref.(double) MG(float) V(2,2) CG INRIA Sophia Antipolis Jun

36 CPU vs. GPU Runtime Efficiency Core2Duo E6750 GTX280 Tesla C2070 Westmere-EP i7 X980 INRIA Sophia Antipolis Jun

37 Multigrid on Refined Unstructured Grid FE-gMG Unstructured grid Regular refinement Restriction & Prolongation as MatVec SPAI preconditioner Order & Storage (ELL) INRIA Sophia Antipolis Jun

38 FE-gMG Results with SPAI Preconditioner INRIA Sophia Antipolis Jun

39 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun

40 Precision Comparison Numerical Algorithms GPU Hardware long 64bit double s52e11 Precision int 32bit float s23e8 Comparison Bandwidth Storage Operator+ Operator* 1 word 2 words 1 word 2 words 1 adder 2 adders 1 multiplier 4 multipliers INRIA Sophia Antipolis Jun 2011

41 Hardware Precision float s23e8 double s52e11 23 bit 52 bit Data Error 1/2 = fl 0.5 1/3 = fl /2 = db 0.5 1/3 = db Roundoff Error * = fl e-8 = fl 1 f(a, b) = fl f fl (a, b) * = db e-15 = db 1 f(a, b) = db f db (a, b) INRIA Sophia Antipolis Jun 2011

42 The Erratic Roundoff Error Roundoff error for: 0 = f(a):= (1+a)^3 - (1+3a^2) - (3a+a^3) single precision double precision Smaller is better y = log2( f(a) ), 0 --> 2^ x = log2( 1/a ), a = 1 / 2^x INRIA Sophia Antipolis Jun 2011

43 Numerical Accuracy float s23e8 double s52e11 ( A x x A fl ε ) x fl 23 bit = b b = c(a) x ε ε Condition of Ax = b ( A x x Discretization Error A db δ ) x 52 bit db = b b = c( A) x δ δ div( G u) = f Ax=b err= u x err c(a) = u x INRIA Sophia Antipolis Jun 2011

44 Mixed Precision Iterative Refinement float s23e8 double s52e11 23 bit 52 bit ( A x x A fl ε ) x = c(a) x Condition of Ax = b = b b Iterative Refinement for Ax = b fl ε ε x d k = b-ax k Compute in high precision (cheap) Ac k = d k Solve in low precision (fast) x k+1 = x k +c k Correct in high precision (cheap) k = k+1 Iterate until convergence in high precision x x fl l+ 1 l+ 1 = F( Ab,, x fl l+ 1 fl l ) = c( F) x ε INRIA Sophia Antipolis Jun 2011

45 Mixed Precision Multigrid on GPU Core2Duo Smaller is better G80-CPU GT200-CPU GT200-db GT200-GPU INRIA Sophia Antipolis Jun 2011

46 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun

47 Multi-Valued Data Multi-valued data is ubiquitous Class 1: mathematical properties, e.g. multiple derivatives or moments Class 2: discrete features, e.g. colors of a pixel, multiple scores Class 3: per-dimension properties, e.g. coordinates, velocities on a 3D grid INRIA Sophia Antipolis Jun 2011

48 AoS and SoA Array of Structs (AoS) struct NormalStruct { Type1 comp1; Type2 comp2; Type3 comp3; }; Struct of Arrays (SoA) struct SoAContainer { Type1 comp1[size]; Type2 comp2[size]; Type3 comp3[size]; }; typedef NormalStruct AoSContainer[SIZE]; AoSContainer container; container[5].comp3++; SoAContainer container; container.comp3[5]++; INRIA Sophia Antipolis Jun 2011

49 Parallel Access in AoS and SoA Array of Structs (AoS) container[1].comp3 container[2].comp3 container[3].comp3 container[4].comp3 Struct of Arrays (SoA) container.comp3[1] container.comp3[2] container.comp3[3] container.comp3[4] container[1].comp{1,2,3} container[5].comp{1,2,3} container.comp{1,2,3}[1] container.comp{1,2,3}[5] INRIA Sophia Antipolis Jun 2011

50 Operating on Container Elements struct NormalStruct { Type1 comp1; Type2 comp2; Type3 comp3; }; typedef NormalStruct AoSContainer[SIZE]; NormalStruct single; AoSContainer container; In-place update of single and indexed structs void update( NormalStruct& s ) { s.comp3 += s.comp1; } update( single ); // OK update( container[5] ); // OK This is not possible with standard SoA/C++ syntax: container.comp3(5); INRIA Sophia Antipolis Jun 2011

51 Abstraction: AoS + SoA = ASA Array of Structs (AoS) struct NormalStruct { Type1 comp1; Type2 comp2; Type3 comp3; }; typedef NormalStruct Container[SIZE]; NormalStruct single Container container; void update(normalstruct& s); container[index].comp3++; update( container[5] ); Array of Structs of Arrays (ASA) template <ID t_id=id_value> struct FlexibleStruct { typedef ASAGroup<Type1,t_id> ASX_ASA; union{ Type1 comp1; ASX_ASA d1; }; union{ Type2 comp2; ASX_ASA d2; }; union{ Type3 comp3; ASX_ASA d3; }; }; typedef ASX::Array<FlexibleStruct, SIZE,??? > Container; FlexibleStruct<> single Container container; template <ID t_id> void update(flexiblestruct<t_id>& s); container[index].comp3++; update( container[5] ); [Strzodka. Abstraction for AoS and SoA layout in C++. GCG 2011] INRIA Sophia Antipolis Jun 2011??? = ASX::AOS or ASX::SOA

52 Overview Levels of Parallelism Explicit exploitation of all levels: SIMD, core, socket, cluster Grid Discretizations of PDEs Regularity of memory access Multigrid and Strong Smoothers Balancing numerical and hardware requirements Mixed Precision Iterative Refinement Same accuracy with faster computation Layout of Multi-valued Data Choice of layout determines memory access patterns INRIA Sophia Antipolis Jun

53 Questions? Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik ~strzodka Dominik Göddeke Institute for Applied Mathematics Technical University of Dortmund ~goeddeke

Structures for Heterogeneous Computing. The Best of Both Worlds: Flexible Data. Integrative Scientific Computing Max Planck Institut Informatik

The Best of Both Worlds: Flexible Data Structures for Heterogeneous Computing Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik My GPU Programming Challenges 2000: Conjugate