Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters
|
|
- Harvey Cameron
- 5 years ago
- Views:
Transcription
1 Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik ~strzodka Dominik Göddeke Institute for Applied Mathematics Technical University of Dortmund ~goeddeke
2 Structure of Double Lecture 2 x 90 min PART 1 PART 2 Parallelism Grid Discretization Multigrid & Smoothers Mixed-Precision Data Layout FEM on GPU Clusters New MPI Application (Rewrite) Legacy MPI Code (Accelerate) INRIA Sophia Antipolis Jun
3 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun
4 Parallelism Sequential execution is an illusionary software concept All transistors always do something in parallel! Billions of transistors in modern CPUs(>0.5) & GPUs(>2) Old: Implicit parallelism with caches, ILP, speculation diminishing returns, power constraints New: Explicit parallelism on multiple levels much more efficient & natural INRIA Sophia Antipolis Jun
5 SIMD Parallelism instructions Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7 Data 0 Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Unit Data Unit 15 Data 15 It is impossible to execute just one instruction a= b + c; Actually means execute (add, nop, nop, nop, ) Penalty for ignoring SIMD 4x on current CPUs (SSE) 8-16x on future CPUs (AVX, LRB) 16-80x on GPUs INRIA Sophia Antipolis Jun 2011
6 Many-Core Parallelism Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7 Unit 15 Data 0 Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 15 Host Input Assembler Execution Manager Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Cache Cache Cache Cache Cache Cache Cache Cache Load/store Penalty Load/store for ignoring Load/store many-cores Load/store Load/store Load/store 4-8x on current CPUs Global Memory 32-48x on future CPUs (LRB) 10-30x on GPUs INRIA Sophia Antipolis Jun
7 Intra-Node Parallelism (multiple CPUs/GPUs per PC) CPU1 CPU1 CPU3 CPU4 Multi-CPU system GPU1 GPU2 GPU3 GPU4 Multi-GPU system Penalty for ignoring intra-node parallelism 4-8x for multi-cpu systems 4-8x for multi-gpu systems INRIA Sophia Antipolis Jun
8 Inter-Node Parallelism in a Cluster PC PC PC Penalty for ignoring inter-node parallelism Depends on the number of nodes in the cluster Capability computing INRIA Sophia Antipolis Jun
9 Bandwidth in a CPU-GPU System CPU socket (4-12 cores) CPU CPU core core CPU CPU core ALU core ALU on-chip on-chip FPUALU on-chip FPU memory GB/s on-chip FPU memory 40 GB/s FPU memory memory 40 GB/s high-end GPU socket ALU ALU FPU FPU 2 TB/s on-chip on-chip memory memory 2 GB/s 20 GB/s 200 GB/s system memory 4 GB/s device memory INRIA Sophia Antipolis Jun
10 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun
11 Generalized Poisson Problem We seek a function u( x) : Ω R m, Ω R d which satisfies interior boundary div( G u=f u) f ν u=bn or in Ω u= b D on Ω We consider the scalar (m=1) 2D case (d=2) with operator anisotropies. Given a vector field ( v 1 (x,y), v 2 (x,y) ) we define: INRIA Sophia Antipolis Jun
12 Discretization Grids Equidistant grid topology: implicit geometry: implicit access: direct Generalized tensor-product grid topology: implicit geometry: explicit access: direct INRIA Sophia Antipolis Jun 2011
13 Discretization Grids Adaptive grid topology: explicit geometry: implicit/explicit access: hash, tree or page table Unstructured grid topology: explicit geometry: explicit access: index array INRIA Sophia Antipolis Jun 2011
14 nd Arrays Generalized tensor-product grid Pros Cons topology: implicit geometry: implicit/explicit access: direct simple implementation implicit topology saves precious global bandwidth allows efficient on-chip stencil operations good SIMD utilization topology constraints make object modeling more difficult element form may require a more powerful solver INRIA Sophia Antipolis Jun 2011
15 Deformation Adaptivity This grid is a tensor-product! Easier to accelerate in hardware than resolution adaptive grids Anisotropy level determines optimal solver INRIA Sophia Antipolis Jun 2011
16 nd Arrays Adaptive/Unstructured grid Pros Cons topology: explicit geometry: implicit/explicit access: indirect general scheme for arbitrary node arrangements only one indirection for local data access clever node numberings preserve some data locality expensive encoding of topology/connectivity no global view of the structure difficult to handle dynamic changes in parallel INRIA Sophia Antipolis Jun 2011
17 Hash Adaptive/Unstructured grid Pros Cons topology: explicit geometry: implicit/explicit access: indirect general scheme for arbitrary node arrangements only one indirection for arbitrary global data access perfect hashes have no collisions, thus good SIMD use expensive encoding of topology/connectivity hashes tend to produce fine-grained random data access perfect hash generation too expensive for dynamic changes INRIA Sophia Antipolis Jun 2011
18 Tree Adaptive grid Pros Cons topology: explicit geometry: implicit/explicit access: indirect allows refinement of arbitrary depths compact encoding of global topology/connectivity allows dynamic changes in parallel several memory indirection in data access difficult to extract SIMD parallelism INRIA Sophia Antipolis Jun 2011
19 Structured and Unstructured Sparse MatVec Larger is better chart from [Bell and Garland SC 2009] INRIA Sophia Antipolis Jun
20 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun
21 Generalized Poisson Problem We seek a function u( x) : Ω R m, Ω R d which satisfies interior boundary div( G u=f u) f ν u=bn or in Ω u= b D on Ω We consider the scalar (m=1) 2D case (d=2) with operator anisotropies. Given a vector field ( v 1 (x,y), v 2 (x,y) ) we define: INRIA Sophia Antipolis Jun
22 Discretization Approach Finite Differences Finite Volumes Ax=b For 2D linear FEM A is a 9-band matrix. Finite Elements INRIA Sophia Antipolis Jun
23 Geometric Multigrid Method Linear equation system after discretization Ax=b Observation: Basic solvers quickly reduce the high frequency error components, but struggle with low frequencies Idea: Solve the system on a pyramid of grids, thus dealing with different frequencies one after another d k = b-ax k Fine grid Ac k = d k Coarse grid x k+1 = x k +c k Back on fine grid INRIA Sophia Antipolis Jun
24 Multigrid Transfers Restriction Interpolate values from fine into coarse array Local weighted gather operation 2i-1 2i 2i+1 i fine adjust index to read neighbors output region coarse coarse result array INRIA Sophia Antipolis Jun
25 Multigrid Transfers Prolongation Scatter values from fine to coarse with weighting stencil Local weighted scatter operation INRIA Sophia Antipolis Jun
26 Preconditioners x = Ax=b Damped, preconditioned defect correction: x + ωc ( b Ax k+ 1 k 1 k ) A = ( LL+ LD+ LU) + ( DL+ DD+ DU) + ( UL+ UD+ UU) JACOBI GSROW TRIDI TRIGSROW C=DD C = ( LL+ LD+ LU) + ( DL+ DD) C = ( DL+ DD+ DU) C = ( LL+ LD+ LU) + ( DL+ DD+ DU) INRIA Sophia Antipolis Jun
27 CPU Numerical and Runtime Efficiency Iter.Ref.(double) MG(float) V(2,2) CG x = 0,34?????? INRIA Sophia Antipolis Jun
28 Gauss-Seidel Preconditioner Ax=b Damped, preconditioned defect correction: x = x + ωc ( b Ax k+ 1 k 1 k ) GSROW C = ( LL+ LD+ LU) + ( DL+ DD) INRIA Sophia Antipolis Jun 2011
29 Multi-Colored Gauss-Seidel INRIA Sophia Antipolis Jun
30 ADI-TRIDI Preconditioner x Ax=b Damped, preconditioned defect correction: = x + ωc ( b Ax k+ 1 k 1 k ) TRIDI-ROW C = ( DL+ DD+ DU) TRIDI-COLUMN INRIA Sophia Antipolis Jun 2011
31 SIMD Parallelism: Cyclic Reduction 3*8 3*4-1 3*2-1 *2 *4 *8 3*2-1 3*4-1 3*9-2 O(N) *8 *4 *2 O(N*logN) INRIA Sophia Antipolis Jun
32 Memory Friendly Cyclic Reduction [Göddeke et al. Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid, TPDS 2011] INRIA Sophia Antipolis Jun
33 ADI-TRIGS Preconditioner x Ax=b Damped, preconditioned defect correction: = x + ωc ( b Ax k+ 1 k 1 k ) TRIGS-ROW C = ( LL+ LD+ LU) + ( DL+ DD+ DU) TRIGS-COLUMN INRIA Sophia Antipolis Jun
34 Many-Core Parallelism Host Input Assembler full/serial coupling Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Cache Cache Cache Cache Cache Cache Cache Cache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 4-way coupling 2-way coupling INRIA Sophia Antipolis Jun
35 CPU vs. GPU Numerical Efficiency Iter.Ref.(double) MG(float) V(2,2) CG INRIA Sophia Antipolis Jun
36 CPU vs. GPU Runtime Efficiency Core2Duo E6750 GTX280 Tesla C2070 Westmere-EP i7 X980 INRIA Sophia Antipolis Jun
37 Multigrid on Refined Unstructured Grid FE-gMG Unstructured grid Regular refinement Restriction & Prolongation as MatVec SPAI preconditioner Order & Storage (ELL) INRIA Sophia Antipolis Jun
38 FE-gMG Results with SPAI Preconditioner INRIA Sophia Antipolis Jun
39 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun
40 Precision Comparison Numerical Algorithms GPU Hardware long 64bit double s52e11 Precision int 32bit float s23e8 Comparison Bandwidth Storage Operator+ Operator* 1 word 2 words 1 word 2 words 1 adder 2 adders 1 multiplier 4 multipliers INRIA Sophia Antipolis Jun 2011
41 Hardware Precision float s23e8 double s52e11 23 bit 52 bit Data Error 1/2 = fl 0.5 1/3 = fl /2 = db 0.5 1/3 = db Roundoff Error * = fl e-8 = fl 1 f(a, b) = fl f fl (a, b) * = db e-15 = db 1 f(a, b) = db f db (a, b) INRIA Sophia Antipolis Jun 2011
42 The Erratic Roundoff Error Roundoff error for: 0 = f(a):= (1+a)^3 - (1+3a^2) - (3a+a^3) single precision double precision Smaller is better y = log2( f(a) ), 0 --> 2^ x = log2( 1/a ), a = 1 / 2^x INRIA Sophia Antipolis Jun 2011
43 Numerical Accuracy float s23e8 double s52e11 ( A x x A fl ε ) x fl 23 bit = b b = c(a) x ε ε Condition of Ax = b ( A x x Discretization Error A db δ ) x 52 bit db = b b = c( A) x δ δ div( G u) = f Ax=b err= u x err c(a) = u x INRIA Sophia Antipolis Jun 2011
44 Mixed Precision Iterative Refinement float s23e8 double s52e11 23 bit 52 bit ( A x x A fl ε ) x = c(a) x Condition of Ax = b = b b Iterative Refinement for Ax = b fl ε ε x d k = b-ax k Compute in high precision (cheap) Ac k = d k Solve in low precision (fast) x k+1 = x k +c k Correct in high precision (cheap) k = k+1 Iterate until convergence in high precision x x fl l+ 1 l+ 1 = F( Ab,, x fl l+ 1 fl l ) = c( F) x ε INRIA Sophia Antipolis Jun 2011
45 Mixed Precision Multigrid on GPU Core2Duo Smaller is better G80-CPU GT200-CPU GT200-db GT200-GPU INRIA Sophia Antipolis Jun 2011
46 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun
47 Multi-Valued Data Multi-valued data is ubiquitous Class 1: mathematical properties, e.g. multiple derivatives or moments Class 2: discrete features, e.g. colors of a pixel, multiple scores Class 3: per-dimension properties, e.g. coordinates, velocities on a 3D grid INRIA Sophia Antipolis Jun 2011
48 AoS and SoA Array of Structs (AoS) struct NormalStruct { Type1 comp1; Type2 comp2; Type3 comp3; }; Struct of Arrays (SoA) struct SoAContainer { Type1 comp1[size]; Type2 comp2[size]; Type3 comp3[size]; }; typedef NormalStruct AoSContainer[SIZE]; AoSContainer container; container[5].comp3++; SoAContainer container; container.comp3[5]++; INRIA Sophia Antipolis Jun 2011
49 Parallel Access in AoS and SoA Array of Structs (AoS) container[1].comp3 container[2].comp3 container[3].comp3 container[4].comp3 Struct of Arrays (SoA) container.comp3[1] container.comp3[2] container.comp3[3] container.comp3[4] container[1].comp{1,2,3} container[5].comp{1,2,3} container.comp{1,2,3}[1] container.comp{1,2,3}[5] INRIA Sophia Antipolis Jun 2011
50 Operating on Container Elements struct NormalStruct { Type1 comp1; Type2 comp2; Type3 comp3; }; typedef NormalStruct AoSContainer[SIZE]; NormalStruct single; AoSContainer container; In-place update of single and indexed structs void update( NormalStruct& s ) { s.comp3 += s.comp1; } update( single ); // OK update( container[5] ); // OK This is not possible with standard SoA/C++ syntax: container.comp3(5); INRIA Sophia Antipolis Jun 2011
51 Abstraction: AoS + SoA = ASA Array of Structs (AoS) struct NormalStruct { Type1 comp1; Type2 comp2; Type3 comp3; }; typedef NormalStruct Container[SIZE]; NormalStruct single Container container; void update(normalstruct& s); container[index].comp3++; update( container[5] ); Array of Structs of Arrays (ASA) template <ID t_id=id_value> struct FlexibleStruct { typedef ASAGroup<Type1,t_id> ASX_ASA; union{ Type1 comp1; ASX_ASA d1; }; union{ Type2 comp2; ASX_ASA d2; }; union{ Type3 comp3; ASX_ASA d3; }; }; typedef ASX::Array<FlexibleStruct, SIZE,??? > Container; FlexibleStruct<> single Container container; template <ID t_id> void update(flexiblestruct<t_id>& s); container[index].comp3++; update( container[5] ); [Strzodka. Abstraction for AoS and SoA layout in C++. GCG 2011] INRIA Sophia Antipolis Jun 2011??? = ASX::AOS or ASX::SOA
52 Overview Levels of Parallelism Explicit exploitation of all levels: SIMD, core, socket, cluster Grid Discretizations of PDEs Regularity of memory access Multigrid and Strong Smoothers Balancing numerical and hardware requirements Mixed Precision Iterative Refinement Same accuracy with faster computation Layout of Multi-valued Data Choice of layout determines memory access patterns INRIA Sophia Antipolis Jun
53 Questions? Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik ~strzodka Dominik Göddeke Institute for Applied Mathematics Technical University of Dortmund ~goeddeke
Structures for Heterogeneous Computing. The Best of Both Worlds: Flexible Data. Integrative Scientific Computing Max Planck Institut Informatik
The Best of Both Worlds: Flexible Data Structures for Heterogeneous Computing Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik My GPU Programming Challenges 2000: Conjugate
More informationPerformance and accuracy of hardware-oriented. native-, solvers in FEM simulations
Robert Strzodka, Stanford University Dominik Göddeke, Universität Dortmund Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Number of slices
More informationGPU Cluster Computing for FEM
GPU Cluster Computing for FEM Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de GPU Computing
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationMixed-Precision GPU-Multigrid Solvers with Strong Smoothers
Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de ILAS 2011 Mini-Symposium: Parallel
More informationPerformance and accuracy of hardware-oriented native-, solvers in FEM simulations
Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Dominik Göddeke Angewandte Mathematik und Numerik, Universität Dortmund Acknowledgments Joint
More informationGPU Acceleration of Unmodified CSM and CFD Solvers
GPU Acceleration of Unmodified CSM and CFD Solvers Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationMixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM
Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Dominik Göddeke and Robert Strzodka Institut für Angewandte Mathematik (LS3), TU Dortmund Max Planck Institut
More informationThe GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.
The GPU as a co-processor in FEM-based simulations Preliminary results Dipl.-Inform. Dominik Göddeke dominik.goeddeke@mathematik.uni-dortmund.de Institute of Applied Mathematics University of Dortmund
More informationGPU Cluster Computing for Finite Element Applications
GPU Cluster Computing for Finite Element Applications Dominik Göddeke, Hilmar Wobker, Sven H.M. Buijssen and Stefan Turek Applied Mathematics TU Dortmund dominik.goeddeke@math.tu-dortmund.de http://www.mathematik.tu-dortmund.de/~goeddeke
More informationIntegrating GPUs as fast co-processors into the existing parallel FE package FEAST
Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dipl.-Inform. Dominik Göddeke (dominik.goeddeke@math.uni-dortmund.de) Mathematics III: Applied Mathematics and Numerics
More informationAccelerating Double Precision FEM Simulations with GPUs
Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University
More informationHigh Performance Computing for PDE Some numerical aspects of Petascale Computing
High Performance Computing for PDE Some numerical aspects of Petascale Computing S. Turek, D. Göddeke with support by: Chr. Becker, S. Buijssen, M. Grajewski, H. Wobker Institut für Angewandte Mathematik,
More informationHigh Performance Computing for PDE Towards Petascale Computing
High Performance Computing for PDE Towards Petascale Computing S. Turek, D. Göddeke with support by: Chr. Becker, S. Buijssen, M. Grajewski, H. Wobker Institut für Angewandte Mathematik, Univ. Dortmund
More informationAccelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead
Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU Robert Strzodka NVAMG Project Lead A Parallel Success Story in Five Steps 2 Step 1: Understand Application ANSYS Fluent Computational Fluid Dynamics
More informationMixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM
Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de SIMTECH
More informationFOR P3: A monolithic multigrid FEM solver for fluid structure interaction
FOR 493 - P3: A monolithic multigrid FEM solver for fluid structure interaction Stefan Turek 1 Jaroslav Hron 1,2 Hilmar Wobker 1 Mudassar Razzaq 1 1 Institute of Applied Mathematics, TU Dortmund, Germany
More informationCase study: GPU acceleration of parallel multigrid solvers
Case study: GPU acceleration of parallel multigrid solvers Dominik Göddeke Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Acknowledgements Hilmar Wobker,
More informationWhy Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends
Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer
More informationIntegrating GPUs as fast co-processors into the existing parallel FE package FEAST
Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dominik Göddeke Universität Dortmund dominik.goeddeke@math.uni-dortmund.de Christian Becker christian.becker@math.uni-dortmund.de
More informationResilient geometric finite-element multigrid algorithms using minimised checkpointing
Resilient geometric finite-element multigrid algorithms using minimised checkpointing Dominik Göddeke, Mirco Altenbernd, Dirk Ribbrock Institut für Angewandte Mathematik (LS3) Fakultät für Mathematik TU
More informationCyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid
PREPRINT OF AN ARTICLE ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid Dominik Göddeke
More informationHardware-Oriented Numerics - High Performance FEM Simulation of PDEs
Hardware-Oriented umerics - High Performance FEM Simulation of PDEs Stefan Turek Institut für Angewandte Mathematik, Univ. Dortmund http://www.mathematik.uni-dortmund.de/ls3 http://www.featflow.de Performance
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationIntroduction to Multigrid and its Parallelization
Introduction to Multigrid and its Parallelization! Thomas D. Economon Lecture 14a May 28, 2014 Announcements 2 HW 1 & 2 have been returned. Any questions? Final projects are due June 11, 5 pm. If you are
More informationPhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.
Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences
More informationWhy GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)
Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009
More informationHardware-Oriented Finite Element Multigrid Solvers for PDEs
Hardware-Oriented Finite Element Multigrid Solvers for PDEs Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de ASIM Workshop Trends in CSE, Garching,
More informationPROGRAMMING OF MULTIGRID METHODS
PROGRAMMING OF MULTIGRID METHODS LONG CHEN In this note, we explain the implementation detail of multigrid methods. We will use the approach by space decomposition and subspace correction method; see Chapter:
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26
More informationMatrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs
Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko
More informationGPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA
GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA ANSYS Fluent 2 Fluent control flow Accelerate this first Non-linear iterations Assemble
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationFast and Accurate Finite Element Multigrid Solvers for PDE Simulations on GPU Clusters
Fast and Accurate Finite Element Multigrid Solvers for PDE Simulations on GPU Clusters Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de Kolloquium
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationAutomatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014
Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential
More informationDistributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs
Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationPerformance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs)
Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs) Dominik Göddeke and Robert Strzodka Applied Mathematics,
More informationEfficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling
Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike
More informationEfficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI
Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from
More informationLarge Displacement Optical Flow & Applications
Large Displacement Optical Flow & Applications Narayanan Sundaram, Kurt Keutzer (Parlab) In collaboration with Thomas Brox (University of Freiburg) Michael Tao (University of California Berkeley) Parlab
More informationHighly Parallel Multigrid Solvers for Multicore and Manycore Processors
Highly Parallel Multigrid Solvers for Multicore and Manycore Processors Oleg Bessonov (B) Institute for Problems in Mechanics of the Russian Academy of Sciences, 101, Vernadsky Avenue, 119526 Moscow, Russia
More informationComputing on GPU Clusters
Computing on GPU Clusters Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009
More informationAccelerating Double Precision FEM Simulations with GPUs
In Proceedings of ASIM 2005-18th Symposium on Simulation Technique, Sept. 2005. Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke dominik.goeddeke@math.uni-dortmund.de Universität
More informationACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016
ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2
More informationTotal efficiency of core components in Finite Element frameworks
Total efficiency of core components in Finite Element frameworks Markus Geveler Inst. for Applied Mathematics TU Dortmund University of Technology, Germany markus.geveler@math.tu-dortmund.de MAFELAP13:
More informationS0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS
S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationGeometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion
Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion M. Stürmer, H. Köstler, and U. Rüde Lehrstuhl für Systemsimulation Friedrich-Alexander-Universität Erlangen-Nürnberg
More information3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs
3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional
More informationWorkshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012
Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012 This work was performed under the auspices of the U.S. Department of Energy by under contract DE-AC52-07NA27344. Lawrence
More informationContents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet
Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage
More informationSELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND
Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationExploring unstructured Poisson solvers for FDS
Exploring unstructured Poisson solvers for FDS Dr. Susanne Kilian hhpberlin - Ingenieure für Brandschutz 10245 Berlin - Germany Agenda 1 Discretization of Poisson- Löser 2 Solvers for 3 Numerical Tests
More informationAmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015
AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative
More informationVery fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards
Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop
More informationLarge-scale Gas Turbine Simulations on GPU clusters
Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based
More informationD036 Accelerating Reservoir Simulation with GPUs
D036 Accelerating Reservoir Simulation with GPUs K.P. Esler* (Stone Ridge Technology), S. Atan (Marathon Oil Corp.), B. Ramirez (Marathon Oil Corp.) & V. Natoli (Stone Ridge Technology) SUMMARY Over the
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationKrishnan Suresh Associate Professor Mechanical Engineering
Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick
More informationNumerical Modelling in Fortran: day 6. Paul Tackley, 2017
Numerical Modelling in Fortran: day 6 Paul Tackley, 2017 Today s Goals 1. Learn about pointers, generic procedures and operators 2. Learn about iterative solvers for boundary value problems, including
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More informationContents. I The Basic Framework for Stationary Problems 1
page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationParallel Implementations of Gaussian Elimination
s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n
More informationsmooth coefficients H. Köstler, U. Rüde
A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin
More informationEFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI
EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,
More informationOptimization of HOM Couplers using Time Domain Schemes
Optimization of HOM Couplers using Time Domain Schemes Workshop on HOM Damping in Superconducting RF Cavities Carsten Potratz Universität Rostock October 11, 2010 10/11/2010 2009 UNIVERSITÄT ROSTOCK FAKULTÄT
More informationGPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction
1/8 GPU Implementation of Elliptic Solvers in Numerical Weather- and Climate- Prediction Eike Hermann Müller, Robert Scheichl Department of Mathematical Sciences EHM, Xu Guo, Sinan Shi and RS: http://arxiv.org/abs/1302.7193
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More information14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs
14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs K. Esler, D. Dembeck, K. Mukundakrishnan, V. Natoli, J. Shumway and Y. Zhang Stone Ridge Technology, Bel Air, MD
More informationThe Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009
The Many-Core Revolution Understanding Change Alejandro Cabrera cpp.cabrera@gmail.com January 29, 2009 Disclaimer This presentation currently contains several claims requiring proper citations and a few
More informationScalability of Elliptic Solvers in NWP. Weather and Climate- Prediction
Background Scaling results Tensor product geometric multigrid Summary and Outlook 1/21 Scalability of Elliptic Solvers in Numerical Weather and Climate- Prediction Eike Hermann Müller, Robert Scheichl
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationA GPU-based High-Performance Library with Application to Nonlinear Water Waves
Downloaded from orbit.dtu.dk on: Dec 20, 2017 Glimberg, Stefan Lemvig; Engsig-Karup, Allan Peter Publication date: 2012 Document Version Publisher's PDF, also known as Version of record Link back to DTU
More informationImplicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC
Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma
More informationMultigrid Pattern. I. Problem. II. Driving Forces. III. Solution
Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids
More informationMultigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK
Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible
More informationDominik Göddeke* and Hilmar Wobker
254 Int. J. Computational Science and Engineering, Vol. 4, No. 4, 2009 Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU Dominik Göddeke* and Hilmar Wobker Institute
More informationGPU Computation Strategies & Tricks. Ian Buck NVIDIA
GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit
More informationReconstruction of Trees from Laser Scan Data and further Simulation Topics
Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair
More informationA Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units
A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units Markus Wagner, Karl Rupp,2, Josef Weinbub Institute for Microelectronics, TU
More informationRealization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects
Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects Markus Geveler, Stefan Turek, Dirk Ribbrock PACO Magdeburg 2015 / 7 / 7 markus.geveler@math.tu-dortmund.de
More informationAlgorithms, System and Data Centre Optimisation for Energy Efficient HPC
2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1 Energy
More informationBLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker
BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationGradient Free Design of Microfluidic Structures on a GPU Cluster
Gradient Free Design of Microfluidic Structures on a GPU Cluster Austen Duffy - Florida State University SIAM Conference on Computational Science and Engineering March 2, 2011 Acknowledgements This work
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationCUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata
CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More information