Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters

Size: px
Start display at page:

Download "Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters"

Transcription

1 Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik ~strzodka Dominik Göddeke Institute for Applied Mathematics Technical University of Dortmund ~goeddeke

2 Structure of Double Lecture 2 x 90 min PART 1 PART 2 Parallelism Grid Discretization Multigrid & Smoothers Mixed-Precision Data Layout FEM on GPU Clusters New MPI Application (Rewrite) Legacy MPI Code (Accelerate) INRIA Sophia Antipolis Jun

3 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun

4 Parallelism Sequential execution is an illusionary software concept All transistors always do something in parallel! Billions of transistors in modern CPUs(>0.5) & GPUs(>2) Old: Implicit parallelism with caches, ILP, speculation diminishing returns, power constraints New: Explicit parallelism on multiple levels much more efficient & natural INRIA Sophia Antipolis Jun

5 SIMD Parallelism instructions Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7 Data 0 Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Unit Data Unit 15 Data 15 It is impossible to execute just one instruction a= b + c; Actually means execute (add, nop, nop, nop, ) Penalty for ignoring SIMD 4x on current CPUs (SSE) 8-16x on future CPUs (AVX, LRB) 16-80x on GPUs INRIA Sophia Antipolis Jun 2011

6 Many-Core Parallelism Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7 Unit 15 Data 0 Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 15 Host Input Assembler Execution Manager Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Cache Cache Cache Cache Cache Cache Cache Cache Load/store Penalty Load/store for ignoring Load/store many-cores Load/store Load/store Load/store 4-8x on current CPUs Global Memory 32-48x on future CPUs (LRB) 10-30x on GPUs INRIA Sophia Antipolis Jun

7 Intra-Node Parallelism (multiple CPUs/GPUs per PC) CPU1 CPU1 CPU3 CPU4 Multi-CPU system GPU1 GPU2 GPU3 GPU4 Multi-GPU system Penalty for ignoring intra-node parallelism 4-8x for multi-cpu systems 4-8x for multi-gpu systems INRIA Sophia Antipolis Jun

8 Inter-Node Parallelism in a Cluster PC PC PC Penalty for ignoring inter-node parallelism Depends on the number of nodes in the cluster Capability computing INRIA Sophia Antipolis Jun

9 Bandwidth in a CPU-GPU System CPU socket (4-12 cores) CPU CPU core core CPU CPU core ALU core ALU on-chip on-chip FPUALU on-chip FPU memory GB/s on-chip FPU memory 40 GB/s FPU memory memory 40 GB/s high-end GPU socket ALU ALU FPU FPU 2 TB/s on-chip on-chip memory memory 2 GB/s 20 GB/s 200 GB/s system memory 4 GB/s device memory INRIA Sophia Antipolis Jun

10 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun

11 Generalized Poisson Problem We seek a function u( x) : Ω R m, Ω R d which satisfies interior boundary div( G u=f u) f ν u=bn or in Ω u= b D on Ω We consider the scalar (m=1) 2D case (d=2) with operator anisotropies. Given a vector field ( v 1 (x,y), v 2 (x,y) ) we define: INRIA Sophia Antipolis Jun

12 Discretization Grids Equidistant grid topology: implicit geometry: implicit access: direct Generalized tensor-product grid topology: implicit geometry: explicit access: direct INRIA Sophia Antipolis Jun 2011

13 Discretization Grids Adaptive grid topology: explicit geometry: implicit/explicit access: hash, tree or page table Unstructured grid topology: explicit geometry: explicit access: index array INRIA Sophia Antipolis Jun 2011

14 nd Arrays Generalized tensor-product grid Pros Cons topology: implicit geometry: implicit/explicit access: direct simple implementation implicit topology saves precious global bandwidth allows efficient on-chip stencil operations good SIMD utilization topology constraints make object modeling more difficult element form may require a more powerful solver INRIA Sophia Antipolis Jun 2011

15 Deformation Adaptivity This grid is a tensor-product! Easier to accelerate in hardware than resolution adaptive grids Anisotropy level determines optimal solver INRIA Sophia Antipolis Jun 2011

16 nd Arrays Adaptive/Unstructured grid Pros Cons topology: explicit geometry: implicit/explicit access: indirect general scheme for arbitrary node arrangements only one indirection for local data access clever node numberings preserve some data locality expensive encoding of topology/connectivity no global view of the structure difficult to handle dynamic changes in parallel INRIA Sophia Antipolis Jun 2011

17 Hash Adaptive/Unstructured grid Pros Cons topology: explicit geometry: implicit/explicit access: indirect general scheme for arbitrary node arrangements only one indirection for arbitrary global data access perfect hashes have no collisions, thus good SIMD use expensive encoding of topology/connectivity hashes tend to produce fine-grained random data access perfect hash generation too expensive for dynamic changes INRIA Sophia Antipolis Jun 2011

18 Tree Adaptive grid Pros Cons topology: explicit geometry: implicit/explicit access: indirect allows refinement of arbitrary depths compact encoding of global topology/connectivity allows dynamic changes in parallel several memory indirection in data access difficult to extract SIMD parallelism INRIA Sophia Antipolis Jun 2011

19 Structured and Unstructured Sparse MatVec Larger is better chart from [Bell and Garland SC 2009] INRIA Sophia Antipolis Jun

20 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun

21 Generalized Poisson Problem We seek a function u( x) : Ω R m, Ω R d which satisfies interior boundary div( G u=f u) f ν u=bn or in Ω u= b D on Ω We consider the scalar (m=1) 2D case (d=2) with operator anisotropies. Given a vector field ( v 1 (x,y), v 2 (x,y) ) we define: INRIA Sophia Antipolis Jun

22 Discretization Approach Finite Differences Finite Volumes Ax=b For 2D linear FEM A is a 9-band matrix. Finite Elements INRIA Sophia Antipolis Jun

23 Geometric Multigrid Method Linear equation system after discretization Ax=b Observation: Basic solvers quickly reduce the high frequency error components, but struggle with low frequencies Idea: Solve the system on a pyramid of grids, thus dealing with different frequencies one after another d k = b-ax k Fine grid Ac k = d k Coarse grid x k+1 = x k +c k Back on fine grid INRIA Sophia Antipolis Jun

24 Multigrid Transfers Restriction Interpolate values from fine into coarse array Local weighted gather operation 2i-1 2i 2i+1 i fine adjust index to read neighbors output region coarse coarse result array INRIA Sophia Antipolis Jun

25 Multigrid Transfers Prolongation Scatter values from fine to coarse with weighting stencil Local weighted scatter operation INRIA Sophia Antipolis Jun

26 Preconditioners x = Ax=b Damped, preconditioned defect correction: x + ωc ( b Ax k+ 1 k 1 k ) A = ( LL+ LD+ LU) + ( DL+ DD+ DU) + ( UL+ UD+ UU) JACOBI GSROW TRIDI TRIGSROW C=DD C = ( LL+ LD+ LU) + ( DL+ DD) C = ( DL+ DD+ DU) C = ( LL+ LD+ LU) + ( DL+ DD+ DU) INRIA Sophia Antipolis Jun

27 CPU Numerical and Runtime Efficiency Iter.Ref.(double) MG(float) V(2,2) CG x = 0,34?????? INRIA Sophia Antipolis Jun

28 Gauss-Seidel Preconditioner Ax=b Damped, preconditioned defect correction: x = x + ωc ( b Ax k+ 1 k 1 k ) GSROW C = ( LL+ LD+ LU) + ( DL+ DD) INRIA Sophia Antipolis Jun 2011

29 Multi-Colored Gauss-Seidel INRIA Sophia Antipolis Jun

30 ADI-TRIDI Preconditioner x Ax=b Damped, preconditioned defect correction: = x + ωc ( b Ax k+ 1 k 1 k ) TRIDI-ROW C = ( DL+ DD+ DU) TRIDI-COLUMN INRIA Sophia Antipolis Jun 2011

31 SIMD Parallelism: Cyclic Reduction 3*8 3*4-1 3*2-1 *2 *4 *8 3*2-1 3*4-1 3*9-2 O(N) *8 *4 *2 O(N*logN) INRIA Sophia Antipolis Jun

32 Memory Friendly Cyclic Reduction [Göddeke et al. Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid, TPDS 2011] INRIA Sophia Antipolis Jun

33 ADI-TRIGS Preconditioner x Ax=b Damped, preconditioned defect correction: = x + ωc ( b Ax k+ 1 k 1 k ) TRIGS-ROW C = ( LL+ LD+ LU) + ( DL+ DD+ DU) TRIGS-COLUMN INRIA Sophia Antipolis Jun

34 Many-Core Parallelism Host Input Assembler full/serial coupling Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Cache Cache Cache Cache Cache Cache Cache Cache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 4-way coupling 2-way coupling INRIA Sophia Antipolis Jun

35 CPU vs. GPU Numerical Efficiency Iter.Ref.(double) MG(float) V(2,2) CG INRIA Sophia Antipolis Jun

36 CPU vs. GPU Runtime Efficiency Core2Duo E6750 GTX280 Tesla C2070 Westmere-EP i7 X980 INRIA Sophia Antipolis Jun

37 Multigrid on Refined Unstructured Grid FE-gMG Unstructured grid Regular refinement Restriction & Prolongation as MatVec SPAI preconditioner Order & Storage (ELL) INRIA Sophia Antipolis Jun

38 FE-gMG Results with SPAI Preconditioner INRIA Sophia Antipolis Jun

39 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun

40 Precision Comparison Numerical Algorithms GPU Hardware long 64bit double s52e11 Precision int 32bit float s23e8 Comparison Bandwidth Storage Operator+ Operator* 1 word 2 words 1 word 2 words 1 adder 2 adders 1 multiplier 4 multipliers INRIA Sophia Antipolis Jun 2011

41 Hardware Precision float s23e8 double s52e11 23 bit 52 bit Data Error 1/2 = fl 0.5 1/3 = fl /2 = db 0.5 1/3 = db Roundoff Error * = fl e-8 = fl 1 f(a, b) = fl f fl (a, b) * = db e-15 = db 1 f(a, b) = db f db (a, b) INRIA Sophia Antipolis Jun 2011

42 The Erratic Roundoff Error Roundoff error for: 0 = f(a):= (1+a)^3 - (1+3a^2) - (3a+a^3) single precision double precision Smaller is better y = log2( f(a) ), 0 --> 2^ x = log2( 1/a ), a = 1 / 2^x INRIA Sophia Antipolis Jun 2011

43 Numerical Accuracy float s23e8 double s52e11 ( A x x A fl ε ) x fl 23 bit = b b = c(a) x ε ε Condition of Ax = b ( A x x Discretization Error A db δ ) x 52 bit db = b b = c( A) x δ δ div( G u) = f Ax=b err= u x err c(a) = u x INRIA Sophia Antipolis Jun 2011

44 Mixed Precision Iterative Refinement float s23e8 double s52e11 23 bit 52 bit ( A x x A fl ε ) x = c(a) x Condition of Ax = b = b b Iterative Refinement for Ax = b fl ε ε x d k = b-ax k Compute in high precision (cheap) Ac k = d k Solve in low precision (fast) x k+1 = x k +c k Correct in high precision (cheap) k = k+1 Iterate until convergence in high precision x x fl l+ 1 l+ 1 = F( Ab,, x fl l+ 1 fl l ) = c( F) x ε INRIA Sophia Antipolis Jun 2011

45 Mixed Precision Multigrid on GPU Core2Duo Smaller is better G80-CPU GT200-CPU GT200-db GT200-GPU INRIA Sophia Antipolis Jun 2011

46 Overview Levels of Parallelism Grid Discretizations of PDEs Multigrid and Strong Smoothers Mixed Precision Iterative Refinement Layout of Multi-valued Data INRIA Sophia Antipolis Jun

47 Multi-Valued Data Multi-valued data is ubiquitous Class 1: mathematical properties, e.g. multiple derivatives or moments Class 2: discrete features, e.g. colors of a pixel, multiple scores Class 3: per-dimension properties, e.g. coordinates, velocities on a 3D grid INRIA Sophia Antipolis Jun 2011

48 AoS and SoA Array of Structs (AoS) struct NormalStruct { Type1 comp1; Type2 comp2; Type3 comp3; }; Struct of Arrays (SoA) struct SoAContainer { Type1 comp1[size]; Type2 comp2[size]; Type3 comp3[size]; }; typedef NormalStruct AoSContainer[SIZE]; AoSContainer container; container[5].comp3++; SoAContainer container; container.comp3[5]++; INRIA Sophia Antipolis Jun 2011

49 Parallel Access in AoS and SoA Array of Structs (AoS) container[1].comp3 container[2].comp3 container[3].comp3 container[4].comp3 Struct of Arrays (SoA) container.comp3[1] container.comp3[2] container.comp3[3] container.comp3[4] container[1].comp{1,2,3} container[5].comp{1,2,3} container.comp{1,2,3}[1] container.comp{1,2,3}[5] INRIA Sophia Antipolis Jun 2011

50 Operating on Container Elements struct NormalStruct { Type1 comp1; Type2 comp2; Type3 comp3; }; typedef NormalStruct AoSContainer[SIZE]; NormalStruct single; AoSContainer container; In-place update of single and indexed structs void update( NormalStruct& s ) { s.comp3 += s.comp1; } update( single ); // OK update( container[5] ); // OK This is not possible with standard SoA/C++ syntax: container.comp3(5); INRIA Sophia Antipolis Jun 2011

51 Abstraction: AoS + SoA = ASA Array of Structs (AoS) struct NormalStruct { Type1 comp1; Type2 comp2; Type3 comp3; }; typedef NormalStruct Container[SIZE]; NormalStruct single Container container; void update(normalstruct& s); container[index].comp3++; update( container[5] ); Array of Structs of Arrays (ASA) template <ID t_id=id_value> struct FlexibleStruct { typedef ASAGroup<Type1,t_id> ASX_ASA; union{ Type1 comp1; ASX_ASA d1; }; union{ Type2 comp2; ASX_ASA d2; }; union{ Type3 comp3; ASX_ASA d3; }; }; typedef ASX::Array<FlexibleStruct, SIZE,??? > Container; FlexibleStruct<> single Container container; template <ID t_id> void update(flexiblestruct<t_id>& s); container[index].comp3++; update( container[5] ); [Strzodka. Abstraction for AoS and SoA layout in C++. GCG 2011] INRIA Sophia Antipolis Jun 2011??? = ASX::AOS or ASX::SOA

52 Overview Levels of Parallelism Explicit exploitation of all levels: SIMD, core, socket, cluster Grid Discretizations of PDEs Regularity of memory access Multigrid and Strong Smoothers Balancing numerical and hardware requirements Mixed Precision Iterative Refinement Same accuracy with faster computation Layout of Multi-valued Data Choice of layout determines memory access patterns INRIA Sophia Antipolis Jun

53 Questions? Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik ~strzodka Dominik Göddeke Institute for Applied Mathematics Technical University of Dortmund ~goeddeke

Structures for Heterogeneous Computing. The Best of Both Worlds: Flexible Data. Integrative Scientific Computing Max Planck Institut Informatik

Structures for Heterogeneous Computing. The Best of Both Worlds: Flexible Data. Integrative Scientific Computing Max Planck Institut Informatik The Best of Both Worlds: Flexible Data Structures for Heterogeneous Computing Robert Strzodka Integrative Scientific Computing Max Planck Institut Informatik My GPU Programming Challenges 2000: Conjugate

More information

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations Robert Strzodka, Stanford University Dominik Göddeke, Universität Dortmund Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Number of slices

More information

GPU Cluster Computing for FEM

GPU Cluster Computing for FEM GPU Cluster Computing for FEM Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de GPU Computing

More information

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de ILAS 2011 Mini-Symposium: Parallel

More information

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Dominik Göddeke Angewandte Mathematik und Numerik, Universität Dortmund Acknowledgments Joint

More information

GPU Acceleration of Unmodified CSM and CFD Solvers

GPU Acceleration of Unmodified CSM and CFD Solvers GPU Acceleration of Unmodified CSM and CFD Solvers Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Dominik Göddeke and Robert Strzodka Institut für Angewandte Mathematik (LS3), TU Dortmund Max Planck Institut

More information

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke. The GPU as a co-processor in FEM-based simulations Preliminary results Dipl.-Inform. Dominik Göddeke dominik.goeddeke@mathematik.uni-dortmund.de Institute of Applied Mathematics University of Dortmund

More information

GPU Cluster Computing for Finite Element Applications

GPU Cluster Computing for Finite Element Applications GPU Cluster Computing for Finite Element Applications Dominik Göddeke, Hilmar Wobker, Sven H.M. Buijssen and Stefan Turek Applied Mathematics TU Dortmund dominik.goeddeke@math.tu-dortmund.de http://www.mathematik.tu-dortmund.de/~goeddeke

More information

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dipl.-Inform. Dominik Göddeke (dominik.goeddeke@math.uni-dortmund.de) Mathematics III: Applied Mathematics and Numerics

More information

Accelerating Double Precision FEM Simulations with GPUs

Accelerating Double Precision FEM Simulations with GPUs Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University

More information

High Performance Computing for PDE Some numerical aspects of Petascale Computing

High Performance Computing for PDE Some numerical aspects of Petascale Computing High Performance Computing for PDE Some numerical aspects of Petascale Computing S. Turek, D. Göddeke with support by: Chr. Becker, S. Buijssen, M. Grajewski, H. Wobker Institut für Angewandte Mathematik,

More information

High Performance Computing for PDE Towards Petascale Computing

High Performance Computing for PDE Towards Petascale Computing High Performance Computing for PDE Towards Petascale Computing S. Turek, D. Göddeke with support by: Chr. Becker, S. Buijssen, M. Grajewski, H. Wobker Institut für Angewandte Mathematik, Univ. Dortmund

More information

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU Robert Strzodka NVAMG Project Lead A Parallel Success Story in Five Steps 2 Step 1: Understand Application ANSYS Fluent Computational Fluid Dynamics

More information

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de SIMTECH

More information

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction FOR 493 - P3: A monolithic multigrid FEM solver for fluid structure interaction Stefan Turek 1 Jaroslav Hron 1,2 Hilmar Wobker 1 Mudassar Razzaq 1 1 Institute of Applied Mathematics, TU Dortmund, Germany

More information

Case study: GPU acceleration of parallel multigrid solvers

Case study: GPU acceleration of parallel multigrid solvers Case study: GPU acceleration of parallel multigrid solvers Dominik Göddeke Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Acknowledgements Hilmar Wobker,

More information

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer

More information

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Integrating GPUs as fast co-processors into the existing parallel FE package FEAST Dominik Göddeke Universität Dortmund dominik.goeddeke@math.uni-dortmund.de Christian Becker christian.becker@math.uni-dortmund.de

More information

Resilient geometric finite-element multigrid algorithms using minimised checkpointing

Resilient geometric finite-element multigrid algorithms using minimised checkpointing Resilient geometric finite-element multigrid algorithms using minimised checkpointing Dominik Göddeke, Mirco Altenbernd, Dirk Ribbrock Institut für Angewandte Mathematik (LS3) Fakultät für Mathematik TU

More information

Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid

Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid PREPRINT OF AN ARTICLE ACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid Dominik Göddeke

More information

Hardware-Oriented Numerics - High Performance FEM Simulation of PDEs

Hardware-Oriented Numerics - High Performance FEM Simulation of PDEs Hardware-Oriented umerics - High Performance FEM Simulation of PDEs Stefan Turek Institut für Angewandte Mathematik, Univ. Dortmund http://www.mathematik.uni-dortmund.de/ls3 http://www.featflow.de Performance

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Introduction to Multigrid and its Parallelization

Introduction to Multigrid and its Parallelization Introduction to Multigrid and its Parallelization! Thomas D. Economon Lecture 14a May 28, 2014 Announcements 2 HW 1 & 2 have been returned. Any questions? Final projects are due June 11, 5 pm. If you are

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD) Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

More information

Hardware-Oriented Finite Element Multigrid Solvers for PDEs

Hardware-Oriented Finite Element Multigrid Solvers for PDEs Hardware-Oriented Finite Element Multigrid Solvers for PDEs Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de ASIM Workshop Trends in CSE, Garching,

More information

PROGRAMMING OF MULTIGRID METHODS

PROGRAMMING OF MULTIGRID METHODS PROGRAMMING OF MULTIGRID METHODS LONG CHEN In this note, we explain the implementation detail of multigrid methods. We will use the approach by space decomposition and subspace correction method; see Chapter:

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Iterative Solvers Numerical Results Conclusion and outlook 1/18 Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs Eike Hermann Müller, Robert Scheichl, Eero Vainikko

More information

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA ANSYS Fluent 2 Fluent control flow Accelerate this first Non-linear iterations Assemble

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Fast and Accurate Finite Element Multigrid Solvers for PDE Simulations on GPU Clusters

Fast and Accurate Finite Element Multigrid Solvers for PDE Simulations on GPU Clusters Fast and Accurate Finite Element Multigrid Solvers for PDE Simulations on GPU Clusters Dominik Göddeke Institut für Angewandte Mathematik (LS3) TU Dortmund dominik.goeddeke@math.tu-dortmund.de Kolloquium

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential

More information

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs)

Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs) Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs) Dominik Göddeke and Robert Strzodka Applied Mathematics,

More information

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Iterative Solvers Numerical Results Conclusion and outlook 1/22 Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling Part II: GPU Implementation and Scaling on Titan Eike

More information

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from

More information

Large Displacement Optical Flow & Applications

Large Displacement Optical Flow & Applications Large Displacement Optical Flow & Applications Narayanan Sundaram, Kurt Keutzer (Parlab) In collaboration with Thomas Brox (University of Freiburg) Michael Tao (University of California Berkeley) Parlab

More information

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors Highly Parallel Multigrid Solvers for Multicore and Manycore Processors Oleg Bessonov (B) Institute for Problems in Mechanics of the Russian Academy of Sciences, 101, Vernadsky Avenue, 119526 Moscow, Russia

More information

Computing on GPU Clusters

Computing on GPU Clusters Computing on GPU Clusters Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009

More information

Accelerating Double Precision FEM Simulations with GPUs

Accelerating Double Precision FEM Simulations with GPUs In Proceedings of ASIM 2005-18th Symposium on Simulation Technique, Sept. 2005. Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke dominik.goeddeke@math.uni-dortmund.de Universität

More information

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2

More information

Total efficiency of core components in Finite Element frameworks

Total efficiency of core components in Finite Element frameworks Total efficiency of core components in Finite Element frameworks Markus Geveler Inst. for Applied Mathematics TU Dortmund University of Technology, Germany markus.geveler@math.tu-dortmund.de MAFELAP13:

More information

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion M. Stürmer, H. Köstler, and U. Rüde Lehrstuhl für Systemsimulation Friedrich-Alexander-Universität Erlangen-Nürnberg

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012

Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012 Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012 This work was performed under the auspices of the U.S. Department of Energy by under contract DE-AC52-07NA27344. Lawrence

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

Exploring unstructured Poisson solvers for FDS

Exploring unstructured Poisson solvers for FDS Exploring unstructured Poisson solvers for FDS Dr. Susanne Kilian hhpberlin - Ingenieure für Brandschutz 10245 Berlin - Germany Agenda 1 Discretization of Poisson- Löser 2 Solvers for 3 Numerical Tests

More information

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative

More information

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop

More information

Large-scale Gas Turbine Simulations on GPU clusters

Large-scale Gas Turbine Simulations on GPU clusters Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based

More information

D036 Accelerating Reservoir Simulation with GPUs

D036 Accelerating Reservoir Simulation with GPUs D036 Accelerating Reservoir Simulation with GPUs K.P. Esler* (Stone Ridge Technology), S. Atan (Marathon Oil Corp.), B. Ramirez (Marathon Oil Corp.) & V. Natoli (Stone Ridge Technology) SUMMARY Over the

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

Krishnan Suresh Associate Professor Mechanical Engineering

Krishnan Suresh Associate Professor Mechanical Engineering Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick

More information

Numerical Modelling in Fortran: day 6. Paul Tackley, 2017

Numerical Modelling in Fortran: day 6. Paul Tackley, 2017 Numerical Modelling in Fortran: day 6 Paul Tackley, 2017 Today s Goals 1. Learn about pointers, generic procedures and operators 2. Learn about iterative solvers for boundary value problems, including

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

Contents. I The Basic Framework for Stationary Problems 1

Contents. I The Basic Framework for Stationary Problems 1 page v Preface xiii I The Basic Framework for Stationary Problems 1 1 Some model PDEs 3 1.1 Laplace s equation; elliptic BVPs... 3 1.1.1 Physical experiments modeled by Laplace s equation... 5 1.2 Other

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

smooth coefficients H. Köstler, U. Rüde

smooth coefficients H. Köstler, U. Rüde A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Optimization of HOM Couplers using Time Domain Schemes

Optimization of HOM Couplers using Time Domain Schemes Optimization of HOM Couplers using Time Domain Schemes Workshop on HOM Damping in Superconducting RF Cavities Carsten Potratz Universität Rostock October 11, 2010 10/11/2010 2009 UNIVERSITÄT ROSTOCK FAKULTÄT

More information

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction 1/8 GPU Implementation of Elliptic Solvers in Numerical Weather- and Climate- Prediction Eike Hermann Müller, Robert Scheichl Department of Mathematical Sciences EHM, Xu Guo, Sinan Shi and RS: http://arxiv.org/abs/1302.7193

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs 14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs K. Esler, D. Dembeck, K. Mukundakrishnan, V. Natoli, J. Shumway and Y. Zhang Stone Ridge Technology, Bel Air, MD

More information

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009 The Many-Core Revolution Understanding Change Alejandro Cabrera cpp.cabrera@gmail.com January 29, 2009 Disclaimer This presentation currently contains several claims requiring proper citations and a few

More information

Scalability of Elliptic Solvers in NWP. Weather and Climate- Prediction

Scalability of Elliptic Solvers in NWP. Weather and Climate- Prediction Background Scaling results Tensor product geometric multigrid Summary and Outlook 1/21 Scalability of Elliptic Solvers in Numerical Weather and Climate- Prediction Eike Hermann Müller, Robert Scheichl

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

A GPU-based High-Performance Library with Application to Nonlinear Water Waves

A GPU-based High-Performance Library with Application to Nonlinear Water Waves Downloaded from orbit.dtu.dk on: Dec 20, 2017 Glimberg, Stefan Lemvig; Engsig-Karup, Allan Peter Publication date: 2012 Document Version Publisher's PDF, also known as Version of record Link back to DTU

More information

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible

More information

Dominik Göddeke* and Hilmar Wobker

Dominik Göddeke* and Hilmar Wobker 254 Int. J. Computational Science and Engineering, Vol. 4, No. 4, 2009 Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU Dominik Göddeke* and Hilmar Wobker Institute

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Reconstruction of Trees from Laser Scan Data and further Simulation Topics Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair

More information

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units Markus Wagner, Karl Rupp,2, Josef Weinbub Institute for Microelectronics, TU

More information

Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects

Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects Markus Geveler, Stefan Turek, Dirk Ribbrock PACO Magdeburg 2015 / 7 / 7 markus.geveler@math.tu-dortmund.de

More information

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC 2015-09-14 Algorithms, System and Data Centre Optimisation for Energy Efficient HPC Vincent Heuveline URZ Computing Centre of Heidelberg University EMCL Engineering Mathematics and Computing Lab 1 Energy

More information

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

Gradient Free Design of Microfluidic Structures on a GPU Cluster

Gradient Free Design of Microfluidic Structures on a GPU Cluster Gradient Free Design of Microfluidic Structures on a GPU Cluster Austen Duffy - Florida State University SIAM Conference on Computational Science and Engineering March 2, 2011 Acknowledgements This work

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information