Massively Parallel Finite Element Simulations with deal.ii

Size: px

Start display at page:

Download "Massively Parallel Finite Element Simulations with deal.ii"

Jayson Watts
5 years ago
Views:

1 Massively Parallel Finite Element Simulations with deal.ii Timo Heister, Texas A&M University SIAM PP2012 joint work with: Wolfgang Bangerth, Carsten Burstedde, Thomas Geenen, Martin Kronbichler

2 Content 1 Introduction 2 Data structures and algorithms Overview Triangulation Distributing the Degrees of Freedom Linear Algebra, Postprocessing 3 Numerical Results scalability tests Mantle Convection 4 Conclusions 2 / 26

3 Motivation: Mantle Convection material (rock) is mostly solid slow velocities (cm/year) driven by temperature differences (source: wikipedia) 3 / 26

Motivation (temperature snapshot, 700 000 degrees of freedom, 2d simulation) Requirements: complex coupled

4 Motivation (temperature snapshot, degrees of freedom, 2d simulation) Requirements: complex coupled equations and material laws 3d simulations, adaptive meshes large number of unknowns (100 million or more) 4 / 26

5 Goals and Background Goals: 1. Finite Elements with adaptive mesh refinement 2. scalable: cores, 1 billion+ unknowns 3. flexible: higher order, multiphysics, reuse existing software done in deal.ii and available today but described in a generic way Bangerth and Kanschat. deal.ii Differential Equations Analysis Library, Technical Reference, Bangerth, Burstedde, Heister, and Kronbichler. Algorithms and Data Structures for Massively Parallel Generic Finite Element Codes. ACM Trans. Math. Softw., 38(2), / 26

6 Requirements for Scalability distributed data storage everywhere need special data structures efficient algorithms not depending on total problem size localize and hide communication point-to-point MPI, nonblocking sends and receives 6 / 26

7 Status of Parallelization Often: (in deal.ii before and many other libraries) only using parallel linear algebra (matrix, vector, solvers) duplicated data: mesh, degrees of freedom,... algorithms not scaling: mesh handling, DoFs,... not efficient for > 100 cores or larger problems 7 / 26

8 Overview of data structures and algorithms needs to be parallelized: 1. Triangulation (mesh with all associated data) hard: distributed storage, new algorithms 2. DoFHandler (manages degrees of freedom, global numbering,... ) hard: find global numbering of DoFs 3. Linear Algebra (matrices, vectors, solvers, preconditioners) easy: use existing library 4. Postprocessing (error estimation, solution transfer, output,... ) medium: do work on local mesh, communicate Triangulation unit cell DoFHandler linear algebra post processing Finite Element, Quadratures, Mapping,... 8 / 26

9 Triangulation p4est library: parallel quad-/octrees store refinement flags from a base mesh based on space-filling curves very good scalability Burstedde, Wilcox, and Ghattas. p4est: Scalable algorithms for parallel adaptive mesh refinement on forests of octrees. SIAM J. Sci. Comput., 33 no. 3 (2011), pages / 26

10 Triangulation partitioning is cheap and simple: #1 #2 then: take p4est refinement information recreate rich deal.ii Triangulation only for local cells (stores coordinates, connectivity, faces, materials,... ) how? recursive queries to p4est also create ghost layer (one layer of cells around own ones) 10 / 26

11 Example: Distributed Mesh Storage = & & (color: owned by CPU id) 11 / 26

12 Distributing the Degrees of Freedom (DoFs) sketch: create global numbering for all DoFs reason: identify shared ones problem: no knowledge about the whole mesh decide on ownership of DoFs on interface (no communication!) 2. enumerate locally (only own DoFs) 3. shift indices to make them globally unique (only communicate local quantities) 4. exchange indices to ghost neighbors 12 / 26

13 Distributing the Degrees of Freedom 1 local numbering: 13 / 26

14 Distributing the Degrees of Freedom 2 shift indices: 14 / 26

15 Distributing the Degrees of Freedom 3 transfer to neighbor (view: green) 15 / 26

16 Distributing the Degrees of Freedom 4 second transfer needed for some cells: 16 / 26

Linear Algebra, Postprocessing Linear Algebra use distributed matrices and vectors (PETSc or Trilinos) assemble local parts (some communication on interfaces) solve

17 Linear Algebra, Postprocessing Linear Algebra use distributed matrices and vectors (PETSc or Trilinos) assemble local parts (some communication on interfaces) solve (preconditioners!) not covered today: error estimation decide over refnement and coarsening (communication!) handling hanging nodes solution transfer (after refinement and repartitioning) 17 / 26

18 Strong Scaling: 2d adaptive Poisson problem Wall time [seconds] Wall clock times for problem of fixed size 335M linear solver copy to deal.ii error estimation assembly init matrix sparsity pattern coarsen and refine Number of processors 18 / 26

19 Weak Scaling Wall time [seconds] Linear Solver Preconditioner Setup Matrices Assembly Assembly (T) Refinement SolutionTransfer Setup DoFs Distribute DoFs Weak scaling, 512 processors e+07 1e+08 1e+09 Number of degrees of freedom 19 / 26

20 Test: memory consumption mem / MB avg max #CPUs average and maximum memory consumption (VmPeak) 3D, weak scalability from 8 to 1000 processors with about DoFs per processor (4 million up to 500 million total) constant memory usage with increasing #CPUs & problem size 20 / 26

21 Test: memory consumption memory in MB # CPUs Triangulation p4est DofHandler Constraints Matrix Vector 3D, weak scalability; better for more complicated problems 21 / 26

22 Mantle Convection flow driven by temperature differences Boussinesq Model: T t (2ηε(u)) + p = ρ β T g, u = 0, + u T κ T = γ. (velocity u, pressure p, temperature T ) 22 / 26

23 Mantle Convection Solution and partition of a 3d simulation with roughly 54 million unknowns on 1.4 million cells running on 512 cores. 23 / 26

24 Strong Scaling: coarse 3d case 10 3 Strong scaling, 4 million DoFs 10 2 time/s #CPUs Setup DoFs Assemble T RHS Assemble Stokes Refine mesh Build preconditioner Solve T Solve Stokes (linear) 24 / 26

25 Weak Scaling 10 time/s # DoFs Error Est & SolutionTransf er setup DoFs Assemble Stokes Assenble Temp. recreate local mesh p4est balance p4est partition ref ine and coarsen linear 2d, 512 cores, adaptive refinement for a fixed time step, 2 million to 160 million unknowns 25 / 26

26 Thanks for your attention! 26 / 26

Parallel Computations

Parallel Computations Timo Heister, Clemson University heister@clemson.edu 2015-08-05 deal.ii workshop 2015 2 Introduction Parallel computations with deal.ii: Introduction Applications Parallel, adaptive,