Efficient Algorithmic Approaches for Flow Simulations on Cartesian Grids

Size: px

Start display at page:

Download "Efficient Algorithmic Approaches for Flow Simulations on Cartesian Grids"

Hester Hines
5 years ago
Views:

1 Efficient Algorithmic Approaches for Flow Simulations on Cartesian Grids M. Bader, H.-J. Bungartz, B. Gatzhammer, M. Mehl, T. Neckel, T. Weinzierl TUM Department of Informatics Chair of Scientific Computing Munich, Germany DEISA PRACE Symposium 2009: HPC Infrastructures for Petascale Applications Amsterdam, May11 13, 2009

2 Prologue or a first Challenge Some people from science and engineering solve great problems using somewhat strange methods whereas some people from scientific computing develop great methods to solve somewhat strange problems. [Think of the world-famous stationary Laplacian on the 1D unit cube ] Principally, we belong to the second group so be prepared to see the application only at the end of the talk! 2

3 Computational Challenges From simulation to optimisation From one-way batch jobs to user interaction From parameter assumptions to identification & estimation It s getting tougher From forward problems to inverse problems From single-physics problems to multi-physics scenarios From island fun From hacker s delight to complex workflows to embedding & integration 3

4 Multi-this this and multi-that that Domains & Education -disciplinary Models -physics Multi- Models -scale Systems -core Numerics -dimensional Numerics -level 4

Hence the Motivation for Computational Algorithms Tackling the memory wall : cache-awareness via sophisticated traversal strategies cache-oblivious vs.

5 Hence the Motivation for Computational Algorithms Tackling the memory wall : cache-awareness via sophisticated traversal strategies cache-oblivious vs. cache-conscious Tackling on-chip parallelism (multi-core): multi-threading, fine-grain parallelism no more sequential kernel non-standard hardware: accelerators, such as GPGPU, Cell, FPGA Tackling scalability: hybrid concepts, sophisticated & cheap load balancing heterogeneous scenarios (non-standard geometry, multi-level schemes, ) require dynamic load balancing intra- and inter-system (from hybrid systems to the Grid) A promising paradigm: space-filling curves [ SFC: continuous and surjective mapping from unit interval onto unit square/cube ] Lebesgue: the classical one (Morton, octree) Hilbert: the most famous one Peano: our favourite for Cartesian grids Sierpinski: the newcomer for triangles & Co. Annual Annual gain gain in in last last years: years: (avg.) (avg.) CPU CPU performance: performance: 60% 60% memory memory bandwidth: bandwidth: 23% 23% memory memory latency: latency: 5% 5% 5

6 Contents The Scope of Space-Filling Curves The Peano Project Proof of Concept Application The Drift Ratchet 6

7 SFC #1 Lebesgue: Hierarchical Spatial Organisation Lebesgue s space-filling curve known as Morton ordering or quad-/ octrees Applications in a CSE & HPC context: Test of geometric consistency of building models Decomposition and meshing of domains Spatial organisation of FEM identify range of modifications Spatial organisation of particle methods (fast multipole) Integration of location-aware simulation tasks 7

SFC #2 Peano: Numerical Linear Algebra TifaMMy TifaMMy

multiplication multiplication Peano-based Peano-based

dense ) dense ) or or sparse sparse matrices) matrices)

structure and and algorithm algorithm parallel parallel @

HW-conscious,kernel,kernel OpenMP OpenMP parallel parallel

caches, MPI MPI application: application: quantum quantum

8 SFC #2 Peano: Numerical Linear Algebra TifaMMy TifaMMy cache-efficient cache-efficient matrix matrix multiplication multiplication Peano-based Peano-based traversal traversal with with high high locality locality dense ) dense ) or or sparse sparse matrices) matrices) block-structured block-structured data data structure structure and and algorithm algorithm parallel multicore: multicore: HW-conscious HW-conscious,kernel,kernel OpenMP OpenMP parallel clusters: clusters: distributed distributed caches, caches, MPI MPI application: application: quantum quantum control control (states (states via via matrices) matrices) AMD (2 x quad) AMD (2 x quad) Xeon (4 x quad) Xeon (4 x quad) 8

9 SFC #3 Sierpinski: Tsunami Simulations Sierpinski Sierpinski space-filling space-filling curves curves FEM FEM with with strong strong adaptive adaptive refinement refinement & & coarsening coarsening structured, structured, but but triangular triangular / / tetrahedral tetrahedral high high locality locality and and HW-/cache-efficiency HW-/cache-efficiency Sierpinski-based Sierpinski-based traversal, traversal, newest newest vertex vertex bisection bisection discontinuous discontinuous Galerkin Galerkin discretization discretization application: application: Tsunami Tsunami simulation simulation (shallow (shallow water water eqs.) eqs.) Cooperation with Jörn Cooperation with Jörn AWI AWI 9

10 Contents The Scope of Space-Filling Curves The Peano Project Proof of Concept Application The Drift Ratchet 10

11 Objectives General PDE framework, with focus on CFD/FSI Discretization: FE (strictly conservative) Cartesian grids (at least logically) Straightforward grid generation & adaptation Direct support of multi-level solvers and parallelisation General dimensionality High efficiency 11

12 Grid Organisation: Adaptive Spacetree Cartesian grid cells squares/cubes recursive refinement tri-partitioning tree structure 12

13 Approximation geometric adaptivity, grid hierarchy Eulerian approach (marker-and-cell) Sphere, d=2,3,4 13

14 Traversal for Iterations: Stack Concept cell-oriented operator evaluation ordering of cells along a Peano curve stacks as non-persistent data structure adaptivity & generating systems multi-level high spatial and time locality of data access 14

15 Traversal for Iterations: Stack Concept 2d+2 stacks in d dimensions 15

16 Fast Linear Solvers: Multigrid dehierarchisation compute residual smooth restrict residual 16

17 Parallel Grid Traversal, Dynamic Load Distribution

FSI Coupling Environments FSI ce and precice Partitioned Approach to FSI Clip Simulation Program FSI_Init () while (FSI_Is_running()) if (FSI_Is_new_interface_values()) Read coupling data from com.

18 FSI Coupling Environments FSI ce and precice Partitioned Approach to FSI Clip Simulation Program FSI_Init () while (FSI_Is_running()) if (FSI_Is_new_interface_values()) Read coupling data from com.mesh Set time step length Compute values of next time step Write coupling data to com. mesh FSI_Data_exchange () if (FSI_Is_implicit_converged()) Store values of next time step end while FSI_Finalize () 18

19 Contents The Scope of Space-Filling Curves The Peano Project Proof of Concept Application The Drift Ratchet 19

20 Parallel Grid Traversal 20

21 Parallelisation: Memory Overhead 3.5 Parallel/Serial Vertex Number Ratio , ,944 vertices, vertices, successive successive subdivision, subdivision, data data duplication duplication at at subdomain subdomain boundaries, boundaries, worst worst case, case, JUGENE JUGENE Number of Nodes 21

22 Cache Efficiency Scenario Vertices L2 ref s L2 misses Bus data cycles Bus load [%] cube, regular cube, adaptive l-shape, regular l-shape, adp sphere, regular sphere, adaptive Example scenario: 2D Poisson cube, L domain, sphere Itanium2 2x DualCore, 1.3 GHz, 256 kb L2, 3MB L3 (shared), 8 GB single-thread application Messages of the measurements: L2 hit rate > 99.9% low bus traffic (hence well-suited for many-core systems, Cell, ) 22

23 Memory Requirements per DoF bytes/cell bytes/vertex 2D 6 2 grid only Poisson solver, sequential Poisson solver, parallel flow solver 3D 10 2 grid only Poisson solver, sequential Poisson solver, parallel Multigrid flow solver z Threshold Vertices Flop/Cycle L2 hit rate t/dof d=2 1.0 * 10^ * 10^ * 10^ % 4.81 * 10^ * 10^ * 10^ * 10^ % 4.26 * 10^-4 d=3 1.0 * 10^ * 10^ * 10^ % 9.75 * 10^ * 10^ * 10^ * 10^ % 9.52 * 10^-4 Poisson, cube, adaptive, F-cycle 23

24 Memory & Runtime Sequential code with hard-disc streaming Pressure-Poisson-Equation, V-(1/0)-Cycle laptop: 1.8 GHz Intel Centrino, 1GB RAM atsccs: 3.4 GHz Intel Pentium 4,2GB RAM 24

25 Contents The Scope of Space-Filling Curves The Peano Project Proof of Concept Application The Drift Ratchet 25

Application: Drift Ratchet Scenario [Matthias and Müller, Asymmetric pores in a silicon membrane acting as massively parallel Brownian ratchets, letters to nature, 424, 2003]; application scenario is

26 Application: Drift Ratchet Scenario [Matthias and Müller, Asymmetric pores in a silicon membrane acting as massively parallel Brownian ratchets, letters to nature, 424, 2003]; application scenario is a cooperation with the physics dept. of Univ. of Augsburg Ratchets Ratchets or or Brownian Brownian motors motors used used for for sorting sorting macromolecules macromolecules or or other other particles particles (think (think of of a a sieve). sieve). Due Due to to the the pore pore geometry, geometry, (symmetric) (symmetric) periodic periodic pressure pressure b.c. b.c. may may induce induce a a size-dependent size-dependent drift. drift. 26

27 Drift Ratchet: Starting Point CFD scenario involving complex geometries, FSI Need for longer time intervals Physics not yet completely understood Simplified models to start with High technological relevance need for microdevices [ sorting macromolecules such as proteins or DNA ] 27

28 Simulation Scenario Snapshots Peano Peano & precice, precice, 2D 2D 3D 3D 28

particle(s): triangulated triangulated surface) surface) Explicit Explicit coupling coupling with with divergence divergence

29 Results One chamber Two chambers (transit) Clip Re = 0.1, f = 7 khz Clip Re = 0.1, f = 10 khz FSI: FSI: Partitioned Partitioned approach approach (fluid: (fluid: Cartesian Cartesian grid; grid; particle(s): particle(s): triangulated triangulated surface) surface) Explicit Explicit coupling coupling with with divergence divergence correction correction Yet Yet incomplete incomplete model: model: no no Brown, Brown, no no collisions, collisions, no no thermo-dynamical thermo-dynamical effects effects 29

30 First Results Simulations Simulations of of several several cycles cycles Simplified Simplified analytical analytical solution solution vs. vs. simulation simulation (one (one cycle) cycle) 30

First Results Re Re = = 0.1, 0.1, f f = = 0.1 0.

30x30x126 30x30x126 = = 113,400 113,400 cells cells Oscillating

c. (grey) (grey) particle particle position position (blue) (blue) and

31 First Results Re Re = = 0.1, 0.1, f f = = khz khz One One pore pore with with two two chambers chambers 30x30x126 30x30x126 = = 113, ,400 cells cells Oscillating Oscillating pressure pressure b. b. c. c. (grey) (grey) particle particle position position (blue) (blue) and and velocity velocity (red) (red) Velocity boundary particle 1 0 5e Time [s] 31

Acknowledgements DFG DEISA project Drift Ratchet Computations & support LRZ, München (D) JSC, Jülich (D) EPCC, Edinburgh (UK) Theoretical Physics @ Universität

32 Acknowledgements DFG DEISA project Drift Ratchet Computations & support LRZ, München (D) JSC, Jülich (D) EPCC, Edinburgh (UK) Theoretical Universität Augsburg (Peter Hänggi)... physics again but in an engineering-driven code development All people contributing to Peano Core components CFD & FSI applications 32

33 Communication Optimizing Packet Sizes Infinicluster Time [s] 7e-004 6e-004 5e-004 4e-004 3e-004 2e-004 1e-004 0e+000 2d 3d HLRB II Time [s] 7e-005 6e-005 6e-005 5e-005 5e-005 4e-005 4e-005 2d 3d Jugene Time [s] 7e-004 6e-004 5e-004 4e-004 3e-004 2e-004 1e-004 0e d 3d O(1M) O(1M) dof, dof, (2d) (2d) or or (3d) (3d) Number of Messages per Message Exchange nodes nodes 33

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum