Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy

Size: px

Start display at page:

Download "Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy"

Lionel Black
6 years ago
Views:

1 Enzo-P / Cello Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology James Bordner 1 Michael L. Norman 1 Brian O Shea 2 1 University of California, San Diego San Diego Supercomputer Center 2 Michigan State University Department of Physics and Astronomy Extreme Scaling Workshop 2012 Blue Waters / XSEDE Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

Enzo overview Parallel astrophysics and cosmology implemented in C++ /Fortran approximately 150K SLOC parallelized using MPI / OpenMP Vast range of scales astrophysical

2 Enzo overview Parallel astrophysics and cosmology implemented in C++ /Fortran approximately 150K SLOC parallelized using MPI / OpenMP Vast range of scales astrophysical fluid dynamics hydrodynamic cosmology Adaptive mesh refinement (AMR) Growing development community [ Norman et al ] Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

Enzo s physics and algorithms Eulerian hydrodynamics piecewise-parabolic method (PPM) Lagrangian dark matter particle-mesh method (PM) Self-gravity FFT s on root-level grid

3 Enzo s physics and algorithms Eulerian hydrodynamics piecewise-parabolic method (PPM) Lagrangian dark matter particle-mesh method (PM) Self-gravity FFT s on root-level grid multigrid on refinement patches Local physics heating, cooling, chemistry, etc. MHD, RHD (ray-tracing or implicit FLD) [JohnWise] Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

Enzo s pursuit of scalability Enzo born in early 1990 s Extreme meant 100 processors Continual scalability improvements MPI/OpenMP parallelism neighbor-finding algorithm I/O optimizations Further

4 Enzo s pursuit of scalability Enzo born in early 1990 s Extreme meant 100 processors Continual scalability improvements MPI/OpenMP parallelism neighbor-finding algorithm I/O optimizations Further improvement getting harder increasing scalability requirements easy improvements already made Motivates concurrent rewriting Enzo-P petascale Enzo fork Cello AMR framework [SamSkillman,MattTurk] Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

Enzo s AMR data structure Patch-based SAMR Each patch is a C++ object root grid root patch refinement patch Patches assigned to processes N P root patches, grid size 64 3, 128 3 Refinement patches

5 Enzo s AMR data structure Patch-based SAMR Each patch is a C++ object root grid root patch refinement patch Patches assigned to processes N P root patches, grid size 64 3, Refinement patches generally smaller Refinement patches initially local to parent Load balancing relocates refinement patches Patch data (grids, particles) are distributed AMR hierarchy structure is replicated [TomAbel,JohnWise,RalfKaehler] Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

6 Enzo s timestepping Adaptive timestepping by level parallel within level less computation by O(N L ) reduced parallel efficiency EvolveLevel(L,dt L 1 ) 1 refresh level L ghosts 2 compute timestep dt L 3 advance level L by dt L 4 EvolveLevel(L+1,dt L ) 5 correct fluxes t t L=0 L=1 L=2 L= dt dt1 3 dt 2 x x Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

Enzo s scaling issues Memory usage AMR structure is non-scalable ghost zone layer three zones deep memory fragmentation Data locality disrupted by load balancing Parallel task definition widely

7 Enzo s scaling issues Memory usage AMR structure is non-scalable ghost zone layer three zones deep memory fragmentation Data locality disrupted by load balancing Parallel task definition widely varying patch sizes granularity determined by AMR Parallel task scheduling parallel within a level synchronization between levels [ElizabethTasker] Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

8 Talk outline 1 Enzo 1 overview 2 design 3 scaling issues 2 Talk outline 3 Enzo-P / Cello 1 overview 2 design 3 scaling solutions 4 implementation using Charm++ 5 recursively generated parallel data structures Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

Enzo-P / Cello overview Enzo-P intended to be petascale Enzo Cello is a scalable AMR framework Parallelism using Charm++ MPI as a backup 25K SLOC Work in

9 Enzo-P / Cello overview Enzo-P intended to be petascale Enzo Cello is a scalable AMR framework Parallelism using Charm++ MPI as a backup 25K SLOC Work in progress PPM HD / PPML MHD on distributed Cartesian grids prototyping Charm++ AMR implementations Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

10 Cello s AMR data structure Forest Tree Patch Block Tree-based SAMR Patches define unit of refinement cubical, varying size Blocks define parallel tasks flexible size and shape One Block per Charm++ chare Tree structure optionally distributed Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

11 Cello s timestepping Optional adaptive timestepping includes by Patch or Block local synchronization parallelism between levels t L=0 L=1 L=2 L= dt dt1 2 1 dt 2 1 x Quantized timesteps avoids sliver timesteps dt 0 but dt s smaller than optimal reduced numerical roundoff t x Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

12 Cello s improvements to scaling Reducing replicated AMR structure Fewer replicated patches patch-merging technique truncates full subtrees 2to3 reduction original patch-merged Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

Cello s improvements to scaling Reducing replicated AMR structure Fewer replicated patches patch-merging technique truncates full subtrees 2to3 reduction Smaller replicated

13 Cello s improvements to scaling Reducing replicated AMR structure Fewer replicated patches patch-merging technique truncates full subtrees 2to3 reduction Smaller replicated patches Enzo: grid = 1544 bytes Cello: Node = 16 bytes 100 reduction! original Enzo grid patch-merged Cello Node Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

14 Cello s improvements to scaling Distributed AMR Structure Forest of Trees each assigned a process range simple indexing load balancing issues Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

Cello s improvements to scaling Distributed AMR Structure Forest of Trees each assigned a process range simple indexing load balancing issues

15 Cello s improvements to scaling Distributed AMR Structure Forest of Trees each assigned a process range simple indexing load balancing issues Space-filling curve improved load balancing requires global scan greater surface area Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

16 Cello s improvements to scaling Dynamically load balancing AMR block data Dynamic Load Balancing Use Charm++ load balancing Space-filling curves equally distribute load maintain data locality no parent-child communication Use measured performance computation memory usage Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

17 Cello s improvements to scaling Parallel tasks Task definition Flexible choice of size / shape Reduced size variability constant grid Block sizes variable subcycling, # particles Enzo Cello Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

Task scheduling Charm++: asynchronous, data-driven Blocks advance when ghosts refreshed t

18 Cello s improvements to scaling Parallel tasks Task definition Flexible choice of size / shape Reduced size variability constant grid Block sizes variable subcycling, # particles Task scheduling Charm++: asynchronous, data-driven Blocks advance when ghosts refreshed t Enzo Cello L=0 L=1 L=2 L=3 x Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

19 Cello s implementation using Charm++ Charm++ program structure Main ChareA px() ChareB Main() py() pw() ChareC pz() pv() ACharm++ Program Charm++ program Charm++ objects are chares invoke entry methods communicate via messages Charm++ runtime system maps chares to processors schedules entry methods migrates chares to load balance Additional scalability features fault tol.: checkpoint/restart dynamic load balancing Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

20 Cello s implementation using Charm++ Charm++ collections of chares Chare Arrays distributed array of chares migratable elements flexible indexing Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

21 Cello s implementation using Charm++ Charm++ collections of chares Chare Arrays distributed array of chares migratable elements flexible indexing Chare Groups one chare per processor (non-migratable) Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

22 Cello s implementation using Charm++ Charm++ collections of chares Chare Arrays distributed array of chares migratable elements flexible indexing Chare Groups one chare per processor (non-migratable) Chare Nodegroups one chare per node (non-migratable) Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

23 Cello s implementation using Charm++ Three implementation strategies 1. Single chare array efficient: single access restricted Tree depth H Hierarchy T P B B P B B T P B B P B B T P B B P B B T P Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

24 Cello s implementation using Charm++ Three implementation strategies 1. Single chare array efficient: single access restricted Tree depth 2. Composite chare arrays tree of chare arrays less restricted Tree depth H F Hierarchy T P B B P B B T Forest P B B P Tree B B T P B B P B B T P Patch T P T T P P B B Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

25 Cello s implementation using Charm++ Three implementation strategies 1. Single chare array efficient: single access restricted Tree depth 2. Composite chare arrays tree of chare arrays less restricted Tree depth H F Hierarchy T P B B P B B T Forest P B B P Tree B B T P B B P B B T P Patch T P T T P P B B 3. Singleton chares no depth restrictions possible performance issues how to generate...? T P B B P B B T P B B P B B T P B B P B B T P Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

26 Recursively generated parallel data structures Generating a software network root seed 1 Start with single Seed conceptually complete p.d.s. with processor range 2 grow() spawns remote Seeds conceptually partitioned p.d.s. with processor subranges Seeds interlinked 3 Recurse to individual elements final Seeds complete p.d.s. scaffolding: previous Seeds scaffolding data structure topology Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

27 Recursively generated parallel data structures Using the software network Three types of Seed links parent: reductions neighbor: collaborations child: distributions Scalable links per node: O(1) generation: O(log N) Load balancing space-filling curves hierarchical Seed parent Seed neighbor Seeds child Seeds reduce link neighbor links broadcast links Usable in Cello grids / octrees easy SeedGrid, SeedTree, etc. Usable with MPI send encoded seed creation link: process rank + pointer Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

28 Summary Enzo Enzo-P / Cello Parallelization MPI Charm++ SAMR patch-based tree-based AMR structure replicated optionally distributed Remote patches 1544 bytes 16 bytes Timestepping level-adaptive block-adaptive Block sizes 1000 variation constant Task scheduling level-parallel dependency-driven Load balancing patch migration space-filling curves Data locality LB conflict no LB conflict NSF PHY , AST Extreme Scaling Workshop Enzo-P / Cello 16 July / 21

Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy

Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy Enzo-P / Cello Formation of the First Galaxies James Bordner 1 Michael L. Norman 1 Brian O Shea 2 1 University of California, San Diego San Diego Supercomputer Center 2 Michigan State University Department