A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications James Bordner, Michael L. Norman San Diego Supercomputer Center University of California, San Diego 15th SIAM Conference on Parallel Processing for Scientific Computing 16 February 2012 James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 1 / 21
Enzo-P / Cello Outline Introduction Cello AMR Conclusions James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 2 / 21
Enzo-P / Cello Outline Introduction project overview motivation Cello AMR SAMR review patch merging dual-decomposition message-driven execution Conclusions James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 2 / 21
Enzo-P / Cello Introduction Cello began as a project to provide Enzo with highly scalable AMR Enzo Enzo-P Cello Enzo: astrophysics / cosmology application patch-based SAMR MPI or MPI / OpenMP 18 years development; 150K SLOC Enzo-P / Cello: petascale fork of Enzo code modified tree-based SAMR MPI or CHARM++ 2 years development; 25K SLOC Work in progress! generated using David A. Wheeler s SLOCCount James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 3 / 21
Motivation Enzo s Strengths [ John Wise ] Multiple application domains astrophysical fluid dynamics hydrodynamic cosmology Rich multi-physics capabilities fluid, particle, gravity, radiation Extreme resolution range 3 levels of refinement by 2! Hybrid MPI / OpenMP Active global development community 25 developers James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 / 21
Motivation Enzo s Struggles Grid patch meta-data is large 1.5KB/patch (MPI/OpenMP helps) Memory fragmentation Mesh quality 2-to-1 constraint can be violated asymmetric mesh for symmetric problem Load balancing difficulty maintaining parent-child locality Parallel scaling AMR overhead dominates computation [ Tom Abel, John Wise, Ralf Kaehler ] James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 5 / 21
Patch-based or Tree-based SAMR? Some advantages of Patch-based AMR Flexible patch size and shape improved refinement efficiency Larger patches smaller surface/volume ratio reduced communication amortized loop overhead Fewer patches smaller trees: reduced meta-data James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 6 / 21
Patch-based or Tree-based SAMR? Some advantages of Tree-based AMR Fixed block size and shape simplified load balancing dynamic memory reuse More blocks more parallelism available Smaller nodes: less meta-data Compute only on leaf nodes less communication James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 7 / 21
Cello AMR Overview Cello uses a modified tree-based SAMR approach Modifications primarily to address large tree sizes Patch merging to reduce node count Dual-decomposition to maintain parallelism Targeted refinement for deep hierarchies Message-driven execution to address many issues dynamic scheduling latency tolerant overlaps communication / computation automatic load balancing... James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 8 / 21
Patch Merging The Basic Idea 25 leaf nodes 13 leaf nodes 25 leaf nodes 1 8 8 8 8 2 8 8 8 8 1 Merge patches into larger ones when possible 2 Split patches into smaller ones when necessary James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 9 / 21
Patch Merging The Basic Idea 25 leaf nodes 13 leaf nodes 25 leaf nodes 1 8 8 8 8 2 8 8 8 8 1 Merge patches into larger ones when possible 2 Split patches into smaller ones when necessary James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 9 / 21
Patch Merging A More Realistic Example Cosmological structure formation L = 38.Mpc 10 2 m 10 7 range in mass density Octree refined to L = 8 has 22737 nodes Balanced tree has 29617 nodes Patch merging tree has 1057 nodes James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 10 / 21
Patch Merging A More Realistic Example Cosmological structure formation L = 38.Mpc 102 m 107 range in mass density Octree refined to L = 8 has 22737 nodes Balanced tree has 29617 nodes Patch merging tree has 1057 nodes James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 10 / 21
Patch Merging A More Realistic Example Cosmological structure formation L = 38.Mpc 102 m 107 range in mass density Octree refined to L = 8 has 22737 nodes Balanced tree has 29617 nodes Patch merging tree has 1057 nodes James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 10 / 21
Patch Merging A More Realistic Example Cosmological structure formation L = 38.Mpc 10 2 m 10 7 range in mass density Octree refined to L = 8 has 22737 nodes Balanced tree has 29617 nodes Patch merging tree has 1057 nodes James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 10 / 21
Patch Merging Summary Could reduce AMR meta-data by 2 to 3 (including 25% to 50% increase in node size) However, there are disadvantages fewer patches: less available parallelism variable sizes: difficult to load-balance How can we regain advantages lost? decompose large Patches into smaller Blocks... James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 11 / 21
Dual-Decomposition The Basic Idea Hierarchy Patch Block Hierarchy: octree-like container of distributed Patches Patch: distributed array of blocks Block: local arrays of data (fields, particles, etc.) James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 12 / 21
Dual-Decomposition Communication Patterns Intra patch block update Inter patch block update Intra-patch block update neighbor blocks in same patch distributed uniform mesh problem regular communication patterns efficient and scalable Inter-patch block update neighbor blocks in neighbor patches standard coarse/fine interface update James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 13 / 21
Dual-Decomposition Communication Patterns Intra patch block update Inter patch block update Intra-patch block update neighbor blocks in same patch distributed uniform mesh problem regular communication patterns efficient and scalable Inter-patch block update neighbor blocks in neighbor patches standard coarse/fine interface update James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 13 / 21
Dual-Decomposition Summary Regains parallelism lost in patch merging Maintains same underlying computational mesh Replaces some subtrees with arrays Embedded unigrid-efficiency in uniformly-refined subregions Hierarchy Patch Block James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 1 / 21
Message-driven Execution What is CHARM++? CHARM++ is a parallel language and runtime system Provides processor virtualization multiple objects per physical processor runtime system schedules object methods Important advantages for AMR asynchronous: latency-tolerance well-suited for complex, dynamic applications Also provides fault tolerance dynamic load balancing checkpoint/restart James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 15 / 21
Message-driven Execution What is CHARM++? Programmer sees a collection of objects CHARM++ objects are called chares chares send messages to each other remote function calls: entry methods CHARM++ runtime system maps chares to physical processors schedules entry methods for execution migrates chares to load balance James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 16 / 21
Message-driven Execution CHARM++ Entities CHARM++ supports collections of chares Chare Arrays distributed array of chares migratable elements Chare Groups one chare per processor (non-migratable) Chare Nodegroups one chare per node (non-migratable) James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 17 / 21
Message-driven Execution CHARM++ Entities in Cello Main Simulation Patch Block Process 0 1 2 3 P 1 The mainchare creates a Simulation chare group Each Simulation contains some Patch chares Each Patch contains a chare array of Blocks James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 18 / 21
Cello CHARM++ Control flow in Enzo-P / Cello Current (unigrid) Enzo-P control flow 1 Initialize create chares and chare arrays set initial conditions 2 Refresh ghost zones 3 Calculate timestep currently involves global reduction this should be avoidable Compute! 1. 2. 3.. Main Simulation Patch Block Patch Simulation Patch Block James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 19 / 21
Conclusions A beta version of Enzo-P / Cello is available at cello-project.org Uniform Cartesian mesh AMR ETA 3-6 months CHARM++ or MPI Blocks contain arrays of field variables controllable precision, ordering, padding, alignment, etc. PPM hydrodynamics, PPML MHD HDF5 I/O on every 1 k P processes James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 20 / 21
Enzo-P/ Cello Enzo-P: Petascale astrophysics and cosmology application Cello: Scalable tree-based AMR framework Patch merging + dual-decomposition suitable for embedded uniformly refined areas estimated 3 fewer nodes Message-driven execution using CHARM++ especially suitable for huge complex dynamic problems latency tolerant, auto load-balancing, checkpointing, etc. Targeted refinement for deep AMR Website Listserv Email http://cello-project.org/ https://mailman.ucsd.edu/mailman/listinfo/cello-l jobordner@ucsd.edu James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 21 / 21