A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Similar documents
Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy

Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy

Formation of the First Galaxies Enzo-P / Cello Adaptive Mesh Refinement

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture

Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications

Future of Enzo. Michael L. Norman James Bordner LCA/SDSC/UCSD

Peta-Scale Simulations with the HPC Software Framework walberla:

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

Visualization Challenges for Large Scale Astrophysical Simulation Data. Ultrascale Visualization Workshop

Welcome to the 2017 Charm++ Workshop!

Advanced Parallel Programming. Is there life beyond MPI?

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

Experiences with ENZO on the Intel Many Integrated Core Architecture

Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program

Parallel Algorithms: Adaptive Mesh Refinement (AMR) method and its implementation

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Computational Astrophysics 5 Higher-order and AMR schemes

I/O Analysis and Optimization for an AMR Cosmology Application

Fourteen years of Cactus Community

CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison

High performance computing and numerical modeling

Fault tolerant issues in large scale applications

RAMSES on the GPU: An OpenACC-Based Approach

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Evaluating the Performance of the Community Atmosphere Model at High Resolutions

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Towards a Reconfigurable HPC Component Model

simulation framework for piecewise regular grids

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Development of a Computational Framework for Block-Based AMR Simulations

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Adaptive Refinement Tree (ART) code. N-Body: Parallelization using OpenMP and MPI

Scalable Interaction with Parallel Applications

Adaptive-Mesh-Refinement Pattern

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Adaptive Blocks: A High Performance Data Structure

COPYRIGHTED MATERIAL. Introduction: Enabling Large-Scale Computational Science Motivations, Requirements, and Challenges.

Enabling scalable parallel implementations of structured adaptive mesh refinement applications

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

Cosmology Simulations with Enzo

AACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-

Using Charm++ to Support Multiscale Multiphysics

Forest-of-octrees AMR: algorithms and interfaces

IOS: A Middleware for Decentralized Distributed Computing

Fast Methods with Sieve

HPC Algorithms and Applications

Application Example Running on Top of GPI-Space Integrating D/C

Scalable Dynamic Adaptive Simulations with ParFUM

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes.

A Kernel-independent Adaptive Fast Multipole Method

Scalability of Uintah Past Present and Future

Efficient Parallel Extraction of Crack-Free Isosurfaces from Adaptive Mesh Refinement (AMR) Data

Generic finite element capabilities for forest-of-octrees AMR

Adaptive Runtime Support

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Week 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh

LOCAL ADAPTIVE MULTIGRID FOR FINITE INTEGRALS TECHNIQUE. Introduction

PiTP Summer School 2009

PREPARING AN AMR LIBRARY FOR SUMMIT. Max Katz March 29, 2018

Large Scale Simulations of the Non-Thermal Universe

Intro to Parallel Computing

GPU-Based Visualization of AMR and N-Body Dark Matter Simulation Data. Ralf Kähler (KIPAC/SLAC)

Knights Landing Scalability and the Role of Hybrid Parallelism

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

Hierarchical Partitioning Techniques for Structured Adaptive Mesh Refinement Applications

Parallel Implementation of 3D FMA using MPI

arxiv: v1 [cs.ms] 8 Aug 2018

Lagrangian methods and Smoothed Particle Hydrodynamics (SPH) Computation in Astrophysics Seminar (Spring 2006) L. J. Dursi

CS 475: Parallel Programming Introduction

Realistic and Efficient Multi-Channel Communications in WSN

Expressing Fault Tolerant Algorithms with MPI-2. William D. Gropp Ewing Lusk

General Plasma Physics

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

What is Cactus? Cactus is a framework for developing portable, modular applications

Sparse Direct Solvers for Extreme-Scale Computing

ARCHITECTURE SPECIFIC COMMUNICATION OPTIMIZATIONS FOR STRUCTURED ADAPTIVE MESH-REFINEMENT APPLICATIONS

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Center Extreme Scale CS Research

Programming Models for Supercomputing in the Era of Multicore

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Fast Dynamic Load Balancing for Extreme Scale Systems

Visualization Tools for Adaptive Mesh Refinement Data

Dynamic load balancing in OSIRIS

A RESTful catalog for simulations

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

PHYSICALLY BASED ANIMATION

Optimization of PIERNIK for the Multiscale Simulations of High-Redshift Disk Galaxies

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

CS 350 : Data Structures B-Trees

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

CPSC / Sonny Chan - University of Calgary. Collision Detection II

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Operating System Virtualization: Practice and Experience

Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

Transcription:

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications James Bordner, Michael L. Norman San Diego Supercomputer Center University of California, San Diego 15th SIAM Conference on Parallel Processing for Scientific Computing 16 February 2012 James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 1 / 21

Enzo-P / Cello Outline Introduction Cello AMR Conclusions James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 2 / 21

Enzo-P / Cello Outline Introduction project overview motivation Cello AMR SAMR review patch merging dual-decomposition message-driven execution Conclusions James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 2 / 21

Enzo-P / Cello Introduction Cello began as a project to provide Enzo with highly scalable AMR Enzo Enzo-P Cello Enzo: astrophysics / cosmology application patch-based SAMR MPI or MPI / OpenMP 18 years development; 150K SLOC Enzo-P / Cello: petascale fork of Enzo code modified tree-based SAMR MPI or CHARM++ 2 years development; 25K SLOC Work in progress! generated using David A. Wheeler s SLOCCount James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 3 / 21

Motivation Enzo s Strengths [ John Wise ] Multiple application domains astrophysical fluid dynamics hydrodynamic cosmology Rich multi-physics capabilities fluid, particle, gravity, radiation Extreme resolution range 3 levels of refinement by 2! Hybrid MPI / OpenMP Active global development community 25 developers James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 / 21

Motivation Enzo s Struggles Grid patch meta-data is large 1.5KB/patch (MPI/OpenMP helps) Memory fragmentation Mesh quality 2-to-1 constraint can be violated asymmetric mesh for symmetric problem Load balancing difficulty maintaining parent-child locality Parallel scaling AMR overhead dominates computation [ Tom Abel, John Wise, Ralf Kaehler ] James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 5 / 21

Patch-based or Tree-based SAMR? Some advantages of Patch-based AMR Flexible patch size and shape improved refinement efficiency Larger patches smaller surface/volume ratio reduced communication amortized loop overhead Fewer patches smaller trees: reduced meta-data James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 6 / 21

Patch-based or Tree-based SAMR? Some advantages of Tree-based AMR Fixed block size and shape simplified load balancing dynamic memory reuse More blocks more parallelism available Smaller nodes: less meta-data Compute only on leaf nodes less communication James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 7 / 21

Cello AMR Overview Cello uses a modified tree-based SAMR approach Modifications primarily to address large tree sizes Patch merging to reduce node count Dual-decomposition to maintain parallelism Targeted refinement for deep hierarchies Message-driven execution to address many issues dynamic scheduling latency tolerant overlaps communication / computation automatic load balancing... James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 8 / 21

Patch Merging The Basic Idea 25 leaf nodes 13 leaf nodes 25 leaf nodes 1 8 8 8 8 2 8 8 8 8 1 Merge patches into larger ones when possible 2 Split patches into smaller ones when necessary James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 9 / 21

Patch Merging The Basic Idea 25 leaf nodes 13 leaf nodes 25 leaf nodes 1 8 8 8 8 2 8 8 8 8 1 Merge patches into larger ones when possible 2 Split patches into smaller ones when necessary James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 9 / 21

Patch Merging A More Realistic Example Cosmological structure formation L = 38.Mpc 10 2 m 10 7 range in mass density Octree refined to L = 8 has 22737 nodes Balanced tree has 29617 nodes Patch merging tree has 1057 nodes James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 10 / 21

Patch Merging A More Realistic Example Cosmological structure formation L = 38.Mpc 102 m 107 range in mass density Octree refined to L = 8 has 22737 nodes Balanced tree has 29617 nodes Patch merging tree has 1057 nodes James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 10 / 21

Patch Merging A More Realistic Example Cosmological structure formation L = 38.Mpc 102 m 107 range in mass density Octree refined to L = 8 has 22737 nodes Balanced tree has 29617 nodes Patch merging tree has 1057 nodes James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 10 / 21

Patch Merging A More Realistic Example Cosmological structure formation L = 38.Mpc 10 2 m 10 7 range in mass density Octree refined to L = 8 has 22737 nodes Balanced tree has 29617 nodes Patch merging tree has 1057 nodes James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 10 / 21

Patch Merging Summary Could reduce AMR meta-data by 2 to 3 (including 25% to 50% increase in node size) However, there are disadvantages fewer patches: less available parallelism variable sizes: difficult to load-balance How can we regain advantages lost? decompose large Patches into smaller Blocks... James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 11 / 21

Dual-Decomposition The Basic Idea Hierarchy Patch Block Hierarchy: octree-like container of distributed Patches Patch: distributed array of blocks Block: local arrays of data (fields, particles, etc.) James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 12 / 21

Dual-Decomposition Communication Patterns Intra patch block update Inter patch block update Intra-patch block update neighbor blocks in same patch distributed uniform mesh problem regular communication patterns efficient and scalable Inter-patch block update neighbor blocks in neighbor patches standard coarse/fine interface update James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 13 / 21

Dual-Decomposition Communication Patterns Intra patch block update Inter patch block update Intra-patch block update neighbor blocks in same patch distributed uniform mesh problem regular communication patterns efficient and scalable Inter-patch block update neighbor blocks in neighbor patches standard coarse/fine interface update James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 13 / 21

Dual-Decomposition Summary Regains parallelism lost in patch merging Maintains same underlying computational mesh Replaces some subtrees with arrays Embedded unigrid-efficiency in uniformly-refined subregions Hierarchy Patch Block James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 1 / 21

Message-driven Execution What is CHARM++? CHARM++ is a parallel language and runtime system Provides processor virtualization multiple objects per physical processor runtime system schedules object methods Important advantages for AMR asynchronous: latency-tolerance well-suited for complex, dynamic applications Also provides fault tolerance dynamic load balancing checkpoint/restart James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 15 / 21

Message-driven Execution What is CHARM++? Programmer sees a collection of objects CHARM++ objects are called chares chares send messages to each other remote function calls: entry methods CHARM++ runtime system maps chares to physical processors schedules entry methods for execution migrates chares to load balance James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 16 / 21

Message-driven Execution CHARM++ Entities CHARM++ supports collections of chares Chare Arrays distributed array of chares migratable elements Chare Groups one chare per processor (non-migratable) Chare Nodegroups one chare per node (non-migratable) James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 17 / 21

Message-driven Execution CHARM++ Entities in Cello Main Simulation Patch Block Process 0 1 2 3 P 1 The mainchare creates a Simulation chare group Each Simulation contains some Patch chares Each Patch contains a chare array of Blocks James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 18 / 21

Cello CHARM++ Control flow in Enzo-P / Cello Current (unigrid) Enzo-P control flow 1 Initialize create chares and chare arrays set initial conditions 2 Refresh ghost zones 3 Calculate timestep currently involves global reduction this should be avoidable Compute! 1. 2. 3.. Main Simulation Patch Block Patch Simulation Patch Block James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 19 / 21

Conclusions A beta version of Enzo-P / Cello is available at cello-project.org Uniform Cartesian mesh AMR ETA 3-6 months CHARM++ or MPI Blocks contain arrays of field variables controllable precision, ordering, padding, alignment, etc. PPM hydrodynamics, PPML MHD HDF5 I/O on every 1 k P processes James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 20 / 21

Enzo-P/ Cello Enzo-P: Petascale astrophysics and cosmology application Cello: Scalable tree-based AMR framework Patch merging + dual-decomposition suitable for embedded uniformly refined areas estimated 3 fewer nodes Message-driven execution using CHARM++ especially suitable for huge complex dynamic problems latency tolerant, auto load-balancing, checkpointing, etc. Targeted refinement for deep AMR Website Listserv Email http://cello-project.org/ https://mailman.ucsd.edu/mailman/listinfo/cello-l jobordner@ucsd.edu James Bordner (U.C. San Diego) Enzo-P / Cello SIAM PP12 21 / 21