Fault tolerant issues in large scale applications

Similar documents
RAMSES on the GPU: An OpenACC-Based Approach

The Swift simulation code

Cosmology Simulations with Enzo

AREPO: a moving-mesh code for cosmological hydrodynamical simulations

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

Computational Astrophysics 5 Higher-order and AMR schemes

Using GADGET-2 for cosmological simulations - Background -

PiTP Summer School 2009

Experiences with ENZO on the Intel Many Integrated Core Architecture

General Plasma Physics

Peta-Scale Simulations with the HPC Software Framework walberla:

simulation framework for piecewise regular grids

Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy

Introduction to parallel Computing

Fast Multipole, GPUs and Memory Crushing: The 2 Trillion Particle Euclid Flagship Simulation. Joachim Stadel Doug Potter See [Potter+ 17]

Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy

Advanced Parallel Programming. Is there life beyond MPI?

Performance of the hybrid MPI/OpenMP version of the HERACLES code on the Curie «Fat nodes» system

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Introducing OpenMP Tasks into the HYDRO Benchmark

Results from the Early Science High Speed Combus:on and Detona:on Project

High-Performance Scientific Computing

Recent results with elsa on multi-cores

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

Dynamic load balancing in OSIRIS

Adaptive Refinement Tree (ART) code. N-Body: Parallelization using OpenMP and MPI

UCDs for simulations

Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

How Simulations and Databases Play Nicely. Alex Szalay, JHU Gerard Lemson, MPA

Data-Intensive Applications on Numerically-Intensive Supercomputers

GPU-Based Visualization of AMR and N-Body Dark Matter Simulation Data. Ralf Kähler (KIPAC/SLAC)

High Performance Calculation with Code_Saturne at EDF. Toolchain evoution and roadmap

High performance computing and numerical modeling

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

RASCAS LEO MICHEL-DANSAC, JEREMY BLAIZOT, THIBAULT GAREL, ANNE VERHAMME. (aka MCLya v.2.0)

Numerical Methods. (Additional Notes from Talks by PFH)

LS-DYNA 980 : Recent Developments, Application Areas and Validation Process of the Incompressible fluid solver (ICFD) in LS-DYNA.

Shallow Water Simulations on Graphics Hardware

Particle-based simulations in Astrophysics

The (fine-grained) phase-space structure of Cold Dark Matter Halos - and its influence on dark matter searches Mak Vogelsberger, Harvard/CfA

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Optimization of PIERNIK for the Multiscale Simulations of High-Redshift Disk Galaxies

Massively Parallel Finite Element Simulations with deal.ii

Future of Enzo. Michael L. Norman James Bordner LCA/SDSC/UCSD

Parallel Multigrid on Cartesian Meshes with Complex Geometry +

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

The Future of High Performance Computing

Advances of parallel computing. Kirill Bogachev May 2016

arxiv: v2 [astro-ph.im] 26 Nov 2015

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

Center Extreme Scale CS Research

Datura The new HPC-Plant at Albert Einstein Institute

Big Orange Bramble. August 09, 2016

Software and Performance Engineering for numerical codes on GPU clusters

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

HACC: Extreme Scaling and Performance Across Diverse Architectures

The Fusion Distributed File System

Week 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh

Algorithm and Library Software Design Challenges for Tera, Peta, and Future Exascale Computing

LAMMPS, FLASH and MESA

Fluent User Services Center

Visualization Challenges for Large Scale Astrophysical Simulation Data. Ultrascale Visualization Workshop

Generic finite element capabilities for forest-of-octrees AMR

Summer school on cosmological numerical simulations 3rd week MONDAY

Lecture 4: Locality and parallelism in simulation I

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

What are Clusters? Why Clusters? - a Short History

Parallel Mesh Partitioning in Alya

Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX10

Efficient Algorithmic Approaches for Flow Simulations on Cartesian Grids

The Icosahedral Nonhydrostatic (ICON) Model

Large scale CFD computations at CEA

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Large scale Imaging on Current Many- Core Platforms

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

Gadget in yt. christopher erick moody

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

Intermediate Parallel Programming & Cluster Computing

Robust Simulation of Sparsely Sampled Thin Features in SPH-Based Free Surface Flows

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Accurate and Efficient Turbomachinery Simulation. Chad Custer, PhD Turbomachinery Technical Specialist

Ateles performance assessment report

Numerical Simulation of Accidental Fires with a Spillage of Oil in Large Buildings

The Barnes-Hut Algorithm in MapReduce

Parallel Mesh Multiplication for Code_Saturne

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Framework Middleware Bridging Large-apps and Hardware

arxiv: v2 [astro-ph.im] 24 Jul 2013

Most real programs operate somewhere between task and data parallelism. Our solution also lies in this set.

FUSION PROCESSORS AND HPC

Simulation of Turbulent Particulate Flow on HPC Systems

Preparing GPU-Accelerated Applications for the Summit Supercomputer

High Performance Computing. Introduction to Parallel Computing

Transcription:

Fault tolerant issues in large scale applications Romain Teyssier George Lake, Ben Moore, Joachim Stadel and the other members of the project «Cosmology at the petascale» SPEEDUP 2010 1

Outline Computational cosmology: an overview Our 2 massively parallel applications - RAMSES - PKDGRAV Fault-tolerant issues in large scale simulations Some solutions at the application level: - efficient checkpoint restart - «divide and conquer» SPEEDUP 2010 2

Cosmological simulations From Gaussian random fields to galaxies: nonlinear dynamics of gravitational instability with N- body and hydrodynamics codes. SPEEDUP 2010 3

Cosmological simulations Springel et al., Nature, 2006 SPEEDUP 2010 4

Hubble Ultra Deep Field SPEEDUP 2010 5

Adaptive Mesh Refinement with Octree SPEEDUP 2010 6

RAMSES: a parallel AMR code Graded octree structure: the cartesian mesh is refined on a cell by cell basis Full connectivity: each oct have direct access to neighboring parent cells and to children octs (memory overhead 2 integers per cell). Optimize the mesh adaptivity to complex geometry but CPU overhead can be as large as 50%. N body module: Particle-Mesh method on AMR grid (similar to the ART code). Poisson equation solved using a multigrid solver. Hydro module: unsplit second order Godunov method with various Riemann solvers and slope limiters. Time integration: single time step or fine levels sub-cycling. Other: Radiative cooling and heating, star formation and feedback. MPI-based parallel computing using time-dependant domain decomposition based on Peano-Hilbert cell ordering. Download at http://irfu.cea.fr/projets/site_ramses SPEEDUP 2010 7

SPEEDUP 2010 8

Domain decomposition for parallel computing Parallel computing using the MPI library with a domain decomposition based on the Peano-Hilbert curve for adaptive tree-based data structure. Peano-Hilbert binary hash key is used for domain decomposition (MPI). Hilbert ordering for optimal data placement in local memory (OpenMP). Data compression based on 6D Hilbert indexing. Implemented in our 2 codes: - PKDGRAV (TREE + SPH) by J. Stadel - RAMSES (PIC + AMR) by R. Teyssier Weak-scaling up to 20,000 core. Dynamical load balancing SPEEDUP 2010 9

Load-balancing issue Scaling depends on problem size and complexity. Large dynamic range in density implies large dynamic range in time steps Main source of load unbalance: multiple time steps and multiple species (stars, gas, DM). Strong-scaling example. Problem: load balancing is performed globally. Intermediate time steps particles are idle. Solution: multiple tree individually load balanced SPEEDUP 2010 10

The Horizon simulation Billion dollar experiments need support from HPC http://www.projet-horizon.fr 6 billions light years 70 billion dark matter particles 140 billion AMR cells RAMSES code 6144 core 2.4 GB per core Wall-clock time < 2 months performed in 2007 on the CCRT BULL Novascale 3045 Teyssier et al. 2009 Sousbie et al. 2009 Pichon et al. 2010 13 billions light years SPEEDUP 2010 11

Cosmological N body simulations Moore s law Courtesy of V. Springel SPEEDUP 2010 12

The MareNostrum simulation 1 billion dark matter particles 3 billion AMR cells 100 million «star» particles 100 000 galaxies 2048 core 2 GB per core Wall-clock time 4 weeks RAMSES code Performed in 2006/2007 on the IBM PowerPC 970 at BSC Dekel et al. 2009 Ocvirk et al. 2009 Devriendt et al. 2010 SPEEDUP 2010 13

Zoom-in Simulations Zoom-in strategy: focus computational resources on a particular Region-Of- Interest and degrade the rest of the box. Much more demanding than full periodic box simulations. N=100,000 1,000,000 10,000,000 From the overmerging problem to the missing satellites problem Moore et al. 1999 SPEEDUP 2010 14

The GHALO project 1 billion particles in the Virial radius 1024 core 2 GB per core on MareNostrum PKDGRAV code Stadel et al. 2009 Zemp et al. 2009 Kuhlen et al. 2010 SPEEDUP 2010 15

At the heart of the halo: galaxy formation SPEEDUP 2010 16

Fault-tolerant issues for our large scale simulations Barcelona simulations (MareNostrum and GHALO): time to failure = 1 day checkpoint restart file = 100 to 200 GB Strong GPFS file-system contention time to write = 3 hours Horizon simulation Pre-commissioning mode time to failure = 1 hour checkpoint restart file = 3 TB Very efficient LUSTRE file system time to write = 5 minutes Requirement for a fault-tolerant application: time to writes < 10% time to failure time to write depends on the number of I/O nodes SPEEDUP 2010 17

What is a failure in our applications? From very basic to very subtle failures: - file system (hardware or software failure) during write - hardware node failure - MPI communication failure - MPI communication completed but corrupted How to detect a failure? - log file size not increasing from more than X minutes - size of check-point restart file not correct - NaN - energy conservation errors and small time step size When do we need to detect a failure? - should be more frequent than checkpoints - run daemon with alarm - stop and restart SPEEDUP 2010 18

Large scale application and fault tolerance Schroeder & Gibson 2005 Time to failure scales up with system and problem size. Massively parallel applications use flat model of message exchange. Failure in one node results in the failure of the entire computation. Because gravity is a long-range force, present-day simulations need to access the whole computational volume (fully-coupled mode). 2 solutions: - efficient checkpoint restart write of global data (dense parallel I/O system) - divide the volume in independent compute elements SPEEDUP 2010 19

Speeding-up check-point restart dump (1/2) Blade Center Blade #1 Blade #2 Blade #3 Blade #4 Compute core I/O core call MPI_Send cp /scratch /gpfs Output files several times per run (one every 2 hours). Need to minimize computational downtime due to parallel file system contention (competitive disc access). Compute group: 2048 compute cores : MPI_COMM_WORLD_COMP I/O group: 1 I/O core per 32 compute cores : MPI_COMM_WORLD_IOGRP (64 dedicated I/O cores) Compute core: -Write replaced by MPI_Send -Wait for compute core to send data I/O core: -Dump data to local scratch disc -Resume computation when all data received -Transfer data from local scratch to GPFS SPEEDUP 2010 20

Speeding-up check-point restart dump (2/2) Data compression of simulation snapshots. Each particle belong to a different time step group. Store less frequently slowly varying particles. Store more frequently a small subset of the data. When check-point restarting the application, need to reconstruct the missing data (error control) SPEEDUP 2010 21

Brute force checkpoint restart induces overhead DARPA Exascale Report on Resilience SPEEDUP 2010 22

Divide and conquer: multiple zooms Fast multipole method: using an octree decomposition, distant particles are grouped together on coarser nodes in a pseudo-particle (multipole approximation of the force). Zoom-in method: distant particles are actually replaced by the pseudo-particle Error is controlled by the «opening angle». Decoupled mode Fully-coupled mode Idea: use the zoom-in technique to segment the computational volume into independent zoom simulations. Distant particles are grouped together into massive particles and evolved locally: maximize data locality at the prize of degraded accuracy and overheads. SPEEDUP 2010 23

Divide and conquer: multiple zooms Increase particle mass gradually to minimize errors. SPEEDUP 2010 24

Divide and conquer: multiple zooms Initial and final particle distribution for one single halo. SPEEDUP 2010 25

Divide and conquer: multiple zooms Check a posteriori mass distribution errors (<1%) and contamination (none). SPEEDUP 2010 26

Divide and conquer: multiple zooms Divide a large scale cosmological application with 10 4 + core into many (>1000) independent small scale zooms (<32 core). Each individual zoom use small checkpoint restart files. In case of failure, resubmit to another node. Work in «embarrassingly parallel» mode. Caveats: - very large spectrum of halo sizes. Larger halos won t fit. - need a redundant, highly available storage device - selection of the halo sample based on user priors: biased approach? Tests: - deploy our multiple zoom scripts in a fault simulation framework. - use available queue system to perform «idle time» computation. - use grid infrastructures. SPEEDUP 2010 27

Fault-tolerant grid computing Grid computing as a laboratory for faulttolerant computing. We used the DIET grid middleware to run a large scale experiment on Grid5000, the French research grid. We obtained a 80% success rate on 3200 cores deployed over 13 sites. The main cause of failure was file system related (2 sites lost). Caniou et al., Fourth HPGC, 2007 SPEEDUP 2010 28

Conclusion - Computational cosmology is still limited by computational power. - Our large scale application are massively parallel and very memory demanding. - We had some (unpleasant) experience with system, disc and MPI failures. - Our present-day application are fault-tolerant because of frequent checkpoint restart dumps: requires a very efficient file system and/or an optimal «parallel write» strategy. - A radical modification of our applications is considered using a multiplezoom approach. - Other approaches are also possible and most welcome (RAID computing, system resilience). SPEEDUP 2010 29

The HP2C Cosmology Team George Lake Romain Teyssier Ben Moore Joachim Stadel Jonathan Coles Markus Wetzstein Doug Potter Christine Moran SPEEDUP 2010 30