Experiences with ENZO on the Intel Many Integrated Core Architecture

Size: px

Start display at page:

Download "Experiences with ENZO on the Intel Many Integrated Core Architecture"

Esther Holt
5 years ago
Views:

1 Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012

2 Overview ENZO applications at petascale ENZO and the Intel MIC Multi-level parallelism with MIC

4 NSF and DOE Open Resources Proc Cores Core / node Memory /node (GB) Memory /core (GB) Kraken AMD IB 114, Titan AMD IL 300, Aggr. Memory (TB) BlueW AMD IL 375, Mira BGQ 750, (18)

5 Petascale Science Applications Complex science Multi-scale, multi-physics, coupled models Local and global interactions Complex algorithms Many algorithms in a single application Extensive use of third-party libraries & packages Complex parallelism Different decompositions, types at each scale Vectorization is a good match for physical science If the processor can manage non-unit-stride and CIGS Vector model may interact poorly with caches

6 The Standard Cosmological Model, 2012 Lambda = 1 (flat) Cold Dark Matter Universe 73% Dark Energy 23% Dark Matter 4% Baryons Consistent with Inflation WMAP7 parameters Planck parameters in 2013

7 The ENZO Astrophysics Code(s) A general-purpose Adaptive Mesh Refinement (AMR) code Hybrid physics capability for cosmology PPM Eulerian hydro and collisionless dark matter (particles) Grey radiation diffusion, coupled chemistry and RHD Extreme AMR to > 35 levels deep > 500,000 subgrids AMR load-balancing and MPI task-to-processor mapping Ultra large-scale non-amr applications at full scale on NICS XT5 High performance I/O using HDF5 LLNL Hypre pre-conditioners and solvers for RHD C, C++ and Fortran90, > 200,000 LOC

9 ENZO - One code, several different modes ENZO-C Conventional hybrid cosmology code Eulerian hydrodynamics with PPM Collisionless particles with PIC MPI and OpenMP hybrid, AMR and non-amr ENZO-R ENZO-C plus + grey flux-limited radiation diffusion Coupled chemistry and radiation hydrodynamics AMR under development by Dan Reynolds (SMU) MPI and OpenMP hybrid (in ENZO and HYPRE)

10 Hybrid ENZO on the NICS Cray XT5 ULTRA Simulation: 6400^3 80 Mpc model Largest practical hydrodynamic cosmology simulation Designed to fit on the upgraded NICS XT5 Kraken 268 billion zones, 268 billion dark matter particles 15,625 (25^3) MPI tasks, 256^3 root grid tiles 6 OpenMP threads per task, 1 MPI task per socket 93,750 cores, 125 TB memory 30 TB per checkpoint/re-start/data dump >15 GB/sec read, >7 GB/sec write, non-dedicated 1500 TB of output to Z=10 40 million core-hours to Z=10

11 2011/12 INCITE : Re-Ionizing the Universe Non-AMR 3200^3 RHD with ENZO-R Hybrid MPI and OpenMP on NCCS Jaguar XT5 SMT and SIMD tuning 25^3 MPI tasks x 128^3 root grid tiles 3 OpenMP threads per task (2 or 4 on Titan) 46,875 cores on Jaguar > 9 TBytes per checkpoint/re-start/data dump (HDF5) I/O rates: 27GB/sec in, 9 GB/sec out 200 TBytes in ORNL HPSS by Z=8 64-bit arithmetic, 64-bit integers and pointers 35 M hours on Jaguar, 32M hours on Titan

15 Near-term Developments Enhancements to OpenMP threading Preparing for at least 8 threads per MPI task RHD Hybrid ENZO + Hypre Hand tuning of OpenMP for multigrid PGAS with UPC on Cray Gemini Multiple UPC development paths Function and Scalability 8192^3 HD, 4096^3 RHD and 2048^3 AMR All practical with ORNL Titan in 2012

16 Large Scale and Accelerators Large scale implies large memory & node count 6400^3 ~ 120TB (sp) or 7500 nodes at 16GB/node 3200^3 ~ 40TB (dp) or 2500 nodes at 16GB/node State per processor >> Accelerator memory Entire state cycles through Accelerator memory Host-to-accelerator bw (PCI) limits offload Offload the entire app if memory available? Offload of just regions still leaves I/O and MPI MIC-resident on cluster approximates this today

17 NICS MICS Rook Intel MIC Knights Ferry Software Development Platform Workstation 2 Westmere CPUs & 2 KNF MICs Bishop Cray CX1 cluster 6 Westmere CPUs & 2 KNF MICs 1 Head node 2 Westmere CPUs 2 Compute nodes 2 Westmere CPUs & 1 KNF MIC / node Knight Appro cluster 8 Westmere CPUs & 8 KNF MICs 4 Compute nodes - 2 Westmere CPUs & 2 KNF MIC / node Beacon coming soon Appro cluster 2 service nodes & up to 32 compute nodes up to 64 Sandybridge CPUs & up to 64 KNF MICs

18 W W K K W W W W K K K K Infiniband PCIe2 W W K K

19 Getting Started on MIC Essential library software for Xeon and MIC Cross-compiles for MIC require extensive mods to configure scripts to prevent local run-time tests MPI is critical priority almost all large apps MPICH 1.2.7p1 Obsolete but easy MPICH 2 Current but more complex to build MVAPICH soon HDF5 HDF with serial C interface, no hi-level Mathematical Libraries Petsc, Hypre, LAPACK, FFTW etc.

20 Design Considerations Inside out or Outside in? Inside out on RoadRunner Inside out driven by hybrid MPI/OpenMP on MIC Attached processor bandwidths Socket ~50 GB/s, Accelerator > 100 GB/s PCI ~5 GB/s PGAS CAF and UPC support in hardware? OpenMP 4.x unified accelerator directives MPI-3 support, especially non-blocking globals Extra HW: barriers, eurekas, Eregisters?

21 Inside Out Native single MIC application Multiple MICs, single node Multiple MICs, multiple nodes via MPI Full MPI 3 support with dynamic processes Possible reverse offload to Xeon

22 Running ENZO on KF Initial focus on function, correctness & scalability ENZO is already hybrid MPI & OpenMP Naturally maps to : MPI tasks offload to MIC MPI hybrid tasks native to a single MIC MPI hybrid tasks communicating between MICs Since ENZO is massively parallel and hybrid the MICresident MPI/hybrid task is the preferred approach

23 Native mode ENZO on the MIC Using native mode on KF HDF MPICH 1.2.7p1 (ch_shmem) LLNL Hypre 2.6.0b ENZO-C and ENZO-R with pure MPI, hybrid MPI/OpenMP ENZO initial conditions generator in OpenMP Builds with configure require manual workaround where native-mode execution is required to build No application source code modifications required All these codes ported in 1 week by 1 person

24 ENZO Tests & Scaling on KF ENZO applications currently constrained by 1GB memory available on a single KF in native mode Many KF processors required for realistic testing Checkpointing suppressed to limit memory use ENZO-C tested with 128^3 non-amr HD model 2-, 4-, 8-, 16- and 32-task pure MPI runs 32 bit floating point sufficient for most arithmetic ENZO-R tested with 80^3 non-amr RHD model 8- and 16-task pure MPI runs 64 bit floating point arithmetic required for RHD ENZO-C tested with 64^3 3-level AMR

25 32 ENZO-R Scaling Single KF Native Mode MPI Ideal Actual 16 Relative Performance Number of MPI Tasks

26 Outside In Native XEON application Multi-node via MPI Intel Offload directives vs. OpenACC vs. OpenMP 4 Is this just a temporary measure?

27 Offload Mode ENZO on the MIC Initial conditions generator using offload 3D FFT of random number fields Mixed C++, C and F90 OpenMP, 10K LOC Offloads confined to F90 OpenMP so far H_TRACE invaluable in checking offload traffic Successful port after ~2 weeks effort Fragile! Easy to crash the MIC Inconsistent translation of offload regions l Serves as a prototype but not for production

28 32 ENZO-Inits Scaling Single KF 16 Ideal MMIC mode OFFLOAD mode XEON mode Relative Performance Number of OpenMP Threads

29 The Road Ahead Aggregate memory limits what you could do Cost decides what you can do ~100M hrs/sim? Weak scaling era will end soon for many apps Multi-level parallelism is required Data locality will be essential Memory bandwidth determines the bottom line Source code investment cannot be abandoned I/O for data and benchmarking is now critical Traditional checkpointing will be impossible at exascale

30 Conclusion Intel MIC is the best way forward for largescale codes which cannot use the existing GPGPU model (even with directives) Intel MIC advantages Vector registers, masks, gather-scatter Traditional vectorization / compilers No restrictions on stride or alignment X86 code + AVX Flexible programming models Exposes maximum application parallelism Fully portable standards-compliant source code

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture 1 Introduction Robert Harkness National Institute for Computational Sciences Oak Ridge National Laboratory The National