Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture

Size: px

Start display at page:

Download "Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture"

Amice Townsend
5 years ago
Views:

1 Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture 1 Introduction Robert Harkness National Institute for Computational Sciences Oak Ridge National Laboratory The National Institute for Computational Sciences (NICS) has deployed several Intel R Many Integrated Core (MIC) Knight s Ferry (KNF) platforms in cooperation with Intel R and Cray Inc. NICS initial focus has been on demonstrating several large-scale physics and chemistry codes on the Intel R KNF architecture and exploring different models of execution. Here we will describe the porting and testing process using ENZO-R, a version of the ENZO astrophysics code. ENZO applications running on petascale systems are approaching the limits of weak scaling mainly due to the limitations of reasonable cost and the need to complete simulations in a funding cycle time scale. The Intel R MIC architecture opens up new possibilities for improving strong scaling by increasing internal parallelism without the need to re-write the entire application code. 2 Programming Models for the ENZO Code ENZO is a complex astrophysics code for multi-scale and multi-physics applications using large (Eulerian) fixed meshes or multi-level adaptive mesh refinement (5; 6; 4; 3). Many of the largest ENZO applications are in cosmology where dark matter is treated as collisionless but gravitating particles. ENZO thus also contains particle-in-mesh type computations as well as standard hydrodynamical methods (PPM) (1) for normal matter gas dynamics. Cosmology simulations also require calculations of the global gravitational field due to the normal matter and dark matter. This requires a 3D FFT on the top-level scale mesh in both AMR and non-amr simulations. ENZO-R incorporates 3D flux-limited radiation diffusion coupled to ENZO hydrodynamics and chemistry. The radiation package is dependent upon solvers and preconditioners from the LLNL HYPRE package (2). The complete ENZO-R source code consists of approximately 250,000 lines of C, C++ and Fortran90, excluding MPI, OpenMP, HDF5, HYPRE and other third-party libraries. A full-scale cosmology simulation involves local computations, mainly of atomic physics processes, and global computations necessitating a considerable amount of long-range communication with MPI, with potentially large imbalances in per-node memory requirements (since the dark matter particles move freely in the domain due to gravity) and local CPU load imbalances due to AMR or nonlinear local interactions. In a cosmology application, ENZO-R requires approximately 25 physical fields defined on each mesh point, plus 12 dark matter or star particle attributes, per particle. At the largest scales, using up to meshes (usually with one dark matter particle per mesh point) these state data amount tens of TBytes and the entire working set can be four times larger still, resulting in aggregate memory requirements of more than 100TB. Additional physics capabilities such as MHD or frequency-dependent RHD bring further increases in the aggregate 1

2 memory requirement. Clearly, this is the defining issue with multi-dimensional, multi-scale, multiphysics codes in general: at the present time, no accelerator technology can provide enough local memory to conduct a full scale calculation without relying on the host system to accommodate as much as 90 per cent of the working set. In principle, parts of the calculation can be offloaded from the host system but the efficiency of this depends on the granularity of the tasks, the performance of the PCIe connection between the Intel R Xeon R and KNF card and the extent to which some part of the working set can reside in the local KNF memory. In practice, this is difficult to achieve. With a memory-bound code like ENZO, the clear preference is to make the entire working set resident on the Intel R MIC co-processor and to use the host merely for inter-node communication and for I/O to disk. The aggregate memory bandwidth available on accelerators is generally greater than that on the host, making a very strong argument in favor of keeping the entire working set resident on the accelerator. The Knight s Ferry cards in NICS systems have approximately 1 GByte of usable memory each and this severely restricts the scale of ENZO test problems compared to production systems which typically have 16 to 64 GBytes of memory per node. The vector model of computation is extremely well-suited to codes like ENZO. Most of the core physics components of ENZO are derived from codes originally developed for true vector multiprocessors such as the Cray YMP/C90/T90. Consequently, most of the computational work is already cleanly vectorized with moderate vector lengths (up to 256 elements). The essentially 3D data structures provide multiple opportunities for simultaneous vectorization and threadeding at more than one level. Strided and indexed memory loads and stores are unavoidable, however, and these operations can sometimes interact poorly with caching mechanisms. One of the many advantages of the Intel R MIC co-processor is that it can perform all of these indirect operations. 3 Migrating ENZO-R to the Intel MIC Architecture ENZO-R and one of its supporting codes have been migrated to the Intel R MIC architecture in both native and offload modes. The main emphasis has been on getting the codes and supporting libraries operational and generating correct results. ENZO-R is used for some of the largest simulations done on NSF and DOE supercomputers today (Cray XT5, XE6) so production-scale models are impossible. The goal is to generate some insight into how to implement ENZO on future very large scale systems utilizing Intel R MIC components. The fundamental choice is between the outside-in method where the ENZO state is mainly resident on the Intel R Xeon R host with offloading of parallel regions to the Intel R MIC using directives, and the inside-out method where the bulk of the ENZO state is resident on the Intel R MIC and the host is peripherally involved with communicating between (single or multiple) Intel R MICs on multiple nodes. ENZO is a large code with over 1,000 routines and it is dependent on several major third party libraries for numerical methods, communications and data handling. No single component of ENZO dominates the execution profile. Furthermore, in real applications, ENZO spends a significant fraction of the execution time in MPI communication and I/O to disk. These factors suggest that the standard CPU plus accelerator offload model stands little chance of success. The Intel R MIC software distribution supports the Intel R compilers for native and offload mode but currently does not provide any of the essential third-party library components needed to support large-scale scientific applications. Large-scale scientific applications are almost all criti- 2

3 cally dependent on MPI for inter-node communication on distributed-memory systems. Any truly large-scale application is likely to be designed to scale to tens of thousands of MPI tasks and it is often the case that MPI communication accounts for a major fraction of the overall simulation run time. Compilation of MPI for the Intel R Xeon R host using the offload mechanism is trivial but compiling MPI for single Intel R MIC, multiple Intel R MIC per node, and single or multiple Intel R MICs on multiple nodes requires cross-compilation. MPI uses configure scripts which require manual modification to achieve this. MPICH was used for the initial porting effort because even though it is now considered obsolete it is relatively simple to build and it presented a lower risk. ENZO is also dependent on the Hierarchical Data Format library (HDF or later) and again the main challenge is manual modification of the necessary configure scripts to enable cross-compilation. Other essential third-party libraries (SPRNG 2.0 and HYPRE 2.8.0) also require similar manual modifications to their configure scripts. With all these components in place for offload and native builds, the compilation of ENZO-R itself is straightforward. No ENZO-R source code modifications are required to build a functional ENZO-R binary in native mode although, clearly, some source code changes will be necessary for to achieve full optimization on the Intel R MIC co-processor. Although ENZO-R is already a hybrid code using OpenMP directives throughout, the extension of all of these parallel regions to offload regions is far from trivial in such a large code and the conversion to offload mode has not been completed at the time of writing. ENZO requires initial conditions for a simulation and for cosmology cases these are generated with ENZO-Inits. Like ENZO-R, this code contains a mix of C, C++ and Fortran90 but contains only 12,000 lines of code. ENZO-Inits has been implemented as a shared memory OpenMPthreaded code as well as a threaded hybrid MPI/OpenMP code. The shared memory OpenMPthreaded variant provides a reasonably complex test case for offload compilation. In comparison to native mode, the offload mode has proven to be quite difficult to use. Small errors in syntax or placement of offload directives tended to result in a system hang and occasionally even a crash of the host system. As expected, simple use of offload directives is relatively ineffective compared to native mode execution on a single KNF card due to the costs of startup and data transfer to and from the KNF to the host, even though the OpenMP parallel code is quite efficient on an Intel R Xeon R multicore when running at the same scale. 4 NICS Test Systems and Preliminary Results NICS has three KNF systems covering most of the choices in basic cluster configurations: An Intel R Development Workstation with 2 Intel R Xeon R processor 5600s and 2 KNF cards. A Cray CX-1 system consisting of a single head node with 2 Intel R Xeon R processor 5600s and two compute nodes each containing 2 Intel R Xeon R processor 5600s plus one KNF card. An Appro cluster consisting of 4 compute nodes, each of which contains 2 Intel R Xeon R processor 5600s and 2 KNF cards, for a total of 8 KNF cards. 3

4 32 ENZO-R Scaling Single KF Native Mode MPI Ideal Actual 16 Relative Performance Number of MPI Tasks Figure 1: Unoptimized ENZO-R scaling in native mode on KNF. Each of these systems is configured to support outside-in and inside-out programming models and can support native mode MPI running across the nodes within each cluster. Every KNF card has 32-core processors with 2 GBytes of on-board GDDR5. Preliminary results are available for small-scale tests run on the various NICS systems. 4.1 ENZO-R in KNF Native Mode Figure 1 shows the strong scaling of a non-amr cosmology model using MPI in native mode on a single KNF card. This non-rhd model uses 32-bit arithmetic for the physical fields with the exception of dark matter particle position which requires 64-bit precision. The scaling behavior is remarkable given the fact that decomposing such a small model results in parallel tasks which are far smaller than would be used in a full-scale simulation (i.e. for 32 tasks the model is decomposed into 32 regions with 4x4x2 tiles of size 32x32x64 compared to or tiles in production simulations). AMR and RHD models require full 64-bit precision throughout. Consequently, for RHD a single KNF can run only an 80 3 model. AMR models require an increasing amount of memory as the refinement progresses and the test case with a 64 3 root grid and three levels of refinement exhausts all available memory at about 1 GByte long before the AMR is fully developed. Larger models using up to 8 times as much memory can run using multiple KNF cards on the NICS cluster although the cost of inter-node communication becomes excessive given the present method of indirect routing. 4

5 32 ENZO-Inits Scaling Single KF 16 Ideal MMIC mode OFFLOAD mode XEON mode Relative Performance Number of OpenMP Threads Figure 2: Comparison of scaling of main parallel region in ENZO-Inits in native mode and offload mode on KNF and execution on the Intel R Xeon R host. 4.2 ENZO-Inits in KNF Offload Mode Figure 2 shows the scaling of part of the ENZO-Inits code running in native mode, offload mode and purely on the Intel R Xeon R processor 5600 front-end. The code is identical in all three cases and all arithmetic uses 64-bit precision. ENZO-Inits reads a file containing a double precision pseudo-random number sequence. For the test problem this file is 85 MBytes in size. This random number data represents the input to the major OpenMP parallel region and the timing data is given for this region only. In native Intel R Xeon R and native KNF modes this data is already resident in memory at the start of the parallel region. For the offload case it is automatically copied in to the KNF card together with a double precision complex field (32 MBytes in size) which is also returned to the host on completion. The results for native mode show reasonable efficiency and the behavior of the pure Intel R Xeon R run is comparable up to 8 OpenMP threads. The results for offload presumably demonstrate the impact of the data transfers from and back to the host over the PCIe. The results for two cores show almost no improvement over using a single core, even on the Intel R Xeon R processor. The reason for this is unknown at the time of writing. Beyond two cores the scaling of the Intel R Xeon R and KNF resident cases is close to ideal. Although this parallel region represents most of the workload, ENZO-Inits must also write several fields to disk. Since all I/O uses KNF resources, the need to write these files restricts the use of the KNF to small test cases only. 5

6 5 Future Developments The next steps in development will be to complete the migration of ENZO-R to offload mode and to investigate the use of two separate but communicating MPI domains resident on the Intel R Xeon R and KNF cooperating components, respectively, to do reverse offload communication and I/O where ENZO state remains KNF-resident. The forward offload model is close to the standard hybrid hybrid CPU/GPGPU model and similar constraints apply. Reverse offload preserves a model which can be used on any MPP platform. 6

7 References [1] P. Colella and P. R. Woodward. The Piecewise Parabolic Method (PPM) for Gas-Dynamical Simulations. Journal of Computational Physics, 54: , September [2] HYPRE project site: [3] ENZO poject site: [4] M.L. Norman, J. Bordner, D. Reynolds, R. Wagner, G.L.Bryan, R. Harkness & B.W. O Shea. Simulating Cosmological Evolution with Enzo Petascale Computing: Algorithms and Applications, pp , Ed. D.A. Bader. Chapman & Hall/CRC, [5] Norman, M. L., Bryan, G. L. et al Simulating Cosmological Evolution with Enzo, in Petascale Computing: Algorithms and Applications, Ed. D. Bader, CRC Press LLC (2007) [6] O Shea, B. W.; Bryan, G.; Bordner, J.; Norman, M. L.; Abel, T.; Harkness, R.; Kritsuk, A. 2004: Introducing Enzo, an AMR Cosmology Application, in Adaptive Mesh Refinement - Theory and Applications, Eds. T. Plewa, T. Linde & V. G. Weirs, Springer Lecture Notes in Computational Science and Engineering 7

Experiences with ENZO on the Intel Many Integrated Core Architecture

Experiences with ENZO on the Intel Many Integrated Core Architecture Dr. Robert Harkness National Institute for Computational Sciences April 10th, 2012 Overview ENZO applications at petascale ENZO and