AACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-

Size: px

Start display at page:

Download "AACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-"

Jewel Alexina Stevens
5 years ago
Views:

National Institute for Computational Sciences glenn- brook@tennessee.

1 AACE: Applications R. Glenn Brook Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn- Ryan C. Hulguin Computational Science Associate National Institute for Computational Sciences ryan-

Codes Investigated by AACE on the Intel Xeon Phi Coprocessor Science

(ported) Astrophysics Enzo (ported and optimized) Magnetospheric Physics

Elk FP- LAPW (ported) Computational Fluid Dynamics (CFD) Euler and BGK

2 Codes Investigated by AACE on the Intel Xeon Phi Coprocessor Science codes ported or optimized through the Beacon Project Chemistry NWChem (ported) Astrophysics Enzo (ported and optimized) Magnetospheric Physics H3D (ported and optimized) Other codes of interest Electronic Structures Elk FP- LAPW (ported) Computational Fluid Dynamics (CFD) Euler and BGK Boltzmann Solver (ported and optimized) Linear Algebra routines SGEMM and DGEMM (ported)

Enzo Community code for computational astrophysics and cosmology

revinement Highly vectorized with a hybrid MPI + OpenMP programming

coprocessor and many threads per MPI task Enzo was ported and

3 Enzo Community code for computational astrophysics and cosmology More than 1 million lines of code Uses powerful adaptive mesh revinement Highly vectorized with a hybrid MPI + OpenMP programming model Utilizes HDF5 and HYPRE libraries Multiple MPI tasks per coprocessor and many threads per MPI task Enzo was ported and optimized for the the Intel Xeon Phi Coprocessor by Dr. Robert Harkness harkness@sdsc.edu

4 Preliminary Scaling Study: Native ENZO-C 128^3 mesh (non-amr) pure MPI Observed Ideal native mode Speedup Number of Threads Results were generated on the Intel Knights Ferry software development platform

5 Hybrid3d (H3D) Provides breakthrough kinetic simulations of the Earth s magnetosphere Models the complex solar wind- magnetosphere interaction using both electron Vluid and kinetic ions Unlike magnetohydrodynamics (MHD), which completely ignores ion kinetic effects Contains the following HPC innovations: 1. multi- zone (asynchronous) algorithm 2. dynamic load balancing 3. code adaptation and optimization to large number of cores Hybrid3d (H3D) was provided for porting to the the Intel Xeon Phi Coprocessor by Dr. Homa Karimabadi hkarimabadi@ucsd.edu

6 Hybrid3d (H3D) Performance 64 H3D Speedup on the Intel Xeon Phi Coprocessor (codename Knights Corner) 32 Optimizations were provided by Intel senior software engineer Rob Van der Wjingaart. Rela%ve Speedup Observed Ideal Speedup Number of MPI Processes Results were generated on a Pre- Production Intel Xeon Phi coprocessor with B0 HW and Beta SW GHz and 8 GB of GDDR GHz

7 Elk FP- LAPW Paramount to extracting functionality from these advanced materials is having a detailed understanding of their electronic, magnetic, vibrational, and optical properties. Elk is a software platform which allows for the understanding of these properties from a first principles approach. It employs electronic structure techniques such as density functional theory, Hartree-Fock theory, and Green s function theory for the calculation of relevant properties from first principles. Fortran 90 Efficient hybrid MPI + OpenMP parallelization Antiferromagnetic structure of Sr 2 CuO 3 Elk was ported to the the Intel Xeon Phi Coprocessor by W. Scott Thornton wsttiger@gmail.com

8 Elk FP- LAPW Performance Elk uses master- slave parallelism where orbitals for different momenta are computed semi- independently. In this test 27 and 64 different crystal momenta were used. The test case was bulk silicon. Results were generated on a Pre- Production Intel Xeon Phi coprocessor with A0 HW and Beta SW GHz and 8 GB of GDDR GHz

9 Computational Fluid Dynamics (CFD) 2 CFD solvers were developed in house at NICS 1 st solver is based on the Euler equations 2 nd solver is based on Model Boltzmann equations Unsteady solution of a Sod Shock using the Euler equations Steady- state solution of a Couette Vlow using the Boltzmann equation with BGK collision approximation The above CFD solvers were developed for the Intel Xeon Phi Coprocessor by Ryan C. Hulguin ryan- hulguin@tennessee.edu

10 Impact of Various Optimizations on the Model Boltzmann Equation Solver The Model Boltzmann Equation solver was optimized by Intel software engineer Rob Van der Wjingaart He took a baseline solver where all loops were vectorized except for one, and applied the following optimizations to get the most performance out of the Intel Xeon Phi Coprocessor (codename Knights Corner) Set I Loop Vectorization Stack variable pulled out of the loop Class member turned into a regular structure Set II Data Access Arrays linearized using macros Align data for more efvicient access Set III Parallel Overhead Reduce the number of parallel sections Set IV Dependency Remove reduction from computational loop by saving value into a private variable Set V Precision Use medium precision for math function calls (- Vimf- precision=medium) Set VI Precision Use single precision constants and intrinsics Set VII Compiler Hints Use #pragma SIMD instead of #pragma IVDEP

11 Optimization Results from the Model Boltzmann Equation Solver 8 7 balanced sca:er Rela%ve Speedup Loop Vectoriza%on 1 0 Results were generated on a Pre- Production Intel Xeon Phi coprocessor with B0 HW and Beta SW GHz and 8 GB of GDDR GHz

12 Model Boltzmann Equation Solver Performance 128 Rela%ve Speedup of two 8- core 3.5 GHz Intel Xeon E Processors Versus an Intel Xeon Phi Coprocessor 64 Rela%ve Speedup Dual Intel Xeon E Compiler Hints Intel Xeon Phi - Precision II - Balanced Intel Xeon Phi - Compiler Hints - Balanced Intel Xeon Phi - Precision II - Sca:er Intel Xeon Phi - Compiler Hints - Sca:er Number of OpenMP Threads Results were generated on a Pre- Production Intel Xeon Phi coprocessor with B0 HW and Beta SW GHz and 8 GB of GDDR GHz

13 Porting to the Intel Xeon Phi Coprocessor No major code rewrites were needed to start running on an Intel Xeon Phi coprocessor The previous applications were run in native mode and simply required a recompile using the mmic Vlag Parallelism is achieved using OpenMP, MPI, or both The transition from the Intel Xeon Phi software development platform (codename Knights Ferry) to the Intel Xeon Phi coprocessor (codename Knights Corner) is seamless.

14 Custom SGEMM and DGEMM Routines for the Intel Xeon Phi Coprocessor Custom General Matrix- Matrix Multiply routines using single and double precision (SGEMM and DGEMM respectively) were developed for the Intel Xeon Phi coprocessor. Square matrix sizes were used (m = n = k). Intel Xeon Phi coprocessor results are run with 240 threads and compared against Intel Xeon E processors. The above SGEMM and DGEMM routines were developed for the Intel Xeon Phi Coprocessor by Jonathan Peyton jpeyton1@utk.edu

15 Custom SGEMM Performance Results

16 Custom DGEMM Performance Results

17 Contact Information R. Glenn Brook, Ph.D. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-

Performance Metrics and Application Experiences on a Cray CS300- AC Cluster Supercomputer Equipped with Intel Xeon Phi Coprocessors

Performance Metrics and Application Experiences on a Cray CS300- AC Cluster Supercomputer Equipped with Intel Xeon Phi Coprocessors Vincent C. Betro, Ph.D. Computational Scientist National Institute for