CONTINUED EFFORTS IN ADAPTING THE GEOS- 5 AGCM TO ACCELERATORS: SUCCESSES AND CHALLENGES

Similar documents
Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessor

OpenACC 2.6 Proposed Features

SIMD Exploitation in (JIT) Compilers

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Native Computing and Optimization. Hang Liu December 4 th, 2013

Can Accelerators Really Accelerate Harmonie?

Progress Porting and Tuning NWP Physics Packages on Intel Xeon Phi

Bring your application to a new era:

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Experiences with CUDA & OpenACC from porting ACME to GPUs

Optimizing Weather Model Radiative Transfer Physics for the Many Integrated Core and GPGPU Architectures

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

OpenACC Course. Office Hour #2 Q&A

GPUs and Emerging Architectures

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

OpenACC programming for GPGPUs: Rotor wake simulation

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Running the FIM and NIM Weather Models on GPUs

An Introduction to OpenAcc

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

The Era of Heterogeneous Computing

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Hybrid Implementation of 3D Kirchhoff Migration

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Preparing for Highly Parallel, Heterogeneous Coprocessing

Heidi Poxon Cray Inc.

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Vectorization on KNL

High Performance Computing: Tools and Applications

Experiences with ENZO on the Intel Many Integrated Core Architecture

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Reusing this material

Progress Report on QDP-JIT

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Chapter 3 Parallel Software

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Physical parametrizations and OpenACC directives in COSMO

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Running HARMONIE on Xeon Phi Coprocessors

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

COMP528: Multi-core and Multi-Processor Computing

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

The CLAW project. Valentin Clément, Xavier Lapillonne. CLAW provides high-level Abstractions for Weather and climate models

Introduction to Xeon Phi. Bill Barth January 11, 2013

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Parallel Programming on Ranger and Stampede

OpenMP: Open Multiprocessing

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

OpenMP: Open Multiprocessing

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Advanced OpenMP Features

Cubed Sphere Dynamics on the Intel Xeon Phi

Profiling & Tuning Applications. CUDA Course István Reguly

Parallel Computing. November 20, W.Homberg

Technical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL

Heterogeneous Computing and OpenCL

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library

How to Write Code that Will Survive the Many-Core Revolution

Introduction to the Intel Xeon Phi on Stampede

Accelerating Harmonie with GPUs (or MICs)

A case study of performance portability with OpenMP 4.5

Code modernization of Polyhedron benchmark suite

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Designing and Optimizing LQCD code using OpenACC

Quantum ESPRESSO on GPU accelerated systems

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

S Comparing OpenACC 2.5 and OpenMP 4.5

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Trends in HPC (hardware complexity and software challenges)

Adrian Tate XK6 / openacc workshop Manno, Mar

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

COMP Parallel Computing. Programming Accelerators using Directives

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Deutscher Wetterdienst

Accelerating Implicit LS-DYNA with GPU

OpenMP 4.0/4.5. Mark Bull, EPCC

Transcription:

CONTINUED EFFORTS IN ADAPTING THE GEOS- 5 AGCM TO ACCELERATORS: SUCCESSES AND CHALLENGES 9/20/2013 Matt Thompson matthew.thompson@nasa.gov

Accelerator Conversion Aims Code rewrites will probably be necessary, preserve bit- identical results on CPU whenever possible Minimize disruption to end- users Checkout, build, &c., should look the same Accelerated code should be a compile- time decision Single Code Base No: irrad_cpu, irrad_gpu, irrad_mic Yes: irrad Make faster, not slower

GEOS- 5 AGCM

GEOS- 5 AGCM Targets

GEOS- 5 AGCM Converted To GPU

GPU Results Physics Kernels Kernel Speedup (v. Socket) GWD 5.1x TURBULENCE 3.0x CLOUD 2.3x IRRAD 3.5x / 4.4x SORAD 4.6x / 6.4x Using only 1 core to 1 GPU System: 1 Node, 1 CPU (8- core E5-2670), 1 GPU (K20x) Model Run: 4 Hours, ½- Degree Includes allocation, deallocation, and data transfer times. All speedups versus 1 socket (8 cores)

GPU Results Full Gridded Components Gridded Component Speedup (v. Socket) GWD 5.1x TURBULENCE 1.2x MOIST 0.3x RADIATION 1.0x / 1.0x PHYSICS 0.5x GCM 0.3x Using only 1 core to 1 GPU System: 1 Node, 1 CPU (8- core E5-2670), 1 GPU (K20x) Model Run: 4 Hours, ½- Degree Includes cost of all host code pre- and post- GPU Note: Subtracted highly variable parameterized chemistry disk read time from PHYSICS and GCM

GPU Challenges MPI GEOS- 5 GCM is fully MPI- decomposed, but GPUs (pre- K20) and MPI don t mix well Choices: Run one CPU core per socket for each GPU card n Pro: Easy to implement n Con: 7/8ths of your cores per socket are wasted! Amdahl would not be proud Multiple MPI ranks per GPU n Pro: Gain back some CPU cores n Con: GPU pipelines kernel calls if it can. Usually will crash as you overload the card

GPU Challenges MPI Possible Solution: Gather/Scatter Before each GPU call, gather all work on one process on each CPU socket One core calls the GPU Scatter resulting data back to MPI processes Not impossible, but difficult/ponderous New communicators Extra communication load

GPU Challenges MPI Possible Solution: Hyper- Q Install some K20s Add a few environment variables n setenv CUDA_VISIBLE_DEVICES 0 n setenv CUDA_MPS_CLIENT 1 n setenv CUDA_MPS_PIPE_DIRECTORY /tmp/mps_0 n setenv CUDA_MPS_LOG_DIRECTORY /tmp/mps_log_0 Start up a proxy server n /usr/bin/nvidia- cuda- mps- control - d But is it that easy? Adapted Answer: from: http://cudamusing.blogspot.com/2013/07/enabling- cuda- multi- process- service- mps.html Seems to be Yes

GPU Hyper- Q Physics Kernels Kernel 1 Core to 1 GPU 8 Cores to 1 GPU GWD 5.1x 6.4x TURBULENCE 3.0x 3.9x CLOUD 2.3x 3.1x IRRAD 3.5x / 4.4x 4.5x / 5.0x SORAD 4.6x / 6.4x 4.9x / 7.3x System: 1 Node, 1 CPU (8- core E5-2670), 1 GPU (K20x) Model Run: 4 Hours, ½- Degree Includes allocation, deallocation, and data transfer times. All speedups versus 1 socket (8 cores)

GPU Hyper- Q Full Gridded Components Grid Comp 1 Core to 1 GPU 8 Cores to 1 GPU GWD 5.1x 6.4x TURBULENCE 1.2x 2.7x MOIST 0.3x 1.8x RADIATION 1.0x / 1.0x 2.5x / 2.7x PHYSICS 0.5x 2.1x GCM 0.3x 1.3x Includes cost of all host code pre- and post- GPU Note: Subtracted highly variable parameterized chemistry disk read time from PHYSICS and GCM System: 1 Node, 1 CPU (8- core E5-2670), 1 GPU (K20x) Model Run: 4 Hours, ½- Degree

GPU Hyper- Q Results Physics Kernels Kernel run timings same 1 core v 8 cores Data transfer times less with 8 cores, amount of data is the same - > Hyper- Q overlaps communication? Dynamics Cubed- sphere dynamics not included here due to PGI limitations (compiler errors) Expect similar results as with physics Full Gridded Components With Hyper- Q, GPU acceleration of a highly MPI decomposed GEOS- 5 becomes viable for first time

GPU Challenges Code Readability CUDA Fortran code can be ugly and intrusive especially how GEOS- 5 (aka how Matt) implements it Valid complaints from other developers #ifdef extravaganza Add to CPU code - > Add to GPU code n Usually means Call Matt n Slows down work Possible cleanup schemes can make code more unreadable!

GPU Future Directions OpenACC Try conversion of working CUDA Fortran kernels and code to OpenACC: Turbulence Turbulence is a mix of one long kernel and many smaller inlined segments that became kernel subroutines We can use ASSOCIATE blocks to simplify CUDA Fortran section and OpenACC elsewhere, returning to inline as before Eventually use OpenACC on main kernel which has many subroutine calls! We know the data movement, so data pragmas should be easy to write

GPU Future Directions OpenACC Pros OpenACC is a standard Should look the same for any accelerator supported It looks (sort of) like OpenMP Most scientific programmers have seen OpenMP Practice/Learning for Xeon Phi Pragmas are pretty readable by others copy: variables copied in and out copyin: variables just copied in

GPU Future Directions OpenACC Cons OpenACC 1.0 is not designed for large, multi- nested codes Requires manual inlining n Pretty much a no- go or inlining by compiler n Every attempt has led to ACON or other compiler errors n GEOS- 5 might require a dedicated PGI engineer just to solve these! Pro: OpenACC 2.0 is almost here:!$acc routine

Xeon Phi: Shortwave Radiation Kernel First step: Construction of offline tester of shortwave radiation kernel (sorad) Actual CPU/GPU code in our model Uses actual data inputs from our model Start by using offload model with Intel directives Any modifications must preserve CPU and GPU capabilities Strive for bit- identical results if possible Transition offload knowledge to full model

Xeon Phi: Shortwave Radiation Kernel In main sorad.f90:!dir$ ATTRIBUTES OFFLOAD : mic :: sorad in both caller and sorad kernel, similar for deledd in sorad!dir$ OMP OFFLOAD TARGET(mic) for main subroutine The usual!$omp directives to enable OpenMP In all, about 5 changes

Xeon Phi: Shortwave Radiation Kernel Constant Data Modules Add!DIR$ ATTRIBUTES OFFLOAD : mic :: <constant> Make system changes for MIC Add new MICOPT = - openmp to be added to FOPT only in MIC enabled code n Some parts of GEOS- 5 has OpenMP we don t want to enable Use xiar - qoffload- build when linking (only in full model) If not on MIC, must send - no- offload

Xeon Phi: Shortwave Radiation Kernel Comparison of SW radiation kernel on various platforms

Xeon Phi: Shortwave Radiation In Full Model CPU- MIC Difference of dt/dt from Shortwave Radiation in K/day after 6 years Contours 0-17 K/day (top)

Xeon Phi: Future Directions We know the data movement from CUDA, makes the OpenMP easier to implement OpenMP 4.0 available soon/now All Intel Directives will be moved to OpenMP 4.0 More benchmarking with full model Test scatter v balanced v compact, other options With tester: Use VTune to look at the code in detail to examine alignment, etc.

Xeon Phi: Challenges Currently using one core per MIC like the GPU days, how do we duplicate Hyper- Q success? Could use 8-16 cores to one MIC, but that means running 15-30 threads per CPU core Srinath showed this might be useful to explore as well in native Enabling MIC Offload in model requires turning off solar load- balancer offload error: offload data transfer supports only a single contiguous memory range per variable ; yet only one MPI rank is talking to one MIC right? Intel rightly prefers Native not Offload for MIC but GEOS- 5 is, at its heart, an I/O heavy model How do we handle outputting GBs of data per step? Full Native seems impossible: Symmetric will be needed

Xeon Phi: Challenges - Libraries GEOS- 5 AGCM just to compile requires: JPEG, ZLIB, szlib, curl, HDF4, HDF5, NetCDF, NetCDF- Fortran, UDUNITS2, ESMF Not impossible, mainly issues with autotools and libraries (cross- compiling) And what about make test? GEOS- 5 is also a DAS: SciPy, CMOR, HDFEOS, HDFEOS5, SDPToolkit

Sandy Bridge: Shortwave Radiation Kernel GPUs and Xeon Phis are not the only multi- core architecture out there that can be accelerated What about good ol vectorizing and AVX? Can the compiler do something for us with a just a bit of work? Working with Dan Kokron of NAS, shortwave radiation kernel modified to better expose vectorization

Sandy Bridge: Shortwave Radiation Kernel Step 1: Compiler options Before: - O3 After: - O3 - axavx xsse4.1 align array32byte n - ip seems to help in a stand- alone tester, but in full model seems to not do much Also: - fpe0 can limit compiler, lowering to fpe1 or fpe3 can help but beware you might depend on that behavior and expose issues! Step 2: Use - vec- reportn to examine the what the compiler is doing

Sandy Bridge: Shortwave Radiation Kernel Step 3: Code Changes!DIR$ ATTRIBUTES FORCEINLINE expensive, scalarized subroutine deledd called multiple times For flux integrations, create per- band accumulators that are local to each column, then at end, sum into final fluxes in a SIMD fashion Remove loop- bound dependencies preventing vectorization (see vec- report) Use!DIR$ SIMD directives to help compiler identify true SIMD loops n but be careful. You are telling the compiler you re smarter than it is! Use!DIR$ NOVECTOR and!dir$ NOFUSION directives to make sure the compiler isn t too aggressive

Sandy Bridge: Shortwave Radiation Kernel In all, about 40 lines additional code in a 1944- line code (excluding code cleanup) Results (c360, 1- day, 384 cores, Xeon E5-2670) Original Code: 282.101 s Vector Code: 144.150 s n ~2x speedup! n Not quite zero- diff (investigating) Haven t looked at alignment in depth yet Vectorizing success on Sandy Bridge doesn t seem to directly translate on Xeon Phi, needs more investigation

Sandy Bridge: Shortwave Radiation Kernel Is sorad a unique success? Code was already nearly fully inlined due to heritage of PGI Accelerator pragma work Only one, very expensive, all scalar internal subroutine to inline Other GPU kernels have been highly scalarized and localized (like sorad, main loop over columns) Preliminary work by Dan Kokron on longwave code shows benefits to a careful analysis But is more complex (hoisting calculations, etc.), not just add a few lines/directives

Thanks Max Suarez & Bill Putman of GMAO Dan Kokron of NAS Carl Ponder of NVIDIA Mike Greenfield & Indraneil Gokhale of Intel GMAO, NCCS, and NAS Computing Support NVIDIA/PGI Support

Questions?