PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Similar documents
INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

Directive-based Programming for Highly-scalable Nodes

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

S Comparing OpenACC 2.5 and OpenMP 4.5

Optimising the Mantevo benchmark suite for multi- and many-core architectures

GTC 2017 S7672. OpenACC Best Practices: Accelerating the C++ NUMECA FINE/Open CFD Solver

Programming NVIDIA GPUs with OpenACC Directives

Steve Scott, Tesla CTO SC 11 November 15, 2011

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

SENSEI / SENSEI-Lite / SENEI-LDC Updates

Running the FIM and NIM Weather Models on GPUs

An Introduction to the SPEC High Performance Group and their Benchmark Suites

CUDA. Matthew Joyner, Jeremy Williams

OpenACC Fundamentals. Steve Abbott November 13, 2016

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

COMP Parallel Computing. Programming Accelerators using Directives

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Timothy Lanfear, NVIDIA HPC

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

An Introduction to OpenACC

High Performance Computing with Accelerators

Scientific Programming in C XIV. Parallel programming

ENABLING NEW SCIENCE GPU SOLUTIONS

A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

Evaluating OpenMP s Effectiveness in the Many-Core Era

John Levesque Nov 16, 2001

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Early Experiences Writing Performance Portable OpenMP 4 Codes

ANSYS HPC Technology Leadership

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC

OpenACC programming for GPGPUs: Rotor wake simulation

Stan Posey, NVIDIA, Santa Clara, CA, USA

Understanding Dynamic Parallelism

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

OPENACC ONLINE COURSE 2018

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Accelerator programming with OpenACC

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Performance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems

Porting MILC to GPU: Lessons learned

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

Performance Analysis of the PSyKAl Approach for a NEMO-based Benchmark

Turbostream: A CFD solver for manycore

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5. Alistair Hart (Cray UK Ltd.)

Two-Phase flows on massively parallel multi-gpu clusters

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Parallel Computing. November 20, W.Homberg

CME 213 S PRING Eric Darve

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Interactive Supercomputing for State-of-the-art Biomolecular Simulation

CUDA Experiences: Over-Optimization and Future HPC

Parallel Programming Libraries and implementations

OpenStaPLE, an OpenACC Lattice QCD Application

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

A General Discussion on! Parallelism!

A case study of performance portability with OpenMP 4.5

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Accelerating Implicit LS-DYNA with GPU

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Experiences with CUDA & OpenACC from porting ACME to GPUs

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Lecture 1. Introduction Course Overview

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

Parallel Programming. Libraries and Implementations

OpenACC Pipelining Jeff Larkin November 18, 2016

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Introduction to High-Performance Computing

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

HPC future trends from a science perspective

AMBER 11 Performance Benchmark and Profiling. July 2011

Industrial achievements on Blue Waters using CPUs and GPUs

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

Transcription:

PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015

TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability for a single code to run well anywhere. 2

TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability for a single code to run well anywhere. Many programming models provide functional portability, but can we provide performance portability too? 3

OPENACC Directives for Portable Parallelism OPEN SINGLE SOURCE main() { <serial code> #pragma acc kernels { <parallel code> } } PORTABLE HIGH PERFORMANCE 4

[The] focus on mythical Performance Portability misses the point. The issue is maintainability. Enabling Application Portability across HPC Platforms: An Application Perspective, OpenMPCon 2015 5

Having a common code base using a portable programming environment even if you must fill the code with ifdefs or have architecture specific versions of key kernels is the only way to support maintainability. Enabling Application Portability across HPC Platforms: An Application Perspective, OpenMPCon 2015 6

MAINTAINABLE CODE Portable parallelism is easier to maintain. #if defined(cpu) #pragma omp parallel for simd #elif defined(mic) #pragma omp target teams distribute \ parallel for simd #elif defined(omp_gpu) #pragma omp target teams distribute \ parallel for schedule(static,1) #elif defined(something_else) #pragma omp target... #endif for(int i = 0; i < N; i++) #pragma acc parallel loop for(int i = 0; i < N; i++) 7

Performance portability enables maintainable code. 8

But is performance portability a myth? 9

Speedup vs single CPU Core OPENACC DELIVERS PERFORMANCE PORTABILITY 12 10 CPU: MPI + OpenMP * 8 6 4 2 0 minighost (Mantevo) NEMO (Climate & Ocean) CLOVERLEAF (Physics) PGI 15.10 Compiler minighost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80 (single GPU) NEMO: Each socket CPU: Intel Xeon E5-2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs * NEMO run used 10all-MPI

Speedup vs single CPU Core OPENACC DELIVERS PERFORMANCE PORTABILITY 12 10 CPU: MPI + OpenMP * CPU: MPI + OpenACC 8 6 4 2 0 minighost (Mantevo) NEMO (Climate & Ocean) CLOVERLEAF (Physics) PGI 15.10 Compiler minighost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80 (single GPU) NEMO: Each socket CPU: Intel Xeon E5-2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs * NEMO run used 11all-MPI

Speedup vs single CPU Core OPENACC DELIVERS PERFORMANCE PORTABILITY 30X 12 10 CPU: MPI + OpenMP * CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC 8 6 4 2 0 minighost (Mantevo) NEMO (Climate & Ocean) CLOVERLEAF (Physics) PGI 15.10 Compiler minighost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80 (single GPU) NEMO: Each socket CPU: Intel Xeon E5-2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs * NEMO run used 12all-MPI

WHY OPENACC? The right programming model for performance portability. DESCRIPTIVE Directives inform and enable the compiler to effectively parallelize on any architecture. INCREMENTAL Programmer can tune the code when needed without ifdefs or loss of portability. MODERN Removes scalability blockers by design to support modern processors. MATURE Multiple mature implementations on multiple platforms. 13

OpenACC is enabling Scientific discoveries 14

Speedup vs CPU LS-DALTON Large-scale application for calculating highaccuracy molecular energies Lines of Code Modified Minimal Effort # of Weeks Required # of Codes to Maintain <100 Lines 1 Week 1 Source Big Performance LS-DALTON CCSD(T) Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X) 11.7x OpenACC makes GPU computing approachable for domain scientists. Initial OpenACC implementation required only minor effort, and more importantly, no modifications of our existing CPU implementation. 7.9x 8.9x Janus Juul Eriksen, PhD Fellow qleap Center for Theoretical Chemistry, Aarhus University ALANINE-1 13 ATOMS ALANINE-2 23 ATOMS ALANINE-3 33 ATOMS 15

NUMECA OpenACC Accelerates Commercial CFD Application OpenACC enabled us to target routines for GPU acceleration without rewriting code, allowing us to maintain portability on a code that is 20-year old. CHALLENGE Accelerate 20 year old highly optimized code on GPUs SOLUTION Accelerated computationally intensive routines with OpenACC RESULTS Achieved 10x or higher speed-ups on key routines Full app speedup of up to 2x on the Oak Ridge Titan supercomputer Total time spent on optimizing various routines with OpenACC was just five person-months David Gutzwiller, Head of HPC NUMECA International 16

MRI RECONSTRUCTION OpenACC Accelerates Advanced MRI Reconstruction Model CHALLENGE Produce detailed and accurate brain images by applying computationally intensive algorithms to MRI data Within short reconstruction time for diagnostic use SOLUTION Accelerated MRI reconstruction application with OpenACC using NVIDIA GPUs Now that we ve seen how easy it is to program the GPU using OpenACC and the PGI compiler, we re looking forward to translating more of our projects RESULTS Reduced reconstruction time for a single high-resolution MRI scan from 40 days to a couple of hours Scaled on Blue Waters at NCSA to reconstruct 3000 images in under 24 hours Brad Sutton, Associate Professor of Bioengineering and Technical Director of the Biomedical Imaging Center University of Illinois at Urbana-Champaign 17

Speedup Vs Single CPU Core INCOMP3D 3D incompressible Navier-Stokes fully implicit CFD solver 120.0 100.0 80.0 60.0 40.0 20.0 0.0 SD7003 URANS Test Run Xeon E5645 (MPI)* NVIDIA K40 (OpenACC) LHS RHS Linear Solver Ghost cell packing CHALLENGE Accelerate a complex memory bounded implicit CFD solver on GPU SOLUTION Accelerated mathematical formulations one loop at a time with OpenACC RESULTS Achieved average speedups of 3X on Tesla GPU over parallel MPI implementation on x86 multicore processor Structured approach to parallelism provided by OpenACC allowed better algorithm design without worrying about GPU internals OpenACC is a highly effective tool for programming fully implicit CFD solvers on GPU to achieve 3X speedup Lixiang Luo, Researcher Aerospace Engineering Computational Fluid Dynamics Laboratory North Carolina State University (NCSU) * CPU Speedup on 6 cores of Xeon E5645, with 18 additional cores performance reduces due to partitioning and MPI overheads

Can a single programming model provide both functional and performance portability? 19

TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability for a single code to run well anywhere. Many programming models provide functional portability, but OpenACC can provide performance portability too! 20