OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

Similar documents
Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

A case study of performance portability with OpenMP 4.5

Early Experiences with the OpenMP Accelerator Model

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

The Era of Heterogeneous Computing

An Introduction to OpenACC

Understanding Performance Portability of OpenACC for Supercomputers

CUDA. Matthew Joyner, Jeremy Williams

Early Experiences With The OpenMP Accelerator Model

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

CME 213 S PRING Eric Darve

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Accelerating Financial Applications on the GPU

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

OpenMP 4.0: A Significant Paradigm Shift in Parallelism

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

SIMD Exploitation in (JIT) Compilers

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

CUDA Architecture & Programming Model

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Compiling a High-level Directive-Based Programming Model for GPGPUs

General Purpose GPU Computing in Partial Wave Analysis

OpenStaPLE, an OpenACC Lattice QCD Application

Physis: An Implicitly Parallel Framework for Stencil Computa;ons

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Introduction to GPU hardware and to CUDA

n N c CIni.o ewsrg.au

Running the FIM and NIM Weather Models on GPUs

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

S Comparing OpenACC 2.5 and OpenMP 4.5

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Modern Processor Architectures. L25: Modern Compiler Design

Using GPUs to compute the multilevel summation of electrostatic forces

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

OpenCL Vectorising Features. Andreas Beckmann

OpenACC programming for GPGPUs: Rotor wake simulation

CPU GPU. Regional Models. Global Models. Bigger Systems More Expensive Facili:es Bigger Power Bills Lower System Reliability

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Evaluating OpenMP s Effectiveness in the Many-Core Era

LLVM for the future of Supercomputing

Technology for a better society. hetcomp.com

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

A low memory footprint OpenCL simulation of short-range particle interactions

Analyzing and improving performance portability of OpenCL applications via auto-tuning

Progress on GPU Parallelization of the NIM Prototype Numerical Weather Prediction Dynamical Core

Addressing Heterogeneity in Manycore Applications

Parallel Programming. Libraries and Implementations

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

An Introduc+on to OpenACC Part II

Parallel Programming Libraries and implementations

CS 179 Lecture 4. GPU Compute Architecture

Parallel Hybrid Computing Stéphane Bihan, CAPS

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Fundamental CUDA Optimization. NVIDIA Corporation

Designing and Optimizing LQCD code using OpenACC

Parallel and Distributed Programming Introduction. Kenjiro Taura

COMP Parallel Computing. Programming Accelerators using Directives

High Performance Computing with Accelerators

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Intel Xeon Phi Coprocessor

Fast-multipole algorithms moving to Exascale

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GLAF: A Visual Programming and Auto- Tuning Framework for Parallel Computing

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Optimising the Mantevo benchmark suite for multi- and many-core architectures

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Parallel Computing. November 20, W.Homberg

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

CPU-GPU Heterogeneous Computing

Accelerator programming with OpenACC

Transcription:

2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James Lin 1,2 and Satoshi Matsuoka 2 1 Center for High Performance Computing 2 Satoshi MATSUOKA Laboratory

Outline OpenACC2 and OpenMP4 Systematic Optimizations to Portable Performance on GPU and Xeon Phi Summary 2

Why Direc>ve- based Programming? We may have many reasons to use directive-based programming, but for me, it can keep the code readable/ maintainable from the application developers' point of view CUDA Experts CUDA Version 1 Port to CUDA Unmaintainable to applica>on developers Applica>on Developers Version 1 Version 2 is based on develops own version 3

Direc>ve- based Programming for Accelerators [1] Standard OpenACC OpenMP >=4.0 Product PGI Accelerators HMPP Research Projects R-Stream HiCUDA OpenMPC/OpenMP for accelerators [1] S. Lee and J. S. VeTer, Early evalua>on of direc>ve- based GPU programming models for produc>ve exascale compu>ng, presented at the High Performance Compu>ng, Networking, Storage and Analysis (SC), 2012 Interna>onal Conference for, 2012, pp. 1 11. 4

OpenACC2 Standard version evolution is much faster than OpenMP and MPI : ~1.5year Standard/Version 1.0 2.0 3.0 4.0 OpenACC Nov 2011 July 2013 - - - - OpenMP- Fortran Oct 1997 Nov 2000 May 2008 July 2013 MPI June 1994 July 1997 Sept 2012 - - Supported by PGI/CAPS/CRAY compiler 5

OpenMP 4.0 for Accelerators Released in July 2013, it supports on directivebased programming on accelerators, such as GPU and Xeon Phi Directives for Parallelism: target/parallel Data: target data Two levels of parallelism: teams/distribute 6

OpenACC2 Parallel Kernel Data Loop Host data Cache Update Wait Declare {enter, exit} data rou>ne Async wait Tile Device_type Atomic N/A OpenMP4 Target N/A Target Data Distribute/Do/for/SIMD N/A N/A Target Update N/A Declare Target N/A Declare target N/A N/A N/A Atomic Cri>cal sec>ons Barrier OpenACC2 has richer features than OpenMP4 7

Experimental Setup 2 Target Devices Accelerators used in Supercomputers, so NO AMD GPU GPU Kepler: K40 Xeon Phi KNC: 5110p 3 Compilers OpenACC: CAPS Compiler V3.4.1, support OpenACC 2.0 OpenARC is not public available yet OpenMP4: HOMP-ROSE for GPU, Intel Compiler for Xeon Phi 2 Test Cases HydroBench is a miniapp https://github.com/hydrobench/hydro Copy benchmark 8

HOMP, OpenMP compiler for CUDA Developed by LLNL, it is build on ROSE [1], a source-to-source compiler It is an early research implementation [2] of OpenMP4.0 on GPU for CUDA5.0 Support C/C++ only now [1] htp://rosecompiler.org [2] C. Liao, Y. Yan, B. R. Supinski, D. J. Quinlan, and B. Chapman, Early Experiences with the 9 OpenMP Accelerator Model, IWOMP13

Tools used OpenACC CAPS3.4 CUDA5.5 nvcc* based on LLVM PTX is open HOMP OpenCL1.2 ICC SPIR OpenMP4.0 ICC13.0 Offload OBJ ptxas/ocelet GPU Assemble Code X86 Assemble Code asfemi+ Kepler K20/20X/40 KNC 3/5/7110p 10

11

12

Performance impact of K40 and 5110P measure with data transfer 13

Outline OpenACC2 and OpenMP4 Systematic Optimizations to Portable Performance on GPU and Xeon Phi Summary 14

Performance Portability Includes source code portability of a kernel implementation and performance that is commensurate with a device-specific implementation of that kernel [Kokks] Portability for Accelerators used in HPC Different accelerators: GPU V.S. Xeon Phi Different generations: Kepler V.S. Fermi Different versions: K20 V.S. K40 15

Portable Performance Op>miza>ons Ninja version, Best op>miza>on, Ideal performance Frog version, achieving reasonable performance on both GPU and Xeon Phi Algorithm changes Compiler technology Naïve version, simple direc>ves, 0 op>miza>on 16

Op>miza>on for Portable Performance SystemaDc Accelerators OpDmizaDons Applica>ons Direc>ves SM SIMD Friendly Algo Arithme>c Intensity REG Register Level Data Locality SGEMM SHMEM SHMEM Level Data Locality FFT LBM Data Direc>ves GM GM Level Data Locality Algorithm Changes Stencil SPMV Loop Direc>ves HM Compiler Technology 17

Global Memory Level Memory Coalescing Thread-Grid Mapping [ISCA09] [MuCoCoS13] 18

Shared Memory/L2 (SHMEM) Level SHMEM Blocking (data tiling) Data Layout: AOS->SOA Avoid SHMEM bank conflict Avoid SHMEM bank Conflict 19

Register Level Prefetching Loop Unrolling and Jam Undocumented Register Bank Conflicts in Kepler [CGO13] 20

SIMD Friendly Algorithm [ISCA12] 21

Compiler Technology Fast math, -fastmath For sin(), replace several FMAD (Floating Point Multiply-Add) with SFU Intrinsic functions Faster, but SP only and loss some accuracy Replace FMAD with FMA, -fmad=false FMAD -> RND(RND(a*b)+c) FMA (Fused Multiply-Add) -> RND(a*b+c), Improve both accuracy and performance Asynchronous Using async clauses (OpenACC only) 22

Opportuni>es for auto- tuning with single source code base Code Level: self-embed code Compiler Level: support vendor intrinsic Tool Level: CAPS Auto-tuning tool Library Level: Lib for portable performance: Kokkos Drop-in support for CUDA math Lib and MKL 23

Outline OpenACC2 and OpenMP4 Systematic Optimizations to Portable Performance on GPU and Xeon Phi Summary 24

Summary Portable performance would be an important feature for directive programming approach OpenACC2 is in early stage gridfication is recommended gang/work/vector directive is weak tile directive is weak cache directive support is still missing OpenMP4 is not ready yet teams/distribute support on GPU is still missing Systematic optimization would be needed 25