Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

Similar documents
A Simple Guideline for Code Optimizations on Modern Architectures with OpenACC and CUDA

OpenACC programming for GPGPUs: Rotor wake simulation

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

OpenACC 2.6 Proposed Features

COMP Parallel Computing. Programming Accelerators using Directives

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

S Comparing OpenACC 2.5 and OpenMP 4.5

OP2 FOR MANY-CORE ARCHITECTURES

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Parallel Programming. Libraries and Implementations

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

John Levesque Nov 16, 2001

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

Hybrid Implementation of 3D Kirchhoff Migration

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

SENSEI / SENSEI-Lite / SENEI-LDC Updates

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Parallel Programming Libraries and implementations

Directive-based Programming for Highly-scalable Nodes

JCudaMP: OpenMP/Java on CUDA

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC Course. Office Hour #2 Q&A

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

A case study of performance portability with OpenMP 4.5

Code optimization in a 3D diffusion model

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Intel Xeon Phi Coprocessors

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to CUDA

OpenACC 2.5 and Beyond. Michael Wolfe PGI compiler engineer

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Data Parallel Algorithmic Skeletons with Accelerator Support

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Porting COSMO to Hybrid Architectures

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018

OpenACC Fundamentals. Steve Abbott November 15, 2017

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Parallelization, OpenMP

SIMD Exploitation in (JIT) Compilers

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

Optimising the Mantevo benchmark suite for multi- and many-core architectures

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA. Matthew Joyner, Jeremy Williams

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Software and Performance Engineering for numerical codes on GPU clusters

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

Overview: Graphics Processing Units

Tri-Hybrid Computational Fluid Dynamics on DOE s Cray XK7, Titan.

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

An Introduction to OpenAcc

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016

Accelerating Financial Applications on the GPU

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

Ronald Rahaman, David Medina, Amanda Lund, John Tramm, Tim Warburton, Andrew Siegel

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs

Parallel Programming. Libraries and implementations

Heterogeneous Computing and OpenCL

Two-Phase flows on massively parallel multi-gpu clusters

HPC with PGI and Scalasca

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Hands-on CUDA Optimization. CUDA Workshop

Intel Xeon Phi Coprocessor

Parallel Accelerators

Generic Refinement and Block Partitioning enabling efficient GPU CFD on Unstructured Grids

Programming NVIDIA GPUs with OpenACC Directives

Transcription:

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA L. Oteski, G. Colin de Verdière, S. Contassot-Vivier, S. Vialle, J. Ryan Acks.: CEA/DIF, IDRIS, GENCI, NVIDIA, Région Lorraine.

CFD at ONERA One simulation = thousands of CPU hours From http://www.onera.fr 1

Why optimize? Cost of a CPU hour ~$0.05 $0.10 [Walker (2009)]. Example: 1000 cores for 1 month ~$36,000 $72,000. Dividing application runtime allows to: be more cost-efficient, get faster results, increase the problem size. 2

CPU-GPU differences CPU: few # of cores, vector machines, hard to manage registers, GPU: huge # of cores, vector machines, easy to manage registers (--ptxas-options=-v with nvcc). Question: Due to vector approaches, could GPU and CPU be considered as similar devices? Could the GPU be used to optimize the CPU code? 3

A prototype flow Yee's vortex with the full Navier-Stokes equations, 2 dimensional structured mesh, Numerical method: Space: 2nd order Discontinuous Galerkin, Time: 3rd order TVD Runge-Kutta, Boundaries: X and Y-periodic. 4

Test case configuration 2001x2001 mesh, 100 time-steps. CPU: Intel Xeon E5-1650v3 (6 cores), Compiler: icc 2016, -O3 -march=native GPU: K20Xm (2688 CUDA cores) Compilers: nvcc CUDA-8.0, -O3 pgc++ 16.10, -O3 -acc -ta=tesla:cuda8.0,nollvm,nordc 5

Experimental protocol Developments with the full CUDA version GPU-CUDA int tidx=threadidx.x+...; int tidy=threadidx.y+...; if(tidx<nx && tidy<ny) { } Copy 6 CPU-OpenMP #pragma omp for for(int tidy=0;tidy<ny;tidy++) { #pragma simd for(int tidx=0;tidx<nx;tidx++) { } Paste }

Experimental protocol Developments with the full CUDA version GPU-CUDA int tidx=threadidx.x+...; int tidy=threadidx.y+...; if(tidx<nx && tidy<ny) { } Copy 7 GPU-OpenACC #pragma acc parallel { #pragma acc loop gang independent for(int tidy=0;tidy<ny;tidy++){ #pragma acc loop vector independent for(int tidx=0;tidx<nx;tidx++) { } Paste } }

Step 0: Code adaptation GPU-CUDA CPU-OpenMP GPU-OpenACC Host Device transfers cudamemcpy(...) #pragma omp parallel { Time loop +CUDA kernels Time loop +omp pragmas Time loop +acc pragmas Device Host transfers cudamemcpy(...) }//end of the omp region }//end of the acc region 8 #pragma acc data\ copy(...)\ copyin(...)\ create(...) {

Step 0: Code adaptation P(cell updt. per sec. [cus])=#of points x number of it. / time (sec) CPU: E5-1650v3 (6 cores) GPU: K20Xm 9

Step 1: Reduce remote accesses Load reused remote variables in registers (time locality), Use restrict keyword on pointers. global void func(double *results, int N) { int tidx=threadidx.x+...; for(int i=0;i<n;i++) { results[tidx]+=... } } 10 global void func(double * restrict results, int N) { int tidx=threadidx.x+...; double loc_result=results[tidx]; for(int i=0;i<n;i++) { loc_result+=... } results[tidx]=loc_result; }

Step 1: Reduce remote accesses P(cell updt. per sec. [cus])=#of points x number of it. / time (sec) CPU: E5-1650v3 (6 cores) GPU: K20Xm 11

Step 2: Merge similar kernels Merge kernels which share similar memory patterns. 1. Compute the time-step 1. Compute the time-step 2. For s Runge-Kutta step: 2. For s Runge-Kutta step: a. Convective fluxes in X, b. Convective fluxes in Y, c. Convective integral, d. Compute local viscosity, e. Viscous fluxes in X, f. Viscous fluxes in Y, g. Viscous integral, h. Runge-Kutta propagation. 12 a. Compute local viscosity, b. Fluxes in X, c. Fluxes in Y, d. Integral+Runge-Kutta.

Step 2: Merge similar kernels P(cell updt. per sec. [cus])=#of points x number of it. / time (sec) CPU: E5-1650v3 (6 cores) GPU: K20Xm 13

Step 3: Factor computations Transform divisions into multiplications, Control whether or not a loop should be unrolled. #pragma unroll for(int i=0;i<size;i++) { double val=... double c0=1/(sqrt(0.5)*val)*... double c1=1/(3*val)*...... } 14 double isqrt0p5=1/sqrt(0.5); double inv3=1/3; #pragma unroll //Keep it? for(int i=0;i<size;i++) { double val=... double ival=1/val; double c0=isqrt0p5*ival*... double c1=inv3*ival*...... }

Step 3: Factor computations P(cell updt. per sec. [cus])=#of points x number of it. / time (sec) CPU: E5-1650v3 (6 cores) GPU: K20Xm 15

Step 4: Align data (+tiling for CPU) GPU: Align data on the size of a warp coalescence. CPU: Tiling (Block) algorithm + fixed block size. Tiling algorithm for OpenMP threads. 16

Step 4: Align data (+tiling for CPU) P(cell updt. per sec. [cus])=#of points x number of it. / time (sec) CPU: E5-1650v3 (6 cores) GPU: K20Xm 17

Step 5: Miscellaneous OpenACC PGI 16.10: -O1 / O2 / O3 / O4 compilation registers overflow. -O0 = +82% performances (corrected in v17). CPU: Introduce MPI to support several nodes and to improve data locality on each one. #pragma simd on huge loops creates huge register pressure: not suitable for CPU vectorization. split SIMD loops into smaller loops adjust vectorization with #pragma vector always. 18

Step 5: Miscellaneous P(cell updt. per sec. [cus])=#of points x number of it. / time (sec) CPU: E5-1650v3 (6 cores) GPU: K20Xm 19

Running on various devices P(cell updt. per sec. [cus])=#of points x number of it. / time (sec) { Xeon Phi 20 *TDP evaluated from constructor's documentations

Vectorized OpenACC code Run on 2xE52680v4 (CPU): OpenACC code for GPU with pgc++ and -ta=multicore -tp=x64 (CPU execution) 2.72 x 106 cus ~25% performances of the Intel version. CPU code with pgc++ and -ta=multicore -tp=x64 (CPU execution) 5.4 x 106 cus ~50% performances of the Intel version. Therefore: OpenACC for CPUs must be developed on smaller loops than GPU kernels. Add tiling algorithms to improve CPU cache efficiency. 21

What's next? GPUs can be used to efficiently optimize CPU codes this optimization strategy leads to highly similar kernels. Could it be possible to use the same kernels for both CPU and GPU within a single version of the code? 22

A unified development approach Use C++ templates and pre-compilation variables to define which version should be used. Compiler directives (ignored when not using the right compiler), Template parameter to control the vector size (VSIZ=1 for GPU, VSIZ 1 for CPU), Inline keyword to avoid conflicts in function names. 23

Sketch of the hybridized code GPU PART managed by MPI process 0 of the node. 24 CPU PART for all other MPI processes of the node.

Work-balancing Poor balance between performances of CPUs and GPUs. Micro-benchmark inside the code to balance work between processes: Performances of MPI process i with the actual work-balance. Work for MPI process i. Total work to be done. Sum of performances with the actual work-balance. 25

Performances 26

Strong scalability Efficiency (%) = (effective performance/max. performance)*100 27

Conclusion GPUs can be used to efficiently optimize CPU codes, 4 steps to optimize: Reduce access to remote memories, Merge kernels with similar data access pattern, Factor your computations, Align your data. Future work: move to unstructured meshes ranking on engine's performances may change. To go deeper into GPU optimization: [Woolley (2013)]. 28

Thank you for your attention. Contact: ludomir.oteski@onera.fr

Cache roofline (Intel Advisor 2017) Single threaded E5-2680v4 Performances of the application ~11 GFlops Storage function

Optimization steps Ref. Case: 4785s