Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Similar documents
Early Experiences with the OpenMP Accelerator Model

Early Experiences With The OpenMP Accelerator Model

OpenMP 4.5 target. Presenters: Tom Scogland Oscar Hernandez. Wednesday, June 28 th, Credits for some of the material

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

Compiling a High-level Directive-Based Programming Model for GPGPUs

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

COMP Parallel Computing. Programming Accelerators using Directives

OpenACC 2.6 Proposed Features

Advanced OpenMP Features

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

A case study of performance portability with OpenMP 4.5

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

OPTIMIZING THE PERFORMANCE OF DIRECTIVE-BASED PROGRAMMING MODEL FOR GPGPUS

OpenACC Fundamentals. Steve Abbott November 15, 2017

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Programming paradigms for GPU devices

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

PGI Accelerator Programming Model for Fortran & C

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

Lecture 4: OpenMP Open Multi-Processing

OpenACC programming for GPGPUs: Rotor wake simulation

S Comparing OpenACC 2.5 and OpenMP 4.5

OpenMP 4.0. Mark Bull, EPCC

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

Advanced OpenMP. Lecture 11: OpenMP 4.0

OpenMP 4.0/4.5. Mark Bull, EPCC

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Experiences of Using the OpenMP Accelerator Model to Port DOE Stencil Applications

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Two OpenMP Programming Patterns

Introduction to OpenACC

An Introduction to OpenAcc

Research Article Multi-GPU Support on Single Node Using Directive-Based Programming Model

OpenMP 4.5: Threading, vectorization & offloading

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

OpenACC (Open Accelerators - Introduced in 2012)

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

PGI Fortran & C Accelerator Programming Model. The Portland Group

PEZY-SC Omni OpenACC GPU. Green500[1] Shoubu( ) ExaScaler PEZY-SC. [4] Omni OpenACC NVIDIA GPU. ExaScaler PEZY-SC PZCL PZCL OpenCL[2]

Parallel Programming. Libraries and Implementations

OpenMP Optimization and its Translation to OpenGL

OpenACC Course. Office Hour #2 Q&A

Getting Started with Directive-based Acceleration: OpenACC

Designing and Optimizing LQCD code using OpenACC

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

GPU Programming Paradigms

The PGI Fortran and C99 OpenACC Compilers

Parallel Programming Libraries and implementations

Introduction to OpenACC. 16 May 2013

Overview: The OpenMP Programming Model

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Parallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian

OpenMP for next generation heterogeneous clusters

Code Migration Methodology for Heterogeneous Systems

Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

ECE 574 Cluster Computing Lecture 10

Modernizing OpenMP for an Accelerated World

Exploring Dynamic Parallelism on OpenMP

An OpenACC construct is an OpenACC directive and, if applicable, the immediately following statement, loop or structured block.

Experiences with Achieving Portability across Heterogeneous Architectures

OpenMP Case Studies. Dieter an Mey. Center for Computing and Communication Aachen University

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

COMP528: Multi-core and Multi-Processor Computing

Directive-based General-Purpose GPU Programming. Tian Yi David Han

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

Trellis: Portability Across Architectures with a High-level Framework

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Optimising the Mantevo benchmark suite for multi- and many-core architectures

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Directive-based Programming for Highly-scalable Nodes

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016

GPU Fundamentals Jeff Larkin November 14, 2016

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

OpenACC Fundamentals. Steve Abbott November 13, 2016

Advanced CUDA Optimization 1. Introduction

Compiling for HSA accelerators with GCC

EE/CSCI 451: Parallel and Distributed Computation

Concurrent Programming with OpenMP

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

High Performance Computing on GPUs using NVIDIA CUDA

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Towards an Efficient CPU-GPU Code Hybridization: a Simple Guideline for Code Optimizations on Modern Architecture with OpenACC and CUDA

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

Transcription:

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu https://passlab.github.io/csce569/ 1

Manycore GPU Architectures and Programming: Outline Introduction GPU architectures, GPGPUs, and CUDA GPU Execution model CUDA Programming model Working with Memory in CUDA Global memory, shared and constant memory Streams and concurrency CUDA instruction intrinsic and library Performance, profiling, debugging, and error handling Directive-based high-level programming model OpenMP and OpenACC 2

HPC Systems with Accelerators Accelerator architectures become popular GPUs and Xeon Phi Multiple accelerators are common 2, 4, or 8 Programming on NVIDIA GPUs 1. CUDA and OpenCL Low-level 2. Library, e.g. cublas, cufft, cudnn 3. OpenMP, OpenACC, and others Rely on compiler support 4. Application framework TensorFlow, etc https://www.anandtech.com/show/12587/nvidias-dgx2-sixteen-v100-gpus-30-tb-of-nvme-only-400k3

OpenMP 4.0 for Accelerators 4

AXPY Example with OpenMP: Multicore y = α x + y x and y are vectors of size n α is scalar Data (x, y and a) are shared Parallelization is relatively easy Other examples sum: reduction Stencil: halo region exchange and synchronization 5

AXPY Offloading To a GPU using CUDA Memory allocation on device Memcpy from host to device Launch parallel execution Memcpy from device to host Deallocation of dev memory 6

AXPY Example with OpenMP: single device y = α x + y x and y are vectors of size n α is scalar target directive: annotate an offloading code region map clause: map data between host and device à moving data to tofrom from: mapping directions Use array region 7

OpenMP Computation and Data Offloading #pragma omp target device(id) map() if() target: create a data environment and offload computation on the device device (int_exp): specify a target device map(to from tofrom alloc:var_list) : data mapping between the current data environment and a device data environment #pragma target data device (id) map() if() Create a device data environment: to be reused/inherited Main Memory Application data Copy in remote data Copy out remote data Tasks offloaded to accelerator target Application data acc. cores omp target omp parallel CPU thread Accelerator threads CPU thread 8

target and map Examples 9

Accelerator: Explicit Data Mapping Relatively small number of truly shared memory accelerators so far Require the user to explicitly map data to and from the device memory Use array region long a = 0x858; long b = 0; int anarray[100] #pragma omp target data map(to:a) \\ map(tofrom:b,anarray[0:64]) { /* a, b and anarray are mapped * to the device */ /* work on the device */ #pragma omp target { } } /* b and anarray are mapped * back to the host */ 10

target date Example 11

Accelerator: Hierarchical Parallelism Organize massive number of threads teams of threads, e.g. map to CUDA grid/block Distribute loops over teams #pragma omp target #pragma omp teams num_teams(2) num_threads(8) { //-- creates a league of teams //-- only local barriers permitted #pragma omp distribute for (int i=0; i<n; i++) { } } 12

teams and distribute Loop Example Double-nested loops are mapped to the two levels of thread hierarchy (league and team) 13

Jacobi Example: The Impact of Compiler Transformation to Performance #pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b, \ f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m]) while ((k<=mits)&&(error>tol)) { Jacobi Execution Time (s) // a loop copying u[][] to uold[][] is omitted here 100 first version 90 #pragma omp target device(gpu0) map(to:n, target-data m, omega, ax, ay, b, f[0:n][0:m], \ 80 uold[0:n][0:m]) map(tofrom:u[0:n][0:m]) Loop collapse using linearization with static-even scheduling 70 #pragma omp parallel for private(resid,j,i) reduction(+:error) Loop collapse using 2-D mapping (16x16 block) for (i=1;i<(n-1);i++) 60 Loop collapse using 2-D mapping (8x32 block) for (j=1;j<(m-1);j++) 50 Loop collapse using linearization with round-robin scheduling { 40 resid = (ax*(uold[i-1][j] + uold[i+1][j])\ 30 + ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b; 20 u[i][j] = uold[i][j] - omega * resid; 10 error = error + resid*resid ; } // the rest code omitted... 0 } Matrix size (float) 128x128 256x256 512x512 1024x1024 2048x2048 Early Experiences With The OpenMP Accelerator Model; Chunhua Liao, Yonghong Yan, Bronis R. de Supinski, Daniel J. Quinlan and Barbara Chapman; International Workshop on OpenMP (IWOMP) 2013, September 2013 14

Mapping Nested Loops to GPUs Need to achieve coalesced memory access on GPUs # pragma acc loop gang(2) vector(2) for ( i = x1; i < X1; i ++ ) { # pragma acc loop gang(3) vector(4) for ( j = y1; j < Y1; j ++ ) {... } } Fig. 9: Double nested loop mapping. Fig. 10: Triple nested loop mapping. Compiler Transformation of Nested Loops for GPGPUs, Xiaonan Tian, Rengan Xu, Yonghong Yan, Sunita Chandrasekaran, and Barbara Chapman Journal of Concurrency and Computation: Practice and Experience, August 2015 15

Compiler vs Hand-Written Applications Needleman- Wunsch Stencil Computational Fluid Dyanmics (CFD) 2D Heat (grid size 4096*4096) Clever (10Ovals) FeldKemp (FDK) Domains Bioinformatics Cellular Automation Fluid Mechanics Heat Conduction Data Mining Image Processing OpenACC Directive Combinations data copy, copyin kernels present loop gang, vector, private data copyin, copy, deviceptr kernels present loop collapse, independent data copyin, copy, deviceptr data present, deviceptr kernels deviceptr kernels loop, gang, vector, private loop gang, vector acc_malloc(), acc_free() data copyin, copy, deviceptr kernels present loop collapse, independent data copyin kernels present, create, copyin, copy loop independent kernels copyin, copyout loop, collapse, independent Lines of Code Added vs Serial Speedup Over OpenMP OpenACC Seq OpenMP CUDA 6 5 2.98 1.28 0.24 1 3 40.55 15.87 0.92 8 46 35.86 4.59 0.38 1 3 99.52 28.63 0.90 10 3 4.25 1.22 0.60 1 2 48.30 6.51 0.75 NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model, Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan and Barbara Chapman 27th International Workshop on Languages and Compilers for Parallel Computing (LCPC2014) 16

AXPY Example with OpenMP: Multiple device Parallel region One CPU thread per device Manually partition array x and y Each thread offload subregion of x and y Chunk the loop in alignment with the data partition 17

Hybrid OpenMP (HOMP) for Multiple Accelerators HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems, Yonghong Yan, Jiawen Liu, and Kirk W. Cameron, The IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2017 18

Three Challenges to Implement HOMP 1. Load balance when distributing loop iterations across computational different devices (CPU, GPU, and MIC) We developed 7 algorithms of loop distribution and the runtime select algorithms based on computation/data intensity 2. Only copy the associated data to the device that are needed for the loop chunks assigned to that device Runtime support for ALIGN interface to move or share data between memory spaces 1. Select devices for computations for the optimal performance because more devices better performance CUTOFF ratio to select device 19

Offloading Execution Time (ms) on 2 CPUs + 4 GPUs + 2 MICs and using CUTOFF_RATIO stencil2di25 6' sumi300m' axpyi10b' bm2di256' matuli6144' matveci48k' MODEL_PROFILE_AUTO(15%'CUTOFF)' MODEL_PROFILE_AUTO' SCHED_PROFILE_AUTO' MODEL_2_AUTO' MODEL_1_AUTO' SCHED_GUIDED' SCHED_DYNAMIC' BLOCK' MODEL_1_AUTO(15%'CUTOFF)' MODEL_2_AUTO' MODEL_1_AUTO' BLOCK' SCHED_PROFILE_AUTO(15%'CUTOFF)' MODEL_PROFILE_AUTO' SCHED_PROFILE_AUTO' MODEL_2_AUTO' MODEL_1_AUTO' SCHED_DYNAMIC' BLOCK' MODEL_2_AUTO(15%'CUTOFF)' MODEL_PROFILE_AUTO' SCHED_PROFILE_AUTO' MODEL_2_AUTO' MODEL_1_AUTO' SCHED_GUIDED' SCHED_DYNAMIC' BLOCK' MODEL_1_AUTO(15%'CUTOFF)' MODEL_PROFILE_AUTO' SCHED_PROFILE_AUTO' MODEL_2_AUTO' MODEL_1_AUTO' SCHED_GUIDED' SCHED_DYNAMIC' MODEL_PROFILE_AUTO'(15%'CUTOFF)' MODEL_PROFILE_AUTO' SCHED_PROFILE_AUTO' MODEL_2_AUTO' MODEL_1_AUTO' SCHED_DYNAMIC' BLOCK' 100.93' 291.86' 407.61' 545.66' 261.99' 779.83' 211.11' 752.22' 432.17' 1504.45' 1482.90' 5054.33' 709.63' 555.21' 793.04' 953.50' 1060.35' 400.75' 1459.78' 1422.96' 6532.80' 10393.55' 3544.63' 3809.59' 21587.16' 3664.08' 20989.67' 272.80' 1723.84' 1695.45' 274.32' 277.19' 885.80' 3508.77' 306.26' 514.63' 767.02' 1017.77' 653.72' 412.29' 1423.10' 2.09 3.43 0.56 2.68 1.01 1.35 Execu2on%Time%(ms)%on%2%CPUs%+%4%GPUs%+%2%MICs% The algorithm that delivers the best performance without CUTOFF 3.43 The algorithm with 15% CUTOFF_RATIO that delivers the best performance and its speedup against the best algorithm that does not use CUTOFF 0.00' 200.00' 400.00' 600.00' 800.00' 1000.00' 1200.00' 1400.00' 1600.00' 1800.00' 2000.00' TOTAL%OFF%TIME(ms)% 20

Speedup From CUTOFF Apply 15% CUTOFF ratio to modeling and profiling Only those devices who may compute more than 15% of total iterations will be used Thinking of 8 devices (1/8 = 12.5%) Benchmarks Devices used CUTOFF Speedup axpy-10b 2 CPU + 4 GPUs 1.35 bm2d-256 2 CPU + 4 GPUs 1.01 matul-6144 4 GPUs 2.68 matvec-48k 4 GPUs 0.56 stencil2d-256 4 GPUs 3.43 sum-300m 2 CPUs + 4 GPUs 2.09 21