GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Similar documents
OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and selecting Better Grid Geometries

A case study of performance portability with OpenMP 4.5

GPU Fundamentals Jeff Larkin November 14, 2016

CS 179 Lecture 4. GPU Compute Architecture

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Tesla Architecture, CUDA and Optimization Strategies

S Comparing OpenACC 2.5 and OpenMP 4.5

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

OPENACC ONLINE COURSE 2018

Hands-on CUDA Optimization. CUDA Workshop

The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

Parallel Numerical Algorithms

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

Lecture 2: CUDA Programming

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CUDA. Matthew Joyner, Jeremy Williams

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Performance of deal.ii on a node

GPU CUDA Programming

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

GPU Programming Using CUDA

Hybrid Implementation of 3D Kirchhoff Migration

Parallel Programming Libraries and implementations

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

OpenACC programming for GPGPUs: Rotor wake simulation

Introduction to Computer Systems /18-243, fall th Lecture, Dec 1

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

COMP Parallel Computing. Programming Accelerators using Directives

Parallel Programming. Libraries and Implementations

JCudaMP: OpenMP/Java on CUDA

OpenACC (Open Accelerators - Introduced in 2012)

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

OpenACC. Part 2. Ned Nedialkov. McMaster University Canada. CS/SE 4F03 March 2016

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Parallel Accelerators

Parallel Programming in C with MPI and OpenMP

Improving Performance of Machine Learning Workloads

High Performance Computing and GPU Programming

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduc)on to GPU Programming

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Directive-based Programming for Highly-scalable Nodes

Overview: Graphics Processing Units

OpenACC Course. Office Hour #2 Q&A

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

CUDA GPGPU Workshop 2012

Introduction to OpenACC. 16 May 2013

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Lecture 3: Introduction to CUDA

Unrolling parallel loops

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

OpenMP for next generation heterogeneous clusters

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Introduction to CUDA Programming

Trellis: Portability Across Architectures with a High-level Framework

OpenACC 2.6 Proposed Features

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

OpenMP 4.0. Mark Bull, EPCC

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

OpenACC Fundamentals. Steve Abbott November 15, 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Profiling of Data-Parallel Processors

CPU-GPU Heterogeneous Computing

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

Introduction to GPU programming with CUDA

Chapter 3 Parallel Software

Concurrent Programming with OpenMP

Portland State University ECE 588/688. Graphics Processors

University of Bielefeld

Massively Parallel Architectures

Parallel Accelerators

Parallel Computing. Lecture 19: CUDA - I

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Computing with CUDA. Part 2: CUDA Introduction

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPU programming. Dr. Bernhard Kainz

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Transcription:

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1

Why? 2

Supercomputer Power/Performance GPUs exhibit much better power scaling characteristics Upcoming CORAL Sierra Supercomputer will deliver 150 Petaflops Cooling is becoming a serious issue at scale A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU, Dong et al,2014 3

Agenda 1. GPU Programming Models 2. GPU Architecture 3. Mapping OpenMP to GPUs 4. OpenMP Performance 4

GPU Programming Models 5

GPU Parallel Programming Models (2006) NVidia GPUs (2009) NVidia GPUs AMD(ATI) GPUs Xeon PHI (2014) CPU Parallelism (2012) NVidia GPUs AMD(ATI) GPUs Xeon PHI (2014) CPU Parallelism Version 4.0 (2013) NVidia GPUs AMD(ATI) GPUs Xeon PHI (2014) CPU Parallelism 6

Language Philosophies Exploit and NVidia CUDA GPUs Maximize performance by exposing hardware details (warps, shared memory) Ease debugging and performance through tooling and profiling

Language Philosophies Run Anywhere: CPU, GPU, Specialized Accelerators Runtime library provides access to parallel primitives OpenCL Compiler can build kernel functions at runtime

Language Philosophies Annotate existing programs with pragmas (acc kernels, acc parallel, acc data) Default to hierarchical parallelism The compiler knows best (Loose Spec)

Language Philosophies Annotate existing programs with pragmas (omp target, omp parallel, omp target data) Hierarchical parallelism available, simple parallelism by default The programmer knows best (Tight Spec)

Which one is best? The stream of code needed to parallelize for CUDA C was about 50 higher than for OpenACC and OpenCL was about three times NextPlatform - Is OpenACC the Best Thing to Happen to OpenMP? (2015) OpenMP is richer and has more features than OpenACC, we make no apologies about that. OpenACC is targeting scalable parallelism, OpenMP is targeting more general parallelism Michael Wolfe, OpenACC Technical Chair (2015) 11

Example: Vector Addition Highly parallel problem int* vecadd(int* a, int* b, size_t len) { size_t len_bytes = len * sizeof(int); int* c=malloc(len_bytes); for(int i=0; i<len; i++) { c[i] = a[i] + b[i]; } return c; } Want to parallelize on GPU as much as possible 12

Example: Vector Addition int* vecadd(int* a, int* b, size_t len) { int* a_d,b_d,c_d; size_t len_bytes = len * sizeof(int); cudamalloc(&a_d, len_bytes); cudamalloc(&b_d, len_bytes); cudamalloc(&c_d, len_bytes); cudamemcpy(a_d, a, len_bytes, cudamemcpyhosttodevice); cudamemcpy(b_d, b, len_bytes, cudamemcpyhosttodevice); vecaddkernel<<<256,256>>>(a_d,b_d,c_d,len); int* c=malloc(len_bytes); cudamemcpy(c, c_d, len_bytes, cudamemcpydevicetohost); return c; } global void vecaddkernel(int* a, int* b, int* c, size_t len) { int index = blockidx.x*blockdim.x+threadidx.x; int grid = griddim.x*blockdim.x; while(index<len) { c[index] = a[index] + b[index]; index += grid; } } Index derived from thread indices Memory must be copied back and forth Explicit parallelism 13

Example: Vector Addition Memory must be copied back and forth int* vecadd(int* a, int* b, size_t len) { size_t len_bytes = len * sizeof(int); int* c=malloc(len_bytes); #pragma acc data to(a[0:len]) to(b[0:len]) create(c[0:len]) #pragma acc kernels for(int i=0; i<len; i++) { c[i] = a[i] + b[i]; } return c; } Compiler determines parallelism 14

Example: Vector Addition Memory must be copied back and forth int* vecadd(int* a, int* b, size_t len) { size_t len_bytes = len * sizeof(int); int* c=malloc(len_bytes); #pragma omp target data to(a[0:len],b[0:len]) from(c[0:len]) #pragma omp target teams #pragma omp distribute parallel for simd for(int i=0; i<len; i++) { c[i] = a[i] + b[i]; } return c; } Programmer specifies parallelism 15

GPU Architecture 16

SM 0 32 32 64 32 Cores SM N 32 32 64 32 Cores L1 Registers Registers SMEM L1 L2 Global Memory SMEM Core 0 Registers Core 1 Registers Core 2 Registers Core 3 Registers Shared Decoder/Issuer Core 0 Executor Core 1 Executor Core 2 Executor Core 3 Executor GP100 Pascal Whitepaper, NVidia, 2016

Block 0 Warp 0 32 Threads The Grid Warp 1 32 Threads Block 1 Warp 0 Warp 1 32 Threads 32 Threads Block 2 Warp 0 Warp 1 32 Threads 32 Threads

Block 0 Warp 0 Warp 1 SM 0 32 32 32 32 Cores 32 Threads 32 Threads Registers Block 1 Warp 0 Warp 1 L1 SMEM 32 Threads Block 2 Warp 0 32 Threads Warp 1 SM 0 32 32 32 32 Cores 32 Threads 32 Threads L1 Registers SMEM Core 0 Registers Core 1 Registers Shared Decoder/Issuer Core 0 Executor Core 1 Executor

Data Transfer CPU GPU data transfer can occur in parallel with GPU computation (OpenMP nowait data clause) Pascal brings NVLink, with bonded unidirectional 20GB/s links GPUs can exchange data with each other independently over NVLink Supported CPUs (Power8, Power9) can also communicate over NVLink GP100 Pascal Whitepaper, NVidia, 2016 20

GPU Memory Accesses Access Type Visibility Cache Hit Cache Miss Local Memory Access Thread-Local 30 Cycles 300 Cycles Shared Memory Access Block-Local 33 Cycles N/A Global Memory Access Globally Visible 175 Cycles 300 Cycles Constant Memory Access Globally Visible 42 Cycles 215 Cycles *Numbers for NVidia Tesla K20 Wong et al., Demystifying GPU Microarchitecture through Microbenchmarking, ISPASS 2010 21

GPU Global Memory Coalescing LOAD.8 R0 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Warp execution means 32 memory accesses issue in the same cycle Global memory system can handle only 2 accesses per cycle Warp memory accesses are coalesced whenever possible Memory coalescing is critical to performance Request Coalescer 0x8000 0038 8B 0x8000 0030 8B 0x8000 0028 8B 0x8000 0020 8B 0x8000 0018 8B 0x8000 0010 8B 0x8000 0008 8B 0x8000 0000 8B Request Queue 0x8000 0000 64B 22

GPU Occupancy GPUs use context switches to hide instruction/memory latency Only so many resources available on each SM Occupancy limited by: Registers, Shared Memory, Thread Count, Block Count Shared memory doesn t suffer coalescing, but is very limited Compiler must balance these resources to maintain performance Acheived Occupancy, NVidia, docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm 23

Mapping OpenMP to GPUs 24

XL Compiler GPU Architecture XL Frontend Annotated WCode Toronto Portable Optimizer (TPO) Outlined WCode Legend: Pre-existing XL Components XL GPU Components NVidia CUDA Components CPU Code GPU Code CPU/GPU Combined Code WCode Partitioner Host WCode XL Backend Linker Host Objects Device WCode Device Objects WC2LLVM LibNVVM CUDA Fatbinary Device LLVM Device PTX Final Executable 25

Code Generation Target Regions GPU Kernel Functions Teams Thread-Blocks Threads Threads 26

Code Generation: A Motivating Example void f() { #pragma omp target teams #pragma omp distribute parallel for for(int i=0; i<1024; i++) g(i); } void g(int i) { #pragma omp parallel for for(int j=0; j<1024; j++) dowork(i,j); } 27

Code Generation: Dynamic Parallelism NVidia GPUs support parent/child kernel relationships Extract each parallel region into a subkernel? CDP implementation can achieve 1.13x-2.73x potential speedup but the huge kernel launching overhead could negate the performance benefit 28

Code Generation: State Machine Start all threads immediately master threads manipulate state in shared memory, release worker threads when parallel Synchronize before each state change Implemented in original XL OpenMP 4 Beta The paper characterizes occupancy as the limiting factor. There is a large difference in th amount of registers per thread required to execute the OpenMP versions and the CUDA C/C++ one 29

Code Generation: Cooperative Multithreading Kernel launched with all threads Worker threads released for outer parallel regions At inner parallel regions, all threads in warp perform each thread s work in sequence Teams Region Outer Parallel Region Inner Parallel Region Master Worker 0 Worker 1 do() sequential() work() sync() NOP sync() NOP sync() do() worker0() work() bdcst_state() do() worker0() work() rcv_state() do() worker1() work() sync() NOP sync() do() worker1() work() rcv_state() do() worker0() work() bdcst_state( ) do() worker1() work() sync() 30

Optimizations: Shared Memory Promotion Local accesses are sorted by usage GPU model determines available shared memory through cost model Locals promoted until available shared memory is full Future Work: Better ordering criteria for local accesses Locals idx0 14 tmp62 8 arrbase2 3... Usage 31

Optimizations: Multi-Hierarchy Parallelism Runtime Initialization is required to emulate the OpenMP fork-join parallelism model When certain patterns are detected, more performant CUDA equivalents can be used For Example: omp distribute parallel for CUDA multiblock loop 32

Performance: Conway s Game of Life Iterative 2D 9-point stencil Tested CUDA, OpenMP, CPU Runtime overhead is most of Optimized OpenMP overhead 64-bit pointer arithmetic accounts for the remainder 2.4s 3.0s 33

Performance: Conway s Game of Life Optimized OpenMP and CUDA have same ratio of loads/stores Global Memory Throughput Naive OpenMP uses global memory for runtime state Lower optimized OpenMP throughput artifact of unnecessary 64-bit operations 34

Taylor Lloyd MSc at University of Alberta Want to discuss GPU computing further? Come talk to me in the Expo, poster A19. 35