THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Similar documents
OpenACC Course. Office Hour #2 Q&A

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Directive-based Programming for Highly-scalable Nodes

OpenACC Fundamentals. Steve Abbott November 15, 2017

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

Automatic Testing of OpenACC Applications

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

GPU Fundamentals Jeff Larkin November 14, 2016

A case study of performance portability with OpenMP 4.5

Intel Xeon Phi Coprocessors

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

PGPROF OpenACC Tutorial

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Illinois Proposal Considerations Greg Bauer

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

Unified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015

GPU Architecture. Alan Gray EPCC The University of Edinburgh

OpenACC 2.6 Proposed Features

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

PROFILER OPENACC TUTORIAL. Version 2018

SIMULATOR AMD RESEARCH JUNE 14, 2015

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

PGI Fortran & C Accelerator Programming Model. The Portland Group

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

S Comparing OpenACC 2.5 and OpenMP 4.5

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

COMP Parallel Computing. Programming Accelerators using Directives

Introduction to OpenMP. Lecture 10: Caches

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Real-Time Support for GPU. GPU Management Heechul Yun

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

PGI Accelerator Programming Model for Fortran & C

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CUDA: NEW AND UPCOMING FEATURES

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Overview of research activities Toward portability of performance

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

OpenACC Fundamentals. Steve Abbott November 13, 2016

Mathematical computations with GPUs

Using OpenACC in IFS Physics Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Cray XE6 Performance Workshop

Accelerator programming with OpenACC

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

OpenACC Course Lecture 1: Introduction to OpenACC September 2015

OPENACC ONLINE COURSE 2018

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Programming paradigms for GPU devices

VSC Users Day 2018 Start to GPU Ehsan Moravveji

HETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

Heidi Poxon Cray Inc.

CSE 120 Principles of Operating Systems

Fundamental CUDA Optimization. NVIDIA Corporation

An Evaluation of Unified Memory Technology on NVIDIA GPUs

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Fundamental CUDA Optimization. NVIDIA Corporation

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

ECE 574 Cluster Computing Lecture 17

n N c CIni.o ewsrg.au

CUDA (Compute Unified Device Architecture)

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

PGI Installation and Release Notes for OpenPOWER CPUs

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

Introduction to Multicore Programming

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

Comparing Memory Systems for Chip Multiprocessors

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

Lecture 2: different memory and variable types

An Introduction to the SPEC High Performance Group and their Benchmark Suites


NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster

Introduction to Parallel Computing with CUDA. Oswald Haan

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Introduction to Multicore Programming

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Lecture 11: Large Cache Design

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH

OPENACC ONLINE COURSE 2018

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

General Purpose GPU Computing in Partial Wave Analysis

Introduction to Compiler Directives with OpenACC

Transcription:

THE FUTURE OF GPU DATA MANAGEMENT Michael Wolfe, May 9, 2017

CPU CACHE Hardware managed What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data back to system DRAM? (coherence) What address to use to access cached data? How to optimize programs to increase locality? 2

CPU CACHE core What data? Where to store? What to evict? Coherence? What address to use? How to optimize? memory 3

SCRATCHPAD MEMORY Software managed What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data back to system DRAM? (if ever) What address to use to access cached data? How to increase use of scratchpad? 4

SCRATCHPAD MEMORY core What data? Where to store? What to evict? Coherence? What address to use? How to optimize? memory 5

GPU DEVICE MEMORY CPU DDR GPU GDDR or HBM What data? Where to store? What to evict? Coherence? What address to use? How to optimize? 6

GPU DEVICE MEMORY Software managed, for correctness as well as performance What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data back to system DRAM? What address to use to access cached data? How to increase use of scratchpad? 7

OPENACC DEVICE MEMORY What: Data directives CPU System Memory GPU Device Memory Where: Anywhere Evict: Error Coherence: Update Address: Device address but same name Optimize: Correctness 8

GPU DEVICE MEMORY OpenACC What data to cache? Data directives and clauses. Where to store the cached data? Anywhere in device memory. What data to evict when the cache fills up? Program error. When to store data back to system DRAM? Update directives, end of data region. What address to use to access cached data? Device address!= host address. How to increase use of scratchpad? Must use device memory. 9

OPENACC USING MANAGED MEMORY -ta=tesla:managed on K80 What: Allocated CPU System Memory GPU Device Memory Where: Driver managed Evict: Error Coherence: Driver Address: Host address Optimize: Locality works malloc cost synchronization 10

CUDA UNIFIED (MANAGED) MEMORY Driver managed on K80 What data to cache? All dynamically allocated. Where to store the cached data? Driver managed. What data to evict when the cache fills up? Runtime error. When to store data back to system DRAM? Driver managed. What address to use to access cached data? Host address. How to increase use of device memory? Locality works. 11

CUDA UNIFIED (MANAGED) MEMORY OpenACC ta=tesla on K80 What data to cache? All dynamically allocated data (malloc, new, allocate). Where to store the cached data? Driver managed. What data to evict when the cache fills up? Runtime error. When to store data back to system DRAM? Driver managed. What address to use to access cached data? Host address. How to increase use of device memory? Locality works. 12

CUDA UNIFIED (MANAGED) MEMORY Driver managed on K80 Allocation is expensive, allocates both device and host pinned memory Can only allocate up to size of device memory All data migrated to device for each kernel launch (and paged back) Host must not access managed data while GPU is active Data movement is fast (host pinned memory) 13

OPENACC USING MANAGED MEMORY -ta=tesla:managed on P100 What: Allocated CPU System Memory GPU Device Memory Where: Driver managed Evict: Driver managed Coherence: Driver Address: Host address Optimize: Locality works malloc cost 14

CUDA UNIFIED (MANAGED) MEMORY Driver managed on P100 What data to cache? All dynamically allocated. Where to store the cached data? Driver managed. What data to evict when the cache fills up? Driver managed. When to store data back to system DRAM? Driver managed. What address to use to access cached data? Host address. How to increase use of device memory? Locality works. 15

CUDA UNIFIED (MANAGED) MEMORY Driver managed on P100 Allocation is expensive, allocates host pinned memory Can allocate more memory than GPU (oversubscription) Data is paged to GPU (and paged back) Host may access managed data while GPU is active User can set preferences and prefetching policies 16

SPECACCEL OPENACC Performance ratio, managed memory vs. directives 100% 80% 60% 40% 20% 0% POWER8+NVLink and P100 GPU, PGI 17.1 compilers PGI 17.1 Compilers OpenACC SPEC ACCEL 1.1 performance measured March, 2017. SPEC and the benchmark name SPEC ACCEL are registered trademarks of the Standard Performance Evaluation Corporation. 17

356.SP PERFORMANCE ANOMALY 100% 80% 60% 40% 20% 0% POWER8 18

356.SP PERFORMANCE ANOMALY 100% 80% 60% 40% 20% 0% POWER8 Haswell 19

356.SP PERFORMANCE W/ POOL ALLOCATOR 100% 80% 60% 40% 20% 0% POWER8 Haswell 20

356.SP PERFORMANCE W/ POOL ALLOCATOR 100% 80% 60% 40% 20% 0% POWER8 Haswell 21

CUDA UNIFIED MEMORY ON PASCAL & BEYOND Future HMM OS Option Works for static data, global data, dynamically allocated data All memory treated as managed Data is paged to GPU (and paged back) Host may access managed data while GPU is active User can set preferences and prefetching policies 22

SUFFICIENT? Why use OpenACC data directives at all? Portability to environments where Unified Memory isn t supported Useful to runtime as hints to set preferences or to prefetch data Moving large blocks is much more efficient than demand paging 23

THE FUTURE GPU Device Memory as a Cache Compile for multicore (let hardware cache manage data) Compile for GPU (let driver manage data) Add data directives as needed (for correctness and performance) 24