Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Similar documents

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

CUDA Introduction Part I PATC GPU Programming Course 2017

Unified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Future Directions for CUDA Presented by Robert Strzodka

CUDA Tools for Debugging and Profiling

OpenACC Course. Office Hour #2 Q&A

CUDA Architecture & Programming Model

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

CUDA Basics. July 6, 2016

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU ACCELERATORS AT JSC OF THREADS AND KERNELS

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

Introduction to GPU hardware and to CUDA

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GPU Fundamentals Jeff Larkin November 14, 2016

Lecture 1: an introduction to CUDA

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

More on CUDA and. graphics processing unit computing CHAPTER. Mark Harris and Isaac Gelado CHAPTER OUTLINE

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

ECE 574 Cluster Computing Lecture 15

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Tesla Architecture, CUDA and Optimization Strategies

Fundamental CUDA Optimization. NVIDIA Corporation

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017

Scientific discovery, analysis and prediction made possible through high performance computing.

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

JCudaMP: OpenMP/Java on CUDA

ECE 574 Cluster Computing Lecture 17

NVIDIA S VISION FOR EXASCALE. Cyril Zeller, Director, Developer Technology

OpenACC 2.6 Proposed Features

Fundamental CUDA Optimization. NVIDIA Corporation

Lecture 8: GPU Programming. CSE599G1: Spring 2017

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

HETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse

Tesla GPU Computing A Revolution in High Performance Computing

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

Kepler Overview Mark Ebersole

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

Addressing Heterogeneity in Manycore Applications

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU Architecture. Alan Gray EPCC The University of Edinburgh

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Parallel and Distributed Computing

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Introduction to CUDA Programming

Introduction to OpenACC. Shaohao Chen Research Computing Services Information Services and Technology Boston University

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU CUDA Programming

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Accelerating image registration on GPUs

Paralization on GPU using CUDA An Introduction

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

LLVM for the future of Supercomputing

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

CUDA. Matthew Joyner, Jeremy Williams

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

The Visual Computing Company

Using JURECA's GPU Nodes

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Parallel Programming Overview

HPCSE II. GPU programming and CUDA

ECE 574 Cluster Computing Lecture 17

Automatic Intra-Application Load Balancing for Heterogeneous Systems

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

CUDA (Compute Unified Device Architecture)

Real-time Graphics 9. GPGPU

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

OpenCL: History & Future. November 20, 2017

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CSC573: TSHA Introduction to Accelerators

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Parallel Accelerators

S Comparing OpenACC 2.5 and OpenMP 4.5

CUDA C/C++ BASICS. NVIDIA Corporation

Real-time Graphics 9. GPGPU

Performance Diagnosis for Hybrid CPU/GPU Environments

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Transcription:

Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version

Overview, Outline Overview Unified Memory enables easy access to GPU development But some tuning might be needed for best performance Contents Background on Unified Memory History of Unified Memory on Pascal Unified Memory on Kepler Practical Differences Revisiting scale_vector_um Example Hints for Performance Task Andreas Herten Unified Memory 24 April 2017 # 2 17

Background on Unified Memory History of Andreas Herten Unified Memory 24 April 2017 # 3 17

CPU and Location, location, location At the Beginning CPU and GPU memory very distinct, own addresses Scheduler... Interconnect CPU L2 CPU Memory DRAM Andreas Herten Unified Memory 24 April 2017 # 4 17

CPU and Location, location, location At the Beginning CPU and GPU memory very distinct, own addresses CUDA 4.0 Unified Virtual Addressing: pointer from same address pool, but data copy manual CUDA 6.0 Unified Memory*: Data copy by driver, but whole data at once CUDA 8.0 Unified Memory (truly): Data copy by driver, page faults on-demand initiate data migrations (Pascal) CPU Scheduler... Interconnect L2 CPU Memory Unified Memory DRAM Andreas Herten Unified Memory 24 April 2017 # 4 17

Unified Memory in Code Vojgjfe Nfnpsz void sortfile(file *fp, int N) { char *data; char *data_d; void sortfile(file *fp, int N) { char *data; data = (char *)malloc(n); cudamalloc(&data_d, N); (&data, N); fread(data, 1, N, fp); fread(data, 1, N, fp); cudamemcpy(data_d, data, N, cudamemcpyhosttodevice); kernel<<<...>>>(data, N); cudamemcpy(data, data_d, N, cudamemcpydevicetohost); host_func(data); cudafree(data_d); free(data); } kernel<<<...>>>(data, N); cudadevicesynchronize(); host_func(data); cudafree(data); } Andreas Herten Unified Memory 24 April 2017 # 5 17

Implementation Details (on Pascal) Under the hood (&ptr,...); *ptr = 1; kernel<<<...>>>(ptr); Empty! No pages anywhere yet (like malloc()) CPU page fault: data allocates on CPU GPU page fault: data migrates to GPU Pages populate on first touch Pages migrate on-demand GPU memory over-subscription possible Concurrent access from CPU and GPU to memory (page-level) Andreas Herten Unified Memory 24 April 2017 # 6 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Page fault Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Page fault Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Page fault Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Page fault Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Page fault Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Page fault Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Page fault Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Page fault Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Page fault Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Page fault Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Page fault Interconnect Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Map memory to system memory Andreas Herten Unified Memory 24 April 2017 # 7 17

On-Demand Migration Flow at Pascal 0.7 TB/s Interconnect Map memory to system memory Only needed page is copied ( 4 kb)! Andreas Herten Unified Memory 24 April 2017 # 7 17

Migration on Kepler 0.3 TB/s PCI-Express Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s PCI-Express Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s PCI-Express Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s PCI-Express Page fault Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s PCI-Express Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s PCI-Express Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s PCI-Express Page fault Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s PCI-Express Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s PCI-Express Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s PCI-Express Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s Kernel launch Page fault not supported PCI-Express Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s PCI-Express Andreas Herten Unified Memory 24 April 2017 # 8 17

Migration on Kepler 0.3 TB/s PCI-Express Andreas Herten Unified Memory 24 April 2017 # 8 17

Implementation before Pascal Kepler (JURECA), Maxwell, Pages populate on GPU with () Might migrate to CPU if touched there first Pages migrate in bulk to GPU on kernel launch No over-subscription possible Andreas Herten Unified Memory 24 April 2017 # 9 17

Practical Differences Revisiting scale_vector_um Example Andreas Herten Unified Memory 24 April 2017 # 10 17

Comparing UM on Pascal & Kepler Different scales Who profiled scale_vector_um on JURON, who on JURECA? What are run times for kernel? JURON ==109924== Profiling result: Time(%) Time Calls Avg Min Max Name 100.00% 519.36us 1 519.36us 519.36us 519.36us scale(float, float*, flo Why?! Shouldn t P100 be about 3 faster than K80? JURECA ==12922== Profiling result: Time(%) Time Calls Avg Min Max Name 100.00% 13.216us 1 13.216us 13.216us 13.216us scale(float, float*, flo Andreas Herten Unified Memory 24 April 2017 # 11 17

Comparing UM on Pascal & Kepler Different scales Who profiled scale_vector_um on JURON, who on JURECA? What are run times for kernel? JURON ==109924== Profiling result: Time(%) Time Calls Avg Min Max Name 100.00% 519.36us 1 519.36us 519.36us 519.36us scale(float, float*, flo JURECA ==12922== Profiling result: Time(%) Time Calls Avg Min Max Name 100.00% 13.216us 1 13.216us 13.216us 13.216us scale(float, float*, flo Andreas Herten Unified Memory 24 April 2017 # 11 17

Comparing UM on Pascal & Kepler What happens? JURON Kernel is launched, data is needed by kernel, data migrates host device Run time of kernel incorporates time for data transfers JURECA Data will be needed by kernel so data migrates host device before kernel launch Run time of kernel without any transfers Implementation on Pascal is the more convenient one Total run time of whole program does not principally change Except it gets shorter because of faster architecture But data transfers sometimes sorted to kernel launch What can we do about this? Andreas Herten Unified Memory 24 April 2017 # 12 17

Performance Hints for UM General hints Keep data local Prevent migrations at all if data is processed by close processor Minimize thrashing Constant migrations hurt performance Minimize page fault overhead Fault handling costs O (10 µs), stalls execution Andreas Herten Unified Memory 24 April 2017 # 13 17

Performance Hints for UM New API routines New API calls to augment data location knowledge of runtime cudamemprefetchasync(data, length, device, stream) Prefetches data to device (on stream) asynchronously cudamemadvise(data, length, advice, device) Advise about usage of given data, advice: cudamemadvisesetreadmostly: Data is mostly read and occasionally written to cudamemadvisesetpreferredlocation: Set preferred location to avoid migrations; first access will establish mapping cudamemadvisesetaccessedby: Data is accessed by this device; will pre-map data to avoid page fault Use cudacpudeviceid for device CPU, or use cudagetdevice() as usual to retrieve current GPU device id (default: 0) Andreas Herten Unified Memory 24 April 2017 # 14 17

Hints in Code void sortfile(file *fp, int N) { char *data; //... (&data, N); Read-only copy of data is created on GPU during prefetch CPU and GPU reads will not fault fread(data, 1, N, fp); cudamemadvise(data, N, cudamemadvisesetreadmostly, device); cudamemprefetchasync(data, N, device); kernel<<<...>>>(data, N); cudadevicesynchronize(); host_func(data); cudafree(data); } Prefetch data to avoid expensive GPU page faults Andreas Herten Unified Memory 24 April 2017 # 15 17

Tuning scale_vector_um Express data movement TASK Location of code: Unified_Memory/exercises/tasks/scale/ Look at Instructions.rst for instructions 1 Show runtime that data should be migrated to GPU before kernel call 2 Build with make (CUDA needs to be loaded!) 3 Run with make run Or bsub -I -R "rusage[ngpus_shared=1]"./scale_vector_um 4 Generate profile to study your progress see make profile See also CUDA C programming guide for details on data usage Finished early? There s one more task in the appendix! Andreas Herten Unified Memory 24 April 2017 # 16 17

Conclusions What we ve learned Unified Memory is implemented differently on Pascal (JURON) and Kepler (JURECA) With CUDA 8.0, there are new API calls to express data locality Thank you for your attention! a.herten@fz-juelich.de Andreas Herten Unified Memory 24 April 2017 # 17 17

Appendix Jacobi Task Glossary Andreas Herten Unified Memory 24 April 2017 # 1 6

Jacobi Task One more time TASK Location of code: Unified_Memory/exercises/tasks/jacobi/ See Jiri Kraus slides on Unified Memory from last year at Unified_Memory/exercises/slides/jkraus-unified_memory- 2016.pdf Short instructions Avoid data migrations in while loop of Jacobi solver: apply boundary conditions with provided GPU kernel; try to avoid remaining migrations Build with make (CUDA needs to be loaded!) Run with make run Look at profile see make profile Andreas Herten Unified Memory 24 April 2017 # 2 6

Glossary I API A programmatic interface to software by well-defined functions. Short for application programming interface. 53 ATI Canada-based GPUs manufacturing company; bought by AMD in 2006. 53 CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 4, 5, 49, 50, 53 GCC The GNU Compiler Collection, the collection of open source compilers, among other for C and Fortran. 53 Andreas Herten Unified Memory 24 April 2017 # 3 6

Glossary II LLVM An open Source compiler infrastructure, providing, among others, Clang for C. 53 NVIDIA US technology company creating GPUs. 53 NVLink NVIDIA s communication protocol connecting CPU GPU and GPU GPU with 80 GB/s. PCI-Express: 16 GB/s. 53 OpenACC Directive-based programming, primarily for many-core machines. 53 OpenCL The Open Computing Language. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 53 Andreas Herten Unified Memory 24 April 2017 # 4 6

Glossary III OpenGL The Open Graphics Library, an API for rendering graphics across different hardware architectures. 53 OpenMP Directive-based programming, primarily for multi-threaded machines. 53 P100 A large GPU with the Pascal architecture from NVIDIA. It employs NVLink as its interconnect and has fast HBM2 memory. 53 SAXPY Single-precision A X + Y. A simple code example of scaling a vector and adding an offset. 53 Tesla The GPU product line for general purpose computing computing of NVIDIA. 53 Andreas Herten Unified Memory 24 April 2017 # 5 6

Glossary IV Thrust A parallel algorithms library for (among others) GPUs. See https://thrust.github.io/. 53 Andreas Herten Unified Memory 24 April 2017 # 6 6