CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface

Similar documents
Efficient Data Transfers

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

Lecture 10. Efficient Host-Device Data Transfer

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

CUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA C Programming Mark Harris NVIDIA Corporation

Massively Parallel Algorithms

GPU programming: Unified memory models and more. Sylvain Collange Inria Rennes Bretagne Atlantique

Introduction to CUDA Programming

GPU Fundamentals Jeff Larkin November 14, 2016

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

Paralization on GPU using CUDA An Introduction

Module 2: Introduction to CUDA C. Objective

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

ECE 574 Cluster Computing Lecture 15

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

GPU Programming Introduction

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Lecture 2: Introduction to CUDA C

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Introduction to CUDA (1 of n*)

Module 2: Introduction to CUDA C

Josef Pelikán, Jan Horáček CGG MFF UK Praha

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Introduction to GPU hardware and to CUDA

Scientific discovery, analysis and prediction made possible through high performance computing.

Zero Copy Memory and Multiple GPUs

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

More on CUDA and. graphics processing unit computing CHAPTER. Mark Harris and Isaac Gelado CHAPTER OUTLINE

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Basics. July 6, 2016

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Basics of CADA Programming - CUDA 4.0 and newer

University of Bielefeld

ECE 574 Cluster Computing Lecture 17

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

EEM528 GPU COMPUTING

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CS 179: GPU Computing. Lecture 2: The Basics

CUDA OPTIMIZATIONS ISC 2011 Tutorial

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Using CUDA. Oswald Haan

CUDA 2.2 Pinned Memory APIs

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

CUDA Experiences: Over-Optimization and Future HPC

Vector Addition on the Device: main()

Massively Parallel Architectures

An Introduction to GPU Computing and CUDA Architecture

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

COMP 322: Fundamentals of Parallel Programming

CUDA Performance Optimization. Patrick Legresley

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

OpenACC Course. Office Hour #2 Q&A

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

Tesla Architecture, CUDA and Optimization Strategies

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Programming with CUDA, WS09

MPI + X programming. UTK resources: Rho Cluster with GPGPU George Bosilca CS462

Double-Precision Matrix Multiply on CUDA

GPUs as better MPI Citizens

COSC 6374 Parallel Computations Introduction to CUDA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Memory Management. Memory Access Bandwidth. Memory Spaces. Memory Spaces

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Unified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015

CSE 160 Lecture 24. Graphical Processing Units

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA%Asynchronous%Memory%Usage%and%Execu6on% Yukai&Hung& Department&of&Mathema>cs& Na>onal&Taiwan&University

Data parallel computing

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Introduction to CUDA

Parallel Computing: Parallel Architectures Jin, Hai

GPU Programming Using CUDA

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Transcription:

CSE 599 I Accelerated Computing - Programming GPUS Advanced Host / Device Interface

Objective Take a slightly lower-level view of the CPU / GPU interface Learn about different CPU / GPU communication techniques

A Prototypical High-Level PC Architecture CPU Memory GPU PCI-e GPU Memory CPU To I/O devices, peripherals, etc.

Peripheral Component Interconnect Express (PCI-e) A high-throughput serial expansion bus Transfers data over 1, 2, 4, 8, 16, or 32 (GPUs generally use PCI-e x16) PCI-e Version Year x16 Throughput 1.0 2003 4 GB / s 2.0 2007 8 GB / s 3.0 2010 15.8 GB / s 4.0 2017 (expected) 31.5 GB / s

GTX 1080 Ti PCI-e 3.0 x16 cudamemcpy comes through here!

Review: Physical Addressing 64 KB Physical Memory Physical addressing assigns consecutive numbers (i.e. addresses) to consecutive memory locations, from 0 to the size of the memory 0 1 2 3 4 One byte addressing is typical In the memory shown, bytes are assigned addresses 0-65535 5 6 65532 65533 65534 65535

Review: Virtual Addressing A virtual address points into a virtual address space that does not have a fixed 1-1 mapping to physical memory Virtual address spaces can be much larger than available physical memory Virtual addresses are typically divided into pages of at least 4096 bytes

Review: Virtual Addressing Address 0 4096 Virtual Memory 0 64 KB Physical Memory page offset 12288 4096 Page Table 16384 8192 0 1 2 3 4 20480 24576 12288 53248 5 503808 507904 This is not a real memory location! 57344 61440 123 12288 512000 124 125 516096 126 520192 127

Why? Can support larger memories than are physically available Page-level permissions information makes sandboxing of separate processes easier Multiple physical memories, files, and I/O devices can be mapped into a virtual memory space

Review: Demand Paging Address 0 4096 Virtual Memory 0 64 KB Physical Memory page offset 12288 4096 To disk Page Table 16384 8192 From disk 0 1 2 3 4 Page fault! 20480 24576 12288 53248 5 503808 507904 57344 61440 123 512000 124 125 516096 126 520192 127

CUDA Virtual Addressing CUDA supports virtual addressing such that CUDA processes cannot accidentally read each others data CUDA does not support demand paging every byte of virtual memory must be backed up by a byte of physical memory Traditionally, CUDA and CPU address spaces were completely separate

Direct Memory Access (DMA) CPU Memory GPU GPU Memory PCI-e DMA CPU To I/O devices, peripherals, etc.

DMA Danger Section of Device Memory oops DMA 0 4096 8192 64 KB Physical Memory To disk From disk 12288 53248 cudamemcpy(,, N, cudamemcpyhosttodevice); 57344 61440

Pinned Memory Pinned memory (akak page-locked memory) is a part of the virtual address space that is pinned to a part of the physical space Pinned memory cannot be paged out The DMA uses pinned memory for data transfer

cudamemcpy Uses Pinned Memory Section of Device Memory DMA 0 4096 64 KB Physical Memory To disk 8192 From disk 12288 53248 cudamemcpy(,, N, cudamemcpyhosttodevice); 57344 61440

Cutting out the Middleman You can allocate a host buffer in pinned memory directly: cudahostalloc(&hbuffer, size, cudahostallocdefault); cudamemcpy(dbuffer, hbuffer, size, cudamemcpyhosttodevice); cudahostfree(hbuffer); This eliminates the extra copy to pinned memory Keep in mind that pinned memory is a limited resource

Zero-Copy Memory Introduced as part of CUDA 2.2 in 2009 Allows mapping pinned host memory into the CUDA address space Values are copied from the pinned memory to the device as they are requested There is no caching Writing simultaneously to zero-copy memory from both host and device leads to undefined behavior

Zero-Copy Memory Setup cudasetdeviceflags(cudadevicemaphost); cudasetdevice( devicenumber ); cudadeviceprop deviceproperties; cudagetdeviceproperties(&deviceproperties, devicenumber); if (!deviceproperties.canmaphostmemory) { printf( Device %d does not support mapped memory\n ); exit(exit_failure); }

Vector Addition with Zero-Copy Memory cudahostalloc(&h_a, N * sizeof(float), cudahostallocmapped); cudahostalloc(&h_b, N * sizeof(float), cudahostallocmapped); // copy input data to h_a and h_b cudahostgetdevicepointer(&d_a, h_a, 0); cudahostgetdevicepointer(&d_b, h_b, 0); cudamalloc(&d_c, N * sizeof(float)); // set up grid vectoradditionkernel<<<grid, block>>>(d_a, d_b, d_c, N); // use results in d_c cudafreehost(h_a); cudafreehost(h_b); cudafree(d_c);

An Alternative: Register Pre-Allocated Memory One can also page-lock an existing buffer: float * h_a; cudahostregister(h_a, N * sizeof(float), cudahostregisterdefault); cudahostunregister(h_a); Could also be cudahostregistermapped

Zero-Copy Memory Tradeoffs Simplifies code by removing explicit data transfer Allows dynamic overlap of transfer and computation Can be extremely efficient for integrated graphics Can be used to process more data than fits in device memory No caching Must synchronize memory writes with the host Overuse of pinned memory can slow down the host

When to Use Zero-Copy Memory on Discrete GPUs When the data will be read or written only once When the reads will all be coalesced

Unified Virtual Addressing (UVA) Introduced as part of CUDA 4 in 2011 Supported on 64-bit systems and devices with compute capability at least 2.0 All addresses are in the same address space The runtime API call cudapointergetattributes() can be used to determine whether a pointer points to host or device memory You still can t dereference a host pointer on the device or vice versa (unless it s mapped memory)

Unified Virtual Addressing (UVA) Before UVA UVA 0 0 0 0 CPU Addresses M-1 GPU 0 Addresses M-1 GPU 1 Addresses CPU Addresses N-1 N GPU 0 Addresses N + M GPU 1 Addresses N + 2M - 1

Unified Memory Introduced as part of CUDA 6 in 2013 Dependent on Unified Virtual Addressing, but not the same thing Extends UVA with the ability to have managed memory, which is automatically migrated between host and device

Vector Addition With Unified Memory cudamallocmanaged(&a, N * sizeof(float)); cudamallocmanaged(&b, N * sizeof(float)); cudamallocmanaged(&c, N * sizeof(float)); // copy input data to h_a and h_b // set up grid vectoradditionkernel<<<grid, block>>>(a, B, C, N); // C is able to be dereferenced from the host cudafree(a); cudafree(b); cudafree(c);

Managed Static Memory Static memory can also be managed: device managed int sum; Can be useful for seamlessly transferring results back from, e.g., a reduction

Advantages of Unified Memory Much simpler code No longer any need for duplicate pointers No longer any need for explicit memory transfer Can naturally use more complicated data structures more naturally Deep copies no longer needed for pointers to structures containing pointers For example, you can construct a linked list on the host and traverse it on the device

Linked List without Unified Memory cudahostalloc() cudahostalloc() cudahostalloc() hhead data next data next data next dhead Host Memory Device Memory cudamalloc() data next We can allocate this with cudamalloc We need to launch a kernel to set this! dhead->next is inaccessible from the host

Linked List with Unified Memory Nodes are allocated in unified memory All pointers (head and all next pointers) are accessible on host and device We can construct the list on the host, pass head to the device, and simply traverse cudamallocmanaged() cudamallocmanaged() cudamallocmanaged() head data next data next data next Unified memory

Performance of Unified Memory Can be nearly as fast as explicitly managed memory Until very recently, kernel launches caused all host-resident pages to be flushed to the GPU The Pascal architecture (2016) includes support for GPU page faults A kernel can be launched with memory resident on the host Page faults cause memory to be migrated to the GPU as needed

Conclusion / Takeaways cudamemcpy requires data on the host to be in pinned memory If the data is not already in pinned memory, the CUDA runtime will copy it to pinned memory and then to the device There are a number of options for transferring data from CPU to GPU with different ease of use and performance The CUDA CPU / GPU interface has been (and still is) evolving over time

Sources https://www.wikipedia.org/ CUDA C Best Practices Guide. http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#axzz4fvwqomth Cheng, John, Max Grossman, and Ty McKercher. Professional Cuda C Programming. John Wiley & Sons, 2014. Hwu, Wen-mei, and David Kirk. "Programming massively parallel processors." Special Edition 92 (2009). Wilt, Nicholas. The cuda handbook: A comprehensive guide to gpu programming. Pearson Education, 2013.