Efficient Data Transfers

Similar documents
Lecture 10. Efficient Host-Device Data Transfer

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 17: Data Transfer and CUDA Streams

Data Transfer and CUDA Streams Data Transfer and CUDA Streams

CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

CUDA Memories. Introduction 5/4/11

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

Introduction to CUDA (1 of n*)

Paralization on GPU using CUDA An Introduction

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Fundamental CUDA Optimization. NVIDIA Corporation

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

CUDA C Programming Mark Harris NVIDIA Corporation

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

A (Very Hand-Wavy) Introduction to. PCI-Express. Jonathan Heathcote

CUDA Performance Optimization. Patrick Legresley

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Massively Parallel Algorithms

Introduction to CUDA Programming

GPU, GP-GPU, GPU computing. Luca Benini

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

Unified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Computer Architecture CS 355 Busses & I/O System

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

Section 3 MUST BE COMPLETED BY: 10/17

Lecture 23. Finish-up buses Storage

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CUDA 2.2 Pinned Memory APIs

Peripheral Component Interconnect - Express

Addressing the Memory Wall

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Information Coding / Computer Graphics, ISY, LiTH. More memory! ! Managed memory!! Atomics!! Pinned memory 25(85)

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Today: I/O Systems. Architecture of I/O Systems

L17: Asynchronous Concurrent Execution, Open GL Rendering

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

Module 2: Introduction to CUDA C. Objective

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Zero Copy Memory and Multiple GPUs

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

EEM528 GPU COMPUTING

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

3D buzzwords. Adding programmability to the pipeline 6/7/16. Bandwidth Gravity of modern computer systems

CUDA%Asynchronous%Memory%Usage%and%Execu6on% Yukai&Hung& Department&of&Mathema>cs& Na>onal&Taiwan&University

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Data parallel computing

GPU Programming Using CUDA

An Introduction to GPU Computing and CUDA Architecture

When MPPDB Meets GPU:

CUDA C/C++ BASICS. NVIDIA Corporation

NVIDIA nforce IGP TwinBank Memory Architecture

Overview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Spring 2011 Prof. Hyesoon Kim

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Module 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C

Multi-GPU Programming

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to GPU hardware and to CUDA

CUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK

Memory Systems in Pipelined Processors

ECE 574 Cluster Computing Lecture 15

I/O. Fall Tore Larsen. Including slides from Pål Halvorsen, Tore Larsen, Kai Li, and Andrew S. Tanenbaum)

I/O. Fall Tore Larsen. Including slides from Pål Halvorsen, Tore Larsen, Kai Li, and Andrew S. Tanenbaum)

Spring 2009 Prof. Hyesoon Kim

CUDA Performance Optimization Mark Harris NVIDIA Corporation

CS330: Operating System and Lab. (Spring 2006) I/O Systems

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Module 11: I/O Systems

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Advanced CUDA Programming. Dr. Timo Stich

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University

Intel Thunderbolt. James Coddington Ed Mackowiak

What Operating Systems Do An operating system is a program hardware that manages the computer provides a basis for application programs acts as an int

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

HW SCREAMER! Video Editing and Gaming Computer Built with love but raging with power. Owner's Manual

GPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

GPUs and GPGPUs. Greg Blanton John T. Lubia

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

Computer Architecture

Transcription:

Efficient Data fers Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016

PCIE

Review Typical Structure of a CUDA Program Global variables declaration Function prototypes global void kernelone( ) Main () allocate memory space on the device cudamalloc(&d_glblvarptr, bytes ) transfer data from host to device cudamemcpy(d_glblvarptr, h_gl ) execution configuration setup repeat kernel call kernelone<<<execution configuration>>>( args as ); needed transfer results from device to host cudamemcpy(h_glblvarptr, ) optional: compare against golden (host computed) solution Kernel void kernelone(type args, ) variables declaration - local, shared automatic variables transparently assigned to registers or local memory syncthreads()

Bandwidth Gravity of Modern Computer Systems The bandwidth between key components ultimately dictates system performance Especially true for massively parallel systems processing massive amount of data Tricks like buffering, reordering and caching can temporarily defy the rules in some cases Ultimately, the performance falls back to what the speeds and feeds dictate

History: Classic PC architecture Northbridge connects 3 components that must communicate at high speed CPU, DRAM, video Video also needs to have 1 st - class access to DRAM Previous NVIDIA cards are connected to AGP, up to 2 GB/s transfers Southbridge serves as a concentrator for slower I/O devices CPU Core Logic Chipset

(Original) PCI Bus Specification Connected to the Southbridge Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate More recently 66 MHz, 64-bit, 528 MB/second peak Upstream bandwidth remain slow for device (~256 MB/s peak) Shared bus with arbitration Winner of arbitration becomes bus master and can connect to CPU or DRAM through the Southbridge and Northbridge`

PCI as Memory Mapped I/O PCI device registers are mapped into the CPU s physical address space Accessed through loads/ stores (kernel mode) Addresses are assigned to the PCI devices at boot time All devices listen for their addresses

PCI Express (PCIe) Switched, point-to-point connection Each card has a dedicated link to the central switch, no bus arbitration Packet switches messages form virtual channel Prioritized packets for QoS E.g., real-time video streaming

PCIe 2 Links and Lanes Each link consists of one or more lanes Each lane is 1-bit wide (4 wires, each 2-wire pair can transmit 2.5Gb/s in one direction) Upstream and downstream now simultaneous and symmetric Each Link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc. Each byte data is 8b/10b encoded into 10 bits with equal number of 1 s and 0 s; net data rate 2 Gb/s per lane each way Thus, the net data rates are 250 MB/s (x1) 500 MB/s (x2), 1GB/s (x4), 2 GB/s (x8), 4 GB/s (x16), each way

8/10 bit encoding Goal is to maintain DC balance while have sufficient state transition for clock recovery The difference of 1s and 0s in a 20-bit stream should be 2 There should be no more than 5 consecutive 1s or 0s in any stream 00000000, 00000111, 11000001 bad 01010101, 11001100 good Find 256 good patterns among 1024 total patterns of 10 bits to encode an 8-bit data 20% overhead

PCIe PC Architecture PCIe forms the interconnect backbone Northbridge and Southbridge are both PCIe switches Some Southbridge designs have built-in PCI-PCIe bridge to allow old PCI cards Some PCIe I/O cards are PCI cards with a PCI-PCIe bridge Source: Jon Stokes, PCI Express: An Overview http://arstechnica.com/articles/p aedia/hardware/pcie.ars

GeForce 7800 GTX Board Details SLI Connector Single slot cooling svideo TV Out DVI x 2 16x PCI-Express 256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32

PCIe 3 A total of 4 Giga fers per second in each direction No more 8/10 encoding but uses a polynomial transformation at the transmitter and its inverse at the receiver to achieve the same effect So the effective bandwidth is double of PCIe 2

PINNED HOST MEMORY

CPU-GPU Data fer using DMA DMA (Direct Memory Access) hardware is used by cudamemcpy() for better efficiency Frees CPU for other tasks Hardware unit specialized to transfer a number of bytes requested by OS Between physical memory address space regions (some can be mapped I/O memory locations) Uses system interconnect, typically PCIe in today s systems CPU Main Memory (DRAM) PCIe Global DMA Memory GPU card (or other I/O cards)

Virtual Memory Management Modern computers use virtual memory management Many virtual memory spaces mapped into a single physical memory Virtual addresses (pointer values) are translated into physical addresses Not all variables and data structures are always in the physical memory Each virtual address space is divided into pages that are mapped into and out of the physical memory Virtual memory pages can be mapped out of the physical memory (page-out) to make room Whether or not a variable is in the physical memory is checked at address translation time

Data fer and Virtual Memory DMA uses physical addresses When cudamemcpy() copies an array, it is implemented as one or more DMA transfers Address is translated and page presence checked for the entire source and destination regions at the beginning of each DMA transfer No address translation for the rest of the same DMA transfer so that high efficiency can be achieved The OS could accidentally page-out the data that is being read or written by a DMA and page-in another virtual page into the same physical location

Pinned Memory and DMA Data fer Pinned memory are virtual memory pages that are specially marked so that they cannot be paged out Allocated with a special system API function call a.k.a. Page Locked Memory, Locked Pages, etc. CPU memory that serve as the source or destination of a DMA transfer must be allocated as pinned memory

CUDA data transfer uses pinned memory. The DMA used by cudamemcpy() requires that any source or destination in the host memory is allocated as pinned memory If a source or destination of a cudamemcpy() in the host memory is not allocated in pinned memory, it needs to be first copied to a pinned memory extra overhead cudamemcpy() is faster if the host memory source or destination is allocated in pinned memory since no extra copy is needed

Allocate/Free Pinned Memory cudahostalloc(), three parameters Address of pointer to the allocated memory Size of the allocated memory in bytes Option use cudahostallocdefault for now cudafreehost(), one parameter Pointer to the memory to be freed

Using Pinned Memory in CUDA Use the allocated pinned memory and its pointer the same way as those returned by malloc(); The only difference is that the allocated memory cannot be paged by the OS The cudamemcpy() function should be about 2X faster with pinned memory Pinned memory is a limited resource over-subscription can have serious consequences

Putting It Together - Vector Addition Host Code Example int main() { float *h_a, *h_b, *h_c; cudahostalloc((void **) &h_a, N* sizeof(float), cudahostallocdefault); cudahostalloc((void **) &h_b, N* sizeof(float), cudahostallocdefault); cudahostalloc((void **) &h_c, N* sizeof(float), cudahostallocdefault); // cudamemcpy() runs 2X faster }

CUDA STREAMS

Serialized Data fer and Computation So far, the way we use cudamemcpy serializes data transfer and GPU computation for VecAddKernel(). A. B Comp. C time Only use one direction, GPU idle PCIe Idle Only use one direction, GPU idle

Device Overlap Some CUDA devices support device overlap Simultaneously execute a kernel while copying data between device and host memory int dev_count; cudadeviceprop prop; cudagetdevicecount( &dev_count); for (int i = 0; i < dev_count; i++) { cudagetdeviceproperties(&prop, i); if (prop.deviceoverlap)

Ideal, Pipelined Timing Divide large vectors into segments Overlap transfer and compute of adjacent segments A.0 B.0 Comp C.0 = A.0 + B.0 C.0 A.1 B.1 Comp C.1 = A.1 + B.1 C.1 A.2 B.2 Comp C.2 = A.2 + B.2 A.3 B.3

CUDA Streams CUDA supports parallel execution of kernels and cudamemcpy() with Streams Each stream is a queue of operations (kernel launches and cudamemcpy()calls) Operations (tasks) in different streams can go in parallel Task parallelism

Streams Requests made from the host code are put into First-In-First-Out queues Queues are read and processed asynchronously by the driver and device Driver ensures that commands in a queue are processed in sequence. E.g., Memory copies end before kernel launch, etc. host thread FIFO cudamemcpy() kernel launch device sync cudamemcpy() device driver

Streams cont. To allow concurrent copying and kernel execution, use multiple queues, called streams CUDA events allow the host thread to query and synchronize with individual queues (i.e. streams). host thread Stream 0 Stream 1 Event device driver

Conceptual View of Streams PCIe up PCIe down Copy Engine Kernel Engine MemCpy A.0 MemCpy B.0 Kernel 0 MemCpy C.0 MemCpy A.1 MemCpy B.1 Kernel 1 MemCpy C.1 Stream 0 Stream 1 Operations (Kernel launches, cudamemcpy() calls)

OVERLAPPING DATA TRANSFER W/ COMPUTATION

Simple Multi-Stream Host Code cudastream_t stream0, stream1; cudastreamcreate(&stream0); cudastreamcreate(&stream1); float *d_a0, *d_b0, *d_c0; // device memory for stream 0 float *d_a1, *d_b1, *d_c1; // device memory for stream 1 // cudamalloc() calls for d_a0, d_b0, d_c0, d_a1, d_b1, d_c1 go here

Simple Multi-Stream Host Code (Cont.) for (int i=0; i<n; i+=segsize*2) { cudamemcpyasync(d_a0, h_a+i, SegSize*sizeof(float),, stream0); cudamemcpyasync(d_b0, h_b+i, SegSize*sizeof(float),, stream0); vecadd<<<segsize/256, 256, 0, stream0>>>(d_a0, d_b0, ); cudamemcpyasync(h_c+i, d_c0, SegSize*sizeof(float),, stream0); cudamemcpyasync(d_a1, h_a+i+segsize, SegSize*sizeof(float),, stream1); cudamemcpyasync(d_b1, h_b+i+segsize, SegSize*sizeof(float),, stream1); vecadd<<<segsize/256, 256, 0, stream1>>>(d_a1, d_b1, ); cudamemcpyasync(d_c1, h_c+i+segsize, SegSize*sizeof(float),, stream1); }

A View Closer to Reality in Previous GPUs PCIe up Copy Engine PCIe down Kernel Engine MemCpy A.0 MemCpy B.0 MemCpy C.0 MemCpy A.1 MemCpy B.1 MemCpy C.1 Kernel 0 Kernel 1 Stream 0 Stream 1 Operations (Kernel launches, cudamemcpy() calls)

Not quite the overlap we want in some GPUs C.0 blocks A.1 and B.1 in the copy engine queue A.0 B.0 Comp C.0 = A.0 + B.0 C.0 A.1 B.1 Comp C.1= A.1 + B.1 C.1

Better Multi-Stream Host Code for (int i=0; i<n; i+=segsize*2) { cudamemcpyasync(d_a0, h_a+i, SegSize*sizeof(float),, stream0); cudamemcpyasync(d_b0, h_b+i, SegSize*sizeof(float),, stream0); cudamemcpyasync(d_a1, h_a+i+segsize, SegSize*sizeof(float),, stream1); cudamemcpyasync(d_b1, h_b+i+segsize, SegSize*sizeof(float),, stream1); vecadd<<<segsize/256, 256, 0, stream0>>>(d_a0, d_b0, ); vecadd<<<segsize/256, 256, 0, stream1>>>(d_a1, d_b1, ); } cudamemcpyasync(h_c+i, d_c0, SegSize*sizeof(float),, stream0); cudamemcpyasync(h_c+i+segsize, d_c1, SegSize*sizeof(float),, stream1);

C.0 no longer blocks A.1 and B.1 PCIe up Copy Engine PCIe down Kernel Engine MemCpy A.0 MemCpy B.0 MemCpy A.1 MemCpy B.1 MemCpy C.0 MemCpy C.1 Kernel 0 Kernel 1 Stream 0 Stream 1 Operations (Kernel launches, cudamemcpy() calls)

Better, not quite the best overlap C.1 blocks next iteration A.0 and B.0 in the copy engine queue A.0 B.0 Comp C.0 = A.0 + B.0 C.0 Iteration n A.1 B.1 Comp C.1= A.1 + B.1 C.1 Iteration n+1 A.2 B.2 Comp C.2 = A.2 + A.2

Ideal, Pipelined Timing Will need at least three buffers for each original A, B, and C, code is more complicated A.0 B.0 Comp C.0 = A.0 + B.0 C.0 A.1 B.1 Comp C.1 = A.1 + B.1 C.1 A.2 B.2 Comp C.2 = A.2 + B.2 A.3 B.3

Hyper Queues Provide multiple queues for each engine Allow more concurrency by allowing some streams to make progress for an engine while others are blocked A--B--C P--Q--R X--Y--Z Multiple Hardware Work Queues A -- B -- C Stream 0 P -- Q -- R Stream 1 X -- Y -- Z Stream 2

Wait until all tasks have completed cudastreamsynchronize(stream_id) Used in host code Takes one parameter stream identifier Wait until all tasks in a stream have completed E.g., cudastreamsynchronize(stream0)in host code ensures that all tasks in the queues of stream0 have completed This is different from cudadevicesynchronize() Also used in host code No parameter cudadevicesynchronize() waits until all tasks in all streams have completed for the current device