CS/EE 217 GPU Architecture and Parallel Programming. Lecture 17: Data Transfer and CUDA Streams

Similar documents
Data Transfer and CUDA Streams Data Transfer and CUDA Streams

Lecture 10. Efficient Host-Device Data Transfer

Efficient Data Transfers

Programming with CUDA WS 08/09. Lecture 7 Thu, 13 Nov, 2008

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA C Programming Mark Harris NVIDIA Corporation

Massively Parallel Algorithms

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA Memories. Introduction 5/4/11

CUDA Architecture & Programming Model

Lecture 6: odds and ends

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Multi-GPU Programming

Fundamental CUDA Optimization. NVIDIA Corporation

Kepler Overview Mark Ebersole

Overview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Information Coding / Computer Graphics, ISY, LiTH. More memory! ! Managed memory!! Atomics!! Pinned memory 25(85)

An Introduction to GPU Computing and CUDA Architecture

CUDA C/C++ BASICS. NVIDIA Corporation

DEEP DIVE INTO DYNAMIC PARALLELISM

AMS 148 Chapter 8: Optimization in CUDA, and Advanced Topics

EEM528 GPU COMPUTING

GPU Programming Using CUDA

COMP528: Multi-core and Multi-Processor Computing

L17: Asynchronous Concurrent Execution, Open GL Rendering

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Fundamental CUDA Optimization. NVIDIA Corporation

CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface

Heidi Poxon Cray Inc.

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

CS 179: GPU Computing. Lecture 2: The Basics

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA%Asynchronous%Memory%Usage%and%Execu6on% Yukai&Hung& Department&of&Mathema>cs& Na>onal&Taiwan&University

Unified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

University of Bielefeld

COSC 6339 Accelerators in Big Data

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

HPCSE II. GPU programming and CUDA

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

GPU Fundamentals Jeff Larkin November 14, 2016

Advanced CUDA Optimizations. Umar Arshad ArrayFire

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

dcuda: Distributed GPU Computing with Hardware Overlap

Hyper-Q Example. Thomas Bradley

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

Introduction to GPU Programming

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

GEARMOTORS AF MOTORS FOR INVERTER C-1

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Lecture 2: Introduction to CUDA C

Programming with CUDA, WS09

Concurrent Kernels and Multiple GPUs

CUDA programming interface - CUDA C

COSC 462 Parallel Programming

Lecture 6b Introduction of CUDA programming

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Data Parallel Execution Model

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CUDA Toolkit 4.1 CUSPARSE Library. PG _v01 January 2012

CUDA Performance Optimization. Patrick Legresley

Paralization on GPU using CUDA An Introduction

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Introduction to CUDA (1 of n*)

Module 2: Introduction to CUDA C. Objective

GPU programming: Unified memory models and more. Sylvain Collange Inria Rennes Bretagne Atlantique

CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs

Mathematical computations with GPUs

Hands-on CUDA Optimization. CUDA Workshop

Inside Kepler. Manuel Ujaldon Nvidia CUDA Fellow. Computer Architecture Department University of Malaga (Spain)

Debugging and Optimization strategies

GPU Computing with CUDA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

CUDA Basics. July 6, 2016

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Transcription:

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data fer and CUDA Streams

Objective To learn more advanced features of the CUDA APIs for data transfer and kernel launch Task parallelism for overlapping data transfer with kernel computation CUDA streams

Serialized Data fer and GPU computation So far, the way we use cudamemcpy serializes data transfer and GPU computation. A. B Vector Add Tranfer C time Only use one direction, GPU idle PCIe Idle Only use one direction, GPU idle

Device Overlap Some CUDA devices support device overlap Simultaneously execute a kernel while performing a copy between device and host memory int dev_count; cudadeviceprop prop; cudagetdevicecount( &dev_count); for (int i = 0; i < dev_count; i++) { cudagetdeviceproperties(&prop, i); if (prop.deviceoverlap)

Overlapped (Pipelined) Timing Divide large vectors into segments Overlap transfer and compute of adjacent segments A.1 B.1 Comp C.1 = A.1 + B.1 C.1 A.2 B.2 Comp C.2 = A.2 + B.2 C.2 A.3 B.3 Comp C.3 = A.3 + B.3 A.4 B.4

Using CUDA Streams and Asynchronous MemCpy CUDA supports parallel execution of kernels and cudamemcpy with Streams Each stream is a queue of operations (kernel launches and cudamemcpy s) Operations (tasks) in different streams can go in parallel Task parallelism

Streams Device requests made from the host code are put into a queue Queue is read and processed asynchronously by the driver and device Driver ensures that commands in the queue are processed in sequence. Memory copies end before kernel launch, etc. cudamemcpy Kernel launch sync host thread 7 fifo device driver

Streams cont. To allow concurrent copying and kernel execution, you need to use multiple queues, called streams CUDA events allow the host thread to query and synchronize with the individual queues. host thread Stream 1 Stream 2 Event device driver 8

Conceptual View of Streams PCIe UP PCIe Down Copy Engine Kernel Engine MemCpy A.1 MemCpy B.1 Kernel 1 MemCpy C.1 MemCpy A.2 MemCpy B.2 Kernel 2 MemCpy C.2 Stream 0 Stream 1 Operations (Kernels, MemCpys)

A Simple Multi-Stream Host Code cudastream_t stream0, stream1; cudastreamcreate( &stream0); cudastreamcreate( &stream1); float *d_a0, *d_b0, *d_c0; // device memory for stream 0 float *d_a1, *d_b1, *d_c1; // device memory for stream 1 // cudamalloc for d_a0, d_b0, d_c0, d_a1, d_b1, d_c1 go here

continued for (int i=0; i<n; i+=segsize*2) { cudamemcpyasync(d_a0, h_a+i; SegSize*sizeof(float),.., stream0); cudamemcpyasync(d_b0, h_b+i; SegSize*sizeof(float),.., stream0); vecadd<<<segsize/256, 256, 0, stream0>>> ( ); cudamemcpyasync(d_c0, h_c+i; SegSize*sizeof(float),.., stream0);

cudamemcpyasync(d_a1, h_a+i+segsize; SegSize*sizeof(float),.., stream1); cudamemcpyasync(d_b1, h_b+i+segsize; SegSize*sizeof(float),.., stream1); vecadd<<<segsize/256, 256, 0, stream1>>>(d_a1, d_b1, ); cudamemcpyasync(d_c1, h_c+i+segsize; SegSize*sizeof(float),.., stream1); } A Simple Multi-Stream Host Code (Cont.) for (int i=0; i<n; i+=segsize*2) { cudamemcpyasync(d_a0, h_a+i; SegSize*sizeof(float),.., stream0); cudamemcpyasync(d_b0, h_b+i; SegSize*sizeof(float),.., stream0); vecadd<<<segsize/256, 256, 0, stream0>>>(d_a0, d_b0, ); cudamemcpyasync(d_c0, h_c+i; SegSize*sizeof(float),.., stream0);

A View Closer to Reality PCI UP PCI Down Copy Engine Kernel Engine MemCpy A.1 MemCpy B.1 MemCpy C.1 Kernel 1 Kernel 2 MemCpy A.2 MemCpy B.2 MemCpy C.2 Stream 0 Stream 1 Operations (Kernels, MemCpys)

Not quite the overlap we want C.1 blocks A.2 and B.2 in the copy engine queue A.1 B.1 Comp C.1 = A.1 + B.1 C.1 A.2 B.2 Comp C.2 = A. 2 + B.2

vecadd<<<segsize/256, 256, 0, stream0>>>(d_a0, d_b0, ); vecadd<<<segsize/256, 256, 0, stream1>>>(d_a1, d_b1, ); cudamemcpyasync(d_c0, h_c+i; SegSize*sizeof(float),.., stream0); cudamemcpyasync(d_c1, h_c+i+segsize; SegSize*sizeof(float),.., stream1); } A Better Multi-Stream Host Code (Cont.) for (int i=0; i<n; i+=segsize*2) { cudamemcpyasync(d_a0, h_a+i; SegSize*sizeof(float),.., stream0); cudamemcpyasync(d_b0, h_b+i; SegSize*sizeof(float),.., stream0); cudamemcpyasync(d_a1, h_a+i+segsize; SegSize*sizeof(float),.., stream1); cudamemcpyasync(d_b1, h_b+i+segsize; SegSize*sizeof(float),.., stream1);

A View Closer to Reality PCI UP PCI Down Copy Engine Kernel Engine MemCpy A.1 MemCpy B.1 MemCpy A.2 Kernel 1 Kernel 2 MemCpy B.2 MemCpy C.1 MemCpy C.2 Stream 0 Stream 1 Operations (Kernels, MemCpys)

Overlapped (Pipelined) Timing Divide large vectors into segments Overlap transfer and compute of adjacent segments A.1 B.1 Comp C.1 = A.1 + B.1 C.1 A.2 B.2 Comp C.2 = A.2 + B.2 C.2 A.3 B.3 Comp C.3 = A.3 + B.3 A.4 B.4

Hyper Queue Provide multiple real queues for each engine Allow much more concurrency by allowing some streams to make progress for an engine while others are blocked

Fermi (and older) Concurrency A -- B -- C Stream 1 A--B--C P--Q--R X--Y--Z P -- Q -- R Stream 2 Hardware Work Queue X -- Y -- Z Stream 3 Fermi allows 16-way concurrency Up to 16 grids can run at once But CUDA streams multiplex into a single queue Overlap only at stream edges

Kepler Improved Concurrency A--B--C A -- B -- C Stream 1 P--Q--R P -- Q -- R Stream 2 X--Y--Z Multiple Hardware Work Queues X -- Y -- Z Stream 3 Kepler allows 32-way concurrency One work queue per stream Concurrency at full-stream level No inter-stream dependencies

ANY QUESTIONS?

Synchronization cudastreamsynchronize(stream_id) Used in host code Takes a stream identifier parameter Waits until all tasks in the stream have completed This is different from cudadevicesynchronize() Also used in host code No parameter Waits until all tasks in all streams have completed for current device