Programming with CUDA WS 08/09. Lecture 7 Thu, 13 Nov, 2008

Size: px
Start display at page:

Download "Programming with CUDA WS 08/09. Lecture 7 Thu, 13 Nov, 2008"

Transcription

1 Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008

2 Previously CUDA Runtime Common Built-in vector types Math functions Timing Textures Texture fetch Texture reference Texture read modes Normalized texture coordinates Linear texture filtering Textures

3 Today CUDA Runtime Common Device Host

4 CUDA Runtime Common Device Host

5 Device Runtime Can only be used in device code Math functions Faster, less accurate versions of functions from common component <common_function_name> log and logf Appendix B of Programming Guide Use fast math by default Compiler option -use_fast_math

6 Device Runtime Synch function: syncthreads() Synchronize threads in a block Avoid read-after-write, write-after- read, write-after-write hazards for commonly accessed shared memory Dangerous to use in conditionals Code hangs / unwanted effects

7 Device Runtime Atomic functions Guaranteed to perform un-interfered Memory address is locked Supported by CUDA cards > 1.0 Mostly operate on integers only Appendix C of programming guide

8 Device Runtime Warp vote functions Supported by CUDA cards >= 1.2 Check a condition on all threads in a warp int all (int predicate) true (non-zero) if predicate is true for all warp threads int any (int predicate) true (non-zero) if predicate is true for any warp thread

9 Device Runtime Texture functions: fetching textures, or texturing Texture data may be stored in linear memory or CUDA arrays Texturing from linear memory template<class Type> Type tex1dfetch( texture<type, 1, cudareadmodeelementtype> texref, int x); float tex1dfetch( texture<type, 1, cudareadmodenormalizedfloat> texref, int x);

10 Device Runtime Texture functions: fetching textures, or texturing Texturing from linear memory Type can be any of the supported 1-, 2- or 4- vector types template<class Type> Type tex1dfetch( texture<type, 1, cudareadmodeelementtype> texref, int x); float4 tex1dfetch( texture<uchar4, 1, cudareadmodenormalizedfloat> texref, int x);

11 Device Runtime Texture functions: fetching textures, or texturing Texturing from linear memory No addressing modes supported No texture filtering supported

12 Device Runtime Texture functions: fetching textures, or texturing Texturing from CUDA arrays template<class Type, enum cudatexturereadmode readmode> Type tex1d(texture<type, 1, readmode> texref, float x); template<class Type, enum cudatexturereadmode readmode> Type tex2d(texture<type, 2, readmode> texref, float x, float y); template<class Type, enum cudatexturereadmode readmode> Type tex3d(texture<type, 3, readmode> texref, float x, float y, float z);

13 Device Runtime Texture functions: fetching textures, or texturing Texturing from CUDA arrays Run-time attributes determine Coordinate normalization Addressing mode (clamp/wrap) Filtering

14 CUDA Runtime Common Device Host

15 Can only be used by host functions Composed of 2 APIs High-level CUDA runtime API, which runs on top of Low-level CUDA driver API No mixing: an application should use either one or the other.

16 Each API provides functions for Device management Context management Memory management Code module management Execution control Texture reference management OpenGL/Direct3D interoperability

17 The CUDA runtime API implicitly provides Initialization Context management Module management CUDA driver API does not, and is harder to program.

18 Recall: nvcc parses an input source file Separates device and host code Device code compiled to cubin object Generated host code in C compiled by external tool

19 Generated host code Is in C format Includes the cubin object Applications may Ignore host code and run cubin object directly using the low-level CUDA driver API Link to generated host code and launch it using the high-level CUDA runtime API

20 The CUDA driver API Is harder to program Offers greater control Does not depend on C Does not offer device emulation

21 CUDA runtime functions and other entry points are prefixed by cuda CUDA driver functions and other entry points are prefixed by cu

22 - detour Device memory is always allocated as either of Linear memory CUDA arrays

23 - detour Linear memory in device Contiguous segment of memory 32-bit addresses Can be referenced using pointers

24 - detour CUDA arrays opaque memory layout 1D/2D/3D arrays of 1/2/4 vectors of 8/16/32 bit integers or 16/32 bit floats 16 bit floats from driver API only Optimized for texture fetching Accessible from kernels through texture fetches only

25 Both the CUDA runtime and CUDA driver APIs Can access device information Enable the host to read/write to linear memory/cuda arrays With support for pinned memory

26 Both the CUDA runtime and CUDA driver APIs Can access device information Enable the host to read/write to linear memory/cuda arrays With support for pinned memory Provide OpenGL/Direct3D interoperability Provide management for asynchronous execution

27 Asynchronous functions Kernel launches, and some others Async memory copies Device <-> device memory copies Memory setting Concurrent execution of functions is managed through streams

28 Streams A queue of operations An application may have multiple stream objects simultaneously kernel<<<ng,nb,ns,s>>> A kernel can be scheduled to execute on a stream Some memory copy functions can also be queued on a stream

29 Streams If no stream is specified, stream 0 is used by default. Operations in a stream are executed synchronously Previous stream operations have to end before a new one begins

30 CUDA runtime and driver APIs provide execution control through stream management <cu/cuda>streamquery() Is stream free? <cu/cuda>streamsynchronize() Wait for stream operations to end

31 CUDA runtime and driver APIs provide execution control through stream management cudathreadsynchronize() / cuctxsynchronize() Wait for all streams to be free <cu/cuda>streamdestroy() Wait for stream to get free Destroy stream

32 Accurate timing using events CUEvent/cudaEvent_t start,stop; <cu/cuda>eventcreate (&start); <cu/cuda>eventcreate (&stop); Events have to be recorded <cu/cuda>eventrecord (start, 0); // asynchronous // stuff to time <cu/cuda>eventrecord (stop, 0); // asynchronous Stream 0: record all operations from all streams Stream N: record operations in stream N

33 Accurate timing using events <cu/cuda>eventrecord (start, 0); // asynchronous // stuff to time <cu/cuda>eventrecord (stop, 0); // asynchronous <cu/cuda>eventsynchronize (stop); float time; <cu/cuda>eventelapsedtime (&time, start, stop); As call to record is asynchronous, the event has to be synchronized before timing <cu/cuda>eventdestroy (start); <cu/cuda>eventdestroy (stop);

34 Asynchronous execution can get confusing Can be switched off Useful for degbugging Set CUDA_LAUNCH_BLOCKING to 1

35 Device Initialization CUDA Runtime API Automatically with first function call Cuda Driver API cuinit() MUST be called before calling any other API function

36 Device Management cudadeviceprop / CUDevice device; int devcount; cudagetdevicecount (&devcount) / cudevicegetcount (&devcount) for dev = 1 to devcount do cudagetdeviceproperties / cudeviceget (&device, dev)

37 Device Management cudasetdevice() Sets the device to be used MUST be set before calling any global function Device 0 used by default

38 Stream Management CUStream / cudastream_t st; cudastreamcreate (&st); / custreamcreate (&st, 0); cudastreamdestroy (&st);

39 Accurate timing using events <cu/cuda>eventrecord (start, 0); // asynchronous // stuff to time <cu/cuda>eventrecord (stop, 0); // asynchronous <cu/cuda>eventsynchronize (stop); float time; <cu/cuda>eventelapsedtime (&time, start, stop); As call to record is asynchronous, the event has to be synchronized before timing <cu/cuda>eventdestroy (start); <cu/cuda>eventdestroy (stop);

40 Event management CUEvent/cudaEvent_t start,stop; <cu/cuda>eventcreate (&start); <cu/cuda>eventcreate (&stop); <cu/cuda>eventrecord (start, 0); // asynchronous // stuff to time <cu/cuda>eventrecord (stop, 0); // asynchronous <cu/cuda>eventsynchronize (stop); float time; <cu/cuda>eventelapsedtime (&time, start, stop); <cu/cuda>eventdestroy (start); <cu/cuda>eventdestroy (stop);

41 All for today Next time More on the host runtime APIs Memory, stream, event, texture management Debug mode for runtime API Context, module, execution control for driver API Performance & Optimization

42 See you next week!

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 7.5 Thursday, 19 November, 2009 Recap CUDA texture memory commands Today CUDA driver API Runtime and Driver APIs Two interfaces

More information

Programming with CUDA

Programming with CUDA Programming with CUDA Jens K. Mueller jkm@informatik.uni-jena.de Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Tuesday 19 th April, 2011 Today s lecture: Synchronization

More information

CMPSCI 691AD General Purpose Computation on the GPU

CMPSCI 691AD General Purpose Computation on the GPU CMPSCI 691AD General Purpose Computation on the GPU Spring 2009 Lecture 5: Quantitative Analysis of Parallel Algorithms Rui Wang (cont. from last lecture) Device Management Context Management Module Management

More information

NVIDIA CUDA Compute Unified Device Architecture

NVIDIA CUDA Compute Unified Device Architecture NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 0.8 2/12/2007 ii CUDA Programming Guide Version 0.8 Table of Contents Chapter 1. Introduction to CUDA... 1 1.1 The Graphics Processor

More information

NVIDIA GPU CODING & COMPUTING

NVIDIA GPU CODING & COMPUTING NVIDIA GPU CODING & COMPUTING WHY GPU S? ARCHITECTURE & PROGRAM MODEL CPU v. GPU Multiprocessor Model Memory Model Memory Model: Thread Level Programing Model: Logical Mapping of Threads Programing Model:

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 17: Data Transfer and CUDA Streams

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 17: Data Transfer and CUDA Streams CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data fer and CUDA Streams Objective To learn more advanced features of the CUDA APIs for data transfer and kernel launch Task parallelism

More information

Lecture 10. Efficient Host-Device Data Transfer

Lecture 10. Efficient Host-Device Data Transfer 1 Lecture 10 Efficient Host-Device Data fer 2 Objective To learn the important concepts involved in copying (transferring) data between host and device System Interconnect Direct Memory Access Pinned memory

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay : CUDA Memory Lecture originally by Luke Durant and Tamas Szalay CUDA Memory Review of Memory Spaces Memory syntax Constant Memory Allocation Issues Global Memory Gotchas Shared Memory Gotchas Texture

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Data Transfer and CUDA Streams Data Transfer and CUDA Streams

Data Transfer and CUDA Streams Data Transfer and CUDA Streams Data fer and CUDA Streams Data fer and CUDA Streams Data fer and CUDA Streams Objective Ø To learn more advanced features of the CUDA APIs for data transfer and kernel launch Ø Task parallelism for overlapping

More information

Lecture 6: odds and ends

Lecture 6: odds and ends Lecture 6: odds and ends Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 6 p. 1 Overview synchronicity multiple streams and devices

More information

Tuning CUDA Applications for Fermi. Version 1.2

Tuning CUDA Applications for Fermi. Version 1.2 Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed

More information

Debugging and Optimization strategies

Debugging and Optimization strategies Debugging and Optimization strategies Philip Blakely Laboratory for Scientific Computing, Cambridge Philip Blakely (LSC) Optimization 1 / 25 Writing a correct CUDA code You should start with a functional

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

CUDA Programming. Week 5. Asynchronized execution, Instructions, and CUDA driver API

CUDA Programming. Week 5. Asynchronized execution, Instructions, and CUDA driver API CUDA Programming Week 5. Asynchronized execution, Instructions, and CUDA driver API Outline Asynchronized Transfers Instruction optimization CUDA driver API Homework ASYNCHRONIZED TRANSFER Asynchronous

More information

CUDA%Asynchronous%Memory%Usage%and%Execu6on% Yukai&Hung& Department&of&Mathema>cs& Na>onal&Taiwan&University

CUDA%Asynchronous%Memory%Usage%and%Execu6on% Yukai&Hung& Department&of&Mathema>cs& Na>onal&Taiwan&University CUDA%Asynchronous%Memory%Usage%and%Execu6on% Yukai&Hung& a0934147@gmail.com Department&of&Mathema>cs& Na>onal&Taiwan&University Page8Locked%Memory% Page8Locked%Memory%! &Regular&pageable&and&pageIlocked&or&pinned&host&memory&

More information

Overview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends

Overview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends Overview Lecture 6: odds and ends Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre synchronicity multiple streams and devices multiple GPUs other

More information

CUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK

CUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK CUDA programming Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics CUDA requirements A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK Standard C compiler http://www.nvidia.com/cuda

More information

NVIDIA CUDA Compute Unified Device Architecture

NVIDIA CUDA Compute Unified Device Architecture NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 2.0 6/7/2008 ii CUDA Programming Guide Version 2.0 Table of Contents Chapter 1. Introduction...1 1.1 CUDA: A Scalable Parallel

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory. Table of Contents Streams Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es

More information

NVIDIA CUDA Compute Unified Device Architecture

NVIDIA CUDA Compute Unified Device Architecture NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.0 6/23/2007 ii CUDA Programming Guide Version 1.0 Table of Contents Chapter 1. Introduction to CUDA... 1 1.1 The Graphics Processor

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

CUDA Odds and Ends. Joseph Kider University of Pennsylvania CIS Fall 2011

CUDA Odds and Ends. Joseph Kider University of Pennsylvania CIS Fall 2011 CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011 Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors

More information

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

CUDA Odds and Ends. Administrivia. Administrivia. Agenda. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5.

CUDA Odds and Ends. Administrivia. Administrivia. Agenda. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5. Administrivia CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 5 Handed out Wednesday, 03/16 Due Friday, 03/25 Project One page pitch due Sunday, 03/20, at 11:59pm

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California Dynamic Cuda with F# HPC GPU & F# Meetup March 19 San Jose, California Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 01 17 +41 79 430 03 61 About Us! Software development and consulting company!

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Efficient Data Transfers

Efficient Data Transfers Efficient Data fers Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 PCIE Review Typical Structure of a CUDA Program Global variables declaration Function prototypes global

More information

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME 1 Last time... GPU Memory System Different kinds of memory pools, caches, etc Different optimization techniques 2 Warp Schedulers

More information

CUDA Performance Optimization Mark Harris NVIDIA Corporation

CUDA Performance Optimization Mark Harris NVIDIA Corporation CUDA Performance Optimization Mark Harris NVIDIA Corporation Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary Optimize Algorithms for

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs CUDA Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University CUDA - Compute Unified Device

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012 CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012 Agenda Instruction Optimizations Mixed Instruction Types Loop Unrolling

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu. Table of Contents Multi-GPU Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes 2 Zero-copy Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11 Administrative L8 Writing Correct Programs, cont. and Control Flow Next assignment available Goals of assignment simple memory hierarchy management block-thread decomposition tradeoff Due Thursday, Feb.

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

CUDA 2.2 Pinned Memory APIs

CUDA 2.2 Pinned Memory APIs CUDA 2.2 Pinned Memory APIs July 2012 July 2012 ii Table of Contents Table of Contents... 1 1. Overview... 2 1.1 Portable pinned memory : available to all contexts... 3 1.2 Mapped pinned memory : zero-copy...

More information

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory Class Windows and CUDA : Shu Guo Program from last time: Constant memory Windows on CUDA Reference: NVIDIA CUDA Getting Started Guide for Microsoft Windows Whiting School has Visual Studio Cuda 5.5 Installer

More information

CUDA Memories. Introduction 5/4/11

CUDA Memories. Introduction 5/4/11 5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

NVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1

NVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1 NVIDIA CUDA Fermi Compatibility Guide for CUDA Applications Version 1.1 4/19/2010 Table of Contents Software Requirements... 1 What Is This Document?... 1 1.1 Application Compatibility on Fermi... 1 1.2

More information

Introduction to OpenCL!

Introduction to OpenCL! Lecture 6! Introduction to OpenCL! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! OpenCL Architecture Defined in four parts Platform Model

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race

More information

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI) Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011

More information

CS : Many-core Computing with CUDA

CS : Many-core Computing with CUDA CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 (Moreno Maza) CS4402-9535: Many-core Computing with CUDA UWO-CS4402-CS9535

More information

CUDA-GDB: The NVIDIA CUDA Debugger

CUDA-GDB: The NVIDIA CUDA Debugger CUDA-GDB: The NVIDIA CUDA Debugger User Manual Version 2.2 Beta 3/30/2009 ii CUDA Debugger User Manual Version 2.2 Beta Table of Contents Chapter 1. Introduction... 1 1.1 CUDA-GDB: The NVIDIA CUDA Debugger...1

More information

simcuda: A C++ based CUDA Simulation Framework

simcuda: A C++ based CUDA Simulation Framework Technical Report simcuda: A C++ based CUDA Simulation Framework Abhishek Das and Andreas Gerstlauer UT-CERC-16-01 May 20, 2016 Computer Engineering Research Center Department of Electrical & Computer Engineering

More information

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Multi-GPU Summary

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Multi-GPU Summary Optimizing CUDA Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Multi-GPU Summary NVIDIA Corporation 2009 2 Optimize Algorithms for the GPU

More information

CS 179: GPU Programming. Lecture 7

CS 179: GPU Programming. Lecture 7 CS 179: GPU Programming Lecture 7 Week 3 Goals: More involved GPU-accelerable algorithms Relevant hardware quirks CUDA libraries Outline GPU-accelerated: Reduction Prefix sum Stream compaction Sorting(quicksort)

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

High-Performance Data Loading and Augmentation for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 2 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Alignment Memory Alignment Memory

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

CUDA Kenjiro Taura 1 / 36

CUDA Kenjiro Taura 1 / 36 CUDA Kenjiro Taura 1 / 36 Contents 1 Overview 2 CUDA Basics 3 Kernels 4 Threads and thread blocks 5 Moving data between host and device 6 Data sharing among threads in the device 2 / 36 Contents 1 Overview

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012 Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay Introduction to CUDA Lecture originally by Luke Durant and Tamas Szalay Today CUDA - Why CUDA? - Overview of CUDA architecture - Dense matrix multiplication with CUDA 2 Shader GPGPU - Before current generation,

More information

Platform Support o Additional OS support - Windows Vista 32-bit - Windows Vista 64-bit

Platform Support o Additional OS support - Windows Vista 32-bit - Windows Vista 64-bit NVIDIA CUDA Windows XP and Vista Release Notes Version 2.0 New Features Hardware Support o Additional hardware support: - GeForce GTX 280 - GeForce GTX 260 - GeForce 9800 GX2 - GeForce 9800 GTX - GeForce

More information

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. CSC 391/691: GPU Programming Fall 2011 Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. Copyright 2011 Samuel S. Cho Streams Until now, we have largely focused on massively data-parallel execution

More information

Introduction to CUDA C

Introduction to CUDA C NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is

More information

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs) MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs) CUDA API Klaus Mueller, Ziyi Zheng, Eric Papenhausen Stony Brook University Function Qualifiers Device Global,

More information

CUDA Toolkit 4.1 CUSPARSE Library. PG _v01 January 2012

CUDA Toolkit 4.1 CUSPARSE Library. PG _v01 January 2012 CUDA Toolkit 4.1 CUSPARSE Library PG-05329-041_v01 January 2012 Contents 1 Introduction 2 1.1 New and Legacy CUSPARSE API........................ 2 1.2 Naming Convention................................

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

Blocks, Grids, and Shared Memory

Blocks, Grids, and Shared Memory Blocks, Grids, and Shared Memory GPU Course, Fall 2012 Last week: ax+b Homework Threads, Blocks, Grids CUDA threads are organized into blocks Threads operate in SIMD(ish) manner -- each executing same

More information

NVIDIA Fermi Architecture

NVIDIA Fermi Architecture Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster

More information

NVIDIA CUDA DEBUGGER CUDA-GDB. User Manual

NVIDIA CUDA DEBUGGER CUDA-GDB. User Manual CUDA DEBUGGER CUDA-GDB User Manual PG-00000-004_V2.3 June, 2009 CUDA-GDB PG-00000-004_V2.3 Published by Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 Notice ALL DESIGN SPECIFICATIONS, REFERENCE

More information

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

Atomic Operations. Atomic operations, fast reduction. GPU Programming.   Szénási Sándor. Atomic Operations Atomic operations, fast reduction GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University ATOMIC OPERATIONS

More information

Concurrent Kernels and Multiple GPUs

Concurrent Kernels and Multiple GPUs Concurrent Kernels and Multiple GPUs 1 Page Locked Host Memory host memory that is page locked or pinned executing a zero copy 2 Concurrent Kernels streams and concurrency squaring numbers with concurrent

More information

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications The NVIDIA GPU Memory Ecosystem Atomic operations in CUDA The thrust library October 7, 2015 Dan Negrut, 2015 ECE/ME/EMA/CS 759

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics 1 GPU Computing Workshop CSU 2013 Getting Started Garland Durham Quantos Analytics nvidia-smi 2 At command line, run command nvidia-smi to get/set GPU properties. nvidia-smi Options: -q query -L list attached

More information

CUDA Memory Hierarchy

CUDA Memory Hierarchy CUDA Memory Hierarchy Piotr Danilewski October 2012 Saarland University Memory GTX 690 GTX 690 Memory host memory main GPU memory (global memory) shared memory caches registers Memory host memory GPU global

More information