CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Similar documents
NVIDIA GPU CODING & COMPUTING

Tesla Architecture, CUDA and Optimization Strategies

Introduction to CUDA

CS179 GPU Programming Recitation 4: CUDA Particles

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Programming Model

NVIDIA CUDA Compute Unified Device Architecture

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

NVIDIA CUDA Compute Unified Device Architecture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Mathematical computations with GPUs

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Performance Optimization. Patrick Legresley

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Tuning CUDA Applications for Fermi. Version 1.2

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Programming with CUDA, WS09

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Architecture & Programming Model

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

B. Tech. Project Second Stage Report on

CME 213 S PRING Eric Darve

CUDA (Compute Unified Device Architecture)

Advanced CUDA Optimization 1. Introduction

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

TUNING CUDA APPLICATIONS FOR MAXWELL

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

TUNING CUDA APPLICATIONS FOR MAXWELL

Hands-on CUDA Optimization. CUDA Workshop

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Practical Introduction to CUDA and GPU

CS377P Programming for Performance GPU Programming - I

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

simcuda: A C++ based CUDA Simulation Framework

Introduction to GPGPUs and to CUDA programming model

1/25/12. Administrative

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

High Performance Computing on GPUs using NVIDIA CUDA

Introduction to CUDA (1 of n*)

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS516 Programming Languages and Compilers II

Device Memories and Matrix Multiplication

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Blocks, Grids, and Shared Memory

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to CUDA 5.0

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

CUDA C Programming Mark Harris NVIDIA Corporation

GPU programming. Dr. Bernhard Kainz

Scientific discovery, analysis and prediction made possible through high performance computing.

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Introduction to GPU Programming

Lecture 11: GPU programming

Parallel Numerical Algorithms

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Introduction to CUDA Programming

Lecture 2: CUDA Programming

ECE 574 Cluster Computing Lecture 15

Programming in CUDA. Malik M Khan

GPU Fundamentals Jeff Larkin November 14, 2016

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

General-purpose computing on graphics processing units (GPGPU)

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

L3: Data Partitioning and Memory Organization, Synchronization

Introduction to CELL B.E. and GPU Programming. Agenda

GPGPU/CUDA/C Workshop 2012

CS 179 Lecture 4. GPU Compute Architecture

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

GPU Memory Memory issue for CUDA programming

MD-CUDA. Presented by Wes Toland Syed Nabeel

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Overview: Graphics Processing Units

Optimizing CUDA for GPU Architecture. CSInParallel Project

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Programming with CUDA

COSC 6374 Parallel Computations Introduction to CUDA

High-Performance Computing Using GPUs

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Transcription:

Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication example SDK browse-through API A minimal set of extensions to the C language that allow the programmer to target portions of the source code for execution on the device Runtime library A host component that provides functions to control and access one or more compute devices from the host, e.g. memory allocation. A device component that provides device-specific functions, e.g. thread synchronization. A common component that provides built-in vector types and a subset of the C standard library that are supported in both host and device code. Language extensions Roughly speaking only four additions to standard C Function type qualifiers to specify whether a function executes on the host or on the device Variable type qualifiers to specify the memory location on the device A new directive to specify how a kernel is executed on the device Four built-in variables that specify the grid and block dimensions and the block and thread indices nvcc CUDA compiler to handle the standard C extensions Each source file containing these extensions must be compiled with the CUDA compiler nvcc Easily integrated with Visual Studio in Windows or makefiles in Linux Function type qualifiers (1) device The device qualifier declares a function that is Executed on the device Callable from the device only 1

Function type qualifiers (2) Function type qualifiers (3) global The global qualifier declares a function as being a kernel Executed on the device Callable from the host only host The host qualifier declares a function that is Executed on the host Callable from the host only Default if no function qualifier is specified Can be used in combination with device Function type qualifiers - restrictions device functions are always inlined device and global do not support recursion device and global cannot declare static variables inside their bodies device and global cannot have a variable number of inputs device functions cannot have their address taken but function pointers to global functions are supported The global and host qualifiers cannot be used together global functions must have void return type Any call to a global function must specify its execution configuration A call to a global function is asynchronous, meaning it returns before the device has completed its execution global function parameters are currently passed via shared memory to the device and limited to 256 bytes. Variable type qualifiers (1) device The device qualifier declares a variable that resides on the device Resides in global memory space Has the lifetime of an application Is accessible from all the threads within the grid and from the host through the runtime library Variable type qualifiers (2) Variable type qualifiers (3) constant The constant qualifier declares a variable that resides on the device Resides in constant memory space Has the lifetime of an application Is accessible from all the threads within the grid and from the host through the runtime library shared The shared qualifier declares a variable that resides on the device Resides in the shared memory space of a thread block Has the lifetime of the block Is only accessible from threads within the block 2

Variable type qualifiers - restrictions shared and constant cannot be used in combination with each other shared and constant variables have implied static storage device and constant variables are only allowed at file scope constant variables cannot be assigned to from the device, only from the host shared variables cannot have an initialization as part of their declaration Variable type qualifiers the default case An automatic variable declared in device code without any of these qualifiers generally resides in a register However in some cases the compiler might choose to place it in local memory This is often the case for large structures or arrays that would consume too much register space, and arrays for which the compiler cannot determine that they are indexed with constant quantities Execution configuration Execution configuration example Any call to a global function must specify the execution configuration for that call. The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device. It is specified by inserting an expression of the form <<< Dg, Db, Ns >>> between the function name and the parenthesized argument list, where: Dg is of type dim3 and specifies the dimension and size of the grid, such that Dg.x*Dg.y*Dg.y equals the number of blocks being launched Db is of type dim3 and specifies the dimension and size of each block, such that Db.x*Db.y*Db.z equals the number of threads per block Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory. Ns is an optional argument which defaults to 0. The arguments to the execution configuration are evaluated before the actual function arguments. A function declared as global void Func(float* parameter); must be called like this Func<<< Dg, Db, Ns >>>(parameter); Built-in variables Common runtime component griddim This variable is of type dim3 and contains the dimensions of the grid. blockidx This variable is of type uint3 and contains the block index within the grid. blockdim This variable is of type dim3 and contains the dimensions of the block. threadidx This variable is of type uint3 and contains the thread index within the block. Restrictions It is not allowed to take the address of any of the built-in variables. It is not allowed to assign values to any of the built-in variables. Built-in vector types For both host and device char1, char2, char3, char4 uchar1, uchar2, uchar3, uchar4 short1, short2, short3, short4 ushort1, ushort2, ushort3, ushort4 int1, int2, int3, int4 uint1, uint2, uint3, uint4 long1, long2, long3, long4, ulong1, ulong2, ulong3, ulong4, float1, float2, float3, float4 Doubles expected in the generation GPUs Constructor of the form make_<type_name>( ); For example int2 make_int2(int x, int y); 3

Mathematical functions <math.mdi> Textures Tiled memory format with cache Hardware interpolation Read-only (at present) Read from kernels through texture fetches from texture references Linear memory vs CUDA arrays? Linear memory: One dimensional only No filtering Device runtime components Synchronization function void syncthreads(); Type conversion unsigned int float2uint_[rn,rz,ru,rd](float); float int2float_[rn,rz,ru,rd](int); float uint2float_[rn,rz,ru,rd](unsigned int); Type casting float int_as_float(int); int float_as_int(float); Atomic functions Performs a read-modify-write operation on one 32 bit word residing in global memory E.g. unsigned int atomicadd(unsigned int* address, unsigned int val); At present limited support for integer types only! Host runtime components Device Multiple devices supported Memory Linear or CUDA arrays OpenGL and DirectX interoperability For visualization of results Asynchronicity global functions and most runtime functions return to the application before the device has completed the requested task Image utilities Performance Guidelines <image_utilities.mdi> How to get the most out of the device 4

Instruction performance To process an instruction for a warp of threads, a multiprocessor must: Read the instruction operands for each thread of the warp, Execute the instruction, Write the result for each thread of the warp. Instruction Performance The effective instruction throughput depends on the nominal instruction throughput as well as the memory latency and bandwidth. It is maximized by: Minimizing the use of instructions with low throughput Maximizing the use of the available memory bandwidth for each category of memory Allowing the thread scheduler to overlap memory transactions with mathematical computations as much as possible, which requires that: The program executed by the threads is of high arithmetic intensity, i.e a high number of arithmetic operations per memory operation; There are many threads that can be run concurrently Instruction Throughput To issue one instruction for a warp, a multiprocessor takes 4 clock cycles for floating-point add floating-point multiply floating-point multiply-add integer add bitwise operations compare min, max, type conversion instruction; Instruction Throughput To issue one instruction for a warp, a multiprocessor takes16 clock cycles for reciprocal reciprocal square root 32-bit integer multiplication; Instruction Throughput Integer division and modulo operation are particularly costly and should be avoided if possible Nvidia doesn t tell how costly though Other functions are implemented as combinations of several instructions Floating-point square root is implemented as a reciprocal square root followed by a reciprocal, so it takes 32 clock cycles for a warp. Floating-point division takes 36 clock cycles. Control Flow Instructions Any flow control instruction (if, switch, do, for, while) can significantly impact the effective instruction throughput by causing threads of the same warp to diverge Serialization vs. parallelization 5

Memory Instructions Memory instructions include any instruction that reads from or writes to shared or global memory. A multiprocessor takes 4 clock cycles to issue one memory instruction for a warp. When accessing global memory, there are, in addition, 400 to 600 clock cycles of memory latency!!! Global Memory Uncached One can fetch 32-bit, 64-bit, or 128-bit words into registers in a single instruction Use coalescence whenever possible Constant Memory Cached Only at cache misses does it cost a read from device memory Texture Memory Cached Only at cache misses does it cost a read from device memory The texture cache is optimized for 2D spatial locality BUT there is a cost associated to copying into the a CUDA array Shared Memory Shared Memory no bank conflicts On-chip As fast as registers Avoid bank conflicts shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously any memory read or write request made of n addresses that fall in n distinct memory banks can be serviced simultaneously 6

2-way and 8-way bank conflicts Registers On-chip Generally, accessing a register is zero extra clock cycles per instruction Number of threads per block There should be at least as many blocks as there are multiprocessors in the device. Running only one block per multiprocessor can force the multiprocessor to idle during thread synchronization and device memory reads. The shared memory available to a block decreases with the number of active blocks The number of threads per block should be chosen as a multiple of the warp size!!! Number of threads per block Allocating more threads per block is better for efficient time slicing, but the more threads per block, the fewer registers are available per thread. CUDA occupancy calculator Current generation hardware Matrix multiplication <G8x_specs.mdi> An examle from the CUDA programming guide 7

Ax=b Split matrix into blocks Computing the product C of two matrices A and B of dimensions (wa, ha) and (wb, wa) Each thread block is responsible for computing one square sub-matrix C sub of C; Each thread within the block is responsible for computing one element of C sub. CUDA programming guide Fig. 6-1. Matrix multiplication code Browse-through SDK <mat_mult.mdi> As time permits 8