Introduction to CUDA

Similar documents
CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA Programming Model

CUDA Architecture & Programming Model

Tesla Architecture, CUDA and Optimization Strategies

CUDA (Compute Unified Device Architecture)

Real-time Graphics 9. GPGPU

NVIDIA GPU CODING & COMPUTING

NVIDIA CUDA Compute Unified Device Architecture

Practical Introduction to CUDA and GPU

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Real-time Graphics 9. GPGPU

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CS179 GPU Programming Recitation 4: CUDA Particles

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Mathematical computations with GPUs

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Parallel Numerical Algorithms

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Introduction to Parallel Computing with CUDA. Oswald Haan

ECE 574 Cluster Computing Lecture 15

Fundamental CUDA Optimization. NVIDIA Corporation

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

ECE 574 Cluster Computing Lecture 17

Fundamental CUDA Optimization. NVIDIA Corporation

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

GPU programming. Dr. Bernhard Kainz

NVIDIA CUDA Compute Unified Device Architecture

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

High-Performance Computing Using GPUs

Lecture 11: GPU programming

GPGPU/CUDA/C Workshop 2012

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CSE 160 Lecture 24. Graphical Processing Units

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Mathematical computations with GPUs

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Scientific discovery, analysis and prediction made possible through high performance computing.

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CS 179: GPU Computing. Lecture 2: The Basics

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA Basics. July 6, 2016

Threading Hardware in G80

Introduction to CUDA (1 of n*)

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Portland State University ECE 588/688. Graphics Processors

B. Tech. Project Second Stage Report on

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CUDA Programming. Aiichiro Nakano

HPCSE II. GPU programming and CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Module 2: Introduction to CUDA C

Introduction to GPGPUs and to CUDA programming model

Introduction to CUDA Programming

CS : Many-core Computing with CUDA

Technology for a better society. hetcomp.com

Josef Pelikán, Jan Horáček CGG MFF UK Praha

About 30 student clubs/associations 1500 students 5 years program

CUDA C Programming Mark Harris NVIDIA Corporation

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

High Performance Computing on GPUs using NVIDIA CUDA

Accelerating image registration on GPUs

Module 2: Introduction to CUDA C. Objective

GPU Fundamentals Jeff Larkin November 14, 2016

Tuning CUDA Applications for Fermi. Version 1.2

Exotic Methods in Parallel Computing [GPU Computing]

EEM528 GPU COMPUTING

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Advanced CUDA Optimization 1. Introduction

Introduction to GPU hardware and to CUDA

MD-CUDA. Presented by Wes Toland Syed Nabeel

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA-GDB: The NVIDIA CUDA Debugger

CUDA Performance Optimization. Patrick Legresley

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Transcription:

Introduction to CUDA

Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations & Advantages

Computational power FLOPS # float operations per second HW Intel Core 2 Extreme QX6850 IBM CELL/BE Clock Speed (GHz) 3.00 3.20 1.2 Processors 4 8 96 Bits per register 128 128 32 Nvidia GeForce 8800 GTS 32bit floats per register Parallel operations per processor 4 4 1 1 1/2* 1/3* Paralel 32bit floats 16 32/64* 96/288* GFLOPS 48.0 102.4/204.8* 115.2/345.6** * perform a load, store, shuffle, channel or branch operation in parallel with a computation. ** parallel MAD ( a=b*c + d) and ADD ( a=a+b) 8800 ULTRA 1.5 GHz 128 processors = 192.0 / 576.0** GFLOPS

Compute Unified Device Architecture HW & SW architecture No need of graphics API HW GeForce 8 series (G8x) CPU Application SW Quadro FX 5600/4600 CUDA Driver CUDA Runtime CUDA Libraries CUDA Runtime CUDA Driver CUDA Libraries CUDA Compiler (nvcc) GPU

Graphics API vs. CUDA Graphics API API Program External Language GLSL,Cg,HLSL C/C++ like Types (function-like) Geometry Shader Vertex Shader Fragment Shader CUDA API Program Internal Language C/C++ extension Types (program-like) Kernel

Graphics API vs. CUDA Graphics API Fragment shader limited to outputting 32 floats at a prespecified location store data in textures (packing long arrays into 2D textures, extra addressing math) CUDA scattered writes (unlimited # of stores to any address) load from any address shared fast memory (currently 16KB per multiprocessor) accessible in parallel by blocks of threads

CUDA glossary Thread Running computation on a single scalar processor Kernel Program running on device Thread Block 3D block of cooperative threads, that share fast memory Grid of Thread Blocks 2D grid of Thread Blocks of same dimensionality executing the same Kernel No communication between Thread Blocks!

Memory model Grid Block (0,0) Shared Memory Block (1,0) Shared Memory Registers Registers Registers Registers Thread (0,0) Thread (1,0) Thread (0,0) Thread (1,0) Local Memory Local Memory Local Memory Local Memory Global memory Constant memory Texture memory

HW implementation Device Set of Multiprocessors (GTS 12x, GTX 16x) Multiprocessor Set of (scalar) processors ( 8x ) Shared memory (16 kb) SIMD architecture every processor executes the same instruction on different data Processes one or more thread blocks Warp SIMD group of threads (32) from the same block

Execution Grid of thread blocks -> one or more thread blocks on each multiprocessor One thread block on only one multiprocessor efective usage of shared memory Block split into SIMD groups of threads (warps) Equal number of threads (warp size) in each warp Each of the Block warps is executed by the Multiprocessor in a SIMD fashion Thread sheduler periodically switches between warps

Execution model Block (0,0) Thread Sheduler Warp 1 Multiprocessor Processor 1 Block 1 Warp 2 Processor 2 Block (0,1) Warp 2 Warp 2 Thread Sheduler Warp 1 Processor 3 Processor 4 Processor 5 Processor 6 Shared Memory Block 1 Warp 2 Processor 7 Warp 2 Processor 8 Warp 2

Execution model 1 x Multiprocessor (8 processors) Thread Block = n-threads (n <= 512) Warp = 32 threads 8 processors = 8 parallel threads 1 x SIMD = 4 clock cycles ( 4 x 8 processors = whole Warp) Next Warp(s) Next Thread Block(s) Only if enough registers & shared memory Other multiprocessors Concurently processes their Blocks from Grid

Memory access 16 banks of shared memory 32 bit words Bank conflict Access of the same bank in the same clock cycle by threads from the same half-warp (16 threads) Thread A accesses address N*4 Thread B accesses address (N+k*16)*4 Banks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 63 Memory Address in bytes 64 128 192 127 16320 16383

Performance guidelines Threads per block Multiple of 64 64 minimum + multiple blocks per multiprocessor 192-256 recomended Use shared memory, avoid bank conflicts Number of blocks 100+ 1000 to scale for future devices 2+ per multiprocessor Avoid wasting shared memory per block / registers per thread (8192 registers per multiprocessor)

CUDA Compiler C/C++ like source code 'device' specific code 'host' specific code NVCC compilation Separate 'device' and 'host' specific code Compile 'device' specific code to CUDA binary Merge CUDA binary with 'host' code Call standard C/C++ compiler and linker on the merged code

C/C++ Language 'extensions' Function type qualifiers device : device only global : exec on device, call from host host : exec on host, call from host Variable type qualifiers device : on the device constant : in constant memory shared : in shared memory of a thread block

C/C++ Language 'extensions' Build-in Vector types - {(u)inttype,float}{1,2,3,4} Build-in Variables griddim dim3 - dimesion of grid blockidx uint3 block index within the grid blockdim dim3 dimesion of the block threadidx uint3 thread index within the block

Execution configuration Function declartation global void Func(float parameter); Function execution Func<<< Dg, Db, Ns >>>(parameter); Dg : dimension and size of the grid Dg.x*Dg.y -# block launched Db : dimension and size of each block Db.x*Db.y*Db.z -# threads per block Ns : optional, # bytes of shared memory dynamicaly allocated

device memory Memory management Runtime API cudamalloc(),cudamallocpitch(),cudafree() host memory pageable standard functions (malloc(),free(),...) page-locked copying cudamallochost(),cudafreehost() cudamemcpy() cudamemcpykind cudamemcpyhosttohost, cudamemcpyhosttodevice, cudamemcpydevicetohost, cudamemcpydevicetodevice

CUDA Libraries CUFFT Fast fourier transform 1D,2D,3D Real -> Complex,Complex->Complex,Complex->Real Emulation CUBLAS Basic linear algebra subroutines scalar,vector,matrix Emulation

Current limitations SW limits No streaming support G8x Memory copying is synchronous Double to float conversion Only 24 bit integer multiply HW support 32bit integer mul is compiled into multiple instructions Advantages GPU & CPU can run in parallel CUDA 1.0 Kernel invocation is asynchronous

Future HW Double precision Native 32-bit integer More multiprocessors SW CUDA 1.1 part of the display drivers async execution double pecision support

References NVidia CUDA Programming guide 1.0 http://developer.download.nvidia.com/compute/cuda/1_0/nvidia_cuda_programming_guide_1.0.pdf NVidia CUDA 1.0 FAQ http://forums.nvidia.com/index.php?showtopic=36286 Technical Brief NVIDIA GeForce 8800 GPU Architecture Overview http://www.nvidia.com/object/io_37100.html beta CUDA 1.1 http://forums.nvidia.com/index.php?showtopic=51026