Practical Introduction to CUDA and GPU

Similar documents
CUDA Architecture & Programming Model

Tesla Architecture, CUDA and Optimization Strategies

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CUDA Programming Model

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

GPU programming. Dr. Bernhard Kainz

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to CUDA

Tesla GPU Computing A Revolution in High Performance Computing

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Introduction to Parallel Computing with CUDA. Oswald Haan

Lecture 3: Introduction to CUDA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

B. Tech. Project Second Stage Report on

Tesla GPU Computing A Revolution in High Performance Computing

Advanced CUDA Optimization 1. Introduction

Parallel Computing. Lecture 19: CUDA - I

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

GPU Computing with CUDA. Part 2: CUDA Introduction

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Scientific discovery, analysis and prediction made possible through high performance computing.

Real-time Graphics 9. GPGPU

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Basics. July 6, 2016

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA C Programming Mark Harris NVIDIA Corporation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 10!! Introduction to CUDA!

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Module 2: Introduction to CUDA C. Objective

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Real-time Graphics 9. GPGPU

Introduction to CUDA Programming

High Performance Linear Algebra on Data Parallel Co-Processors I

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

ECE 574 Cluster Computing Lecture 15

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

ECE 574 Cluster Computing Lecture 17

Introduction to GPGPUs and to CUDA programming model

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

NVIDIA CUDA Compute Unified Device Architecture

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Introduction to CUDA (1 of n*)

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

High-Performance Computing Using GPUs

Technology for a better society. hetcomp.com

Lecture 1: an introduction to CUDA

Programming with CUDA, WS09

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

CME 213 S PRING Eric Darve

Lecture 2: CUDA Programming

Parallel Numerical Algorithms

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Lecture 11: GPU programming

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Introduction to GPU hardware and to CUDA

Cartoon parallel architectures; CPUs and GPUs

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Module 3: CUDA Execution Model -I. Objective

Portland State University ECE 588/688. Graphics Processors

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Module 2: Introduction to CUDA C

Programming in CUDA. Malik M Khan

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

ECE 408 / CS 483 Final Exam, Fall 2014

COSC 6374 Parallel Computations Introduction to CUDA

CS 179: GPU Computing

GPU Programming Using CUDA

Speed Up Your Codes Using GPU

NVIDIA GPU CODING & COMPUTING

Introduction to CUDA C

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Topics in Parallel and Distributed Computing: Introducing Algorithms, Programming, and Performance within Undergraduate Curricula

GPGPU/CUDA/C Workshop 2012

Vector Addition on the Device: main()

HPCSE II. GPU programming and CUDA

Transcription:

Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009

Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing architecture to use NVIDIA GPU New parallel programming model and instructure architecture Support for C; FORTRAN, C++, OpenCL, DirectX Compute to come Low learning curve for C programmers See pg. 12 Programming Guide

CUDA All the cool kids are using it! Cuda Zone Support base: Cuda Forum OpenCV to OpenVIDIA Computer Vision - it s like saturday morning before all the football games starts (if you like football)

Performance Comparison See pg. 10 Programming Guide

CUDA Programming Three key abstractions: hierarchy of thread groups; shared memories; barrier synchronization nvcc (CUDA compiler) compiles a.cu file (C program with extended syntax) into host (CPU) code and device (GPU) code

CUDA Programming.cu file contains host code and device code Host code is run by the CPU, device code is run by the GPU Show pg. 19-22 gpu slide CS775 Show pg. 15 Programming Guide Grid of thread blocks - show pg. 18 Programming Guide Mapping to hardware - show pg. 80 Programming Guide devicequery program

Execution Scalable array of multithreaded Streaming Multiprocessors (SMs) Host CPU calls a kernel, launches the kernel grid, with 2D layout of thread blocks Each thread block is allocated to 1 SM. (GTX 280 has 30 SMs) Each SM contains 8 Scalar Processor (SP) cores, 2 special math units, multithreaded instruction unit, on-chip shared memory Allows for fine-grain parallelism, e.g. each thread process one element of a matrix of a pixel in an image Each SM supports synchronization syncthreads() function for barrier

Warps Each SM uses a Single Instruction Multiple Threads (SIMT) architecture Maps every thread onto a SP core Manages, schedules, and carry out thread executions in warps of 32 threads If threads within a warp diverge via data dependent branching, then the warp serially executes each branch path in turn, disabling threads on all other paths Therefore, max efficiency occurs when all threads in a warp agrees on execution path

Memory of SM (really fast) 32 bit registers (fast) shared memory cache - shared by all SP cores and therefore all threads in a block (fast, read-only) constant memory cache (fast, read-only) texture memory cache (slow, read-write) global memory - not cached (slow, read-write) local memory - not cached (used when variables are too big in kernel) See pg. 83 Programming Guide, pg 23. gpu slides

Specifics - C extensions Function Type Qualifiers device : execute on device, callable from device only global : a kernel func, execute on device, callable from host only host : execute on host, callable from host only (default if a func has no qualifiers) Builtin Variables griddim: e.g. griddim.x griddim.y blockidx: e.g. blockidx.x blockidx.y blockdim: e.g. blockdim.x blockdim.y blockdim.z threaddim: e.g.threaddim.x threaddim.y threaddim.z all are dim3, an integer vector type

Compiling 2 interfaces: C for CUDA (easier), and CUDA driver API (harder) nvcc compiles C or PTX language code (within.cu file) into binary code Splits.cu code into host and device code, host code is compiled by standard compilers (gcc), device code is compiled into PTX code or cubin objects cubin objects can be executed by CUDA driver API directly or linked together with host executable and lauched as kernels Full C++ support for host code, subset of C for device code, warrants splitting up the device and host code into separate.cu files -arch sm 13 flag for double precision, only on GPU with compute capability of 1.3 or later. Slow!! Fermi 8x faster

First Example! Show code, common.mk, and Makefile cudamalloc() cudamemcpy() cudafree()

Shared Memory Example Show pg. 28 Programming Guide

Useful APIs CUBLAS - Basic Linear Algebra Subroutines (BLAS) implementation e.g. matrix-vector, matrix-matrix operations CULA - LAPACK for CUDA e.g. LU, QR, SVD, Least Squares CUFFT - 14x faster than Fastest Fourier Transform in the West (FFTW) CUDPP - CUDA Data Parallel Primitives Library e.g. parallel-prefix-sum, parallel sorting, sparse matrix vector multiply, parallel reduction

Example - Matrix Multiplication Use CUBLAS - Basic Linear Algebra Subroutine implementation in C for CUDA Show matlab and.cu code Compare to Matlab Run example

CUDA debugging using CUDA-GDB nvcc -g -G first example.cu -o first example Newly introduced hardware debugger based on GDB!!

CUDA profiling Find out which kernel calls is dragging you behind Demo

Example - Sum How do we perform MATLAB s sum() in parallel? This is a parallel reduction problem Show instructive white paper

Example - Wake Sleep algorithm for DBN 10x speed up over the fine tuning algorithm for Deep Belief Network Show matlab and.cu code Run example

Example - Backpropagation Nonlinear Neighborhood Component Analysis learning algorithm Show matlab and.cu code Run example

Thank You Supercomputer in your desktop! No reason why you shouldn t attack algorithm bottelnecks with CUDA