CUDA Programming Model

Similar documents
Tesla Architecture, CUDA and Optimization Strategies

Blockseminar: Verteiltes Rechnen und Parallelprogrammierung. Tim Conrad

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to CUDA

CUDA (Compute Unified Device Architecture)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Practical Introduction to CUDA and GPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Architecture & Programming Model

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Threading Hardware in G80

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to CELL B.E. and GPU Programming. Agenda

GPU Computing with CUDA. Part 2: CUDA Introduction

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

ECE 574 Cluster Computing Lecture 17

CUDA C Programming Mark Harris NVIDIA Corporation

ECE 574 Cluster Computing Lecture 15

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

CSE 160 Lecture 24. Graphical Processing Units

GPU programming. Dr. Bernhard Kainz

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA Parallel Programming Model Michael Garland

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Portland State University ECE 588/688. Graphics Processors

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Introduction to CUDA (1 of n*)

Introduction to CUDA Programming

NVIDIA CUDA Compute Unified Device Architecture

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

B. Tech. Project Second Stage Report on

NVIDIA GPU CODING & COMPUTING

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Multi-Processors and GPU

CUDA Performance Optimization. Patrick Legresley

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Real-time Graphics 9. GPGPU

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

NVIDIA CUDA Compute Unified Device Architecture

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Lecture 11: GPU programming

Real-time Graphics 9. GPGPU

Parallel Numerical Algorithms

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Module 2: Introduction to CUDA C

CS179 GPU Programming Recitation 4: CUDA Particles

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

High Performance Computing on GPUs using NVIDIA CUDA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Introduction to GPU hardware and to CUDA

Tesla GPU Computing A Revolution in High Performance Computing

Introduction to CUDA (1 of n*)

Scientific discovery, analysis and prediction made possible through high performance computing.

MD-CUDA. Presented by Wes Toland Syed Nabeel

Intro to GPU s for Parallel Computing

CUDA Parallelism Model

From Brook to CUDA. GPU Technology Conference

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Introduction to GPGPUs and to CUDA programming model

Module 2: Introduction to CUDA C. Objective

Fundamental CUDA Optimization. NVIDIA Corporation

Advanced CUDA Optimization 1. Introduction

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Fundamental CUDA Optimization. NVIDIA Corporation

Tesla GPU Computing A Revolution in High Performance Computing

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Lecture 2: CUDA Programming

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Mattan Erez. The University of Texas at Austin

CS : Many-core Computing with CUDA

High-Performance Computing Using GPUs

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Lecture 3: Introduction to CUDA

University of Bielefeld

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Computing. Lecture 19: CUDA - I

Master Informatics Eng.

CS427 Multicore Architecture and Parallel Computing

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Transcription:

CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming model for issuing and managing computations on the GPU without mapping them to a graphics API. Heterogenous - mixed serial-parallel programming Scalable - hierarchical thread execution model Accessible - minimal but expressive changes to C Introduction Software Stack: Libraries: CUFFT & CUBLAS Introduction CUDA SDK Runtime: Common component Device component Host component Driver: Driver API 1

Introduction Example Pro & Contra Trend GPU GPGPU CUDA GPU for Graphics GPU GPGPU CUDA GPU GPGPU CUDA GPGPU Trick the GPU into general-purpose computing by casting problem as graphics Turn data into images ("texture maps") Turn algorithms into image synthesis ("rending passes") Drawback: Tough learning curve potentially high overhead of graphics API highly constrained memory layout & access model Need for many passes drives up bandwidth consumption 2

GPGPU Programming to do A + B What's wrong with GPGPU 1 APIs are specific to Graphics Limited texture size and dimension Limited Instruction set No thread Communication Limited local storage Limited shader outputs No scatter What's wrong with GPGPU 2 GPU GPGPU CUDA Introduction Example Pro & Contra Trend CUDA: Unified Design Advantage: HW: fully generally data-parallel archtecture. General thread launch Global load-store Parallel data cache Scalar architecture Integers, bit operation SW: program the GPU in C Scalable data parallel execuation/ memory model C with minimal yet powerful extensions 3

From GPGPU to CUDA Feature 1: Thread not pixel Full Integer and Bit Instructions No limits on branching, looping 1D, 2D, 3D threadid allocation Feature 2: Fully general load/store to GPU memory Untyped, not fixed texture types Pointer support Feature 3: Dedicated on-chip memory Shared between threads for inter-threads communication Explicitly managed As fast as registers Important Concepts: Important Concepts: Device: GPU, viewed as a co-processor. Host: CPU Kernel: data-parallel, computed-intensive positions of application running on the device. Thread: basic execution unit Thread block: A batch of thread. Threads in a block cooperate together, efficiently share data. Thread/block have unique id Grid: A batch of thread block. that excuate same kernel. Threads in different block in the same grid cannot directly communicate with each other Simple example ( Matrx addition ): cpu c program: cuda program: Hardware implementation: A set of SIMD Multiprocessors with OnChip shared memory 4

G80 Example: 16 Multiprocessors, 128 Thread Processors Up to 12,288 parallel threads active Per-block shared memory accelerates processing. Streaming Multiprocessor (SM) Processing elements o 8 scalar thread processors o 32 GFLOPS peak at 1.35GHz o 8192 32-bit registers (32KB) o usual ops: float, int, branch... Hardware multithreading o up to 8 blocks (3 active) residents at once o up to 768 active threads in total 16KB on-chip memory supports thread communication shared amongst threads of a block Execution Model: Single Instruction Multiple Thread (SIMT) Execution: Groups of 32 threads formed into warps o always executing same instruction o share instruction fetch/dispatch o some become inactive when code path diverges o hardware automatically handles divergence Warps are primitive unit of scheduling pick 1 of 24 warps for each instruction slot. all warps from all active blocks are time-sliced Introduction Example Pro & Contra Trend 5

Registers o on chip o fast access o per thread o limited amount o 32 bit Registers Local Memory o in DRAM o slow o non-cached o per thread o relative large Registers Local Memory Shared Memory o on chip o fast access o per block o 16 KByte o synchronize between threads Registers Local Memory Shared Memory Global Memory o in DRAM o slow o non-cached o per grid o communicate between grids Registers Local Memory Shared Memory Global Memory Constant Memory o in DRAM o cached o per grid o read-only Registers Local Memory Shared Memory Global Memory Constant Memory Texture Memory o in DRAM o cached o per grid o read-only 6

Registers Shared Memory o on chip Local Memory Global Memory Constant Memory Texture Memory o in Device Memory Global Memory Constant Memory Texture Memory o managed by host code o persistent across kernels Introduction Example Pro & Contra Trend provides a easily path for users to write programs for GPU device. It consists of: A minimal set of extensions to C/C++ o type qualifiers o call-syntax o build-in variables A runtime library to support the execution o host component o device component o common component CUDA C/C++ Extensions: New function type qualifiers host void HostFunc(...); //executable on host global void KernelFunc(...); //callable from host device void DeviceFunc(...); //callable from device only o Restrictions for device code ( global / device ) no recursive call no static variable no function pointer global function is asynchronous invoked global function must have void return type CUDA C/C++ Extensions: New variable type qualifiers device int GlobalVar; //in global memory, lifetime of app const int ConstVar; //in constant memory, lifetime of app shared int SharedVar; //in shared memory, lifetime of blocks o Restrictions no external usage only file scope no combination with struct or union no initialization for shared 7

CUDA C/C++ Extensions: New syntax to invoke the device code KernelFunc<<< Dg, Db, Ns, S >>>(...); o Dg: dimension of grid o Db: dimension of block o Ns: optional, shared memory for external variables o S : optional, associated stream New build-in variables for indexing the threads o griddim: dimension of the whole grid o blockidx: index of the current block o blockdim: dimension of each block in the grid o threadidx: index of the current thread CUDA Runtime Library: Common component o Vector/Texture Types o Mathematical/Time Functions Device component o Mathematical/Time/Texture Functions o Synchronization Function o syncthreads() o Type Conversion/Casting Functions CUDA Runtime Library: Host component o Structure Driver API Runtime API o Functions Device, Context, Memory, Module, Texture management Execution control Interoperability with OpenGL and Direct3D The CUDA source file uses.cu as extension. It contains host and device source codes. The CUDA Compiler Driver nvcc can compile it and generate CPU/PTX binary code. (PTX: Parallel Thread Execution, a device independent VM code) PTX code may be further translated for special GPU-Arch. Introduction Example Pro & Contra Trend Simple example ( Matrx addition ): cpu c program: cuda program: 8

Pro & Contra Introduction Example Pro & Contra Trend CUDA allows massive parallel computing with a relative low price high integrated solution personal supercomputing ecofriendly production easy to learn Pro & Contra Problem... slightly low precision limited support for IEEE-754 no recursive function call hard to use for irregular join/fork logic no concurrency between jobs Introduction Example Pro & Contra Trend Trend References More cores on-chip Better support for float point Flexiber configuration & control/data flow Lower price Support higher level programming language [1] CUDA Programming Guide, nvidia Corp. [2] The CUDA Compiler Driver, nvidia Corp. [3] Parallel Thread Execution, nvidia Corp. [4] CUDA: A Heterogeneous Parallel for Manycore Computing, ASPLOS 2008, gpgpu.org 9

Question? 10