Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Similar documents
Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Threading Hardware in G80

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Introduction to CUDA (1 of n*)

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Master Informatics Eng.

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

CUDA Programming Model

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Tesla GPU Computing A Revolution in High Performance Computing

Advanced CUDA Optimization 1. Introduction

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Portland State University ECE 588/688. Graphics Processors

Programming in CUDA. Malik M Khan

GPU Programming Using NVIDIA CUDA

CUDA (Compute Unified Device Architecture)

Introduction to CUDA (1 of n*)

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

GPGPU. Lecture 2: CUDA

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Introduction to GPU hardware and to CUDA

High Performance Computing on GPUs using NVIDIA CUDA

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Tesla GPU Computing A Revolution in High Performance Computing

ECE 8823: GPU Architectures. Objectives

Lecture 1: Introduction and Computational Thinking

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Paralization on GPU using CUDA An Introduction

Introduc)on to GPU Programming

Massively Parallel Architectures

Tesla Architecture, CUDA and Optimization Strategies

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

Introduction to CUDA Programming

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Numerical Simulation on the GPU

Practical Introduction to CUDA and GPU

B. Tech. Project Second Stage Report on

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Lecture 11: GPU programming

GPU Fundamentals Jeff Larkin November 14, 2016

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

L3: Data Partitioning and Memory Organization, Synchronization

Intro to GPU s for Parallel Computing

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

GPU programming. Dr. Bernhard Kainz

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Improving Performance of Machine Learning Workloads

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

From Brook to CUDA. GPU Technology Conference

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Introduction to GPU programming with CUDA

Parallel Processing SIMD, Vector and GPU s cont.

! Readings! ! Room-level, on-chip! vs.!

Lecture 1: Gentle Introduction to GPUs

Parallel Computing: Parallel Architectures Jin, Hai

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Module 2: Introduction to CUDA C

Introduction to CUDA

CSE 160 Lecture 24. Graphical Processing Units

Module 2: Introduction to CUDA C. Objective

MD-CUDA. Presented by Wes Toland Syed Nabeel

TUNING CUDA APPLICATIONS FOR MAXWELL

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPUs and GPGPUs. Greg Blanton John T. Lubia

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Mattan Erez. The University of Texas at Austin

GPU Cluster Computing. Advanced Computing Center for Research and Education

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

TUNING CUDA APPLICATIONS FOR MAXWELL

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Lecture 2: Introduction to CUDA C

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Modern Processor Architectures. L25: Modern Compiler Design

Lecture 3: Introduction to CUDA

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Technology for a better society. hetcomp.com

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Computer Architecture

Introduction to GPU programming. Introduction to GPU programming p. 1/17

ECE 574 Cluster Computing Lecture 15

CUDA Performance Optimization. Patrick Legresley

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Graphics Processor Acceleration and YOU

CUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK

high performance medical reconstruction using stream programming paradigms

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CUDA GPGPU Workshop 2012

Transcription:

Introduction to CUDA Algoritmi e Calcolo Parallelo

References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied Parallel Programming (ECE498@UIUC) http://courses.engr.illinois.edu/ece498/al/ Useful references Programming Massively Parallel Processors: A Hands-on Approach, David B. Kirk and Wen-mei W. Hwu http://www.gpgpu.it/ (CUDA Tutorial)

GPGPU

What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications see //GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

GPGPU Constraints Dealing with graphics API Working with the corner cases of the graphics API Addressing modes Limited texture size/dimension Shader capabilities Limited outputs Instruction sets Lack of Integer & bit ops Communication limited Between pixels

Why GPUs? The GPU has evolved into a very flexible and powerful processor: It s programmable using high-level languages Now supports 32-bit and 64-bit floating point IEEE-754 precision It offers lots of GFLOPS GPU in every PC and workstation 6

What is behind such an evolution? The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU The fast-growing video game industry exerts strong economic pressure that forces constant innovation 7

Application Domains Massive Data Parallelism Graphics GPU (Parallel Computing) Instruction Level Parallelism CPU (Sequential Computing) Data Fits in Cache Larger Data Sets Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging 8

GPUs Each NVIDIA GPU has up to 448 parallel cores Within each core Floating point unit Logic unit (add, sub, mul, madd) Move, compare unit Branch unit Cores managed by thread manager manager can spawn and manage 12,000+ threads per core Zero overhead thread switching NVIDIA Fermi Architecture 9

CUDA

CUDA Parallel Computing Architecture Parallel computing architecture and programming model Includes a C compiler plus support for OpenCL and DX11 Compute Architected to natively support all computational interfaces (standard languages and APIs) NVIDIA GPU architecture accelerates CUDA Hardware and software designed together for computing Expose the computational horsepower of NVIDIA GPUs Enable general-purpose GPU computing Compute ATI s Unified Compute Device Solution Architecture (CUDA)

CUDA is C for Parallel Processors CUDA is industry-standard C with minimal extensions Write a program for one thread Instantiate it on many parallel threads Familiar programming model and language CUDA is a scalable parallel programming model Program runs on any number of processors without recompiling CUDA parallelism applies to both CPUs and GPUs Compile the same program source to run on different platforms with widely different parallelism Map to CUDA threads to GPU threads or to CPU vectors

A Highly Multithreaded Coprocessor The GPU is a highly parallel compute coprocessor serves as a coprocessor for the host CPU has its own device memory with high bandwidth interconnect GPU CPU 13

CUDA Uses Extensive Multithreading CUDA threads express fine-grained data parallelism Map threads to GPU threads Virtualize the processors You must rethink your algorithms to be aggressively parallel CUDA thread blocks express coarse-grained parallelism Blocks hold arrays of GPU threads, define shared memory boundaries Allow scaling between smaller and larger GPUs GPUs execute thousands of lightweight threads (In graphics, each thread computes one pixel) One CUDA thread computes one result (or several results) Hardware multithreading & zero-overhead scheduling

CUDA Kernels and s Parallel portions of an application are executed on the device as kernels One kernel is executed at a time Many threads execute each kernel Differences between CUDA and CPU threads CUDA threads are extremely lightweight Very little creation overhead Instant switching CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can use only a few Definitions Device = GPU Host = CPU Kernel = function called from the host that runs on the device

Arrays of Parallel s A CUDA kernel is executed by an array of threads All threads run the same program, SIMT (Singe Instruction multiple threads) Each thread uses its ID to compute addresses and make control decisions threadid float x = input[threadid]; float y = func(x); output[threadid] = y;

CUDA Programming Model A kernel is executed by a grid, which contain blocks. Host Device Grid 1 These blocks contain our threads. Kernel 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) A thread block is a batch of threads that can cooperate: Sharing data through shared memory Synchronizing their execution Kernel 2 Block (1, 1) (0, 0) (1, 0) Grid 2 (2, 0) (3, 0) (4, 0) s from different blocks operate independently (0, 1) (0, 2) (1, 1) (1, 2) (2, 1) (2, 2) (3, 1) (3, 2) (4, 1) (4, 2) 17

Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks s within a block cooperate via shared memory s in different blocks cannot cooperate Enables programs to transparently scale to any number of processors! Block 0 Block 1 Block N - 1 threadid 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y;

Cooperation cooperation is a powerful feature of CUDA s can cooperate via on-chip shared memory and synchronization The on-chip shared memory within one block allows: Share memory accesses, drastic memory bandwidth reduction Share intermediate results, thus: save computation Makes algorithm porting to GPUs a lot easier (vs. GPGPU and its strict stream processor model)

Transparent Scalability Hardware is free to schedule thread blocks on any processor Kernels scale to any number of parallel multiprocessors Device A Device B Kernel grid Block 0 Block 1 Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 2 Block 3 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7

Memory model seen from CUDA Kernel Registers (per thread) Shared Memory Shared among threads in a single block On-chip, small As fast as registers Global Memory Kernel inputs and outputs reside here Off-chip, large Uncached (use coalescing) Grid Block (0, 0) Registers Gl obal Mem ory Shared Memory (0, 0) Registers (1, 0) Block (1, 0) Shared Memory Registers Registers (0, 0) (1, 0) Note: The host can read & write global memory but not shared memory Host

Execution Model Kernels are launched in grids One kernel executes at a time A block executes on one multiprocessor Does not migrate Several blocks can reside concurrently on one multiprocessor Number is limited by multiprocessor resources Register file is partitioned among all resident threads Shared memory is partitioned among all resident thread blocks

Heterogeneous programming in CUDA Serial Code Parallel Kernel KernelA<<< nblk, ntid >>>(args); float x = input[thread ID]; float x = input[thread ID]; float x = input[thread ID]; Serial Code Parallel Kernel KernelB<<< nblk, ntid >>>(args); float x = input[thread ID]; float x = input[thread ID]; float x = input[thread ID];

CUDA Advantages over Legacy GPGPU Random access byte-addressable memory can access any memory location Unlimited access to memory can read/write as many locations as needed Shared memory (per block) and thread synchronization s can cooperatively load data into shared memory Any thread can then access any shared memory location Low learning curve Just a few extensions to C No knowledge of graphics is required

Compiling C for CUDA Applications C CUDA Key Kernels Rest of C Application NVCC CPU Code CUDA object files Linker CPU object files CPU-GPU Executable

Compiling CUDA Any source file containing CUDA language extensions must be compiled with NVCC NVCC is a compiler driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl,... NVCC outputs C code (host CPU Code) Must then be compiled with the rest of the application using another tool PTX Object code directly Or, PTX source, interpreted at runtime

Linking Any executable with CUDA code requires two dynamic libraries: The CUDA core library (cuda) The CUDA runtime library (cudart)

Debugging Using the Device Emulation Mode An executable compiled in device emulation mode (nvcc - deviceemu) runs completely on the host using the CUDA runtime No need of any device and CUDA driver Each device thread is emulated with a host thread Running in device emulation mode, one can: Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf) and vice-versa Detect deadlock situations caused by improper usage of syncthreads

Device Emulation Mode Pitfalls Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results. Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode