CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Similar documents
Introduction to CUDA C

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Introduction to CUDA C

GPU Programming Using CUDA

GPU Computing Master Clss. Development Tools

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

ECE 574 Cluster Computing Lecture 15

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Tesla Architecture, CUDA and Optimization Strategies

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

COSC 6374 Parallel Computations Introduction to CUDA

HPCSE II. GPU programming and CUDA

ECE 574 Cluster Computing Lecture 17

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to CUDA Programming

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Programming with CUDA, WS09

Introduction to CUDA

CS 179: GPU Computing. Lecture 2: The Basics

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Massively Parallel Architectures

Vector Addition on the Device: main()

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CUDA C/C++ BASICS. NVIDIA Corporation

An Introduction to GPU Computing and CUDA Architecture

GPGPU/CUDA/C Workshop 2012

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

Paralization on GPU using CUDA An Introduction

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA Architecture & Programming Model

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Dense Linear Algebra. HPC - Algorithms and Applications

Speed Up Your Codes Using GPU

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Introduction to GPGPU and GPU-architectures

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Scientific discovery, analysis and prediction made possible through high performance computing.

Device Memories and Matrix Multiplication

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Introduction to Parallel Computing with CUDA. Oswald Haan

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

EEM528 GPU COMPUTING

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Matrix Multiplication in CUDA. A case study

Computation to Core Mapping Lessons learned from a simple application

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU Fundamentals Jeff Larkin November 14, 2016

Lessons learned from a simple application

ECE 408 / CS 483 Final Exam, Fall 2014

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GPGPUs and to CUDA programming model

Parallel Numerical Algorithms

COSC 462 Parallel Programming

COSC 6339 Accelerators in Big Data

CUDA Performance Optimization. Patrick Legresley

Real-time Graphics 9. GPGPU

CSE 160 Lecture 24. Graphical Processing Units

Real-time Graphics 9. GPGPU

CS : Many-core Computing with CUDA

Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop

Overview: Graphics Processing Units

CUDA Memories. Introduction 5/4/11

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

GPU CUDA Programming

Cartoon parallel architectures; CPUs and GPUs

Efficient Data Transfers

High Performance Computing and GPU Programming

GPU programming. Dr. Bernhard Kainz

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

University of Bielefeld

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Massively Parallel Algorithms

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

GPU Programming Introduction

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Lecture 5. Performance Programming with CUDA

Lecture 6b Introduction of CUDA programming

Lecture 3: Introduction to CUDA

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Parallel Computing. Lecture 19: CUDA - I

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPU Memory Memory issue for CUDA programming

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Practical Introduction to CUDA and GPU

Transcription:

CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012

GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture

GPU as PCI device Traditional PC connects to devices through 2 nodes (North & South Bridge) Southbridge serves as a concentrator for slower I/O devices GPU was connected as peripheral devices

GPU as PCI device PCI Device are mapped into CPU s physical address space Addresses assigned to the PCI devices at boot time Devices polls for input

PCIe Bus PCIe now serves as backbone Northbridge/Southbrid ge are PCIe switches Some PCIe cards are PCI cards with PCI to PCIe bridge

GPU-PCI-E Devices connect to same switch Each card links to central switch Dedicated bus(links) for PCI-E devices Packet switches messages from virtual channel

GPU Architecture GPU(the chip itself) consists of group of Streaming Multiprocessors (SM) Inside each SM 32 cores (sharing the same instruction) 64KB shared memory (shared among the 32 cores) 32k 32bit registers 2 warp schedulers (To schedule instructions) 4 special function units

GPU Architecture Each core Logic Move, Compare Branch Fused Multiply-Add for single(32bit) and double(64bit) precision

GPU Architecture Therefore, one SM can perform 32x32bit floating point/clock 16x64bit floating point/clock 32x32bit Integer/clock Each SM shares the same instruction Each GPU Chip have different number of SM On a high-end consumer card GTX 480 has 15 SM GTX 680 has 96 SM This translates to performance

CPU and GPU Comparison CPU : Low Latency Fast access to memory through caching Control logic for out-of-order and speculative execution High clockspeed GPU : High Throughput High latency, high throughput Relatively simple control logic Massively parallel ALUs

CPU and GPU Comparison CPU architecture reduces memory access time within each thread to enhance throughput GPU architecture increases number of concurrent threads to enhance throughput Optional: CPU threads are equivalent to Warps (Warps as in collection of threads) Each Warp will process 32 CUDA Threads concurrently

Threads and Blocks GPU threads are lightweight GPU executes massive number of threads concurrently Low branching performance

Threads and Blocks CUDA Threads are grouped into blocks This is to optimize the use of memory Instruction sent by host to GPU is called a Kernel GPU sees a kernel as a grid of blocks of threads

Execution of Threads and Blocks Each CUDA thread will execute in one core Depending on memory requirements of a kernel, multiple block may execute on each SM Each kernel can only be executed by one device (unless programmer s intervention) Multiple kernels may be executed at one time

Execution of Kernel The host(cpu) executes a kernel in GPU in 4 steps CPU allocates and copies data to GPU On CUDA API: cudamalloc() cudamemcpy()

Execution of Kernel CPU Sends function parameters and instructions to GPU CUDA API: myfunc<<<blocks, Threads>>>(parameters)

Execution of Kernel - GPU executes instruction as scheduled in warps - Results will need to be copied back to Host memory(ram) using cudamemcpy()

Questions? References (including images): http://www.cc.gatech.edu/~vetter/keeneland /tutorial-2012-02-20/06-cuda-overview.pdf

CUDA Memory Memory Hierarchy of nvidia GPU Speed: Registers > Shared Memory > Constant Cache > Texture Memory > Device Memory Memory Sizes: Registers (per SM) 32kb Shared Mem (per SM) 64KB Constant Texture Device Memory (per card) up to 6GB

CUDA Memory Recall : each block consist of a set of threads Each Thread is given small piece of Local memory Blocks do not span more than one SM Each Block owns a shared memory

Memory Hierarchy Local memory for each thread is private Lifespan only during thread execution Shared Memory is accessible among threads for each block Lifespan only during kernel call In CUDA API: shared int a[size];

Memory Hierarchy Lifespan of Global Memory: Global Memory will be reserved until programmer uses cudafree() Global Memory is accessible among blocks Higher latency Off-chip memory cudamalloc() allocates memory here

Memory Hierarchy Note that every GPU board has it s own memory Programmer in CUDA is responsible for memory allocation and free operations

A CUDA Example CUDA Kernel consists of two parts 1. void add<<<1, N>>>(int dev_a, int dev_b, int dev_c) N tells CUDA how many threads to execute this function on In this case this generates N threads each block 2. global void add( int*a, int*b, int*c ) { } c[threadidx.x] = a[threadidx.x] + b[threadidx.x]; this is a CUDA Kernel(GPU Code) The number of thread blocks in a grid and the number of threads in a thread block is controlled by the programmer passing in variables to the kernel launch command.

A CUDA Example With add()running in parallel let s do vector addition Kernel can refer to its thread s index with the variable threadidx.x Each thread adds a value from a[]and b[], storing the result in c[]: global void add( int*a, int*b, int*c ) { } c[threadidx.x] = a[threadidx.x] + b[threadidx.x]; By using threadidx.x to index arrays, each thread handles different indices

A CUDA Example We write this code: global void add( int*a, int*b, int*c ) { c[threadidx.x] = a[threadidx.x] + b[threadidx.x]; }. void add<<<1, N>>>(int dev_a, int dev_b, int dev_c) N=4, This is what runs in parallel on the device: Thread 0 c[0] = a[0] + b[0]; Thread 2 c[2] = a[2] + b[2]; Thread 1 c[1] = a[1] + b[1]; Thread 3 c[3] = a[3] + b[3];

A CUDA Example global void add( int*a, int*b, int*c ) { c[threadidx.x] = a[threadidx.x] + b[threadidx.x]; } Int main( void ) { Int a, b, c; // host copies of a, b, c int*dev_a, *dev_b, *dev_c; // device copies of a, b, c Int size = sizeof( int); // we need space for an integer // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, size ); a = 2; b = 7;

A CUDA Example } // copy inputs to device cudamemcpy( dev_a, &a, size, cudamemcpyhosttodevice); cudamemcpy( dev_b, &b, size, cudamemcpyhosttodevice); // launch add() kernel on GPU, passing parameters add<<< 1, 1 >>>( dev_a, dev_b, dev_c); // copy device result back to host copy of c cudamemcpy( &c, dev_c, size, cudamemcpydevicetohost); cudafree( dev_a); cudafree( dev_b); cudafree( dev_c); Return 0;

Another Example Matrix Multiplication Problem: Recall that each block can hold maximum of 512 threads What if matrix contains more than 512 elements? Solution: Group threads into blocks

Another Example In this example, we will attempt to group 16x16 matrices into 16 4x4 tiles // CUDA Kernel global void matrixmul( float* C, float* A, float* B, int wa, int wb) { // 1. 2D Thread ID int tx = blockidx.x * TILE_SIZE + threadidx.x; int ty = blockidx.y * TILE_SIZE + threadidx.y;.. }

Another Example Calling the kernel // setup execution parameters dim3 threads(block_size, BLOCK_SIZE); dim3 grid(wc / threads.x, HC / threads.y); //BLOCK_SIZE = 16, WC = 16, HC = 16 // execute the kernel matrixmul<<< grid, threads >>>(d_c, d_a, d_b, WA, WB);

CUDA/GPGPU Arch&Prog