ECE 574 Cluster Computing Lecture 17

Similar documents
ECE 574 Cluster Computing Lecture 15

Speed Up Your Codes Using GPU

GPU Programming Using CUDA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Introduction to CUDA C

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

CUDA Programming Model

Introduction to CUDA C

Real-time Graphics 9. GPGPU

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CUDA C Programming Mark Harris NVIDIA Corporation

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Tesla Architecture, CUDA and Optimization Strategies

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Real-time Graphics 9. GPGPU

High Performance Computing and GPU Programming

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Module 2: Introduction to CUDA C

University of Bielefeld

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU programming. Dr. Bernhard Kainz

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Module 2: Introduction to CUDA C. Objective

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CUDA Architecture & Programming Model

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to CUDA

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Lecture 11: GPU programming

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model Michael Garland

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 2: Introduction to CUDA C

Parallel Numerical Algorithms

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Lecture 1: an introduction to CUDA

Threading Hardware in G80

CS377P Programming for Performance GPU Programming - I

GPGPU/CUDA/C Workshop 2012

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Introduction to CUDA Programming

Introduction to Parallel Computing with CUDA. Oswald Haan

Practical Introduction to CUDA and GPU

GPU Computing: Introduction to CUDA. Dr Paul Richmond

OpenACC Course. Office Hour #2 Q&A

ECE 574 Cluster Computing Lecture 16

Paralization on GPU using CUDA An Introduction

CUDA C/C++ BASICS. NVIDIA Corporation

EEM528 GPU COMPUTING

CUDA. More on threads, shared memory, synchronization. cuprintf

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

CSE 160 Lecture 24. Graphical Processing Units

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Cartoon parallel architectures; CPUs and GPUs

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

COSC 462 Parallel Programming

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Basics of CADA Programming - CUDA 4.0 and newer

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

COSC 6374 Parallel Computations Introduction to CUDA

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Introduction to CUDA (1 of n*)

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CUDA C/C++ BASICS. NVIDIA Corporation

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA (Compute Unified Device Architecture)

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

CS 179: GPU Computing. Lecture 2: The Basics

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Introduc)on to GPU Programming

Introduction to GPU hardware and to CUDA

Vector Addition on the Device: main()

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Fundamental CUDA Optimization. NVIDIA Corporation

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Tesla GPU Computing A Revolution in High Performance Computing

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Transcription:

ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019

HW#8 (CUDA) posted. Project topics due. Announcements 1

CUDA installing On Linux need to install the proprietary NVIDIA drivers Have to specify nonfree on Debian. Debates over the years whether NVIDIA can have proprietrary drivers; no one sued yet. (Depends on whether they are a derived work or not. Linus refuses to weigh in) 2

GPU hardware for the class NVIDIA Quadro hardware workstation rather than gaming Higher reliability FP? ECC RAM (maybe?) NVIDIA Quadro P400 in Haswell-EP 2GB GDDR5, 64-bit, up to 32 GB/s 256 cores, Pascal architecture 30W, OpenGL 4.5, DirectX 12.0 NVIDIA Quadro K2200 in Quadro 4GB GDDR5, 128-bitm 80 GB/s 3

640 cores, Maxwell architecture 68W, OpenGL 4.5, DirectX 11.2 4

Programming a GPGPU Create a kernel which is a small GPU program that runs on a single thread. This will be run on many cores at a time. Allocate memory on the GPU and copy input data to it Launch the kernel to run many times in parallel. The threads operate in lockstep, all executing the same instruction in each thread. How is conditional execution handled? a lot like on ARM. If/then/else. If the particular thread does not 5

meet the condition, it just does nothing until the other condition finishes executing. If more threads are needed the available on the GPU, may need to break the problem up into smaller batches of threads. Once computing is done, copy results back to the CPU. 6

GPU vs GPGPU concepts Only handle fixed-point or floating point values Can use textures as arrays. Typically only read-only. Some can render to texture, only way GPU can share RAM w/o going through CPU. In general data not written back until entire chunk is done. Fragment processor can read memory as often as it wants, but not write back until done. Analogies: Textures == arrays 7

Kernels == inner loops Render-to-texture == feedback Geometry-rasterization == computation. done as a simple grid (quadrilateral) Texture-coordinates = Domain Vertex-coordinates = Range Usually 8

Flow Control, Branches Only recently added to GPUs, but at a performance penalty. Often a lot like ARM conditional execution 9

NVIDIA Terminology (CUDA) Thread: chunk of code running on GPU. Warp: group of thread running at same time in parallel simultaneously AMD calles this a wavefront Block: group of threads that need to run Grid: a group of thread blocks that need to finish before next can be started 10

Terminology (cores) Confusing. Nvidia would say GTX285 had 240 stream processors; what they mean is 30 cores, 8 SIMD units per core. 11

CUDA Programming Since 2006 Use nvcc to compile.cu files. Note, technically C++ so watch for things like new *host* vs *device* host code runs on CPU device code runs on GPU Host code compiled by host compiler (gcc), device code by custom NVidia compiler 12

global parameters to function means pass to CUDA compiler cudamalloc() to allocate memory and pointers that can be passed in call global function like this add<<<1,1>>>(args) where first inside brackets is number of blocks, second is threads per block cudafree() at the end Can get block number with blockidx.x and thread index with threadidx.x Can have 65536 blocks and 512 threads (At least in 13

2010) Why threads vs blocks? Shared memory, block specific shared to specify syncthreads() is a barrier to make sure all threads finish before continuing 14

CUDA Programming See the NVIDIA CUDA C Programming Guide Compute Unified Device Architecture From CUDA C Programming guide from NVIDIA CUDA introduced in 2006 Heterogeneous programming there is a host executing a main body of code (a CPU) and it dispatches code to run on a device (a GPU) CUDA assumes host and device each have own separate DRAM memory 15

(newer cards can share address space via VM tricks) CUDA C extends C, define C functions kernels that are executed N times in parallel by N CUDA threads 16

CUDA Debugging Can download special cuda-gdb from NVIDIA Plain printf debugging doesn t really work 17

CUDA Coding version compliance can check version number. New versions support more hardware but sometimes drop old nvcc wrapper around gcc. global code compiled into PTX (parallel thread execution) ISA can code in PTX code directly which is sort of like assembly language. Won t give out actual assembly language. Why? CUDA C has mix of host and device code. Compiles the global stuff to PTX, compiles the <<<... >>> into 18

code that can launch the GPU code PTX code is JIT compiled into native by the device driver You can control JIT with environment variables Only subset of C/C++ supported in the device code 19

CUDA Hardware GPU is array of Streaming Multiprocessors (SMs) Program partitioned into blocks of threads. Blocks execute independently from each other. Manages/Schedules/Executes threads in groups of 32 parallel threads (warps) (weaving terminology) (no relation) Threads have own PC, registers, etc, and can execute independently When SM given thread block, partitions to warps and 20

each warp gets scheduled One common instruction at a time. If diverge in control flow, each way executed and thread not taking that path just waits. Full context stored with each warp; if warp is not ready (waiting for memory) then it may be stopped and another warp that s ready can be run 21

CUDA Threads kernel defined using global declaration. When called use <<<...>>> to specify number of threads each thread that is called is assigned a unique ThreadID Use threadidx to find what thread you are and act accordingly global void VecAdd ( float *A, float *B, float * C) { int i = threadidx. x; } if (i < N) // don t execute out of bounds C[i]=A[i]+B[i]; 22

int main ( int argc, char ** argv ) {... /* Invoke N threads */ VecAdd <<<1,N >>>(A,B,C); } threadidx is 3-component vector, can be seen as 1, 2 or 3 dimensional block of threads (thread block) Much like our sobel code, can look as 1D (just x), 2D, (thread id is ((y*xsize)+x) or (z*xsize*ysize)+y*xsize+x Weird syntax for doing 2 or 3d. global void MatAdd ( float A[N][N], float B[N][N], float C[N][N]) { int i= threadidx.x; int j= threadidx.y; C[i][j]=A[i][j]+B[i][j]; 23

} int numblocks =1; dim3 threadsperblock (N, N); MatAdd <<< numblocks, threadsperblock >>>(A,B,C); Each block made up of the threads. Can have multiple levels of blocks too, can get block number with blockidx Thread blocks operate independently, in any order. That way can be scheduled across arbitrary number of cores (depends how fancy your GPU is) 24

CUDA Memory Per-thread private local memory Shared memory visible to whole block (lifetime of block) Is like a scratchpad, also faster Global memory also constant and texture spaces. Have special rules. Texture can do some filtering and stuff Global, constant, and texture persistent across kernel launches by same app. 25

More Coding No explicit initialization, done automatically first time you do something (keep in mind if timing) Global Memory: linear or arrays. Arrays are textures Linear arrays are allocated with cudamalloc(), cudafree() To transfer use cudamemcpy() Also can be allocated cudamallocpitch() cudamalloc3d() for alignment reasons 26

Access by symbol (?) Shared memory, shared. Faster than Global also device Manually break your problem into smaller sizes 27

Misc Can lock host memory with cudahostalloc(). Pinned, can t be paged out. Can load store while kernel running if case. Only so much available. Can be marked writecombining. Not cached. So slow for host to read (should only write) but speeds up PCI transaction. 28

Async Concurrent Execution Instead of serial/parallel/serial/parallel model Want to have CUDA running and host at same time, or with mem transfers at same time Concurrent host/device: calls are async and return to host before device done Concurrent kernel execution: newer devices can run multiple kernels at once. Problem if use lots of memory Overlap of Data Transfer and Kernel execution Streams: sequence of commands that execute in order, 29

but can be interleaved with other streams complicated way to set them up. Synchronization and callbacks 30

Events Can create performance events to monitor timing PAPI can read out performance counters on some boards Often it s for a full synchronous stream, can t get values mid-operation NVML can measure power and temp on some boards? 31

Multi-device system Can switch between active device More advanced systems can access each others device memory 32

Other features Unified virtual address space (64 bit machines) Interprocess communication Error checking 33

Texture Memory Complex 34

3D Interop Can make results go to an OpenGL or Direct3D buffer Can then use CUDA results in your graphics program 35

Code Example # include <stdio.h> # define N 10 global void add ( int *a, int *b, int * c) { int tid = blockidx.x; } if (tid <N) { c[ tid ]=a[ tid ]+b[ tid ]; } int main ( int arc, char ** argv ) { int a[n],b[n],c[n]; int * dev_a,* dev_b,* dev_c ; int i; 36

/* Allocate memory on GPU */ cudamalloc (( void **)& dev_a,n* sizeof ( int )); cudamalloc (( void **)& dev_b,n* sizeof ( int )); cudamalloc (( void **)& dev_c,n* sizeof ( int )); /* Fill the host arrays with values */ for (i =0;i<N;i ++) { a[i]= -i; b[i]=i*i; } cudamemcpy ( dev_a,a,n* sizeof ( int ), cudamemcpyhosttodevice ); cudamemcpy ( dev_b,b,n* sizeof ( int ), cudamemcpyhosttodevice ); add <<<N,1 > > >( dev_a, dev_b, dev_c ); cudamemcpy (c,dev_c,n* sizeof ( int ), cudamemcpydevicetohost ); /* results */ for (i =0;i<N;i ++) { printf ("%d+%d=%d\n",a[i],b[i],c[i ]); } 37

cudafree ( dev_a ); cudafree ( dev_b ); cudafree ( dev_c ); } return 0; 38

Go through examples Also show off nvidia-smi Code Examples 39