ECE 574 Cluster Computing Lecture 15

Similar documents
ECE 574 Cluster Computing Lecture 17

GPU Programming Using CUDA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

High Performance Linear Algebra on Data Parallel Co-Processors I

Tesla Architecture, CUDA and Optimization Strategies

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to CUDA C

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Introduction to CUDA C

High Performance Computing and GPU Programming

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

ECE 574 Cluster Computing Lecture 16

Speed Up Your Codes Using GPU

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Module 2: Introduction to CUDA C

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA Programming Model

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Module 2: Introduction to CUDA C. Objective

Real-time Graphics 9. GPGPU

Scientific discovery, analysis and prediction made possible through high performance computing.

Real-time Graphics 9. GPGPU

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Parallel Numerical Algorithms

CUDA Architecture & Programming Model

University of Bielefeld

CUDA C Programming Mark Harris NVIDIA Corporation

Lecture 11: GPU programming

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

GPU programming. Dr. Bernhard Kainz

Basics of CADA Programming - CUDA 4.0 and newer

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Lecture 2: Introduction to CUDA C

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Introduction to CUDA Programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to CUDA

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

CS377P Programming for Performance GPU Programming - I

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

ECE 571 Advanced Microprocessor-Based Design Lecture 18

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

CSE 160 Lecture 24. Graphical Processing Units

CUDA. More on threads, shared memory, synchronization. cuprintf

CUDA Parallel Programming Model Michael Garland

COSC 6374 Parallel Computations Introduction to CUDA

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Cartoon parallel architectures; CPUs and GPUs

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

Introduction to Parallel Computing with CUDA. Oswald Haan

Lecture 1: an introduction to CUDA

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Threading Hardware in G80

HPCSE II. GPU programming and CUDA

Practical Introduction to CUDA and GPU

GPU Fundamentals Jeff Larkin November 14, 2016

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Lecture 3: Introduction to CUDA

Parallel Computing. Lecture 19: CUDA - I

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Introduction to CUDA (1 of n*)

EEM528 GPU COMPUTING

OpenMP and GPU Programming

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

GPGPU/CUDA/C Workshop 2012

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Introduc)on to GPU Programming

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

CUDA C/C++ BASICS. NVIDIA Corporation

Analyzing CUDA Workloads Using a Detailed GPU Simulator

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Lecture 10!! Introduction to CUDA!

Portland State University ECE 588/688. Graphics Processors

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

An Introduction to GPU Computing and CUDA Architecture

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to CUDA (1 of n*)

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Fundamental CUDA Optimization. NVIDIA Corporation

Paralization on GPU using CUDA An Introduction

Fundamental CUDA Optimization. NVIDIA Corporation

Transcription:

ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017

HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements 1

Raspberry Pi Cluster Construction Imaging disks is slow. SD-card takes 40 minutes or so to write a 4GB image. It s not quite a commodity cluster as it has a fairly complicated power distribution system (ATX power supply to power boards to provide measured 5V to the USB power sockets) A bit time consuming to wire up all the cables. Power distribution issues 2

An ATX power supply runs best when it has a PC-like power draw Drawing too much 5V without a 12V low and the 5V line droops low enough that the Pis won t boot. Uses DHCP instead of hard-coding IP address in image. Why? Allows one common disk image for all nodes. NFS filesystem: for MPI to work you need to have an identical file layout (including the executable) on all nodes. Using a cluster filesystem makes this easier. ganglia: provides cluster stats via web-browser. Having 3

a huge issue trying to get it working. 4

Latency vs Throughput CPUs = Low latency, low throughput GPUs = high latency, high throughput CPUs optimized to try to get lowest latency (caches); with no parallelism have to get memory back as soon as possible GPUs optimized for throughput. Best throughput for all better than low-latency for one 5

GPGPUs Interfaces needed, as GPU companies do not like to reveal what their chips due at the assembly level. CUDA (Nvidia) OpenCL (Everyone else) can in theory take parallel code and map to CPU, GPU, FPGA, DSP, etc OpenACC? 6

Program Typically textures read-only. Some can render to texture, only way GPU can share RAM w/o going through CPU. In general data not written back until entire chunk is done. Fragment processor can read memory as often as it wants, but not write back until done. Only handle fixed-point or floating point values Analogies: Textures == arrays 7

Kernels == inner loops Render-to-texture == feedback Geometry-rasterization == computation. Usually done as a simple grid (quadrilateral) Texture-coordinates = Domain Vertex-coordinates = Range 8

Flow Control, Branches only recently added to GPUs, but at a performance penalty. Often a lot like ARM conditional execution 9

Terminology (CUDA) Thread: chunk of code running on GPU. Warp: group of thread running at same time in parallel simultaneously Block: group of threads that need to run Grid: a group of thread blocks that need to finish before next can be started 10

Terminology (cores) Confusing. Nvidia would say GTX285 had 240 stream processors; what they mean is 30 cores, 8 SIMD units per core. 11

CUDA Programming Since 2007 Use nvcc to compile *host* vs *device* host code runs on CPU device code runs on GPU Host code compiled by host compiler (gcc), device code by custom NVidia compiler 12

global parameters to function means pass to CUDA compiler cudamalloc() to allocate memory and pointers that can be passed in call global function like this add<<<1,1>>>(args) where first inside brackets is number of blocks, second is threads per block cudafree() at the end Can get block number with blockidx.x and thread index 13

with threadidx.x Can have 65536 blocks and 512 threads (At least in 2010) Why threads vs blocks? Shared memory, block specific shared to specify syncthreads() is a barrier to make sure all threads finish before continuing 14

CUDA Programming See the NVIDIA CUDA C Programming Guide Compute Unified Device Architecture From CUDA C Programming guide from NVIDIA CUDA introduced in 2006 Heterogeneous programming there is a host executing a main body of code (a CPU) and it dispatches code to run on a device (a GPU) CUDA assumes host and device each have own separate DRAM memory 15

CUDA C extends C, define C functions kernels that are executed N times in parallel by N CUDA threads 16

CUDA Coding version compliance can check version number. New versions support more hardware but sometimes drop old nvcc wrapper around gcc. global code compiled into PTX (parallel thread execution) ISA can code in PTX code directly which is sort of like assembly language. Won t give out actual assembly language. Why? CUDA C has mix of host and device code. Compiles the global stuff to PTX, compiles the <<<... >>> into 17

code that can launch the GPU code PTX code is JIT compiled into native by the device driver You can control JIT with environment variables Only subset of C/C++ supported in the device code 18

CUDA Hardware GPU is array of Streaming Multiprocessors (SMs) Program partitioned into blocks of threads that execute independently from each other. Manages/Schedules/Executes threads in groups of 32 parallel threads (warps) (weaving terminology) (no relation) Threads have own PC, registers, etc, and can execute independently When SM given thread block, partitions to warps and 19

each warp gets scheduled One common instruction at a time. If diverge in control flow, each way executed and thread not taking that path just waits. Full context stored with each warp; if warp is not ready (waiting for memory) then it may be stopped and another warp that s ready can be run 20

CUDA Threads kernel defined using global declaration. When called use <<<...>>> to specify number of threads each thread that is called is assigned a unique ThreadID Use threadidx to find what thread you are and act accordingly global void VecAdd ( float *A, float *B, float * C) { int i = threadidx. x; C[i]=A[i]+B[i]; } int main ( int argc, char ** argv ) {... /* Invoke N threads */ VecAdd <<<1,N >>>(A,B,C); 21

} threadidx is 3-component vector, can be seen as 1, 2 or 3 dimensional block of threads (thread block) Much like our sobel code, can look as 1D (just x), 2D, (thread id is ((y*xsize)+x) or (z*xsize*ysize)+y*xsize+x Weird syntax for doing 2 or 3d. global void MatAdd ( float A[N][N], float B[N][N], float C[N][N]) { int i= threadidx.x; int j= threadidx.y; C[i][j]=A[i][j]+B[i][j]; } int numblocks =1; dim3 threadsperblock (N, N); MatAdd <<< numblocks, threadsperblock >>>(A,B,C); 22

Each block made up of the threads. Can have multiple levels of blocks too, can get block number with blockidx Thread blocks operate independently, in any order. That way can be scheduled across arbitrary number of cores (depends how fancy your GPU is) 23

CUDA Memory Per-thread private local memory Shared memory visible to whole block (lifetime of block) Global memory also constant and texture spaces. Have special rules. Texture can do some filtering and stuff Global, constant, and texture persistent across kernel launches by same app. 24

More Coding No explicit initialization, done automatically first time you do something (keep in mind if timing) Global Memory: linear or arrays. Arrays are textures Linear arrays are allocated with cudamalloc(), cudafree() To transfer use cudamemcpy() Also can be allocated cudamallocpitch() cudamalloc3d() for allignmen reasons 25

Access by symbol (?) Shared memory, shared. Faster than Global also device Manually break your problem into smaller sizes 26

Misc Can lock host memory with cudahostalloc(). Pinned, can t be paged out. Can load store while kernel running if case. Only so much available. Can be marked writecombining. Not cached. So slow for host to read (should only write) but speeds up PCI transaction. 27

Async Concurrent Execution Instead of serial/parallel/serial/parallel model Want to have CUDA running and host at same time, or with mem transfers at same time Concurrent host/device: calls are async and return to host before device done Concurrent kernel execution: newer devices can run multiple kernels at once. Problem if use lots of memory Overlap of Data Transfer and Kernel execution Streams: sequence of commands that execute in order, 28

but can be interleaved with other streams complicated way to set them up. Synchronization and callbacks 29

Events Can create performance events to monitor timing PAPI can read out performance counters on some boards Often it s for a full synchronous stream, can t get values mid-operation NVML can measure power and temp on some boards? 30

Multi-device system Can switch between active device More advanced systems can access each others device memory 31

Other features Unified virtual address space (64 bit machines) Interprocess communication Error checking 32

Texture Memory Complex 33

3D Interop Can make results go to an OpenGL or Direct3D buffer Can then use CUDA results in your graphics program 34

Code Example # include <stdio.h> # define N 10 global void add ( int *a, int *b, int * c) { int tid = blockidx.x; } if (tid <N) { c[ tid ]=a[ tid ]+b[ tid ]; } int main ( int arc, char ** argv ) { int a[n],b[n],c[n]; int * dev_a,* dev_b,* dev_c ; int i; /* Allocate memory on GPU */ 35

cudamalloc (( void **)& dev_a,n* sizeof ( int )); cudamalloc (( void **)& dev_b,n* sizeof ( int )); cudamalloc (( void **)& dev_c,n* sizeof ( int )); /* Fill the host arrays with values */ for (i =0;i<N;i ++) { a[i]= -i; b[i]=i*i; } cudamemcpy ( dev_a,a,n* sizeof ( int ), cudamemcpyhosttodevice ); cudamemcpy ( dev_b,b,n* sizeof ( int ), cudamemcpyhosttodevice ); add <<<N,1 > > >( dev_a, dev_b, dev_c ); cudamemcpy (c,dev_c,n* sizeof ( int ), cudamemcpydevicetohost ); /* results */ for (i =0;i<N;i ++) { printf ("%d+%d=%d\n",a[i],b[i],c[i ]); } cudafree ( dev_a ); cudafree ( dev_b ); 36

cudafree ( dev_c ); } return 0; 37