Cartoon parallel architectures; CPUs and GPUs

Similar documents
Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Josef Pelikán, Jan Horáček CGG MFF UK Praha

High Performance Computing and GPU Programming

Tesla Architecture, CUDA and Optimization Strategies

Fundamental CUDA Optimization. NVIDIA Corporation

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Scientific discovery, analysis and prediction made possible through high performance computing.

ECE 574 Cluster Computing Lecture 15

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 2: CUDA Programming

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

High Performance Linear Algebra on Data Parallel Co-Processors I

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Introduction to CUDA Programming

ECE 574 Cluster Computing Lecture 17

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Parallel Numerical Algorithms

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Lecture 3: Introduction to CUDA

Fundamental CUDA Optimization. NVIDIA Corporation

Introduction to GPGPU and GPU-architectures

COSC 6374 Parallel Computations Introduction to CUDA

Multi-Processors and GPU

Practical Introduction to CUDA and GPU

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Programming. Rupesh Nasre.

ECE 408 / CS 483 Final Exam, Fall 2014

Programming with CUDA, WS09

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Parallel Accelerators

CS377P Programming for Performance GPU Programming - I

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Overview: Graphics Processing Units

Double-Precision Matrix Multiply on CUDA

Lecture 5. Performance Programming with CUDA

CUDA Architecture & Programming Model

Parallel Accelerators

COSC 6339 Accelerators in Big Data

University of Bielefeld

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPU CUDA Programming

GPU Programming Using CUDA

Multi Agent Navigation on GPU. Avi Bleiweiss

Parallel Computing. Lecture 19: CUDA - I

Advanced CUDA Optimizations. Umar Arshad ArrayFire

GPU programming. Dr. Bernhard Kainz

CS377P Programming for Performance GPU Programming - II

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Image convolution with CUDA

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

CS 179: GPU Computing. Lecture 2: The Basics

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Lecture 10!! Introduction to CUDA!

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CSE 160 Lecture 24. Graphical Processing Units

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Warps and Reduction Algorithms

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

COSC 462 Parallel Programming

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

CUDA C Programming Mark Harris NVIDIA Corporation

CME 213 S PRING Eric Darve

Data Parallel Execution Model

Introduc)on to GPU Programming

Paralization on GPU using CUDA An Introduction

CUDA. More on threads, shared memory, synchronization. cuprintf

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Inter-Block GPU Communication via Fast Barrier Synchronization

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to GPGPUs and to CUDA programming model

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research


CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

CUDA Kenjiro Taura 1 / 36

CUDA Basics. July 6, 2016

Transcription:

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1

2

3

4

5

6

7

8

9

10

11

12

13

14

~ socket 14

~ core 14

~ HWMT+SIMD ( SIMT ) 14

Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 14

~ 500 GF/s (single) Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 14

~ 4 TF/s (single) ~ 500 GF/s (single) Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 14

Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

~ 50 GB/s Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

~ 50 GB/s ~ 250 GB/s Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

~ 50 GB/s ~ 250 GB/s 6 GB/s Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

System Comparison Intel Xeon NVIDIA Difference E5-2687W K20X # Cores/SMX 8 14 1.75 Clock frequency (max) SIMD Width Thread processors 3.8 GHz 735 MHz 0.20 256-bits 2688 SP + 896 DP Performance 8 cores 3.8 GHz 2688 735 MHz 8.12 (single precision) (8 Add + 8 Mul) = 2 (FMA) = Performance 8 cores 3.8 GHz 896 735 MHz 5.42 (double precision) (4 Add + 4 Mul) = 2 (FMA) = Memory bandwidth 51.2 GB/s 250 GB/s 4.88 TDP 150 W 235 W 1.57

17

17

6 GB/s 17

18

19

20

21

22

23

24

CUDA is NVIDIA s implementation of this execution model

Thread hierarchy Single instruction multiple (SIMT)

An example to compare models Naïve: for (i=0; i<n; i++) A[i] += 2; OpenMP: #pragma omp parallel for for (i=0; i<n; i++) A[i] += 2; CUDA, with N s: int i = f(global ID); A[i] += 2;

Global IDs blockidx.x 0 1 2 3 Idx.x 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 global ID 0 1 2 3 15

Global IDs blockidx.x 0 1 2 3 Idx.x 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 A global ID 0 1 2 3 15

Thread hierarchy Given a 3-D grid of blocks there are (griddim.x*griddim.y*griddim.z) blocks in the grid each block s position is identified by blockidx.x, blockidx.y, and blockidx.z Similarly for a 3-D block blockdim.x, blockdim.y, blockdim.z Idx.x, Idx.y, Idx.z Thread-to-data mapping depends on how the work is divided amongst the s

Memory hierarchy variables local memory block shared memory grid global memory constant memory (read-only) texture memory (read-only)

CUDA by example Basic CUDA code global void test (int* in, int* out, int N) { int gid = Idx.x + blockdim.x * blockidx.x; out[gid] = in[gid]; }! int main (int argc, char** argv) { int N = 1048576; in tbsize = 256;! int nblocks = N / tbsize;! dim3 grid (nblocks); dim3 block (tbsize);! test <<<grid, block>>> (d_in, d_out, N); cudathreadsynchronize (); }

CUDA by example Basic CUDA code int main (int argc, char** argv) { /* allocate memory for host and device */ int* h_in, h_out, d_in, d_out; h_in = (int*) malloc (N * sizeof (int)); h_out = (int*) malloc (N * sizeof (int)); cudamalloc ((void**) &d_in, N * sizeof (int)); cudamalloc ((void**) &d_out, N * sizeof (int));! /* copy data from device to host */ cudamemcpy (d_in, h_in, N * sizeof (int), cudamemcpyhosttodevice);! /* body of the problem here */... /* copy data back to host */ cudamemcpy (h_out, d_out, N * sizeof (int), cudamemcpydevicetohost); allocate memory on device Copy data from CPU to GPU Copy data from GPU to CPU } /* free memory */ free (h_in); free (h_out) cudafree (d_in); cudafree (d_out); free memory

CUDA by example What is this code doing? global mysteryfunction (int* in) { int tidx, tidy, gidx, gidy; tidx = Idx.x; tidy = Idx.y; gidx = tidx + blockdim.x * blockidx.x; gidy = tidy + blockdim.y * blockidx.y;! shared buffer[16][16];! buffer[tidx][tidy] = in[gidx + gidy * blockdim.x * griddim.x]; syncs();! if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gidx + gidy * blockdim.x * griddim.x] = temp; }

CUDA by example What is this code doing? global mysteryfunction (int* in) { int tidx, tidy, gidx, gidy; tidx = Idx.x; tidy = Idx.y; gidx = tidx + blockdim.x * blockidx.x; gidy = tidy + blockdim.y * blockidx.y;! shared buffer[16][16];! buffer[tidx][tidy] = in[gidx + gidy * blockdim.x * griddim.x]; syncs();! if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gidx + gidy * blockdim.x * griddim.x] = temp; } shared memory why do we need this?

Synchronization Within a block via syncs (); Global synchronization implicit synchronization between kernels only way to synchronize globally is to finish the grid and start another grid

Scheduling Each block gets scheduled on a multiprocessor (SMX) there is no guarantee in the order in which they get scheduled blocks run independently to each other Multiple blocks can reside on a single SMX simultaneously (occupancy) the number of blocks is determined by the resource usage and availability (shared memory and registers) Once scheduled, each blocks runs to completion

Execution Minimum unit of execution: warp typically 32 s At any given time, multiple warps will be executing could be from the same or different blocks A warp of s could be either executing waiting (for data or their turn) When a warp gets stalled, they could be switched out instantaneously so that another warp can start executing hardware multi-ing

Performance Notes Thread Divergence On a branch, s in a warp can diverge execution is serialized s taking one branch executes while others idle Avoid divergence!!! use bitwise operation when possible diverge at granularity of warps (no penalty)

Performance Notes Occupancy Occupancy = # resident warps / max # warps # resident warps is determined by per- register and per-block shared memory usage max # warps is specific to the hardware generation More warps means more s with which to hide latency increases the chance of keeping the GPU busy at all times does not necessarily mean better performance

Performance Notes Bandwidth Utilization Reading from the DRAM occurs at the granularity of 128 Byte transactions requests are further decomposed to aligned cache lines read-only cache:128 Bytes L2 cache: 32 Bytes Minimize loading redundant cache lines to maximize bandwidth utilization aligned access to memory sequential access pattern

Performance Notes Bandwidth Utilization

Performance Notes Bandwidth Utilization

Performance Notes Bandwidth Utilization

Backup 44

GPU Architecture

Performance Notes Bandwidth Utilization II Little s Law L = λw L = average number of customers in a store λ = arrival rate W = average time spent

Performance Notes Bandwidth Utilization II Little s Law L = λw L = average number of customers in a store λ = arrival rate W = average time spent Memory Bandwidth Bandwidth (λ) Latency (W)

Performance Notes Bandwidth Utilization II Little s Law L = λw L = average number of customers in a store λ = arrival rate W = average time spent Memory Bandwidth tens of thousands of in-flight requests!!! Bandwidth (λ) Latency (W)

In summary Use as many cheap s as possible maximizes occupancy increases the number of memory requests Avoid divergence if unavoidable, diverge at the warp level Use aligned and sequential data access pattern minimize redundant data loads

CUDA by example Quicksort Let s now consider quicksort on a GPU Step 1 Partition the initial list how do we partition the list amongst blocks? recall that blocks CANNOT co-operate and blocks can go in ANY order however, we need to have MANY s and blocks in order to see good performance

CUDA by example Quicksort 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot (5) 2 1 0 2 1 0 1 1 >= pivot (5) 0 1 2 0 1 2 1 1

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot 2 3 0 2 1 1 1 2 >= pivot 0 1 2 2 1 3 1 2 Do a cumulative sum on < pivot and >= pivot This should be done in shared memory in parallel

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot 2 3 0 2 1 1 1 2 >= pivot 0 1 2 2 1 3 1 2 This tells us how much space and where each block needs to store its values

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot 2 3 >= pivot 0 1 temporary array start end

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot 2 3 atomic fetch-and-add (FAA) >= pivot 0 1 temporary array start end

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot 2 3 atomic fetch-and-add (FAA) >= pivot 0 1 temporary array start end

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot 2 3 atomic fetch-and-add (FAA) >= pivot 0 1 temporary array start end 4 3 2 5

CUDA by example Quicksort Phew. That was the first part. This is repeated until there are enough independent partitions that can be assigned to blocks In the next part, each block will do something similar minus the FAA When sequences become small enough, you can sort it using an alternative sorting algorithm (e.g., bitonic sort)