Cartoon parallel architectures; CPUs and GPUs

Size: px

Start display at page:

Download "Cartoon parallel architectures; CPUs and GPUs"

Corey Ellis
6 years ago
Views:

1 Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 ~ socket 14

16 ~ core 14

17 ~ HWMT+SIMD ( SIMT ) 14

18 Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 14

19 ~ 500 GF/s (single) Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 14

20 ~ 4 TF/s (single) ~ 500 GF/s (single) Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 14

21 Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

22 ~ 50 GB/s Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

23 ~ 50 GB/s ~ 250 GB/s Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

24 ~ 50 GB/s ~ 250 GB/s 6 GB/s Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

25 System Comparison Intel Xeon NVIDIA Difference E5-2687W K20X # Cores/SMX Clock frequency (max) SIMD Width Thread processors 3.8 GHz 735 MHz bits 2688 SP DP Performance 8 cores 3.8 GHz MHz 8.12 (single precision) (8 Add + 8 Mul) = 2 (FMA) = Performance 8 cores 3.8 GHz MHz 5.42 (double precision) (4 Add + 4 Mul) = 2 (FMA) = Memory bandwidth 51.2 GB/s 250 GB/s 4.88 TDP 150 W 235 W 1.57

26 17

27 17

28 6 GB/s 17

29 18

30 19

31 20

32 21

33 22

34 23

35 24

36 CUDA is NVIDIA s implementation of this execution model

37 Thread hierarchy Single instruction multiple (SIMT)

38 An example to compare models Naïve: for (i=0; i<n; i++) A[i] += 2; OpenMP: #pragma omp parallel for for (i=0; i<n; i++) A[i] += 2; CUDA, with N s: int i = f(global ID); A[i] += 2;

39 Global IDs blockidx.x Idx.x global ID

40 Global IDs blockidx.x Idx.x A global ID

41 Thread hierarchy Given a 3-D grid of blocks there are (griddim.x*griddim.y*griddim.z) blocks in the grid each block s position is identified by blockidx.x, blockidx.y, and blockidx.z Similarly for a 3-D block blockdim.x, blockdim.y, blockdim.z Idx.x, Idx.y, Idx.z Thread-to-data mapping depends on how the work is divided amongst the s

42 Memory hierarchy variables local memory block shared memory grid global memory constant memory (read-only) texture memory (read-only)

43 CUDA by example Basic CUDA code global void test (int* in, int* out, int N) { int gid = Idx.x + blockdim.x * blockidx.x; out[gid] = in[gid]; }! int main (int argc, char** argv) { int N = ; in tbsize = 256;! int nblocks = N / tbsize;! dim3 grid (nblocks); dim3 block (tbsize);! test <<<grid, block>>> (d_in, d_out, N); cudathreadsynchronize (); }

44 CUDA by example Basic CUDA code int main (int argc, char** argv) { /* allocate memory for host and device */ int* h_in, h_out, d_in, d_out; h_in = (int*) malloc (N * sizeof (int)); h_out = (int*) malloc (N * sizeof (int)); cudamalloc ((void**) &d_in, N * sizeof (int)); cudamalloc ((void**) &d_out, N * sizeof (int));! /* copy data from device to host */ cudamemcpy (d_in, h_in, N * sizeof (int), cudamemcpyhosttodevice);! /* body of the problem here */... /* copy data back to host */ cudamemcpy (h_out, d_out, N * sizeof (int), cudamemcpydevicetohost); allocate memory on device Copy data from CPU to GPU Copy data from GPU to CPU } /* free memory */ free (h_in); free (h_out) cudafree (d_in); cudafree (d_out); free memory

45 CUDA by example What is this code doing? global mysteryfunction (int* in) { int tidx, tidy, gidx, gidy; tidx = Idx.x; tidy = Idx.y; gidx = tidx + blockdim.x * blockidx.x; gidy = tidy + blockdim.y * blockidx.y;! shared buffer[16][16];! buffer[tidx][tidy] = in[gidx + gidy * blockdim.x * griddim.x]; syncs();! if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gidx + gidy * blockdim.x * griddim.x] = temp; }

46 CUDA by example What is this code doing? global mysteryfunction (int* in) { int tidx, tidy, gidx, gidy; tidx = Idx.x; tidy = Idx.y; gidx = tidx + blockdim.x * blockidx.x; gidy = tidy + blockdim.y * blockidx.y;! shared buffer[16][16];! buffer[tidx][tidy] = in[gidx + gidy * blockdim.x * griddim.x]; syncs();! if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gidx + gidy * blockdim.x * griddim.x] = temp; } shared memory why do we need this?

47 Synchronization Within a block via syncs (); Global synchronization implicit synchronization between kernels only way to synchronize globally is to finish the grid and start another grid

48 Scheduling Each block gets scheduled on a multiprocessor (SMX) there is no guarantee in the order in which they get scheduled blocks run independently to each other Multiple blocks can reside on a single SMX simultaneously (occupancy) the number of blocks is determined by the resource usage and availability (shared memory and registers) Once scheduled, each blocks runs to completion

49 Execution Minimum unit of execution: warp typically 32 s At any given time, multiple warps will be executing could be from the same or different blocks A warp of s could be either executing waiting (for data or their turn) When a warp gets stalled, they could be switched out instantaneously so that another warp can start executing hardware multi-ing

50 Performance Notes Thread Divergence On a branch, s in a warp can diverge execution is serialized s taking one branch executes while others idle Avoid divergence!!! use bitwise operation when possible diverge at granularity of warps (no penalty)

51 Performance Notes Occupancy Occupancy = # resident warps / max # warps # resident warps is determined by per- register and per-block shared memory usage max # warps is specific to the hardware generation More warps means more s with which to hide latency increases the chance of keeping the GPU busy at all times does not necessarily mean better performance

52 Performance Notes Bandwidth Utilization Reading from the DRAM occurs at the granularity of 128 Byte transactions requests are further decomposed to aligned cache lines read-only cache:128 Bytes L2 cache: 32 Bytes Minimize loading redundant cache lines to maximize bandwidth utilization aligned access to memory sequential access pattern

53 Performance Notes Bandwidth Utilization

54 Performance Notes Bandwidth Utilization

55 Performance Notes Bandwidth Utilization

56 Backup 44

57 GPU Architecture

58 Performance Notes Bandwidth Utilization II Little s Law L = λw L = average number of customers in a store λ = arrival rate W = average time spent

59 Performance Notes Bandwidth Utilization II Little s Law L = λw L = average number of customers in a store λ = arrival rate W = average time spent Memory Bandwidth Bandwidth (λ) Latency (W)

60 Performance Notes Bandwidth Utilization II Little s Law L = λw L = average number of customers in a store λ = arrival rate W = average time spent Memory Bandwidth tens of thousands of in-flight requests!!! Bandwidth (λ) Latency (W)

61 In summary Use as many cheap s as possible maximizes occupancy increases the number of memory requests Avoid divergence if unavoidable, diverge at the warp level Use aligned and sequential data access pattern minimize redundant data loads

62 CUDA by example Quicksort Let s now consider quicksort on a GPU Step 1 Partition the initial list how do we partition the list amongst blocks? recall that blocks CANNOT co-operate and blocks can go in ANY order however, we need to have MANY s and blocks in order to see good performance

63 CUDA by example Quicksort block 0 block 1 block 2 block 3

64 CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr block 0 block 1 block 2 block 3

65 CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr block 0 block 1 block 2 block 3 < pivot (5) >= pivot (5)

66 CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr block 0 block 1 block 2 block 3 < pivot >= pivot Do a cumulative sum on < pivot and >= pivot This should be done in shared memory in parallel

67 CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr block 0 block 1 block 2 block 3 < pivot >= pivot This tells us how much space and where each block needs to store its values

68 CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr block 0 block 1 block 2 block 3 < pivot 2 3 >= pivot 0 1 temporary array start end

69 CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr block 0 block 1 block 2 block 3 < pivot 2 3 atomic fetch-and-add (FAA) >= pivot 0 1 temporary array start end

70 CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr block 0 block 1 block 2 block 3 < pivot 2 3 atomic fetch-and-add (FAA) >= pivot 0 1 temporary array start end

71 CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr block 0 block 1 block 2 block 3 < pivot 2 3 atomic fetch-and-add (FAA) >= pivot 0 1 temporary array start end

72 CUDA by example Quicksort Phew. That was the first part. This is repeated until there are enough independent partitions that can be assigned to blocks In the next part, each block will do something similar minus the FAA When sequences become small enough, you can sort it using an alternative sorting algorithm (e.g., bitonic sort)

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries