Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018

Size: px

Start display at page:

Download "Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018"

Lauren Taylor
5 years ago
Views:

1 Desig of Digital Circuits Lecture 22: GPU Programmig Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig May 2018

2 Ageda for Today GPU as a accelerator Program structure Bulk sychroous programmig model Memory hierarchy ad memory maagemet Performace cosideratios Memory access SIMD utilizatio Atomic operatios Data trasfers 2

3 Recommeded Readigs CUDA Programmig Guide Hwu ad Kirk, Programmig Massively Parallel Processors, Third Editio,

4 A Example GPU

5 NVIDIA GeForce GTX 285 NVIDIA-speak: 240 stream processors SIMT executio Geeric speak: 30 cores 8 SIMD fuctioal uits per core Slide credit: Kayvo Fatahalia 5

6 NVIDIA GeForce GTX 285 core (I) 64 KB of storage for thread cotexts (registers) = SIMD fuctioal uit, cotrol shared across 8 uits = multiply-add = multiply = istructio stream decode = executio cotext storage Slide credit: Kayvo Fatahalia 6

7 NVIDIA GeForce GTX 285 core (II) 64 KB of storage for thread cotexts (registers) Groups of 32 threads share istructio stream (each group is a Warp): they execute the same istructio o differet data Up to 32 warps are iterleaved i a FGMT maer Up to 1024 thread cotexts ca be stored Slide credit: Kayvo Fatahalia 7

8 NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 30 cores o the GTX 285: 30,720 threads Slide credit: Kayvo Fatahalia 8

9 Evolutio of NVIDIA GPUs #Stream Processors GFLOPS 0 GTX 285 (2009) GTX 480 (2010) GTX 780 (2013) GTX 980 (2014) P100 (2016) V100 (2017) 0 Stream Processors GFLOPS 9

10 NVIDIA V100 NVIDIA-speak: 5120 stream processors SIMT executio Geeric speak: 80 cores 64 SIMD fuctioal uits per core Specialized Fuctioal Uits for Machie Learig (tesor cores i NVIDIA-speak) 10

11 NVIDIA V100 Block Diagram 80 cores o the V

12 NVIDIA V100 Core 15.7 TFLOPS Sigle Precisio 7.8 TFLOPS Double Precisio 125 TFLOPS for Deep Learig (Tesor cores ) 12

13 GPU Programmig

14 Recall: Vector Processor Disadvatages -- Works (oly) if parallelism is regular (data/simd parallelism) ++ Vector operatios -- Very iefficiet if parallelism is irregular -- How about searchig for a key i a liked list? Fisher, Very Log Istructio Word architectures ad the ELI-512, ISCA

15 Geeral Purpose Processig o GPU Easier programmig of SIMD processors with MD GPUs have democratized High Performace Computig (HPC) Great FLOPS/$, massively parallel chip o a commodity PC May workloads exhibit iheret parallelism Matrices Image processig However, this is ot for free New programmig model Algorithms eed to be re-implemeted ad rethought Still some bottleecks CPU-GPU data trasfers (PCIe, NVLINK) DRAM memory badwidth (GDDR5, HBM2) Data layout 15

16 CPU vs. GPU Differet desig philosophies CPU: A few out-of-order cores GPU: May i-order FGMT cores CPU GPU Cotrol ALU ALU ALU ALU Cache DRAM DRAM Slide credit: Hwu & Kirk 16

17 GPU Computig Computatio is offloaded to the GPU Three steps CPU-GPU data trasfer (1) GPU kerel executio (2) GPU-CPU data trasfer (3) CPU cores CPU memory Matrix 1 GPU memory Matrix 2 GPU cores 3 17

18 Traditioal Program Structure CPU threads ad GPU kerels Seuetial or modestly parallel sectios o CPU Massively parallel sectios o GPU Serial Code (host) Parallel Kerel (device) KerelA<<< Blk, Thr >>>(args);... Serial Code (host) Parallel Kerel (device) KerelB<<< Blk, Thr >>>(args);... Slide credit: Hwu & Kirk 18

19 Recall: MD Sigle procedure/program, multiple data This is a programmig model rather tha computer orgaizatio Each processig elemet executes the same procedure, except o differet data elemets Procedures ca sychroize at certai poits i program, e.g. barriers Essetially, multiple istructio streams execute the same program Each program/procedure 1) works o differet data, 2) ca execute a differet cotrol-flow path, at ru-time May scietific applicatios are programmed this way ad ru o MIMD hardware (multiprocessors) Moder GPUs programmed i a similar way o a SIMD hardware 19

20 CUDA/OpeCL Programmig Model SIMT or MD Bulk sychroous programmig Global (coarse-grai) sychroizatio betwee kerels The host (typically CPU) allocates memory, copies data, ad lauches kerels The device (typically GPU) executes kerels Grid (NDRage) Block (work-group) Withi a block, shared memory, ad sychroizatio Thread (work-item) 20

21 Trasparet Scalability Hardware is free to schedule thread blocks Device Kerel grid Block 0 Block 1 Device Block 2 Block 3 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 Each block ca execute i ay order relative to other blocks. Slide credit: Hwu & Kirk 21

22 Memory Hierarchy Grid (Device) Block (0, 0) Block (1, 0) Shared memory Shared memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Global / Texture & Surface memory Host Costat memory 22

23 Traditioal Program Structure i CUDA Fuctio prototypes float serialfuctio(); global void kerel(); mai() 1) Allocate memory space o the device cudamalloc(&d_i, bytes); 2) Trasfer data from host to device cudamemcpy(d_i, h_i, ); 3) Executio cofiguratio setup: #blocks ad #threads 4) Kerel call kerel<<<executio cofiguratio>>>(args); 5) Trasfer results from device to host cudamemcpy(h_out, d_out, ); repeat as eeded Kerel global void kerel(type args,) Automatic variables trasparetly assiged to registers Shared memory: shared Itra-block sychroizatio: sycthreads(); Slide credit: Hwu & Kirk 23

24 CUDA Programmig Laguage Memory allocatio cudamalloc((void**)&d_i, #bytes); Memory copy cudamemcpy(d_i, h_i, #bytes, cudamemcpyhosttodevice); Kerel lauch kerel<<< #blocks, #threads >>>(args); Memory deallocatio cudafree(d_i); Explicit sychroizatio cudadevicesychroize(); 24

25 Idexig ad Memory Access Images are 2D data structures height x width Image[j][i], where 0 j < height, ad 0 i < width Image[0][1] Image[1][2]

26 Image Layout i Memory Row-major layout Image[j][i] = Image[j x width + i] Stride = width Image[0][1] = Image[0 x 8 + 1] Image[1][2] = Image[1 x 8 + 2] 26

27 Idexig ad Memory Access: 1D Grid Oe GPU thread per pixel Grid of Blocks of Threads griddim.x, blockdim.x blockidx.x, threadidx.x blockidx.x Block 0 Thread 0 Thread 1 Thread 2 Thread 3 threadidx.x Block 0 6 * = 25 blockidx.x * blockdim.x + threadidx.x 27

28 Idexig ad Memory Access: 2D Grid 2D blocks griddim.x, griddim.y Block (0, 0) threadidx.x = 1 threadidx.y = 0 Row = blockidx.y * blockdim.y + threadidx.y Col = blockidx.x * blockdim.x + threadidx.x blockidx.x = 2 blockidx.y = 1 Row = 1 * = 3 Col = 0 * = 1 Image[3][1] = Image[3 * 8 + 1] 28

29 Brief Review of GPU Architecture (I) Streamig Processor Array Tesla architecture (G80/GT200) Streamig Processor Array TPC TPC TPC.. TPC TPC SM Istructio Cache SM SM Istructio Fetch/Dispatch Register File Texture Uit SFU Texture L1 Cache SFU Shared Memory Costat Cache 29

30 Brief Review of GPU Architecture (II) Streamig Multiprocessors (SM) Streamig Processors () Streamig Multiprocessor Istructio Cache Warp Scheduler Warp Scheduler Dispatch Uit Dispatch Uit Register File Blocks are divided ito warps SIMD uit (32 threads) LD/ST LD/ST LD/ST LD/ST SFU Block 0 s warps t0 t1 t2 t31 Block 1 s warps t0 t1 t2 t31 Block 2 s warps t0 t1 t2 t31 LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST SFU SFU SFU Shared Memory / L1 Cache Costat Cache NVIDIA Fermi architecture 30

31 Brief Review of GPU Architecture (III) Streamig Multiprocessors (SM) or Compute Uits (CU) SIMD pipelies Streamig Processors () or CUDA cores Vector laes Number of SMs x s across geeratios Tesla (2007): 30 x 8 Fermi (2010): 16 x 32 Kepler (2012): 15 x 192 Maxwell (2014): 24 x 128 Pascal (2016): 56 x 64 Volta (2017): 80 x 64 31

32 Performace Cosideratios

33 Performace Cosideratios Mai bottleecks Global memory access CPU-GPU data trasfers Memory access Latecy hidig Occupacy Memory coalescig Data reuse Shared memory usage SIMD (Warp) Utilizatio: Divergece Atomic operatios: Serializatio Data trasfers betwee CPU ad GPU Overlap of commuicatio ad computatio 33

34 Memory Access

35 Latecy Hidig FGMT ca hide log latecy operatios (e.g., memory accesses) Occupacy: ratio of active warps 4 active warps 2 active warps Warp 0 Warp 0 Istructio 3 Istructio 3 Warp 1 Istructio 2 Warp 1 Istructio 2 time Warp 2 Warp 3 Istructio 1 Istructio 1 Warp 0 Istructio 4 (Log latecy) time Warp 1 Istructio 3 Warp 0 Istructio 4 (Log latecy) Warp 1 Istructio 3 Warp 0 Warp 0 Istructio 5 Istructio 5 35

36 Occupacy SM resources (typical values) Maximum umber of warps per SM (64) Maximum umber of blocks per SM (32) Register usage (256KB) Shared memory usage (64KB) Occupacy calculatio Number of threads per block (defied by the programmer) Registers per thread (kow at compile time) Shared memory per block (defied by the programmer) 36

37 Memory Coalescig Whe accessig global memory, we wat to make sure that cocurret threads access earby memory locatios Peak badwidth utilizatio occurs whe all threads i a warp access oe cache lie Not coalesced Coalesced Md Nd Thread 1 Thread 2 WIDTH WIDTH Slide credit: Hwu & Kirk 37

38 Ucoalesced Memory Accesses Access directio i Kerel code M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Time Period 2 T 1 T 2 T 3 T 4 Time Period 1 T 1 T 2 T 3 T 4 M M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Slide credit: Hwu & Kirk 38

39 Coalesced Memory Accesses Access directio i Kerel code M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Time Period 1 T 1 T 2 T 3 T 4 Time Period 2 T 1 T 2 T 3 T 4 M M 0,0 M 1,0 M 2,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 Slide credit: Hwu & Kirk 39

40 AoS vs. SoA Array of Structures vs. Structure of Arrays Structure of Arrays (SoA) struct foo{ float a[8]; float b[8]; float c[8]; it d[8]; } A; Array of Structures (AoS) struct foo{ float a; float b; float c; it d; } A[8]; 40

41 CPUs Prefer AoS, GPUs Prefer SoA Liear ad strided accesses GPU CPU 12.0# 5.0# Throughput#(GB/s)# 11.0# 10.0# 9.0# 8.0# 7.0# 6.0# 5.0# 4.0# 3.0# 2.0# 1.0# GPU# Throughput#(GB/s)# 4.5# 4.0# 3.5# 3.0# 2.5# 2.0# 1.5# 1.0# 0.5# 1CPU# 2CPU# 4CPU# 0.0# 1# 2# 4# 8# 16# 32# 64# 128# 256# 512# 1024# 0.0# 1# 2# 4# 8# 16# 32# 64# 128# 256# 512# 1024# Stride#(Structure#size)# Stride#(Structure#size)# AMD Kaveri A K Sug+, DL: A data layout trasformatio system for heterogeeous computig, INPAR

42 Data Reuse Same memory locatios accessed by eighborig threads for (it i = 0; i < 3; i++){ for (it j = 0; j < 3; j++){ sum += gauss[i][j] * Image[(i+row-1)*width + (j+col-1)]; } } 42

43 Data Reuse: Tilig To take advatage of data reuse, we divide the iput ito tiles that ca be loaded ito shared memory shared it l_data[(l_size+2)*(l_size+2)]; Load tile ito shared memory sycthreads(); for (it i = 0; i < 3; i++){ for (it j = 0; j < 3; j++){ sum += gauss[i][j] * l_data[(i+l_row-1)*(l_size+2)+j+l_col-1]; } } 43

44 Shared Memory Shared memory is a iterleaved (baked) memory Each bak ca service oe address per cycle Typically, 32 baks i NVIDIA GPUs Successive 32-bit words are assiged to successive baks Bak = Address % 32 Bak coflicts are oly possible withi a warp No bak coflicts betwee differet warps 44

45 Shared Memory Bak Coflicts (I) Bak coflict free Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bak 0 Bak 1 Bak 2 Bak 3 Bak 4 Bak 5 Bak 6 Bak 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bak 0 Bak 1 Bak 2 Bak 3 Bak 4 Bak 5 Bak 6 Bak 7 Thread 15 Bak 15 Thread 15 Bak 15 Liear addressig: stride = 1 Radom addressig 1:1 Slide credit: Hwu & Kirk 45

46 Shared Memory Bak Coflicts (II) N-way bak coflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 Bak 0 Bak 1 Bak 2 Bak 3 Bak 4 Bak 5 Bak 6 Bak 7 Bak 15 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 x8 x8 Bak 0 Bak 1 Bak 2 Bak 7 Bak 8 Bak 9 Bak 15 2-way bak coflict: stride = 2 8-way bak coflict: stride = 8 Slide credit: Hwu & Kirk 46

47 Reducig Shared Memory Bak Coflicts Bak coflicts are oly possible withi a warp No bak coflicts betwee differet warps If strided accesses are eeded, some optimizatio techiues ca help Paddig Radomized mappig Rau, Pseudo-radomly iterleaved memory, ISCA 1991 Hash fuctios V.d.Braak+, Cofigurable XOR Hash Fuctios for Baked Scratchpad Memories i GPUs, IEEE TC,

48 SIMD Utilizatio

49 SIMD Utilizatio Itra-warp divergece Compute(threadIdx.x); if (threadidx.x % 2 == 0){ Do_this(threadIdx.x); } else{ Do_that(threadIdx.x); } Compute If Else 49

50 Icreasig SIMD Utilizatio Divergece-free executio Compute(threadIdx.x); if (threadidx.x < 32){ Do_this(threadIdx.x * 2); } else{ Do_that((threadIdx.x%32)*2+1); } Compute If Else 50

51 Vector Reductio: Naïve Mappig (I) Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread iteratios Slide credit: Hwu & Kirk 51

52 Vector Reductio: Naïve Mappig (II) Program with low SIMD utilizatio shared float partialsum[] usiged it t = threadidx.x; for (it stride = 1; stride < blockdim.x; stride *= 2) { } sycthreads(); if (t % (2*stride) == 0) partialsum[t] += partialsum[t + stride]; 52

53 Divergece-Free Mappig (I) All active threads belog to the same warp Thread 0 Thread 1 Thread 2 Thread 14 Thread iteratios 3 Slide credit: Hwu & Kirk 53

54 Divergece-Free Mappig (II) Program with high SIMD utilizatio shared float partialsum[] usiged it t = threadidx.x; for (it stride = blockdim.x; stride > 1; stride >> 1){ } sycthreads(); if (t < stride) partialsum[t] += partialsum[t + stride]; 54

55 Atomic Operatios

56 Shared Memory Atomic Operatios Atomic Operatios are eeded whe threads might update the same memory locatios at the same time CUDA: it atomicadd(it*, it); PTX: atom.shared.add.u32 %r25, [%rd14], %r24; SASS: Tesla, Fermi, Kepler /*00a0*/ LDSLK P0, R9, [R8]; IADD R10, R9, R7; STSCUL P1, [R8], R10; BRA 0xa0; Maxwell, Pascal, Volta /*01f8*/ ATOMS.ADD RZ, [R7], R11; Native atomic operatios for 32-bit iteger, ad 32-bit ad 64-bit atomiccas 56

57 Atomic Coflicts We defie the itra-warp coflict degree as the umber of threads i a warp that update the same memory positio The coflict degree ca be betwee 1 ad 32 th0 th1 th0 th1 th1 t coflict th0 th1 th0 0 2 t base 2 2 t base No atomic coflict = cocurret updates Shared memory Atomic coflict = serialized updates Shared memory 57

58 Histogram Calculatio Histograms cout the umber of data istaces i disjoit categories (bis) for (each pixel i i image I){ Pixel = I[i] Pixel = Computatio(Pixel) Histogram[Pixel ]++ } // Read pixel // Optioal computatio // Vote i histogram bi Iput data data[] data[+1] data[+2]... data[2-1] data[0] data[1] data[2]... data[-1] Thread 0 Thread 1 Thread 2 Thread Histogram... B-1 Atomic additios 58

Histogram Calculatio of Natural Images Freuet coflicts i atural images 169 170 171 174 177 182 187 192 194 192 169 173 173 175 177 181 185 189 191 192 169 173 173 175 177 180 184 188 190 193 169 172

59 Histogram Calculatio of Natural Images Freuet coflicts i atural images

60 Optimizig Histogram Calculatio Privatizatio: Per-block sub-histograms i shared memory Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Shared memory Block 0 s sub-histo Block 1 s sub-histo Block 2 s sub-histo Block 3 s sub-histo b0 b1 b2 b3 b0 b1 b2 b3 b0 b1 b2 b3 b0 b1 b2 b3 b0 b1 b2 b3 Global memory Fial histogram Gomez-Lua+, Performace Modelig of Atomic Additios o GPU Scratchpad Memory, IEEE TPDS,

61 Data Trasfers betwee CPU ad GPU

62 Data Trasfers Sychroous ad asychroous trasfers Streams (Commad ueues) Seuece of operatios that are performed i order CPU-GPU data trasfer Kerel executio D iput data istaces, B blocks GPU-CPU data trasfer Default stream t T Copy data t E Execute 62

63 Asychroous Trasfers Computatio divided ito Streams D iput data istaces, B blocks Streams D/Streams data istaces B/Streams blocks t T Copy data t E Execute Copy data Execute Estimates t E + t T Streams t E >= t T (domiat kerel) t T + t E Streams t T > t E (domiat trasfers) 63

64 Overlap of Commuicatio ad Computatio Applicatios with idepedet computatio o differet data istaces ca beefit from asychroous trasfers For istace, video processig Nostreamed executio A seuece of 6 frames is trasferred to device 6 x b blocks compute o the seuece of frames Streamed executio A chuk of 2 frames is trasferred to device 2 x b blocks compute o the chuk, while the secod chuk is beig trasferred Executio time saved thaks to streams Gomez-Lua+, Performace models for asychroous data trasfers o cosumer Graphics Processig Uits, JPDC,

65 Summary GPU as a accelerator Program structure Bulk sychroous programmig model Memory hierarchy ad memory maagemet Performace cosideratios Memory access Latecy hidig: occupacy (TLP) Memory coalescig Data reuse: shared memory SIMD utilizatio Atomic operatios Data trasfers 65

66 PUMPS+AI Summer School, Barceloa, July

67 Desig of Digital Circuits Lecture 22: GPU Programmig Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig May 2018

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 17 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018