CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs

Size: px

Start display at page:

Download "CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs"

Winifred Thompson
5 years ago
Views:

1 CS 152 Computer Architecture ad Egieerig CS252 Graduate Computer Architecture Lecture 17 GPUs Krste Asaovic Electrical Egieerig ad Computer Scieces Uiversity of Califoria at Berkeley

2 Last Time i Lecture 16 RISC-V Vector Stadard ad programmig examples 2

3 Types of Parallelism IstrucAo-Level Parallelism (ILP) Execute idepedet istrucaos from oe istrucao stream i parallel (pipeliig, superscalar, VLIW) Thread-Level Parallelism (TLP) Execute idepedet istrucao streams i parallel (mulathreadig, mulaple cores) Data-Level Parallelism (DLP) Execute mulaple operaaos of the same type i parallel (vector/simd execuao) Which is easiest to program? Which is most flexible form of parallelism? i.e., ca be used i more situaaos Which is most efficiet? i.e., greatest tasks/secod/area, lowest eergy/task 3

4 Resurgece of DLP Covergece of applicaao demads ad techology costraits drives architecture choice New applicaaos, such as graphics, machie visio, speech recogiao, machie learig, etc. all require large umerical computaaos that are ove trivially data parallel SIMD-based architectures (vector-simd, subword-simd, SIMT/GPUs) are most efficiet way to execute these algorithms 4

5 Packed SIMD Extesios 64b 32b 32b 16b 16b 16b 16b 8b 8b 8b 8b 8b 8b 8b 8b Short vectors added to exisag ISAs for microprocessors Use exisag 64-bit registers split ito 2x32b or 4x16b or 8x8b Licol Labs TX-2 from 1957 had 36b datapath split ito 2x18b or 4x9b Newer desigs have wider registers 128b for PowerPC AlAvec, Itel SSE2/3/4 256b for Itel AVX Sigle istrucao operates o all elemets withi register 16b 16b 16b 16b 16b 16b 16b 16b 4x16b adds b 16b 16b 16b 5

6 MulJmedia Extesios versus Vectors Limited istrucao set: o vector legth cotrol o strided load/store or scader/gather uit-stride loads must be aliged to 64/128-bit boudary Limited vector register legth: requires superscalar dispatch to keep mulaply/add/load uits busy loop urollig to hide latecies icreases register pressure Tred towards fuller vector support i microprocessors Beder support for misaliged memory accesses Support of double-precisio (64-bit floaag-poit) New Itel AVX spec (aouced April 2008), 256b vector registers (expadable up to 1024b) 6

7 DLP importat for covejoal CPUs PredicAo for x86 processors, from Heessy & Paderso, 5 th ediao Note: Educated guess, ot Itel product plas! TLP: 2+ cores / 2 years DLP: 2x width / 4 years DLP will accout for more maistream parallelism growth tha TLP i ext decade. SIMD sigle-istrucao mulaple-data (DLP) MIMD- mulaple-istrucao mulaple-data (TLP) 7

8 Graphics Processig Uits (GPUs) Origial GPUs were dedicated fixed-fucao devices for geeraag 3D graphics (mid-late 1990s) icludig highperformace floaag-poit uits Provide workstaao-like graphics for PCs User could cofigure graphics pipelie, but ot really program it Over Ame, more programmability added ( ) E.g., New laguage Cg for wriag small programs ru o each vertex or each pixel, also Widows DirectX variats Massively parallel (millios of veraces or pixels per frame) but very costraied programmig model Some users oaced they could do geeral-purpose computaao by mappig iput ad output data to images, ad computaao to vertex ad pixel shadig computaaos Icredibly difficult programmig model as had to use graphics pipelie model for geeral computaao 8

9 Geeral-Purpose GPUs (GP-GPUs) I 2006, Nvidia itroduced GeForce 8800 GPU supporag a ew programmig laguage: CUDA Compute Uified Device Architecture Subsequetly, broader idustry pushig for OpeCL, a vedor-eutral versio of same ideas. Idea: Take advatage of GPU computaaoal performace ad memory badwidth to accelerate some kerels for geeral-purpose compuag Adached processor model: Host CPU issues data-parallel kerels to GP-GPU for execuao This lecture has a simplified versio of Nvidia CUDA-style model ad oly cosiders GPU execuao for computaaoal kerels, ot graphics Would probably eed aother course to describe graphics processig 9

10 Simplified CUDA Programmig Model ComputaAo performed by a very large umber of idepedet small scalar threads (CUDA threads or microthreads) grouped ito thread blocks. // C versio of DAXPY loop. void daxpy(it, double a, double*x, double*y) { for (it i=0; i<; i++) y[i] = a*x[i] + y[i]; } // CUDA versio. host // Piece ru o host processor. it blocks = (+255)/256; //256 CUDA threads/block daxpy<<<blocks,256>>>(,2.0,x,y); device // Piece ru o GP-GPU. void daxpy(it, double a, double*x, double*y) { it i = blockidx.x*blockdim.x + threadid.x; if (i<) y[i]=a*x[i]+y[i]; } 10

11 Programmer s View of ExecuJo blockidx 0 threadid 0 threadid 1 threadid 255 blockdim = 256 (programmer ca choose) Create eough blocks to cover iput vector blockidx 1 threadid 0 threadid 1 threadid 255 (NVIDIA calls this esemble of blocks a Grid, ca be 2-dimesioal) blockidx (+255/256) threadid 0 threadid 1 threadid 255 CodiAoal (i<) turs off uused threads i last block 11

12 Hardware ExecuJo Model CPU Lae 0 Lae 1 Lae 0 Lae 1 Lae 0 Lae 1 CPU Memory Lae 15 Core 0 Lae 15 Core 1 GPU Lae 15 Core 15 GPU Memory GPU is built from mulaple parallel cores, each core cotais a mulathreaded SIMD processor with mulaple laes but with o scalar processor some addig scalar coprocessors ow CPU seds whole grid over to GPU, which distributes thread blocks amog cores (each thread block executes o oe core) Programmer uaware of umber of cores 12

13 Historical RetrospecJve, Cray-2 (1985) 243MHz ECL logic 2GB DRAM mai memory (128 baks of 16MB each) Bak busy Ame 57 clocks! Local memory of 128KB/core 1 foregroud + 4 backgroud vector processors Foregroud CPU Lae Lae Lae Local Lae Memory Local Memory Core Memory 0Local Core Memory 0 Core 0 Core 0 Shared Memory 13

14 Sigle IstrucJo, MulJple Thread (SIMT) GPUs use a SIMT model, where idividual scalar istrucao streams for each CUDA thread are grouped together for SIMD execuao o hardware (NVIDIA groups 32 CUDA threads ito a warp) Scalar istrucao stream ld x mul a ld y add st y µt0 µt1 µt2 µt3 µt4 µt5 µt6 µt7 SIMD execuao across warp 14

15 ImplicaJos of SIMT Model All vector loads ad stores are scader-gather, as idividual µthreads perform scalar loads ad stores GPU adds hardware to dyamically coalesce idividual µthread loads ad stores to mimic vector loads ad stores Every µthread has to perform stripmiig calculaaos redudatly ( am I acave? ) as there is o scalar processor equivalet 15

16 CS152 Admiistrivia PS 4 due Friday March 23 i secao Ca also tur i o class Wedesday, office hours, or ca pdf Next week is Sprig Break o classes or secaos! Lab 4 out o Friday 16

17 CS252 Admiistrivia CS252 17

18 CodiJoals i SIMT model Simple if-the-else are compiled ito predicated execuao, equivalet to vector maskig More complex cotrol flow compiled ito braches How to execute a vector of braches? Scalar istructio stream tid=threadid If (tid >= ) skip Call fuc1 add st y skip: µt0 µt1 µt2 µt3 µt4 µt5 µt6 µt7 SIMD executio across warp 18

19 Brach divergece Hardware tracks which µthreads take or do t take brach If all go the same way, the keep goig i SIMD fashio If ot, create mask vector idicaag take/ot-take Keep execuag ot-take path uder mask, push take brach PC+mask oto a hardware stack ad execute later Whe ca execuao of µthreads i warp recoverge? 19

20 NVIDIA Istructio Set Arch. ISA is a abstractio of the hardware istructio set Parallel Thread Executio (PTX) opcode.type d,a,b,c; Uses virtual registers Traslatio to machie code is performed i software Example: shl.s32 R8, blockidx, 9 ; Thread Block ID * Block size (512 or 29) add.s32 R8, R8, threadidx ; R8 = i = my CUDA thread ID ld.global.f64 RD0, [X+R8] ; RD0 = X[i] ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i] mul.f64 R0D, RD0, RD4 ; Product i RD0 = RD0 * RD4 (scalar a) add.f64 R0D, RD0, RD2 ; Sum i RD0 = RD0 + RD2 (Y[i]) st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i]) Graphical Processig Uits Copyright 2019, Elsevier Ic. All rights Reserved 20

21 Coditioal Brachig Like vector architectures, GPU brach hardware uses iteral masks Also uses Brach sychroizatio stack Etries cosist of masks for each SIMD lae I.e. which threads commit their results (all threads execute) Istructio markers to maage whe a brach diverges ito multiple executio paths Push o diverget brach ad whe paths coverge Act as barriers Pops stack Per-thread-lae 1-bit predicate register, specified by programmer Graphical Processig Uits Copyright 2019, Elsevier Ic. All rights Reserved 21

22 Example if (X[i]!= 0) X[i] = X[i] Y[i]; else X[i] = Z[i]; ld.global.f64 RD0, [X+R8] ; RD0 = X[i] setp.eq.s32 P1, RD0, #0 ; P1 is predicate register bra ELSE1, *Push ; Push old mask, set ew mask bits ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i] ; if P1 false, go to ELSE1 sub.f64 RD0, RD0, RD2 ; Differece i RD0 st.global.f64 [X+R8], RD0 ; X[i] = bra ENDIF1, *Comp ; complemet mask bits ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i] st.global.f64 [X+R8], RD0 ; X[i] = RD0 ; if P1 true, go to ENDIF1 ENDIF1: <ext istructio>, *Pop ; pop to restore old mask Graphical Processig Uits Copyright 2019, Elsevier Ic. All rights Reserved 22

Warps are muljthreaded o core Oe warp of 32 µthreads is a sigle thread i the hardware MulAple warp threads are iterleaved i execuao o a sigle core to hide latecies

23 Warps are muljthreaded o core Oe warp of 32 µthreads is a sigle thread i the hardware MulAple warp threads are iterleaved i execuao o a sigle core to hide latecies (memory ad fucaoal uit) A sigle thread block ca cotai mulaple warps (up to 512 µt max i CUDA), all mapped to sigle core Ca have mulaple blocks execuag o oe core [Nvidia, 2010] 23

24 GPU Memory Hierarchy [ Nvidia, 2010] 24

25 SIMT Illusio of may idepedet threads But for efficiecy, programmer must try ad keep µthreads aliged i a SIMD fashio Try ad do uit-stride loads ad store so memory coalescig kicks i Avoid brach divergece so most istrucao slots execute useful work ad are ot masked off 25

26 Nvidia Fermi GF100 GPU [Nvidia, 2010] 26

27 Fermi Streamig MulJprocessor Core 27

28 NVIDIA Pascal MulJthreaded GPU Core 28

29 Fermi Dual-Issue Warp Scheduler 29

30 Importat of Machie Learig for GPUs NVIDIA stock price 20x i 5 years 30

31 Apple A5X Processor for ipad v3 (2012) 12.90mm x 12.79mm 45m techology [Source: Chipworks, 2012] 31

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste