Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018

Size: px

Start display at page:

Download "Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018"

Ginger Doyle
5 years ago
Views:

1 Desig of Digital Circuits Lecture 20: SIMD Processors Prof. Our Mutlu ETH Zurich Sprig May 2018

2 New Course: Bachelor s Semiar i Comp Arch Fall credit uits Rigorous semiar o fudametal ad cuttig-edge topics i computer architecture Critical presetatio, review, ad discussio of semial works i computer architecture We will cover may ideas & issues, aalyze their tradeoffs, perform critical thikig ad braistormig Participatio, presetatio, report ad review writig Stay tued for more iformatio 2

3 For the Curious: New Rowhammer Attack Aother Rowhammer-based attack disclosed yesterday 3

4 Last Week s Attack Usig a itegrated GPU i a mobile system to remotely escalate privilege via the WebGL iterface 4

5 More to Come Our Mutlu, "The RowHammer Problem ad Other Issues We May Face as Memory Becomes Deser" Ivited Paper i Proceedigs of the Desig, Automatio, ad Test i Europe Coferece (DATE), Lausae, Switzerlad, March [Slides (pptx) (pdf)] 5

6 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed Microarchitectures Pipeliig Issues i Pipeliig: Cotrol & Data Depedece Hadlig, State Maiteace ad Recovery, Out-of-Order Executio Other Executio Paradigms 6

7 Readigs for Today Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro Lidholm et al., "NVIDIA Tesla: A Uified Graphics ad Computig Architecture," IEEE Micro

8 Other Approaches to Cocurrecy (or Istructio Level Parallelism)

9 Approaches to (Istructio-Level) Cocurrecy Pipeliig Out-of-order executio Dataflow (at the ISA level) Superscalar Executio VLIW Fie-Graied Multithreadig SIMD Processig (Vector ad array processors, GPUs) Decoupled Access Execute Systolic Arrays 9

10 SIMD Processig: Exploitig Regular (Data) Parallelism

11 Fly s Taxoomy of Computers Mike Fly, Very High-Speed Computig Systems, Proc. of IEEE, 1966 SISD: Sigle istructio operates o sigle data elemet SIMD: Sigle istructio operates o multiple data elemets Array processor Vector processor MISD: Multiple istructios operate o sigle data elemet Closest form: systolic array processor, streamig processor MIMD: Multiple istructios operate o multiple data elemets (multiple istructio streams) Multiprocessor Multithreaded processor 11

12 Data Parallelism Cocurrecy arises from performig the same operatio o differet pieces of data Sigle istructio multiple data (SIMD) E.g., dot product of two vectors Cotrast with data flow Cocurrecy arises from executig differet operatios i parallel (i a data drive maer) Cotrast with thread ( cotrol ) parallelism Cocurrecy arises from executig differet threads of cotrol i parallel SIMD exploits operatio-level parallelism o differet data Same operatio cocurretly applied to differet pieces of data A form of ILP where istructio happes to be the same across data 12

13 SIMD Processig Sigle istructio operates o multiple data elemets I time or i space Multiple processig elemets Time-space duality Array processor: Istructio operates o multiple data elemets at the same time usig differet spaces Vector processor: Istructio operates o multiple data elemets i cosecutive time steps usig the same space 13

14 Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Istructio Stream LD VR ß A[3:0] ADD VR ß VR, 1 MUL VR ß VR, 2 ST A[3:0] ß VR Time Same same time LD0 LD1 LD2 LD3 LD0 Differet time AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 Differet same space AD3 MU2 ST1 MU3 ST2 Same space ST3 Space Space 14

15 SIMD Array Processig vs. VLIW VLIW: Multiple idepedet operatios packed together by the compiler 15

16 SIMD Array Processig vs. VLIW Array processor: Sigle operatio o multiple (differet) data elemets 16

17 Vector Processors (I) A vector is a oe-dimesioal array of umbers May scietific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2 A vector processor is oe whose istructios operate o vectors rather tha scalar (sigle data) values Basic reuiremets Need to load/store vectors à vector registers (cotai vectors) Need to operate o vectors of differet legths à vector legth register (VLEN) Elemets of a vector might be stored apart from each other i memory à vector stride register (VSTR) Stride: distace i memory betwee two elemets of a vector 17

18 Vector Processors (II) A vector istructio performs a operatio o each elemet i cosecutive cycles Vector fuctioal uits are pipelied Each pipelie stage operates o a differet data elemet Vector istructios allow deeper pipelies No itra-vector depedecies à o hardware iterlockig eeded withi a vector No cotrol flow withi a vector Kow stride allows easy address calculatio for all vector elemets Eables prefetchig of vectors ito registers/cache/memory 18

19 Vector Processor Advatages + No depedecies withi a vector Pipeliig & parallelizatio work really well Ca have very deep pipelies, o depedecies! + Each istructio geerates a lot of work Reduces istructio fetch badwidth reuiremets + Highly regular memory access patter + No eed to explicitly code loops Fewer braches i the istructio seuece 19

20 Vector Processor Disadvatages -- Works (oly) if parallelism is regular (data/simd parallelism) ++ Vector operatios -- Very iefficiet if parallelism is irregular -- How about searchig for a key i a liked list? Fisher, Very Log Istructio Word architectures ad the ELI-512, ISCA

21 Vector Processor Limitatios -- Memory (badwidth) ca easily become a bottleeck, especially if 1. compute/memory operatio balace is ot maitaied 2. data is ot mapped appropriately to memory baks 21

22 Vector Processig i More Depth

23 Vector Registers Each vector data register holds N M-bit values Vector cotrol registers: VLEN, VSTR, VMASK Maximum VLEN ca be N Maximum umber of elemets stored i a vector register Vector Mask Register (VMASK) Idicates which elemets of vector to operate o Set by vector test istructios e.g., VMASK[i] = (V k [i] == 0) V0,0 V0,1 M-bit wide V1,0 V1,1 M-bit wide V0,N-1 V1,N-1 23

24 Vector Fuctioal Uits Use a deep pipelie to execute elemet operatios à fast clock cycle V 1 V 2 V 3 Cotrol of deep pipelie is simple because elemets i vector are idepedet Six stage multiply pipelie V1 * V2 à V3 Slide credit: Krste Asaovic 24

25 Vector Machie Orgaizatio (CRAY-1) CRAY-1 Russell, The CRAY-1 computer system, CACM Scalar ad vector modes 8 64-elemet vector registers 64 bits per elemet 16 memory baks 8 64-bit scalar registers 8 24-bit address registers 25

26 CRAY ETH (CAB, E Floor) 26

27 CRAY X-MP System Orgaizatio CRAY X-MP system orgaizatio Cray Research Ic., The CRAY X-MP Series of Computer Systems,

28 CRAY X-MP Desig Detail CRAY X-MP desigdetail Maiframe CRAY X-MP sigle- ad multiprocessor systems are desiged to offer users outstadmg performace o large-scale, compute-itesive ad 110-boud jobs. CRAY X-MP maiframes cosist of SIX (X-MPII), eight (X-MPl2) or twelve (X-MPl4) vertical colums arraged i a arc. Power supplies ad coolig are clustered aroud the base ad exted outward. Model CRAY X-MPl416 CRAY X-MPl48 CRAY X-MPl216 CRAY X-MP128 CRAY X-MPl24 CRAY X-MPl18 CRAY X-MPl14 CRAY X-MP112 CRAY X-MPII 1 Number of CPUs Memory size (millios of 64-bit words) Number of baks A descriptio of the major system compoets ad their fuctios follows. CPU computatio sectio Withi the computatio sectio of each CPU are operatig registers, fuctioal uits ad a istructio cotrol etwork - hardware elemets that cooperate i executig seueces of istructios. The istructio cotrol etwork makes all decisios related to istructio issue as well as coordiatig the three types of processig withi each CPU: vector, scalar ad address. Each of the processig modes has its associated registers ad fuctioal u k The block diagram of a CRAY X-MPl4 (opposite page) illustrates the relatioship of the registers to the fuctioal uits, istructio buffers, I10 chael cotrol registers, iterprocessor commuicatios sectio ad memory. For multiple-processor CRAY X-MP models, the iterprocessor commuicatios sectio coordiates processig betwee CPUs, ad cetral memory is shared. Registers The basic set of programmable registers is composed of: Eight 24-bit address (A) registers Sixty-four 24-b~t itermediate address (B) registers Eight 64-bit scalar (S) registers Sixty-four 64-bit scalar-save (T) reg~sters Eight 64-elemet (4096-bit) vector (V) registers with 64 bits per elemet The 24-bit A registers are geerally used for addressig ad coutig operatios. Associated with them are 64 B registers, also 24 bits wide. Sice the trasfer betwee a A ad a B register takes oly oe clock period, the B registers assume the role of data cache, storig iformatio for fast access without tyig up the A registers for relatively log periods. Cray Research Ic., The CRAY X-MP Series of Computer Systems,

29 CRAY X-MP CPU Fuctioal Uits Cray Research Ic., The CRAY X-MP Series of Computer Systems,

30 CRAY X-MP System Cofiguratio Cray Research Ic., The CRAY X-MP Series of Computer Systems,

Seymour Cray, the Father of Supercomputers "If you

strog oxe or 1024 chickes?" amityrebecca / Piterest.

ch/pi/473018767088408061/ Scott Siklier / Corbis.

31 Seymour Cray, the Father of Supercomputers "If you were plowig a field, which would you rather use: Two strog oxe or 1024 chickes?" amityrebecca / Piterest. Scott Siklier / Corbis. 31

Vector Machie Orgaizatio (CRAY-1) CRAY-1 Russell, The CRAY-1 computer system, CACM 1978.

32 Vector Machie Orgaizatio (CRAY-1) CRAY-1 Russell, The CRAY-1 computer system, CACM Scalar ad vector modes 8 64-elemet vector registers 64 bits per elemet 16 memory baks 8 64-bit scalar registers 8 24-bit address registers 32

33 Loadig/Storig Vectors from/to Memory Reuires loadig/storig multiple elemets Elemets separated from each other by a costat distace (stride) Assume stride = 1 for ow Elemets ca be loaded i cosecutive cycles if we ca start the load of oe elemet per cycle Ca sustai a throughput of oe elemet per cycle Questio: How do we achieve this with a memory that takes more tha 1 cycle to access? Aswer: Bak the memory; iterleave the elemets across baks 33

34 Memory Bakig Memory is divided ito baks that ca be accessed idepedetly; baks share address ad data buses (to miimize pi cost) Ca start ad complete oe bak access per cycle Ca sustai N parallel accesses if all N go to differet baks Bak 0 Bak 1 Bak 2 Bak 15 MDR MAR MDR MAR MDR MAR MDR MAR Data bus Address bus Picture credit: Derek Chiou CPU 34

35 Vector Memory System Next address = Previous address + Stride If (stride == 1) && (cosecutive elemets iterleaved across baks) && (umber of baks >= bak latecy), the we ca sustai 1 elemet/cycle throughput Vector Registers Base Stride Address Geerator A B C D E F Memory Baks Picture credit: Krste Asaovic 35

36 Scalar Code Example: Elemet-Wise Avg. For I = 0 to 49 C[i] = (A[i] + B[i]) / 2 Scalar code (istructio ad its latecy) MOVI R0 = 50 1 MOVA R1 = A 1 MOVA R2 = B 1 MOVA R3 = C dyamic istructios X: LD R4 = MEM[R1++] 11 ;autoicremet addressig LD R5 = MEM[R2++] 11 ADD R6 = R4 + R5 4 SHFR R7 = R6 >> 1 1 ST MEM[R3++] = R7 11 DECBNZ R0, X 2 ;decremet ad brach if NZ 36

37 Scalar Code Executio Time (I Order) Scalar executio time o a i-order processor with 1 bak First two loads i the loop caot be pipelied: 2*11 cycles *40 = 2004 cycles Scalar executio time o a i-order processor with 16 baks (word-iterleaved: cosecutive words are stored i cosecutive baks) First two loads i the loop ca be pipelied *30 = 1504 cycles Why 16 baks? 11-cycle memory access latecy Havig 16 (>11) baks esures there are eough baks to overlap eough memory operatios to cover memory latecy 37

38 Vectorizable Loops A loop is vectorizable if each iteratio is idepedet of ay other For I = 0 to 49 C[i] = (A[i] + B[i]) / 2 Vectorized loop (each istructio ad its latecy): MOVI VLEN = 50 1 MOVI VSTR = dyamic istructios VLD V0 = A 11 + VLEN - 1 VLD V1 = B 11 + VLEN 1 VADD V2 = V0 + V1 4 + VLEN - 1 VSHFR V3 = V2 >> VLEN - 1 VST C = V VLEN 1 38

39 Basic Vector Code Performace Assume o chaiig (o vector data forwardig) i.e., output of a vector fuctioal uit caot be used as the direct iput of aother The etire vector register eeds to be ready before ay elemet of it ca be used as part of aother operatio Oe memory port (oe address geerator) 16 memory baks (word-iterleaved) V0 = A[0..49] V1 = B[0..49] ADD SHIFT STORE 285 cycles 39

40 Vector Chaiig Vector chaiig: Data forwardig from oe vector fuctioal uit to aother LV v1 MULV v3,v1,v2 ADDV v5, v3, v4 V 1 V 2 V 3 V 4 V 5 Chai Chai Load Uit Memory Mult. Add Slide credit: Krste Asaovic 40

41 Vector Code Performace - Chaiig Vector chaiig: Data forwardig from oe vector fuctioal uit to aother These two VLDs caot be pipelied. WHY? Strict assumptio: Each memory bak has a sigle port (memory badwidth bottleeck) cycles VLD ad VST caot be pipelied. WHY? 41

42 Vector Code Performace Multiple Memory Ports Chaiig ad 2 load ports, 1 store port i each bak cycles 19X perf. improvemet!

43 Questios (I) What if # data elemets > # elemets i a vector register? Idea: Break loops so that each iteratio operates o # elemets i a vector register E.g., 527 data elemets, 64-elemet VREGs 8 iteratios where VLEN = 64 1 iteratio where VLEN = 15 (eed to chage value of VLEN) Called vector stripmiig 43

44 (Vector) Stripmiig Source: 44

45 Questios (II) What if vector data is ot stored i a strided fashio i memory? (irregular memory access to a vector) Idea: Use idirectio to combie/pack elemets ito vector registers Called scatter/gather operatios 45

46 Gather/Scatter Operatios Wat to vectorize loops with idirect accesses: for (i=0; i<n; i++) A[i] = B[i] + C[D[i]] Idexed load istructio (Gather) LV vd, rd # Load idices i D vector LVI vc, rc, vd # Load idirect from rc base LV vb, rb # Load B vector ADDV.D va,vb,vc # Do add SV va, ra # Store result 46

47 Gather/Scatter Operatios Gather/scatter operatios ofte implemeted i hardware to hadle sparse vectors (matrices) Vector loads ad stores use a idex vector which is added to the base register to geerate the addresses Idex Vector Data Vector (to Store) Stored Vector (i Memory) Base Base+1 X Base Base+3 X Base+4 X Base+5 X Base Base

48 Coditioal Operatios i a Loop What if some operatios should ot be executed o a vector (based o a dyamically-determied coditio)? loop: for (i=0; i<n; i++) if (a[i]!= 0) the b[i]=a[i]*b[i] Idea: Masked operatios VMASK register is a bit mask determiig which data elemet should ot be acted upo VLD V0 = A VLD V1 = B VMASK = (V0!= 0) VMUL V1 = V0 * V1 VST B = V1 This is predicated executio. Executio is predicated o mask bit. 48

49 Aother Example with Maskig for (i = 0; i < 64; ++i) if (a[i] >= b[i]) else c[i] = a[i] c[i] = b[i] A B VMASK Steps to execute the loop i SIMD code 1. Compare A, B to get VMASK 2. Masked store of A ito C 3. Complemet VMASK 4. Masked store of B ito C 49

50 Masked Vector Istructios Simple Implemetatio execute all N operatios, tur off result writeback accordig to mask Desity-Time Implemetatio sca mask vector ad oly execute elemets with o-zero masks M[7]=1 A[7] B[7] M[7]=1 M[6]=0 M[5]=1 A[6] A[5] B[6] B[5] M[6]=0 M[5]=1 A[7] B[7] M[4]=1 M[3]=0 A[4] A[3] B[4] B[3] M[4]=1 M[3]=0 C[5] M[2]=0 C[4] M[2]=0 M[1]=1 C[2] C[1] M[1]=1 M[0]=0 C[1] Write data port M[0]=0 Write Eable C[0] Write data port Which oe is better? Tradeoffs? Slide credit: Krste Asaovic 50

51 Desig of Digital Circuits Lecture 20: SIMD Processors Prof. Our Mutlu ETH Zurich Sprig May 2018

52 We did ot cover the followig slides i lecture. These are for your preparatio for the ext lecture.

53 Some Issues Stride ad bakig As log as they are relatively prime to each other ad there are eough baks to cover bak access latecy, we ca sustai 1 elemet/cycle throughput Storage of a matrix Row major: Cosecutive elemets i a row are laid out cosecutively i memory Colum major: Cosecutive elemets i a colum are laid out cosecutively i memory You eed to chage the stride whe accessig a row versus colum 53

54 54

55 Miimizig Bak Coflicts More baks Better data layout to match the access patter Is this always possible? Better mappig of address to bak E.g., radomized mappig Rau, Pseudo-radomly iterleaved memory, ISCA

56 Array vs. Vector Processors, Revisited Array vs. vector processor distictio is a purist s distictio Most moder SIMD processors are a combiatio of both They exploit data parallelism i both time ad space GPUs are a prime example we will cover i a bit more detail 56

57 Remember: Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Istructio Stream LD VR ß A[3:0] ADD VR ß VR, 1 MUL VR ß VR, 2 ST A[3:0] ß VR Time Same same time LD0 LD1 LD2 LD3 LD0 Differet time AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 Differet same space AD3 MU2 ST1 MU3 ST2 Same space ST3 Space Space 57

58 Vector Istructio Executio VADD A,B à C Executio usig oe pipelied fuctioal uit Executio usig four pipelied fuctioal uits A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] Slide credit: Krste Asaovic 58

59 Vector Uit Structure Fuctioal Uit Partitioed Vector Registers Elemets 0, 4, 8, Elemets 1, 5, 9, Elemets 2, 6, 10, Elemets 3, 7, 11, Lae Memory Subsystem Slide credit: Krste Asaovic 59

60 Vector Istructio Level Parallelism Ca overlap executio of multiple vector istructios Example machie has 32 elemets per vector register ad 8 laes Completes 24 operatios/cycle while issuig 1 vector istructio/cycle Load Uit Multiply Uit Add Uit load mul time add load mul add Istructio issue Slide credit: Krste Asaovic 60

61 Automatic Code Vectorizatio Scalar Seuetial Code for (i=0; i < N; i++) C[i] = A[i] + B[i]; Vectorized Code load load load Iter. 1 add load Time add load add load store store store Iter. 2 load load Iter. 1 Iter. 2 Vector Istructio add store Vectorizatio is a compile-time reorderig of operatio seuecig Þ reuires extesive loop depedece aalysis Slide credit: Krste Asaovic 61

62 Vector/SIMD Processig Summary Vector/SIMD machies are good at exploitig regular datalevel parallelism Same operatio performed o may data elemets Improve performace, simplify desig (o itra-vector depedecies) Performace improvemet limited by vectorizability of code Scalar operatios limit vector machie performace Remember Amdahl s Law CRAY-1 was the fastest SCALAR machie at its time! May existig ISAs iclude (vector-like) SIMD operatios Itel MMX/SSE/AVX, PowerPC AltiVec, ARM Advaced SIMD 62

63 SIMD Operatios i Moder ISAs

64 Caregie Mello SIMD ISA Extesios Sigle Istructio Multiple Data (SIMD) extesio istructios Sigle istructio acts o multiple pieces of data at oce Commo applicatio: graphics Perform short arithmetic operatios (also called packed arithmetic) For example: add four 8-bit umbers Must modify ALU to elimiate carries betwee 8-bit values padd8 $s2, $s0, $s Bit positio a 3 a 2 a 1 a 0 $s0 + b 3 b 2 b 1 b 0 $s1 a 3 + b 3 a 2 + b 2 a 1 + b 1 a 0 + b 0 $s2 64

65 Itel Petium MMX Operatios Idea: Oe istructio operates o multiple data elemets simultaeously Ala array processig (yet much more limited) Desiged with multimedia (graphics) operatios i mid No VLEN register Opcode determies data type: 8 8-bit bytes 4 16-bit words 2 32-bit doublewords 1 64-bit uadword Stride is always eual to 1. Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro,

66 MMX Example: Image Overlayig (I) Goal: Overlay the huma i image 1 o top of the backgroud i image 2 Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro,

67 MMX Example: Image Overlayig (II) Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro,

68 GPUs (Graphics Processig Uits)

69 GPUs are SIMD Egies Udereath The istructio pipelie operates like a SIMD pipelie (e.g., a array processor) However, the programmig is doe usig threads, NOT SIMD istructios To uderstad this, let s go back to our parallelizable code example But, before that, let s distiguish betwee Programmig Model (Software) vs. Executio Model (Hardware) 69

70 Programmig Model vs. Hardware Executio Model Programmig Model refers to how the programmer expresses the code E.g., Seuetial (vo Neuma), Data Parallel (SIMD), Dataflow, Multi-threaded (MIMD, SPMD), Executio Model refers to how the hardware executes the code udereath E.g., Out-of-order executio, Vector processor, Array processor, Dataflow processor, Multiprocessor, Multithreaded processor, Executio Model ca be very differet from the Programmig Model E.g., vo Neuma model implemeted by a OoO processor E.g., SPMD model implemeted by a SIMD processor (a GPU) 70

71 How Ca You Exploit Parallelism Here? Scalar Seuetial Code for (i=0; i < N; i++) C[i] = A[i] + B[i]; Iter. 1 load add load Let s examie three programmig optios to exploit istructio-level parallelism preset i this seuetial code: store load 1. Seuetial (SISD) Iter. 2 load 2. Data-Parallel (SIMD) add store 3. Multithreaded (MIMD/SPMD) 71

72 Prog. Model 1: Seuetial (SISD) for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Ca be executed o a: Iter. 1 load load Pipelied processor Out-of-order executio processor add Idepedet istructios executed whe ready store load Differet iteratios are preset i the istructio widow ad ca execute i parallel i multiple fuctioal uits Iter. 2 add load I other words, the loop is dyamically urolled by the hardware Superscalar or VLIW processor store Ca fetch ad execute multiple istructios per cycle 72

73 Prog. Model 2: Data Parallel (SIMD) for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Vector Istructio Vectorized Code load load load VLD A à V1 Iter. 1 load load VLD B à V2 add add VADD V1 + V2 à V3 store store VST V3 à C Iter. 2 load Iter. 1 load Iter. 2 Realizatio: Each iteratio is idepedet add store Idea: Programmer or compiler geerates a SIMD istructio to execute the same istructio from all iteratios across differet data Best executed by a SIMD processor (vector, array) 73

74 Prog. Model 3: Multithreaded for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code load load load Iter. 1 load load Iter. 2 load Iter. 1 add store add store load add store Iter. 2 Realizatio: Each iteratio is idepedet Idea: Programmer or compiler geerates a thread to execute each iteratio. Each thread does the same thig (but o differet data) Ca be executed o a MIMD machie 74

75 Prog. Model 3: Multithreaded for (i=0; i < N; i++) C[i] = A[i] + B[i]; load load load load add store add store Iter. 1 Iter. 2 Realizatio: Each iteratio is idepedet Idea: This Programmer particular or model compiler is also geerates called: a thread to execute each iteratio. Each thread does the same SPMD: thig (but Sigle o Program differet data) Multiple Data Ca Ca be executed be executed o a MIMD o a SIMT SIMD machie machie Sigle Istructio Multiple Thread 75

76 A GPU is a SIMD (SIMT) Machie Except it is ot programmed usig SIMD istructios It is programmed usig threads (SPMD programmig model) Each thread executes the same code but operates a differet piece of data Each thread has its ow cotext (i.e., ca be treated/restarted/executed idepedetly) A set of threads executig the same istructio are dyamically grouped ito a warp (wavefrot) by the hardware A warp is essetially a SIMD operatio formed by hardware! 76

77 SPMD o SIMT Machie for (i=0; i < N; i++) C[i] = A[i] + B[i]; load load Warp 0 at PC X load load Warp 0 at PC X+1 add add Warp 0 at PC X+2 store store Warp 0 at PC X+3 Iter. 1 Iter. 2 Warp: A set of threads that execute Realizatio: Each iteratio is idepedet the same istructio (i.e., at the same PC) Idea: This Programmer particular or model compiler is also geerates called: a thread to execute each iteratio. Each thread does the same SPMD: thig Sigle (but o Program differet data) Multiple Data Ca A GPU Ca be executed be executes executed o it a usig MIMD o a SIMD the machie SIMT machie model: Sigle Istructio Multiple Thread 77

78 Graphics Processig Uits SIMD ot Exposed to Programmer (SIMT)

79 SIMD vs. SIMT Executio Model SIMD: A sigle seuetial istructio stream of SIMD istructios à each istructio specifies multiple data iputs [VLD, VLD, VADD, VST], VLEN SIMT: Multiple istructio streams of scalar istructios à threads grouped dyamically ito warps [LD, LD, ADD, ST], NumThreads Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly (o ay type of scalar pipelie) à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig 79

80 Multithreadig of Warps for (i=0; i < N; i++) C[i] = A[i] + B[i]; Assume a warp cosists of 32 threads If you have 32K iteratios, ad 1 iteratio/thread à 1K warps Warps ca be iterleaved o the same pipelie à Fie graied multithreadig of warps load load Warp 10 at PC X load load add store add store Warp 20 at PC X+2 Iter * Iter *

81 Warps ad Warp-Level FGMT Warp: A set of threads that execute the same istructio (o differet data elemets) à SIMT (Nvidia-speak) All threads ru the same code Warp: The threads that ru legthwise i a wove fabric Thread Warp Scalar Scalar Scalar Thread Thread Thread W X Y Commo PC Scalar Thread Z Thread Warp 3 Thread Warp 8 Thread Warp 7 SIMD Pipelie 81

82 High-Level View of a GPU 82

83 Latecy Hidig via Warp-Level FGMT Warp: A set of threads that execute the same istructio (o differet data elemets) Thread Warp 3 Thread Warp 8 Warps available for schedulig Fie-graied multithreadig Oe istructio per thread i pipelie at a time (No iterlockig) Iterleave warp executio to hide latecies Register values of all threads stay i register file FGMT eables log latecy tolerace Millios of pixels Thread Warp 7 RF ALU All Hit? I-Fetch Decode RF ALU D-Cache Writeback Data RF ALU SIMD Pipelie Warps accessig memory hierarchy Miss? Thread Warp 1 Thread Warp 2 Thread Warp 6 Slide credit: Tor Aamodt 83

84 Warp Executio (Recall the Slide) 32-thread warp executig ADD A[tid],B[tid] à C[tid] Executio usig oe pipelied fuctioal uit Executio usig four pipelied fuctioal uits A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] Slide credit: Krste Asaovic 84

85 SIMD Executio Uit Structure Fuctioal Uit Registers for each Thread Registers for thread IDs 0, 4, 8, Registers for thread IDs 1, 5, 9, Registers for thread IDs 2, 6, 10, Registers for thread IDs 3, 7, 11, Lae Memory Subsystem Slide credit: Krste Asaovic 85

86 Warp Istructio Level Parallelism Ca overlap executio of multiple istructios Example machie has 32 threads per warp ad 8 laes Completes 24 operatios/cycle while issuig 1 warp/cycle Load Uit Multiply Uit Add Uit W0 W1 time W2 W3 W4 W5 Warp issue Slide credit: Krste Asaovic 86

87 SIMT Memory Access Same istructio i differet threads uses thread id to idex ad access differet data elemets + Let s assume N=16, 4 threads per warp à 4 warps Threads Data elemets Warp 0 Warp 1 Warp 2 Warp 3 Slide credit: Hyesoo Kim

Sample GPU SIMT Code (Simplified) CPU code for (ii = 0; ii < 100000; ++ii)

void KerelFuctio( ) { it tid = blockdim.x * blockidx.x + threadidx.

88 Sample GPU SIMT Code (Simplified) CPU code for (ii = 0; ii < ; ++ii) { C[ii] = A[ii] + B[ii]; } CUDA code // there are threads global void KerelFuctio( ) { it tid = blockdim.x * blockidx.x + threadidx.x; it vara = aa[tid]; it varb = bb[tid]; C[tid] = vara + varb; } Slide credit: Hyesoo Kim

89 Sample GPU Program (Less Simplified) Slide credit: Hyesoo Kim 89

90 Warp-based SIMD vs. Traditioal SIMD Traditioal SIMD cotais a sigle thread Seuetial istructio executio; lock-step operatios i a SIMD istructio Programmig model is SIMD (o extra threads) à SW eeds to kow vector legth ISA cotais vector/simd istructios Warp-based SIMD cosists of multiple scalar threads executig i a SIMD maer (i.e., same istructio executed by all threads) Does ot have to be lock step Each thread ca be treated idividually (i.e., placed i a differet warp) à programmig model ot SIMD SW does ot eed to kow vector legth Eables multithreadig ad flexible dyamic groupig of threads ISA is scalar à SIMD operatios ca be formed dyamically Essetially, it is SPMD programmig model implemeted o SIMD hardware 90

91 SPMD Sigle procedure/program, multiple data This is a programmig model rather tha computer orgaizatio Each processig elemet executes the same procedure, except o differet data elemets Procedures ca sychroize at certai poits i program, e.g. barriers Essetially, multiple istructio streams execute the same program Each program/procedure 1) works o differet data, 2) ca execute a differet cotrol-flow path, at ru-time May scietific applicatios are programmed this way ad ru o MIMD hardware (multiprocessors) Moder GPUs programmed i a similar way o a SIMD hardware 91

92 SIMD vs. SIMT Executio Model SIMD: A sigle seuetial istructio stream of SIMD istructios à each istructio specifies multiple data iputs [VLD, VLD, VADD, VST], VLEN SIMT: Multiple istructio streams of scalar istructios à threads grouped dyamically ito warps [LD, LD, ADD, ST], NumThreads Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly o ay type of scalar pipelie à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig 92

93 Threads Ca Take Differet Paths i Warp-based SIMD Each thread ca have coditioal cotrol flow istructios Threads ca execute differet cotrol flow paths A B Thread Warp Commo PC C D F Thread 1 Thread 2 Thread 3 Thread 4 E G Slide credit: Tor Aamodt 93

94 Cotrol Flow Problem i GPUs/SIMT A GPU uses a SIMD pipelie to save area o cotrol logic. Groups scalar threads ito warps Brach Brach divergece occurs whe threads iside warps brach to differet executio paths. Path A Path B This is the same as coditioal/predicated/masked executio. Recall the Vector Mask ad Masked Vector Operatios? Slide credit: Tor Aamodt 94

95 Remember: Each Thread Is Idepedet Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly o ay type of scalar pipelie à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig If we have may threads We ca fid idividual threads that are at the same PC Ad, group them together ito a sigle warp dyamically This reduces divergece à improves SIMD utilizatio SIMD utilizatio: fractio of SIMD laes executig a useful operatio (i.e., executig a active thread) 95

96 Dyamic Warp Formatio/Mergig Idea: Dyamically merge threads executig the same istructio (after brach divergece) Form ew warps from warps that are waitig Eough threads brachig to each path eables the creatio of full ew warps Warp X Warp Z Warp Y 96

97 Dyamic Warp Formatio/Mergig Idea: Dyamically merge threads executig the same istructio (after brach divergece) Brach Path A Path B Fug et al., Dyamic Warp Formatio ad Schedulig for Efficiet GPU Cotrol Flow, MICRO

98 Dyamic Warp Formatio Example B x/1110 y/0011 A x/1111 y/1111 C x/1000 y/0010 D x/0110 y/0001 F x/0001 y/1100 E x/1110 y/0011 A D Leged A Executio of Warp x at Basic Block A A ew warp created from scalar threads of both Warp x ad y executig at Basic Block D Executio of Warp y at Basic Block A Baselie Dyamic Warp Formatio G x/1111 y/1111 A A B B C C D D E E F F G G A A A A B B C D E E F G G A A Time Time Slide credit: Tor Aamodt 98

99 Hardware Costraits Limit Flexibility of Warp Groupig Fuctioal Uit Registers for each Thread Registers for thread IDs 0, 4, 8, Registers for thread IDs 1, 5, 9, Registers for thread IDs 2, 6, 10, Registers for thread IDs 3, 7, 11, Lae Ca you move ay thread flexibly to ay lae? Memory Subsystem Slide credit: Krste Asaovic 99

100 A Example GPU

101 NVIDIA GeForce GTX 285 NVIDIA-speak: 240 stream processors SIMT executio Geeric speak: 30 cores 8 SIMD fuctioal uits per core Slide credit: Kayvo Fatahalia 101

102 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) = SIMD fuctioal uit, cotrol shared across 8 uits = multiply-add = multiply = istructio stream decode = executio cotext storage Slide credit: Kayvo Fatahalia 102

103 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) Groups of 32 threads share istructio stream (each group is a Warp) Up to 32 warps are simultaeously iterleaved Up to 1024 thread cotexts ca be stored Slide credit: Kayvo Fatahalia 103

104 NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 30 cores o the GTX 285: 30,720 threads Slide credit: Kayvo Fatahalia 104

105 Evolutio of NVIDIA GPUs #Stream Processors GFLOPS Stream Processors GFLOPS 0 GTX 285 (2009) GTX 480 (2010) GTX 780 (2013) GTX 980 (2014) P100 (2016) V100 (2017) 0 105

106 NVIDIA V100 NVIDIA-speak: 5120 stream processors SIMT executio Geeric speak: 80 cores 64 SIMD fuctioal uits per core Tesor cores for Machie Learig 106

107 NVIDIA V100 Block Diagram 80 cores o the V

108 NVIDIA V100 Core 15.7 TFLOPS Sigle Precisio 7.8 TFLOPS Double Precisio 125 TFLOPS for Deep Learig (Tesor cores) 108

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 17 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018