Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October PDF Free Download

Computer Architecture Lecture 8: SIMD Processors ad GPUs Prof. Our Mutlu ETH Zürich Fall 2017 18 October 2017

Ageda for Today & Next Few Lectures SIMD Processors GPUs Itroductio to GPU Programmig Digitaltechik (Sprig 2017) YouTube videos Lecture 19: Begiig of SIMD https://youtu.be/xe9ogmpemlw?t=1h11m42s Lecture 20: SIMD Processors https://youtu.be/hrhs7xlp0sg?t=6m48s Lecture 21: GPUs https://youtu.be/muptdxl3jks?t=3m03s 2

SIMD Processig: Exploitig Regular (Data) Parallelism

Fly s Taxoomy of Computers Mike Fly, Very High-Speed Computig Systems, Proc. of IEEE, 1966 SISD: Sigle istructio operates o sigle data elemet SIMD: Sigle istructio operates o multiple data elemets Array processor Vector processor MISD: Multiple istructios operate o sigle data elemet Closest form: systolic array processor, streamig processor MIMD: Multiple istructios operate o multiple data elemets (multiple istructio streams) Multiprocessor Multithreaded processor 4

Data Parallelism Cocurrecy arises from performig the same operatios o differet pieces of data Sigle istructio multiple data (SIMD) E.g., dot product of two vectors Cotrast with data flow Cocurrecy arises from executig differet operatios i parallel (i a data drive maer) Cotrast with thread ( cotrol ) parallelism Cocurrecy arises from executig differet threads of cotrol i parallel SIMD exploits istructio-level parallelism Multiple istructios (more appropriately, operatios) are cocurret: istructios happe to be the same 5

SIMD Processig Sigle istructio operates o multiple data elemets I time or i space Multiple processig elemets Time-space duality Array processor: Istructio operates o multiple data elemets at the same time usig differet spaces Vector processor: Istructio operates o multiple data elemets i cosecutive time steps usig the same space 6

Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Istructio Stream LD VR ß A[3:0] ADD VR ß VR, 1 MUL VR ß VR, 2 ST A[3:0] ß VR Time Same op @ same time Differet ops @ time LD0 LD1 LD2 LD3 LD0 AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 Differet ops @ same space AD3 MU2 ST1 MU3 ST2 Same op @ space ST3 Space Space 7

SIMD Array Processig vs. VLIW VLIW: Multiple idepedet operatios packed together by the compiler 8

SIMD Array Processig vs. VLIW Array processor: Sigle operatio o multiple (differet) data elemets 9

Vector Processors A vector is a oe-dimesioal array of umbers May scietific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2 A vector processor is oe whose istructios operate o vectors rather tha scalar (sigle data) values Basic reuiremets Need to load/store vectors à vector registers (cotai vectors) Need to operate o vectors of differet legths à vector legth register (VLEN) Elemets of a vector might be stored apart from each other i memory à vector stride register (VSTR) Stride: distace betwee two elemets of a vector 10

Vector Processors (II) A vector istructio performs a operatio o each elemet i cosecutive cycles Vector fuctioal uits are pipelied Each pipelie stage operates o a differet data elemet Vector istructios allow deeper pipelies No itra-vector depedecies à o hardware iterlockig withi a vector No cotrol flow withi a vector Kow stride allows prefetchig of vectors ito registers/ cache/memory 11

Vector Processor Advatages + No depedecies withi a vector Pipeliig. parallelizatio work really well Ca have very deep pipelies, o depedecies! + Each istructio geerates a lot of work Reduces istructio fetch badwidth reuiremets + Highly regular memory access patter + No eed to explicitly code loops Fewer braches i the istructio seuece 12

Vector Processor Disadvatages -- Works (oly) if parallelism is regular (data/simd parallelism) ++ Vector operatios -- Very iefficiet if parallelism is irregular -- How about searchig for a key i a liked list? Fisher, Very Log Istructio Word architectures ad the ELI-512, ISCA 1983. 13

Vector Processor Limitatios -- Memory (badwidth) ca easily become a bottleeck, especially if 1. compute/memory operatio balace is ot maitaied 2. data is ot mapped appropriately to memory baks 14

Vector Processig i More Depth

Vector Registers Each vector data register holds N M-bit values Vector cotrol registers: VLEN, VSTR, VMASK Maximum VLEN ca be N Maximum umber of elemets stored i a vector register Vector Mask Register (VMASK) Idicates which elemets of vector to operate o Set by vector test istructios e.g., VMASK[i] = (V k [i] == 0) V0,0 V0,1 M-bit wide V1,0 V1,1 M-bit wide V0,N-1 V1,N-1 16

Vector Fuctioal Uits Use deep pipelie to execute elemet operatios à fast clock cycle V 1 V 2 V 3 Cotrol of deep pipelie is simple because elemets i vector are idepedet Six stage multiply pipelie V1 * V2 à V3 Slide credit: Krste Asaovic 17

Vector Machie Orgaizatio (CRAY-1) CRAY-1 Russell, The CRAY-1 computer system, CACM 1978. Scalar ad vector modes 8 64-elemet vector registers 64 bits per elemet 16 memory baks 8 64-bit scalar registers 8 24-bit address registers 18

Loadig/Storig Vectors from/to Memory Reuires loadig/storig multiple elemets Elemets separated from each other by a costat distace (stride) Assume stride = 1 for ow Elemets ca be loaded i cosecutive cycles if we ca start the load of oe elemet per cycle Ca sustai a throughput of oe elemet per cycle Questio: How do we achieve this with a memory that takes more tha 1 cycle to access? Aswer: Bak the memory; iterleave the elemets across baks 19

Memory Bakig Memory is divided ito baks that ca be accessed idepedetly; baks share address ad data buses (to miimize pi cost) Ca start ad complete oe bak access per cycle Ca sustai N parallel accesses if all N go to differet baks Bak 0 Bak 1 Bak 2 Bak 15 MDR MAR MDR MAR MDR MAR MDR MAR Data bus Address bus Picture credit: Derek Chiou CPU 20

Vector Memory System Next address = Previous address + Stride If stride = 1 & cosecutive elemets iterleaved across baks & umber of baks >= bak latecy, the ca sustai 1 elemet/cycle throughput Vector Registers Base Stride Address Geerator + 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory Baks Picture credit: Krste Asaovic 21

Scalar Code Example For I = 0 to 49 C[i] = (A[i] + B[i]) / 2 Scalar code (istructio ad its latecy) MOVI R0 = 50 1 MOVA R1 = A 1 MOVA R2 = B 1 MOVA R3 = C 1 304 dyamic istructios X: LD R4 = MEM[R1++] 11 ;autoicremet addressig LD R5 = MEM[R2++] 11 ADD R6 = R4 + R5 4 SHFR R7 = R6 >> 1 1 ST MEM[R3++] = R7 11 DECBNZ R0, X 2 ;decremet ad brach if NZ 22

Scalar Code Executio Time (I Order) Scalar executio time o a i-order processor with 1 bak First two loads i the loop caot be pipelied: 2*11 cycles 4 + 50*40 = 2004 cycles Scalar executio time o a i-order processor with 16 baks (word-iterleaved: cosecutive words are stored i cosecutive baks) First two loads i the loop ca be pipelied 4 + 50*30 = 1504 cycles Why 16 baks? 11 cycle memory access latecy Havig 16 (>11) baks esures there are eough baks to overlap eough memory operatios to cover memory latecy 23

Vectorizable Loops A loop is vectorizable if each iteratio is idepedet of ay other For I = 0 to 49 C[i] = (A[i] + B[i]) / 2 Vectorized loop (each istructio ad its latecy): MOVI VLEN = 50 1 MOVI VSTR = 1 1 VLD V0 = A 11 + VLEN - 1 VLD V1 = B 11 + VLEN 1 VADD V2 = V0 + V1 4 + VLEN - 1 VSHFR V3 = V2 >> 1 1 + VLEN - 1 VST C = V3 11 + VLEN 1 7 dyamic istructios 24

Basic Vector Code Performace Assume o chaiig (o vector data forwardig) i.e., output of a vector fuctioal uit caot be used as the direct iput of aother The etire vector register eeds to be ready before ay elemet of it ca be used as part of aother operatio Oe memory port (oe address geerator) 16 memory baks (word-iterleaved) 1 1 11 49 11 49 4 49 1 49 11 49 V0 = A[0..49] V1 = B[0..49] ADD SHIFT STORE 285 cycles 25

Vector Chaiig Vector chaiig: Data forwardig from oe vector fuctioal uit to aother LV v1 MULV v3,v1,v2 ADDV v5, v3, v4 V1 V 2 V 3 V 4 V 5 Chai Chai Load Uit Memory Mult. Add Slide credit: Krste Asaovic 26

Vector Code Performace - Chaiig Vector chaiig: Data forwardig from oe vector fuctioal uit to aother 1 1 11 49 11 49 These two VLDs caot be pipelied. WHY? 4 49 1 49 Strict assumptio: Each memory bak has a sigle port (memory badwidth bottleeck) 11 49 182 cycles VLD ad VST caot be pipelied. WHY? 27

Vector Code Performace Multiple Memory Ports Chaiig ad 2 load ports, 1 store port i each bak 1 1 11 49 1 11 49 4 49 1 49 79 cycles 19X perf. improvemet! 11 49 28

Questios (I) What if # data elemets > # elemets i a vector register? Idea: Break loops so that each iteratio operates o # elemets i a vector register E.g., 527 data elemets, 64-elemet VREGs 8 iteratios where VLEN = 64 1 iteratio where VLEN = 15 (eed to chage value of VLEN) Called vector stripmiig What if vector data is ot stored i a strided fashio i memory? (irregular memory access to a vector) Idea: Use idirectio to combie/pack elemets ito vector registers Called scatter/gather operatios 29

Gather/Scatter Operatios Wat to vectorize loops with idirect accesses: for (i=0; i<n; i++) A[i] = B[i] + C[D[i]] Idexed load istructio (Gather) LV vd, rd # Load idices i D vector LVI vc, rc, vd # Load idirect from rc base LV vb, rb # Load B vector ADDV.D va,vb,vc # Do add SV va, ra # Store result 30

Gather/Scatter Operatios Gather/scatter operatios ofte implemeted i hardware to hadle sparse vectors (matrices) Vector loads ad stores use a idex vector which is added to the base register to geerate the addresses Idex Vector Data Vector (to Store) Stored Vector (i Memory) 0 3.14 Base+0 3.14 2 6.5 Base+1 X 6 71.2 Base+2 6.5 7 2.71 Base+3 X Base+4 X Base+5 X Base+6 71.2 Base+7 2.71 31

Coditioal Operatios i a Loop What if some operatios should ot be executed o a vector (based o a dyamically-determied coditio)? loop: for (i=0; i<n; i++) if (a[i]!= 0) the b[i]=a[i]*b[i] Idea: Masked operatios VMASK register is a bit mask determiig which data elemet should ot be acted upo VLD V0 = A VLD V1 = B VMASK = (V0!= 0) VMUL V1 = V0 * V1 VST B = V1 This is predicated executio. Executio is predicated o mask bit. 32

Aother Example with Maskig for (i = 0; i < 64; ++i) if (a[i] >= b[i]) c[i] = a[i] else c[i] = b[i] A B VMASK 1 2 0 2 2 1 3 2 1 4 10 0-5 -4 0 0-3 1 6 5 1-7 -8 1 Steps to execute the loop i SIMD code 1. Compare A, B to get VMASK 2. Masked store of A ito C 3. Complemet VMASK 4. Masked store of B ito C 33

Masked Vector Istructios Simple Implemetatio execute all N operatios, tur off result writeback accordig to mask Desity-Time Implemetatio sca mask vector ad oly execute elemets with o-zero masks M[7]=1 A[7] B[7] M[7]=1 M[6]=0 M[5]=1 A[6] A[5] B[6] B[5] M[6]=0 M[5]=1 A[7] B[7] M[4]=1 M[3]=0 A[4] A[3] B[4] B[3] M[4]=1 M[3]=0 C[5] M[2]=0 C[4] M[2]=0 M[1]=1 C[2] C[1] M[1]=1 M[0]=0 C[1] Write data port M[0]=0 Write Eable C[0] Write data port Which oe is better? Tradeoffs? Slide credit: Krste Asaovic 34

Some Issues Stride ad bakig As log as they are relatively prime to each other ad there are eough baks to cover bak access latecy, we ca sustai 1 elemet/cycle throughput Storage of a matrix Row major: Cosecutive elemets i a row are laid out cosecutively i memory Colum major: Cosecutive elemets i a colum are laid out cosecutively i memory You eed to chage the stride whe accessig a row versus colum 35

Miimizig Bak Coflicts More baks Better data layout to match the access patter Is this always possible? Better mappig of address to bak E.g., radomized mappig Rau, Pseudo-radomly iterleaved memory, ISCA 1991. 37

Array vs. Vector Processors, Revisited Array vs. vector processor distictio is a purist s distictio Most moder SIMD processors are a combiatio of both They exploit data parallelism i both time ad space GPUs are a prime example we will cover i a bit more detail 38

Remember: Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Istructio Stream LD VR ß A[3:0] ADD VR ß VR, 1 MUL VR ß VR, 2 ST A[3:0] ß VR Time Same op @ same time Differet ops @ time LD0 LD1 LD2 LD3 LD0 AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 Differet ops @ same space AD3 MU2 ST1 MU3 ST2 Same op @ space ST3 Space Space 39

Vector Istructio Executio VADD A,B à C Executio usig oe pipelied fuctioal uit Executio usig four pipelied fuctioal uits A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] Slide credit: Krste Asaovic 40

Vector Uit Structure Fuctioal Uit Partitioed Vector Registers Elemets 0, 4, 8, Elemets 1, 5, 9, Elemets 2, 6, 10, Elemets 3, 7, 11, Lae Memory Subsystem Slide credit: Krste Asaovic 41

Vector Istructio Level Parallelism Ca overlap executio of multiple vector istructios Example machie has 32 elemets per vector register ad 8 laes Completes 24 operatios/cycle while issuig 1 vector istructio/cycle Load Uit Multiply Uit Add Uit load mul time add load mul add Istructio issue Slide credit: Krste Asaovic 42

Automatic Code Vectorizatio for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Vectorized Code load load load Iter. 1 add load Time add load add load store store store Iter. 2 load load Iter. 1 Iter. 2 Vector Istructio add store Vectorizatio is a compile-time reorderig of operatio seuecig reuires extesive loop depedece aalysis Slide credit: Krste Asaovic 43

Vector/SIMD Processig Summary Vector/SIMD machies are good at exploitig regular datalevel parallelism Same operatio performed o may data elemets Improve performace, simplify desig (o itra-vector depedecies) Performace improvemet limited by vectorizability of code Scalar operatios limit vector machie performace Remember Amdahl s Law CRAY-1 was the fastest SCALAR machie at its time! May existig ISAs iclude (vector-like) SIMD operatios Itel MMX/SSE/AVX, PowerPC AltiVec, ARM Advaced SIMD 44

SIMD Operatios i Moder ISAs

SIMD ISA Extesios Sigle Istruc.o Mul.ple Data (SIMD) extesio istruc.os Sigle istruc.o acts o mul.ple pieces of data at oce Commo applica.o: graphics Perform short arithme.c opera.os (also called packed arithme-c) For example: add four 8-bit umbers Must modify ALU to elimiate carries betwee 8-bit values padd8 $s2, $s0, $s1 32 24 23 16 15 8 7 0 Bit positio a 3 a 2 a 1 a 0 $s0 + b 3 b 2 b 1 b 0 $s1 a 3 + b 3 a 2 + b 2 a 1 + b 1 a 0 + b 0 $s2

Itel Petium MMX Operatios Idea: Oe istructio operates o multiple data elemets simultaeously Ala array processig (yet much more limited) Desiged with multimedia (graphics) operatios i mid No VLEN register Opcode determies data type: 8 8-bit bytes 4 16-bit words 2 32-bit doublewords 1 64-bit uadword Stride is always eual to 1. Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro, 1996. 47

MMX Example: Image Overlayig (I) Goal: Overlay the huma i image 1 o top of the backgroud i image 2 Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro, 1996. 48

MMX Example: Image Overlayig (II) Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro, 1996. 49

We did ot cover the followig slides i lecture. These are for your preparatio for the ext lecture.

GPUs (Graphics Processig Uits)

GPUs are SIMD Egies Udereath The istructio pipelie operates like a SIMD pipelie (e.g., a array processor) However, the programmig is doe usig threads, NOT SIMD istructios To uderstad this, let s go back to our parallelizable code example But, before that, let s distiguish betwee Programmig Model (Software) vs. Executio Model (Hardware) 52

Programmig Model vs. Hardware Executio Model Programmig Model refers to how the programmer expresses the code E.g., Seuetial (vo Neuma), Data Parallel (SIMD), Dataflow, Multi-threaded (MIMD, SPMD), Executio Model refers to how the hardware executes the code udereath E.g., Out-of-order executio, Vector processor, Array processor, Dataflow processor, Multiprocessor, Multithreaded processor, Executio Model ca be very differet from the Programmig Model E.g., vo Neuma model implemeted by a OoO processor E.g., SPMD model implemeted by a SIMD processor (a GPU) 53

How Ca You Exploit Parallelism Here? for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Iter. 1 load load add store load Let s examie three programmig optios to exploit istructio-level parallelism preset i this seuetial code: 1. Seuetial (SISD) Iter. 2 add load 2. Data-Parallel (SIMD) store 3. Multithreaded (MIMD/SPMD) 54

Prog. Model 1: Seuetial (SISD) for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Ca be executed o a: Iter. 1 load load Pipelied processor Out-of-order executio processor add Idepedet istructios executed whe ready store load Differet iteratios are preset i the istructio widow ad ca execute i parallel i multiple fuctioal uits Iter. 2 add load I other words, the loop is dyamically urolled by the hardware Superscalar or VLIW processor store Ca fetch ad execute multiple istructios per cycle 55

Prog. Model 2: Data Parallel (SIMD) for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Vector Istructio Vectorized Code load load load VLD A à V1 Iter. 1 load load VLD B à V2 add add VADD V1 + V2 à V3 store store VST V3 à C Iter. 2 load Iter. 1 load Iter. 2 Realizatio: Each iteratio is idepedet add store Idea: Programmer or compiler geerates a SIMD istructio to execute the same istructio from all iteratios across differet data Best executed by a SIMD processor (vector, array) 56

Prog. Model 3: Multithreaded for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code load load load Iter. 1 load load Iter. 2 load Iter. 1 add store add store load add store Iter. 2 Realizatio: Each iteratio is idepedet Idea: Programmer or compiler geerates a thread to execute each iteratio. Each thread does the same thig (but o differet data) Ca be executed o a MIMD machie 57

Prog. Model 3: Multithreaded for (i=0; i < N; i++) C[i] = A[i] + B[i]; load load load load add store add store Iter. 1 Iter. 2 Realizatio: Each iteratio is idepedet Idea: This Programmer particular or model compiler is also geerates called: a thread to execute each iteratio. Each thread does the same SPMD: thig (but Sigle o Program differet data) Multiple Data Ca Ca be executed be executed o a MIMD o a SIMT SIMD machie machie Sigle Istructio Multiple Thread 58

A GPU is a SIMD (SIMT) Machie Except it is ot programmed usig SIMD istructios It is programmed usig threads (SPMD programmig model) Each thread executes the same code but operates a differet piece of data Each thread has its ow cotext (i.e., ca be treated/restarted/ executed idepedetly) A set of threads executig the same istructio are dyamically grouped ito a warp (wavefrot) by the hardware A warp is essetially a SIMD operatio formed by hardware! 59

SPMD o SIMT Machie for (i=0; i < N; i++) C[i] = A[i] + B[i]; load load Warp 0 at PC X load load Warp 0 at PC X+1 add add Warp 0 at PC X+2 store store Warp 0 at PC X+3 Iter. 1 Iter. 2 Warp: A set of threads that execute Realizatio: Each iteratio is idepedet the same istructio (i.e., at the same PC) Idea: This Programmer particular or model compiler is also geerates called: a thread to execute each iteratio. Each thread does the same SPMD: thig Sigle (but o Program differet data) Multiple Data Ca A GPU Ca be executed be executes executed o it a usig MIMD o a SIMD the machie SIMT machie model: Sigle Istructio Multiple Thread 60

Graphics Processig Uits SIMD ot Exposed to Programmer (SIMT)

SIMD vs. SIMT Executio Model SIMD: A sigle seuetial istructio stream of SIMD istructios à each istructio specifies multiple data iputs [VLD, VLD, VADD, VST], VLEN SIMT: Multiple istructio streams of scalar istructios à threads grouped dyamically ito warps [LD, LD, ADD, ST], NumThreads Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly (o ay type of scalar pipelie) à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig 62

Multithreadig of Warps for (i=0; i < N; i++) C[i] = A[i] + B[i]; Assume a warp cosists of 32 threads If you have 32K iteratios, ad 1 iteratio/thread à 1K warps Warps ca be iterleaved o the same pipelie à Fie graied multithreadig of warps load load Warp 10 at PC X load load add store add store Warp 20 at PC X+2 Iter. 33 120*32 + 1 Iter. 234 20*32 + 2 63

Warps ad Warp-Level FGMT Warp: A set of threads that execute the same istructio (o differet data elemets) à SIMT (Nvidia-speak) All threads ru the same code Warp: The threads that ru legthwise i a wove fabric Thread Warp Scalar Scalar Scalar Thread Thread Thread W X Y Commo PC Scalar Thread Z Thread Warp 3 Thread Warp 8 Thread Warp 7 SIMD Pipelie 64

High-Level View of a GPU 65

Latecy Hidig via Warp-Level FGMT Warp: A set of threads that execute the same istructio (o differet data elemets) Thread Warp 3 Thread Warp 8 Warps available for schedulig Fie-graied multithreadig Oe istructio per thread i pipelie at a time (No iterlockig) Iterleave warp executio to hide latecies Register values of all threads stay i register file FGMT eables log latecy tolerace Millios of pixels Thread Warp 7 R F A L U All Hit? I-Fetch Decode R F A L U D-Cache Writeback Data R F A L U SIMD Pipelie Warps accessig memory hierarchy Miss? Thread Warp 1 Thread Warp 2 Thread Warp 6 Slide credit: Tor Aamodt 66

Warp Executio (Recall the Slide) 32-thread warp executig ADD A[tid],B[tid] à C[tid] Executio usig oe pipelied fuctioal uit Executio usig four pipelied fuctioal uits A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] Slide credit: Krste Asaovic 67

SIMD Executio Uit Structure Fuctioal Uit Registers for each Thread Registers for thread IDs 0, 4, 8, Registers for thread IDs 1, 5, 9, Registers for thread IDs 2, 6, 10, Registers for thread IDs 3, 7, 11, Lae Memory Subsystem Slide credit: Krste Asaovic 68

Warp Istructio Level Parallelism Ca overlap executio of multiple istructios Example machie has 32 threads per warp ad 8 laes Completes 24 operatios/cycle while issuig 1 warp/cycle Load Uit Multiply Uit Add Uit W0 W1 time W2 W3 W4 W5 Warp issue Slide credit: Krste Asaovic 69

SIMT Memory Access Same istructio i differet threads uses thread id to idex ad access differet data elemets + Let s assume N=16, 4 threads per warp à 4 warps 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Threads Data elemets + + + + Warp 0 Warp 1 Warp 2 Warp 3 Slide credit: Hyesoo Kim

Sample GPU SIMT Code (Simplified) CPU code for (ii = 0; ii < 100000; ++ii) { C[ii] = A[ii] + B[ii]; } CUDA code // there are 100000 threads global void KerelFuctio( ) { it tid = blockdim.x * blockidx.x + threadidx.x; it vara = aa[tid]; it varb = bb[tid]; C[tid] = vara + varb; } Slide credit: Hyesoo Kim

Sample GPU Program (Less Simplified) Slide credit: Hyesoo Kim 72

Warp-based SIMD vs. Traditioal SIMD Traditioal SIMD cotais a sigle thread Seuetial istructio executio; lock-step operatios i a SIMD istructio Programmig model is SIMD (o extra threads) à SW eeds to kow vector legth ISA cotais vector/simd istructios Warp-based SIMD cosists of multiple scalar threads executig i a SIMD maer (i.e., same istructio executed by all threads) Does ot have to be lock step Each thread ca be treated idividually (i.e., placed i a differet warp) à programmig model ot SIMD SW does ot eed to kow vector legth Eables multithreadig ad flexible dyamic groupig of threads ISA is scalar à SIMD operatios ca be formed dyamically Essetially, it is SPMD programmig model implemeted o SIMD hardware 73

SPMD Sigle procedure/program, multiple data This is a programmig model rather tha computer orgaizatio Each processig elemet executes the same procedure, except o differet data elemets Procedures ca sychroize at certai poits i program, e.g. barriers Essetially, multiple istructio streams execute the same program Each program/procedure 1) works o differet data, 2) ca execute a differet cotrol-flow path, at ru-time May scietific applicatios are programmed this way ad ru o MIMD hardware (multiprocessors) Moder GPUs programmed i a similar way o a SIMD hardware 74

SIMD vs. SIMT Executio Model SIMD: A sigle seuetial istructio stream of SIMD istructios à each istructio specifies multiple data iputs [VLD, VLD, VADD, VST], VLEN SIMT: Multiple istructio streams of scalar istructios à threads grouped dyamically ito warps [LD, LD, ADD, ST], NumThreads Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly o ay type of scalar pipelie à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig 75

Threads Ca Take Differet Paths i Warp-based SIMD Each thread ca have coditioal cotrol flow istructios Threads ca execute differet cotrol flow paths A B Thread Warp Commo PC C D F Thread 1 Thread 2 Thread 3 Thread 4 E G Slide credit: Tor Aamodt 76

Cotrol Flow Problem i GPUs/SIMT A GPU uses a SIMD pipelie to save area o cotrol logic. Groups scalar threads ito warps Brach divergece occurs whe threads iside warps brach to differet executio paths. Brach Path A Path B This is the same as coditioal/predicated/masked executio. Recall the Vector Mask ad Masked Vector Operatios? Slide credit: Tor Aamodt 77

Remember: Each Thread Is Idepedet Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly o ay type of scalar pipelie à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig If we have may threads We ca fid idividual threads that are at the same PC Ad, group them together ito a sigle warp dyamically This reduces divergece à improves SIMD utilizatio SIMD utilizatio: fractio of SIMD laes executig a useful operatio (i.e., executig a active thread) 78

Dyamic Warp Formatio/Mergig Idea: Dyamically merge threads executig the same istructio (after brach divergece) Form ew warps from warps that are waitig Eough threads brachig to each path eables the creatio of full ew warps Warp X Warp Z Warp Y 79

Dyamic Warp Formatio/Mergig Idea: Dyamically merge threads executig the same istructio (after brach divergece) Brach Path A Path B Fug et al., Dyamic Warp Formatio ad Schedulig for Efficiet GPU Cotrol Flow, MICRO 2007. 80

Dyamic Warp Formatio Example B x/1110 y/0011 A x/1111 y/1111 C x/1000 y/0010 D x/0110 y/0001 F x/0001 y/1100 E x/1110 y/0011 A D Leged A Executio of Warp x at Basic Block A A ew warp created from scalar threads of both Warp x ad y executig at Basic Block D Executio of Warp y at Basic Block A Baselie Dyamic Warp Formatio G x/1111 y/1111 A A B B C C D D E E F F G G A A A A B B C D E E F G G A A Time Time Slide credit: Tor Aamodt 81

Hardware Costraits Limit Flexibility of Warp Groupig Fuctioal Uit Registers for each Thread Registers for thread IDs 0, 4, 8, Registers for thread IDs 1, 5, 9, Registers for thread IDs 2, 6, 10, Registers for thread IDs 3, 7, 11, Lae Ca you move ay thread flexibly to ay lae? Memory Subsystem Slide credit: Krste Asaovic 82

A Example GPU

NVIDIA GeForce GTX 285 NVIDIA-speak: 240 stream processors SIMT executio Geeric speak: 30 cores 8 SIMD fuctioal uits per core Slide credit: Kayvo Fatahalia 84

NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) = SIMD fuctioal uit, cotrol shared across 8 uits = multiply-add = multiply = istructio stream decode = executio cotext storage Slide credit: Kayvo Fatahalia 85

NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) Groups of 32 threads share istructio stream (each group is a Warp) Up to 32 warps are simultaeously iterleaved Up to 1024 thread cotexts ca be stored Slide credit: Kayvo Fatahalia 86

NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 30 cores o the GTX 285: 30,720 threads Slide credit: Kayvo Fatahalia 87

Computer Architecture Lecture 8: SIMD Processors ad GPUs Prof. Our Mutlu ETH Zürich Fall 2017 18 October 2017

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017