Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017

Size: px
Start display at page:

Download "Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017"

Transcription

1 Computer Architecture Lecture 8: SIMD Processors ad GPUs Prof. Our Mutlu ETH Zürich Fall October 2017

2 Ageda for Today & Next Few Lectures SIMD Processors GPUs Itroductio to GPU Programmig Digitaltechik (Sprig 2017) YouTube videos Lecture 19: Begiig of SIMD Lecture 20: SIMD Processors Lecture 21: GPUs 2

3 SIMD Processig: Exploitig Regular (Data) Parallelism

4 Fly s Taxoomy of Computers Mike Fly, Very High-Speed Computig Systems, Proc. of IEEE, 1966 SISD: Sigle istructio operates o sigle data elemet SIMD: Sigle istructio operates o multiple data elemets Array processor Vector processor MISD: Multiple istructios operate o sigle data elemet Closest form: systolic array processor, streamig processor MIMD: Multiple istructios operate o multiple data elemets (multiple istructio streams) Multiprocessor Multithreaded processor 4

5 Data Parallelism Cocurrecy arises from performig the same operatios o differet pieces of data Sigle istructio multiple data (SIMD) E.g., dot product of two vectors Cotrast with data flow Cocurrecy arises from executig differet operatios i parallel (i a data drive maer) Cotrast with thread ( cotrol ) parallelism Cocurrecy arises from executig differet threads of cotrol i parallel SIMD exploits istructio-level parallelism Multiple istructios (more appropriately, operatios) are cocurret: istructios happe to be the same 5

6 SIMD Processig Sigle istructio operates o multiple data elemets I time or i space Multiple processig elemets Time-space duality Array processor: Istructio operates o multiple data elemets at the same time usig differet spaces Vector processor: Istructio operates o multiple data elemets i cosecutive time steps usig the same space 6

7 Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Istructio Stream LD VR ß A[3:0] ADD VR ß VR, 1 MUL VR ß VR, 2 ST A[3:0] ß VR Time Same same time Differet time LD0 LD1 LD2 LD3 LD0 AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 Differet same space AD3 MU2 ST1 MU3 ST2 Same space ST3 Space Space 7

8 SIMD Array Processig vs. VLIW VLIW: Multiple idepedet operatios packed together by the compiler 8

9 SIMD Array Processig vs. VLIW Array processor: Sigle operatio o multiple (differet) data elemets 9

10 Vector Processors A vector is a oe-dimesioal array of umbers May scietific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2 A vector processor is oe whose istructios operate o vectors rather tha scalar (sigle data) values Basic reuiremets Need to load/store vectors à vector registers (cotai vectors) Need to operate o vectors of differet legths à vector legth register (VLEN) Elemets of a vector might be stored apart from each other i memory à vector stride register (VSTR) Stride: distace betwee two elemets of a vector 10

11 Vector Processors (II) A vector istructio performs a operatio o each elemet i cosecutive cycles Vector fuctioal uits are pipelied Each pipelie stage operates o a differet data elemet Vector istructios allow deeper pipelies No itra-vector depedecies à o hardware iterlockig withi a vector No cotrol flow withi a vector Kow stride allows prefetchig of vectors ito registers/ cache/memory 11

12 Vector Processor Advatages + No depedecies withi a vector Pipeliig. parallelizatio work really well Ca have very deep pipelies, o depedecies! + Each istructio geerates a lot of work Reduces istructio fetch badwidth reuiremets + Highly regular memory access patter + No eed to explicitly code loops Fewer braches i the istructio seuece 12

13 Vector Processor Disadvatages -- Works (oly) if parallelism is regular (data/simd parallelism) ++ Vector operatios -- Very iefficiet if parallelism is irregular -- How about searchig for a key i a liked list? Fisher, Very Log Istructio Word architectures ad the ELI-512, ISCA

14 Vector Processor Limitatios -- Memory (badwidth) ca easily become a bottleeck, especially if 1. compute/memory operatio balace is ot maitaied 2. data is ot mapped appropriately to memory baks 14

15 Vector Processig i More Depth

16 Vector Registers Each vector data register holds N M-bit values Vector cotrol registers: VLEN, VSTR, VMASK Maximum VLEN ca be N Maximum umber of elemets stored i a vector register Vector Mask Register (VMASK) Idicates which elemets of vector to operate o Set by vector test istructios e.g., VMASK[i] = (V k [i] == 0) V0,0 V0,1 M-bit wide V1,0 V1,1 M-bit wide V0,N-1 V1,N-1 16

17 Vector Fuctioal Uits Use deep pipelie to execute elemet operatios à fast clock cycle V 1 V 2 V 3 Cotrol of deep pipelie is simple because elemets i vector are idepedet Six stage multiply pipelie V1 * V2 à V3 Slide credit: Krste Asaovic 17

18 Vector Machie Orgaizatio (CRAY-1) CRAY-1 Russell, The CRAY-1 computer system, CACM Scalar ad vector modes 8 64-elemet vector registers 64 bits per elemet 16 memory baks 8 64-bit scalar registers 8 24-bit address registers 18

19 Loadig/Storig Vectors from/to Memory Reuires loadig/storig multiple elemets Elemets separated from each other by a costat distace (stride) Assume stride = 1 for ow Elemets ca be loaded i cosecutive cycles if we ca start the load of oe elemet per cycle Ca sustai a throughput of oe elemet per cycle Questio: How do we achieve this with a memory that takes more tha 1 cycle to access? Aswer: Bak the memory; iterleave the elemets across baks 19

20 Memory Bakig Memory is divided ito baks that ca be accessed idepedetly; baks share address ad data buses (to miimize pi cost) Ca start ad complete oe bak access per cycle Ca sustai N parallel accesses if all N go to differet baks Bak 0 Bak 1 Bak 2 Bak 15 MDR MAR MDR MAR MDR MAR MDR MAR Data bus Address bus Picture credit: Derek Chiou CPU 20

21 Vector Memory System Next address = Previous address + Stride If stride = 1 & cosecutive elemets iterleaved across baks & umber of baks >= bak latecy, the ca sustai 1 elemet/cycle throughput Vector Registers Base Stride Address Geerator A B C D E F Memory Baks Picture credit: Krste Asaovic 21

22 Scalar Code Example For I = 0 to 49 C[i] = (A[i] + B[i]) / 2 Scalar code (istructio ad its latecy) MOVI R0 = 50 1 MOVA R1 = A 1 MOVA R2 = B 1 MOVA R3 = C dyamic istructios X: LD R4 = MEM[R1++] 11 ;autoicremet addressig LD R5 = MEM[R2++] 11 ADD R6 = R4 + R5 4 SHFR R7 = R6 >> 1 1 ST MEM[R3++] = R7 11 DECBNZ R0, X 2 ;decremet ad brach if NZ 22

23 Scalar Code Executio Time (I Order) Scalar executio time o a i-order processor with 1 bak First two loads i the loop caot be pipelied: 2*11 cycles *40 = 2004 cycles Scalar executio time o a i-order processor with 16 baks (word-iterleaved: cosecutive words are stored i cosecutive baks) First two loads i the loop ca be pipelied *30 = 1504 cycles Why 16 baks? 11 cycle memory access latecy Havig 16 (>11) baks esures there are eough baks to overlap eough memory operatios to cover memory latecy 23

24 Vectorizable Loops A loop is vectorizable if each iteratio is idepedet of ay other For I = 0 to 49 C[i] = (A[i] + B[i]) / 2 Vectorized loop (each istructio ad its latecy): MOVI VLEN = 50 1 MOVI VSTR = 1 1 VLD V0 = A 11 + VLEN - 1 VLD V1 = B 11 + VLEN 1 VADD V2 = V0 + V1 4 + VLEN - 1 VSHFR V3 = V2 >> VLEN - 1 VST C = V VLEN 1 7 dyamic istructios 24

25 Basic Vector Code Performace Assume o chaiig (o vector data forwardig) i.e., output of a vector fuctioal uit caot be used as the direct iput of aother The etire vector register eeds to be ready before ay elemet of it ca be used as part of aother operatio Oe memory port (oe address geerator) 16 memory baks (word-iterleaved) V0 = A[0..49] V1 = B[0..49] ADD SHIFT STORE 285 cycles 25

26 Vector Chaiig Vector chaiig: Data forwardig from oe vector fuctioal uit to aother LV v1 MULV v3,v1,v2 ADDV v5, v3, v4 V1 V 2 V 3 V 4 V 5 Chai Chai Load Uit Memory Mult. Add Slide credit: Krste Asaovic 26

27 Vector Code Performace - Chaiig Vector chaiig: Data forwardig from oe vector fuctioal uit to aother These two VLDs caot be pipelied. WHY? Strict assumptio: Each memory bak has a sigle port (memory badwidth bottleeck) cycles VLD ad VST caot be pipelied. WHY? 27

28 Vector Code Performace Multiple Memory Ports Chaiig ad 2 load ports, 1 store port i each bak cycles 19X perf. improvemet!

29 Questios (I) What if # data elemets > # elemets i a vector register? Idea: Break loops so that each iteratio operates o # elemets i a vector register E.g., 527 data elemets, 64-elemet VREGs 8 iteratios where VLEN = 64 1 iteratio where VLEN = 15 (eed to chage value of VLEN) Called vector stripmiig What if vector data is ot stored i a strided fashio i memory? (irregular memory access to a vector) Idea: Use idirectio to combie/pack elemets ito vector registers Called scatter/gather operatios 29

30 Gather/Scatter Operatios Wat to vectorize loops with idirect accesses: for (i=0; i<n; i++) A[i] = B[i] + C[D[i]] Idexed load istructio (Gather) LV vd, rd # Load idices i D vector LVI vc, rc, vd # Load idirect from rc base LV vb, rb # Load B vector ADDV.D va,vb,vc # Do add SV va, ra # Store result 30

31 Gather/Scatter Operatios Gather/scatter operatios ofte implemeted i hardware to hadle sparse vectors (matrices) Vector loads ad stores use a idex vector which is added to the base register to geerate the addresses Idex Vector Data Vector (to Store) Stored Vector (i Memory) Base Base+1 X Base Base+3 X Base+4 X Base+5 X Base Base

32 Coditioal Operatios i a Loop What if some operatios should ot be executed o a vector (based o a dyamically-determied coditio)? loop: for (i=0; i<n; i++) if (a[i]!= 0) the b[i]=a[i]*b[i] Idea: Masked operatios VMASK register is a bit mask determiig which data elemet should ot be acted upo VLD V0 = A VLD V1 = B VMASK = (V0!= 0) VMUL V1 = V0 * V1 VST B = V1 This is predicated executio. Executio is predicated o mask bit. 32

33 Aother Example with Maskig for (i = 0; i < 64; ++i) if (a[i] >= b[i]) c[i] = a[i] else c[i] = b[i] A B VMASK Steps to execute the loop i SIMD code 1. Compare A, B to get VMASK 2. Masked store of A ito C 3. Complemet VMASK 4. Masked store of B ito C 33

34 Masked Vector Istructios Simple Implemetatio execute all N operatios, tur off result writeback accordig to mask Desity-Time Implemetatio sca mask vector ad oly execute elemets with o-zero masks M[7]=1 A[7] B[7] M[7]=1 M[6]=0 M[5]=1 A[6] A[5] B[6] B[5] M[6]=0 M[5]=1 A[7] B[7] M[4]=1 M[3]=0 A[4] A[3] B[4] B[3] M[4]=1 M[3]=0 C[5] M[2]=0 C[4] M[2]=0 M[1]=1 C[2] C[1] M[1]=1 M[0]=0 C[1] Write data port M[0]=0 Write Eable C[0] Write data port Which oe is better? Tradeoffs? Slide credit: Krste Asaovic 34

35 Some Issues Stride ad bakig As log as they are relatively prime to each other ad there are eough baks to cover bak access latecy, we ca sustai 1 elemet/cycle throughput Storage of a matrix Row major: Cosecutive elemets i a row are laid out cosecutively i memory Colum major: Cosecutive elemets i a colum are laid out cosecutively i memory You eed to chage the stride whe accessig a row versus colum 35

36 36

37 Miimizig Bak Coflicts More baks Better data layout to match the access patter Is this always possible? Better mappig of address to bak E.g., radomized mappig Rau, Pseudo-radomly iterleaved memory, ISCA

38 Array vs. Vector Processors, Revisited Array vs. vector processor distictio is a purist s distictio Most moder SIMD processors are a combiatio of both They exploit data parallelism i both time ad space GPUs are a prime example we will cover i a bit more detail 38

39 Remember: Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Istructio Stream LD VR ß A[3:0] ADD VR ß VR, 1 MUL VR ß VR, 2 ST A[3:0] ß VR Time Same same time Differet time LD0 LD1 LD2 LD3 LD0 AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 Differet same space AD3 MU2 ST1 MU3 ST2 Same space ST3 Space Space 39

40 Vector Istructio Executio VADD A,B à C Executio usig oe pipelied fuctioal uit Executio usig four pipelied fuctioal uits A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] Slide credit: Krste Asaovic 40

41 Vector Uit Structure Fuctioal Uit Partitioed Vector Registers Elemets 0, 4, 8, Elemets 1, 5, 9, Elemets 2, 6, 10, Elemets 3, 7, 11, Lae Memory Subsystem Slide credit: Krste Asaovic 41

42 Vector Istructio Level Parallelism Ca overlap executio of multiple vector istructios Example machie has 32 elemets per vector register ad 8 laes Completes 24 operatios/cycle while issuig 1 vector istructio/cycle Load Uit Multiply Uit Add Uit load mul time add load mul add Istructio issue Slide credit: Krste Asaovic 42

43 Automatic Code Vectorizatio for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Vectorized Code load load load Iter. 1 add load Time add load add load store store store Iter. 2 load load Iter. 1 Iter. 2 Vector Istructio add store Vectorizatio is a compile-time reorderig of operatio seuecig reuires extesive loop depedece aalysis Slide credit: Krste Asaovic 43

44 Vector/SIMD Processig Summary Vector/SIMD machies are good at exploitig regular datalevel parallelism Same operatio performed o may data elemets Improve performace, simplify desig (o itra-vector depedecies) Performace improvemet limited by vectorizability of code Scalar operatios limit vector machie performace Remember Amdahl s Law CRAY-1 was the fastest SCALAR machie at its time! May existig ISAs iclude (vector-like) SIMD operatios Itel MMX/SSE/AVX, PowerPC AltiVec, ARM Advaced SIMD 44

45 SIMD Operatios i Moder ISAs

46 SIMD ISA Extesios Sigle Istruc.o Mul.ple Data (SIMD) extesio istruc.os Sigle istruc.o acts o mul.ple pieces of data at oce Commo applica.o: graphics Perform short arithme.c opera.os (also called packed arithme-c) For example: add four 8-bit umbers Must modify ALU to elimiate carries betwee 8-bit values padd8 $s2, $s0, $s Bit positio a 3 a 2 a 1 a 0 $s0 + b 3 b 2 b 1 b 0 $s1 a 3 + b 3 a 2 + b 2 a 1 + b 1 a 0 + b 0 $s2

47 Itel Petium MMX Operatios Idea: Oe istructio operates o multiple data elemets simultaeously Ala array processig (yet much more limited) Desiged with multimedia (graphics) operatios i mid No VLEN register Opcode determies data type: 8 8-bit bytes 4 16-bit words 2 32-bit doublewords 1 64-bit uadword Stride is always eual to 1. Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro,

48 MMX Example: Image Overlayig (I) Goal: Overlay the huma i image 1 o top of the backgroud i image 2 Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro,

49 MMX Example: Image Overlayig (II) Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro,

50 We did ot cover the followig slides i lecture. These are for your preparatio for the ext lecture.

51 GPUs (Graphics Processig Uits)

52 GPUs are SIMD Egies Udereath The istructio pipelie operates like a SIMD pipelie (e.g., a array processor) However, the programmig is doe usig threads, NOT SIMD istructios To uderstad this, let s go back to our parallelizable code example But, before that, let s distiguish betwee Programmig Model (Software) vs. Executio Model (Hardware) 52

53 Programmig Model vs. Hardware Executio Model Programmig Model refers to how the programmer expresses the code E.g., Seuetial (vo Neuma), Data Parallel (SIMD), Dataflow, Multi-threaded (MIMD, SPMD), Executio Model refers to how the hardware executes the code udereath E.g., Out-of-order executio, Vector processor, Array processor, Dataflow processor, Multiprocessor, Multithreaded processor, Executio Model ca be very differet from the Programmig Model E.g., vo Neuma model implemeted by a OoO processor E.g., SPMD model implemeted by a SIMD processor (a GPU) 53

54 How Ca You Exploit Parallelism Here? for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Iter. 1 load load add store load Let s examie three programmig optios to exploit istructio-level parallelism preset i this seuetial code: 1. Seuetial (SISD) Iter. 2 add load 2. Data-Parallel (SIMD) store 3. Multithreaded (MIMD/SPMD) 54

55 Prog. Model 1: Seuetial (SISD) for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Ca be executed o a: Iter. 1 load load Pipelied processor Out-of-order executio processor add Idepedet istructios executed whe ready store load Differet iteratios are preset i the istructio widow ad ca execute i parallel i multiple fuctioal uits Iter. 2 add load I other words, the loop is dyamically urolled by the hardware Superscalar or VLIW processor store Ca fetch ad execute multiple istructios per cycle 55

56 Prog. Model 2: Data Parallel (SIMD) for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Vector Istructio Vectorized Code load load load VLD A à V1 Iter. 1 load load VLD B à V2 add add VADD V1 + V2 à V3 store store VST V3 à C Iter. 2 load Iter. 1 load Iter. 2 Realizatio: Each iteratio is idepedet add store Idea: Programmer or compiler geerates a SIMD istructio to execute the same istructio from all iteratios across differet data Best executed by a SIMD processor (vector, array) 56

57 Prog. Model 3: Multithreaded for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code load load load Iter. 1 load load Iter. 2 load Iter. 1 add store add store load add store Iter. 2 Realizatio: Each iteratio is idepedet Idea: Programmer or compiler geerates a thread to execute each iteratio. Each thread does the same thig (but o differet data) Ca be executed o a MIMD machie 57

58 Prog. Model 3: Multithreaded for (i=0; i < N; i++) C[i] = A[i] + B[i]; load load load load add store add store Iter. 1 Iter. 2 Realizatio: Each iteratio is idepedet Idea: This Programmer particular or model compiler is also geerates called: a thread to execute each iteratio. Each thread does the same SPMD: thig (but Sigle o Program differet data) Multiple Data Ca Ca be executed be executed o a MIMD o a SIMT SIMD machie machie Sigle Istructio Multiple Thread 58

59 A GPU is a SIMD (SIMT) Machie Except it is ot programmed usig SIMD istructios It is programmed usig threads (SPMD programmig model) Each thread executes the same code but operates a differet piece of data Each thread has its ow cotext (i.e., ca be treated/restarted/ executed idepedetly) A set of threads executig the same istructio are dyamically grouped ito a warp (wavefrot) by the hardware A warp is essetially a SIMD operatio formed by hardware! 59

60 SPMD o SIMT Machie for (i=0; i < N; i++) C[i] = A[i] + B[i]; load load Warp 0 at PC X load load Warp 0 at PC X+1 add add Warp 0 at PC X+2 store store Warp 0 at PC X+3 Iter. 1 Iter. 2 Warp: A set of threads that execute Realizatio: Each iteratio is idepedet the same istructio (i.e., at the same PC) Idea: This Programmer particular or model compiler is also geerates called: a thread to execute each iteratio. Each thread does the same SPMD: thig Sigle (but o Program differet data) Multiple Data Ca A GPU Ca be executed be executes executed o it a usig MIMD o a SIMD the machie SIMT machie model: Sigle Istructio Multiple Thread 60

61 Graphics Processig Uits SIMD ot Exposed to Programmer (SIMT)

62 SIMD vs. SIMT Executio Model SIMD: A sigle seuetial istructio stream of SIMD istructios à each istructio specifies multiple data iputs [VLD, VLD, VADD, VST], VLEN SIMT: Multiple istructio streams of scalar istructios à threads grouped dyamically ito warps [LD, LD, ADD, ST], NumThreads Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly (o ay type of scalar pipelie) à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig 62

63 Multithreadig of Warps for (i=0; i < N; i++) C[i] = A[i] + B[i]; Assume a warp cosists of 32 threads If you have 32K iteratios, ad 1 iteratio/thread à 1K warps Warps ca be iterleaved o the same pipelie à Fie graied multithreadig of warps load load Warp 10 at PC X load load add store add store Warp 20 at PC X+2 Iter * Iter *

64 Warps ad Warp-Level FGMT Warp: A set of threads that execute the same istructio (o differet data elemets) à SIMT (Nvidia-speak) All threads ru the same code Warp: The threads that ru legthwise i a wove fabric Thread Warp Scalar Scalar Scalar Thread Thread Thread W X Y Commo PC Scalar Thread Z Thread Warp 3 Thread Warp 8 Thread Warp 7 SIMD Pipelie 64

65 High-Level View of a GPU 65

66 Latecy Hidig via Warp-Level FGMT Warp: A set of threads that execute the same istructio (o differet data elemets) Thread Warp 3 Thread Warp 8 Warps available for schedulig Fie-graied multithreadig Oe istructio per thread i pipelie at a time (No iterlockig) Iterleave warp executio to hide latecies Register values of all threads stay i register file FGMT eables log latecy tolerace Millios of pixels Thread Warp 7 R F A L U All Hit? I-Fetch Decode R F A L U D-Cache Writeback Data R F A L U SIMD Pipelie Warps accessig memory hierarchy Miss? Thread Warp 1 Thread Warp 2 Thread Warp 6 Slide credit: Tor Aamodt 66

67 Warp Executio (Recall the Slide) 32-thread warp executig ADD A[tid],B[tid] à C[tid] Executio usig oe pipelied fuctioal uit Executio usig four pipelied fuctioal uits A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] Slide credit: Krste Asaovic 67

68 SIMD Executio Uit Structure Fuctioal Uit Registers for each Thread Registers for thread IDs 0, 4, 8, Registers for thread IDs 1, 5, 9, Registers for thread IDs 2, 6, 10, Registers for thread IDs 3, 7, 11, Lae Memory Subsystem Slide credit: Krste Asaovic 68

69 Warp Istructio Level Parallelism Ca overlap executio of multiple istructios Example machie has 32 threads per warp ad 8 laes Completes 24 operatios/cycle while issuig 1 warp/cycle Load Uit Multiply Uit Add Uit W0 W1 time W2 W3 W4 W5 Warp issue Slide credit: Krste Asaovic 69

70 SIMT Memory Access Same istructio i differet threads uses thread id to idex ad access differet data elemets + Let s assume N=16, 4 threads per warp à 4 warps Threads Data elemets Warp 0 Warp 1 Warp 2 Warp 3 Slide credit: Hyesoo Kim

71 Sample GPU SIMT Code (Simplified) CPU code for (ii = 0; ii < ; ++ii) { C[ii] = A[ii] + B[ii]; } CUDA code // there are threads global void KerelFuctio( ) { it tid = blockdim.x * blockidx.x + threadidx.x; it vara = aa[tid]; it varb = bb[tid]; C[tid] = vara + varb; } Slide credit: Hyesoo Kim

72 Sample GPU Program (Less Simplified) Slide credit: Hyesoo Kim 72

73 Warp-based SIMD vs. Traditioal SIMD Traditioal SIMD cotais a sigle thread Seuetial istructio executio; lock-step operatios i a SIMD istructio Programmig model is SIMD (o extra threads) à SW eeds to kow vector legth ISA cotais vector/simd istructios Warp-based SIMD cosists of multiple scalar threads executig i a SIMD maer (i.e., same istructio executed by all threads) Does ot have to be lock step Each thread ca be treated idividually (i.e., placed i a differet warp) à programmig model ot SIMD SW does ot eed to kow vector legth Eables multithreadig ad flexible dyamic groupig of threads ISA is scalar à SIMD operatios ca be formed dyamically Essetially, it is SPMD programmig model implemeted o SIMD hardware 73

74 SPMD Sigle procedure/program, multiple data This is a programmig model rather tha computer orgaizatio Each processig elemet executes the same procedure, except o differet data elemets Procedures ca sychroize at certai poits i program, e.g. barriers Essetially, multiple istructio streams execute the same program Each program/procedure 1) works o differet data, 2) ca execute a differet cotrol-flow path, at ru-time May scietific applicatios are programmed this way ad ru o MIMD hardware (multiprocessors) Moder GPUs programmed i a similar way o a SIMD hardware 74

75 SIMD vs. SIMT Executio Model SIMD: A sigle seuetial istructio stream of SIMD istructios à each istructio specifies multiple data iputs [VLD, VLD, VADD, VST], VLEN SIMT: Multiple istructio streams of scalar istructios à threads grouped dyamically ito warps [LD, LD, ADD, ST], NumThreads Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly o ay type of scalar pipelie à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig 75

76 Threads Ca Take Differet Paths i Warp-based SIMD Each thread ca have coditioal cotrol flow istructios Threads ca execute differet cotrol flow paths A B Thread Warp Commo PC C D F Thread 1 Thread 2 Thread 3 Thread 4 E G Slide credit: Tor Aamodt 76

77 Cotrol Flow Problem i GPUs/SIMT A GPU uses a SIMD pipelie to save area o cotrol logic. Groups scalar threads ito warps Brach divergece occurs whe threads iside warps brach to differet executio paths. Brach Path A Path B This is the same as coditioal/predicated/masked executio. Recall the Vector Mask ad Masked Vector Operatios? Slide credit: Tor Aamodt 77

78 Remember: Each Thread Is Idepedet Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly o ay type of scalar pipelie à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig If we have may threads We ca fid idividual threads that are at the same PC Ad, group them together ito a sigle warp dyamically This reduces divergece à improves SIMD utilizatio SIMD utilizatio: fractio of SIMD laes executig a useful operatio (i.e., executig a active thread) 78

79 Dyamic Warp Formatio/Mergig Idea: Dyamically merge threads executig the same istructio (after brach divergece) Form ew warps from warps that are waitig Eough threads brachig to each path eables the creatio of full ew warps Warp X Warp Z Warp Y 79

80 Dyamic Warp Formatio/Mergig Idea: Dyamically merge threads executig the same istructio (after brach divergece) Brach Path A Path B Fug et al., Dyamic Warp Formatio ad Schedulig for Efficiet GPU Cotrol Flow, MICRO

81 Dyamic Warp Formatio Example B x/1110 y/0011 A x/1111 y/1111 C x/1000 y/0010 D x/0110 y/0001 F x/0001 y/1100 E x/1110 y/0011 A D Leged A Executio of Warp x at Basic Block A A ew warp created from scalar threads of both Warp x ad y executig at Basic Block D Executio of Warp y at Basic Block A Baselie Dyamic Warp Formatio G x/1111 y/1111 A A B B C C D D E E F F G G A A A A B B C D E E F G G A A Time Time Slide credit: Tor Aamodt 81

82 Hardware Costraits Limit Flexibility of Warp Groupig Fuctioal Uit Registers for each Thread Registers for thread IDs 0, 4, 8, Registers for thread IDs 1, 5, 9, Registers for thread IDs 2, 6, 10, Registers for thread IDs 3, 7, 11, Lae Ca you move ay thread flexibly to ay lae? Memory Subsystem Slide credit: Krste Asaovic 82

83 A Example GPU

84 NVIDIA GeForce GTX 285 NVIDIA-speak: 240 stream processors SIMT executio Geeric speak: 30 cores 8 SIMD fuctioal uits per core Slide credit: Kayvo Fatahalia 84

85 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) = SIMD fuctioal uit, cotrol shared across 8 uits = multiply-add = multiply = istructio stream decode = executio cotext storage Slide credit: Kayvo Fatahalia 85

86 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) Groups of 32 threads share istructio stream (each group is a Warp) Up to 32 warps are simultaeously iterleaved Up to 1024 thread cotexts ca be stored Slide credit: Kayvo Fatahalia 86

87 NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 30 cores o the GTX 285: 30,720 threads Slide credit: Kayvo Fatahalia 87

88 Computer Architecture Lecture 8: SIMD Processors ad GPUs Prof. Our Mutlu ETH Zürich Fall October 2017

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 17 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018

More information

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow

More information

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018 Desig of Digital Circuits Lecture 20: SIMD Processors Prof. Our Mutlu ETH Zurich Sprig 2018 11 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018 2 credit uits Rigorous semiar o fudametal ad

More information

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) 18-447 Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/24/2014 Lab 4 Reminder Lab 4a out Branch handling and branch

More information

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 19: SIMD

More information

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 20: SIMD Processors Prof. Onur Mutlu ETH Zurich Spring 2017 11 May 2017 Agenda for Today & Next Few Lectures! Single-cycle Microarchitectures! Multi-cycle and Microprogrammed

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device

More information

Computer Architecture Lecture 15: GPUs, VLIW, DAE. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015

Computer Architecture Lecture 15: GPUs, VLIW, DAE. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015 18-447 Computer Architecture Lecture 15: GPUs, VLIW, DAE Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle

More information

Instruction and Data Streams

Instruction and Data Streams Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad

More information

Design of Digital Circuits Lecture 19: Approaches to Concurrency. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 19: Approaches to Concurrency. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 19: Approaches to Concurrency Prof. Onur Mutlu ETH Zurich Spring 2017 5 May 2017 Agenda for Today & Next Few Lectures! Single-cycle Microarchitectures! Multi-cycle and

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

Computer Architecture Lecture 15: Dataflow and SIMD. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/20/2013

Computer Architecture Lecture 15: Dataflow and SIMD. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/20/2013 18-447 Computer Architecture Lecture 15: Dataflow and SIMD Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/20/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed LC-3b,

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018 Desig of Digital Circuits Lecture 22: GPU Programmig Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 18 May 2018 Ageda for Today GPU as a accelerator Program structure Bulk sychroous programmig

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Executio Prof. Our Mutlu ETH Zurich Sprig 2018 27 April 2018 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware A Overview Graphics System Moitor Iput devices CPU/Memory GPU Raster Graphics System Raster: A array of picture elemets Based o raster-sca TV techology The scree (ad a picture)

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 5: Pipeliig Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab1 Due toight Lab2: out later today; due 2 weeks from ow Review sessio this Friday Turig award

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access

More information

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Course Site:   Copyright 2012, Elsevier Inc. All rights reserved. Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios

More information

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 16: Out-of-Order Executio Prof. Our Mutlu ETH Zurich Sprig 2018 26 April 2018 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed

More information

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings Operatig Systems: Iterals ad Desig Priciples Chapter 4 Threads Nith Editio By William Stalligs Processes ad Threads Resource Owership Process icludes a virtual address space to hold the process image The

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors CS252 Sprig 2017 Graduate Computer Architecture Lecture 6: Out-of-Order Processors Lisa Wu, Krste Asaovic http://ist.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 2 WU UCB CS252 SP17 Last Time i Lecture

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Threads ad Cocurrecy i Java: Part 1 1 Cocurrecy What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Advaced Issues Review: Pipelie Hazards Structural hazards Desig pipelie to elimiate structural hazards.

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Cocurrecy Threads ad Cocurrecy i Java: Part 1 What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

Chapter 4 The Datapath

Chapter 4 The Datapath The Ageda Chapter 4 The Datapath Based o slides McGraw-Hill Additioal material 24/25/26 Lewis/Marti Additioal material 28 Roth Additioal material 2 Taylor Additioal material 2 Farmer Tae the elemets that

More information

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs CS 152 Computer Architecture ad Egieerig CS252 Graduate Computer Architecture Lecture 17 GPUs Krste Asaovic Electrical Egieerig ad Computer Scieces Uiversity of Califoria at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Isn t It Time You Got Faster, Quicker?

Isn t It Time You Got Faster, Quicker? Is t It Time You Got Faster, Quicker? AltiVec Techology At-a-Glace OVERVIEW Motorola s advaced AltiVec techology is desiged to eable host processors compatible with the PowerPC istructio-set architecture

More information

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization Ed Semester Examiatio 2013-14 CSE, III Yr. (I Sem), 30002: Computer Orgaizatio Istructios: GROUP -A 1. Write the questio paper group (A, B, C, D), o frot page top of aswer book, as per what is metioed

More information

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago CMSC 22200 Computer Architecture Lecture 2: ISA Prof. Yajig Li Departmet of Computer Sciece Uiversity of Chicago Admiistrative Stuff Lab1 out toight Due Thursday (10/18) Lab1 review sessio Tomorrow, 10/05,

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016

More information

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition. Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive

More information

Data diverse software fault tolerance techniques

Data diverse software fault tolerance techniques Data diverse software fault tolerace techiques Complemets desig diversity by compesatig for desig diversity s s limitatios Ivolves obtaiig a related set of poits i the program data space, executig the

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 3: ISA ad Itroductio to Microarchitecture Prof. Yajig Li Uiversity of Chicago Lecture Outlie ISA uarch (hardware implemetatio of a ISA) Logic desig basics Sigle-cycle

More information

Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics

Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics Trasformig Irregular lgorithms for Heterogeeous omputig - ase Studies i ioiformatics Jig Zhag dvisor: Dr. Wu Feg ollaborator: Hao Wag syergy.cs.vt.edu Irregular lgorithms haracterized by Operate o irregular

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

One advantage that SONAR has over any other music-sequencing product I ve worked

One advantage that SONAR has over any other music-sequencing product I ve worked *gajedra* D:/Thomso_Learig_Projects/Garrigus_163132/z_productio/z_3B2_3D_files/Garrigus_163132_ch17.3d, 14/11/08/16:26:39, 16:26, page: 647 17 CAL 101 Oe advatage that SONAR has over ay other music-sequecig

More information

Uniprocessors. HPC Prof. Robert van Engelen

Uniprocessors. HPC Prof. Robert van Engelen Uiprocessors HPC Prof. Robert va Egele Overview PART I: Uiprocessors PART II: Multiprocessors ad ad Compiler Optimizatios Parallel Programmig Models Uiprocessors Multiprocessors Processor architectures

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1 Switchig Hardware Sprig 208 CS 438 Staff, Uiversity of Illiois Where are we? Uderstad Differet ways to move through a etwork (forwardig) Read sigs at each switch (datagram) Follow a kow path (virtual circuit)

More information

CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Multi-Core Prof. Yajig Li Uiversity of Chicago Course Evaluatio Very importat Please fill out! 2 Lab3 Brach Predictio Competitio 8 teams etered the competitio,

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

Computer Architecture. Microcomputer Architecture and Interfacing Colorado School of Mines Professor William Hoff

Computer Architecture. Microcomputer Architecture and Interfacing Colorado School of Mines Professor William Hoff Computer rchitecture Microcomputer rchitecture ad Iterfacig Colorado School of Mies Professor William Hoff Computer Hardware Orgaizatio Processor Performs all computatios; coordiates data trasfer Iput

More information

Computers and Scientific Thinking

Computers and Scientific Thinking Computers ad Scietific Thikig David Reed, Creighto Uiversity Chapter 15 JavaScript Strigs 1 Strigs as Objects so far, your iteractive Web pages have maipulated strigs i simple ways use text box to iput

More information

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu UH-MEM: Utility-Based Hybrid Memory Maagemet Yag Li, Saugata Ghose, Jogmoo Choi, Ji Su, Hui Wag, Our Mutlu 1 Executive Summary DRAM faces sigificat techology scalig difficulties Emergig memory techologies

More information

CS2410 Computer Architecture. Flynn s Taxonomy

CS2410 Computer Architecture. Flynn s Taxonomy CS2410 Computer Architecture Dept. of Computer Sciece Uiversity of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/2410p/idex.html 1 Fly s Taxoomy SISD Sigle istructio stream Sigle data stream (SIMD)

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 1 Computers ad Programs 1 Objectives To uderstad the respective roles of hardware ad software i a computig system. To lear what computer scietists

More information

ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University

ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University ECE5917 SoC Architecture: MP SoC Part 1 Tae Hee Ha: tha@skku.edu Semicoductor Systems Egieerig Sugkyukwa Uiversity Outlie Overview Parallelism Data-Level Parallelism Istructio-Level Parallelism Thread-Level

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 4: Pipeliig Prof. Our Mutlu ETH Zurich Sprig 28 9 April 28 Ageda for Today & Next Few Lectures Previous lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 CPU-Memory Bottleeck Computer Architecture ELEC44 CPU Memory Lecture 8 Cache Dr. Hayde Kwok-Hay So Departmet of Electrical ad Electroic Egieerig Performace of high-speed computers is usually limited by

More information

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software Structurig Redudacy for Fault Tolerace CSE 598D: Fault Tolerat Software What do we wat to achieve? Versios Damage Assessmet Versio 1 Error Detectio Iputs Versio 2 Voter Outputs State Restoratio Cotiued

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering EE 4363 1 Uiversity of Miesota Midterm Exam #1 Prof. Matthew O'Keefe TA: Eric Seppae Departmet of Electrical ad Computer Egieerig Uiversity of Miesota Twi Cities Campus EE 4363 Itroductio to Microprocessors

More information

Chapter 5: Processor Design Advanced Topics. Microprogramming: Basic Idea

Chapter 5: Processor Design Advanced Topics. Microprogramming: Basic Idea 5-1 Chapter 5 Processor Desig Advaced Topics Chapter 5: Processor Desig Advaced Topics Topics 5.3 Microprogrammig Cotrol store ad microbrachig Horizotal ad vertical microprogrammig 5- Chapter 5 Processor

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015. Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Hash Tables xkcd. http://xkcd.com/221/. Radom Number. Used with permissio uder Creative

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 20 Itroductio to Trasactio Processig Cocepts ad Theory Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Trasactio Describes local

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

How do we evaluate algorithms?

How do we evaluate algorithms? F2 Readig referece: chapter 2 + slides Algorithm complexity Big O ad big Ω To calculate ruig time Aalysis of recursive Algorithms Next time: Litterature: slides mostly The first Algorithm desig methods:

More information

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1 CS200: Hash Tables Prichard Ch. 13.2 CS200 - Hash Tables 1 Table Implemetatios: average cases Search Add Remove Sorted array-based Usorted array-based Balaced Search Trees O(log ) O() O() O() O(1) O()

More information

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines This Uit: Damic Schedulig CSE 560 Computer Sstems Architecture Damic Schedulig Slides origiall developed b Drew Hilto (IBM) ad Milo Marti (Uiversit of Peslvaia) App App App Sstem software Mem CPU I/O Code

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 6 Parallel Processors from Cliet to Cloud Itroductio Goal: coectig multiple computers to get higher performace Multiprocessors

More information

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen Programmig with Shared Memory PART II HPC Sprig 2017 Prof. Robert va Egele Overview Sequetial cosistecy Parallel programmig costructs Depedece aalysis OpeMP Autoparallelizatio Further readig HPC Sprig

More information

MOTIF XF Extension Owner s Manual

MOTIF XF Extension Owner s Manual MOTIF XF Extesio Ower s Maual Table of Cotets About MOTIF XF Extesio...2 What Extesio ca do...2 Auto settig of Audio Driver... 2 Auto settigs of Remote Device... 2 Project templates with Iput/ Output Bus

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

Threads and Concurrency in Java: Part 2

Threads and Concurrency in Java: Part 2 Threads ad Cocurrecy i Java: Part 2 1 Waitig Sychroized methods itroduce oe kid of coordiatio betwee threads. Sometimes we eed a thread to wait util a specific coditio has arise. 2003--09 T. S. Norvell

More information

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory! Why Care About the Memory Hierarchy? Memory Virtual Memory -DRAM Memory Gap (latecy) Reasos: Multi process systems (abstractio & memory protectio) Solutio: Tables (holdig per process traslatios) Fast traslatio

More information

Computer Architecture

Computer Architecture Computer Architecture Overview Prof. Tie-Fu Che Dept. of Computer Sciece Natioal Chug Cheg Uiv Sprig 2002 Overview- Computer Architecture Course Focus Uderstadig the desig techiques, machie structures,

More information

Lecture 5. Counting Sort / Radix Sort

Lecture 5. Counting Sort / Radix Sort Lecture 5. Coutig Sort / Radix Sort T. H. Corme, C. E. Leiserso ad R. L. Rivest Itroductio to Algorithms, 3rd Editio, MIT Press, 2009 Sugkyukwa Uiversity Hyuseug Choo choo@skku.edu Copyright 2000-2018

More information

Operating System Concepts. Operating System Concepts

Operating System Concepts. Operating System Concepts Chapter 4: Mass-Storage Systems Logical Disk Structure Logical Disk Structure Disk Schedulig Disk Maagemet RAID Structure Disk drives are addressed as large -dimesioal arrays of logical blocks, where the

More information

Lower Bounds for Sorting

Lower Bounds for Sorting Liear Sortig Topics Covered: Lower Bouds for Sortig Coutig Sort Radix Sort Bucket Sort Lower Bouds for Sortig Compariso vs. o-compariso sortig Decisio tree model Worst case lower boud Compariso Sortig

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein 068.670 Subliear Time Algorithms November, 0 Lecture 6 Lecturer: Roitt Rubifeld Scribes: Che Ziv, Eliav Buchik, Ophir Arie, Joatha Gradstei Lesso overview. Usig the oracle reductio framework for approximatig

More information

K-NET bus. When several turrets are connected to the K-Bus, the structure of the system is as showns

K-NET bus. When several turrets are connected to the K-Bus, the structure of the system is as showns K-NET bus The K-Net bus is based o the SPI bus but it allows to addressig may differet turrets like the I 2 C bus. The K-Net is 6 a wires bus (4 for SPI wires ad 2 additioal wires for request ad ackowledge

More information

n Learn how resiliency strategies reduce risk n Discover automation strategies to reduce risk

n Learn how resiliency strategies reduce risk n Discover automation strategies to reduce risk Chapter Objectives Lear how resiliecy strategies reduce risk Discover automatio strategies to reduce risk Chapter #16: Architecture ad Desig Resiliecy ad Automatio Strategies 2 Automatio/Scriptig Resiliet

More information

Goals of the Lecture UML Implementation Diagrams

Goals of the Lecture UML Implementation Diagrams Goals of the Lecture UML Implemetatio Diagrams Object-Orieted Aalysis ad Desig - Fall 1998 Preset UML Diagrams useful for implemetatio Provide examples Next Lecture Ð A variety of topics o mappig from

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19 CIS Data Structures ad Algorithms with Java Sprig 09 Stacks, Queues, ad Heaps Moday, February 8 / Tuesday, February 9 Stacks ad Queues Recall the stack ad queue ADTs (abstract data types from lecture.

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition Lecture Goals Itroductio to Computig Systems: From Bits ad Gates to C ad Beyod 2 d Editio Yale N. Patt Sajay J. Patel Origial slides from Gregory Byrd, North Carolia State Uiversity Modified slides by

More information

Pattern Recognition Systems Lab 1 Least Mean Squares

Pattern Recognition Systems Lab 1 Least Mean Squares Patter Recogitio Systems Lab 1 Least Mea Squares 1. Objectives This laboratory work itroduces the OpeCV-based framework used throughout the course. I this assigmet a lie is fitted to a set of poits usig

More information

condition w i B i S maximum u i

condition w i B i S maximum u i ecture 10 Dyamic Programmig 10.1 Kapsack Problem November 1, 2004 ecturer: Kamal Jai Notes: Tobias Holgers We are give a set of items U = {a 1, a 2,..., a }. Each item has a weight w i Z + ad a utility

More information

Lecture 28: Data Link Layer

Lecture 28: Data Link Layer Automatic Repeat Request (ARQ) 2. Go ack N ARQ Although the Stop ad Wait ARQ is very simple, you ca easily show that it has very the low efficiecy. The low efficiecy comes from the fact that the trasmittig

More information

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS SIAM J. SCI. COMPUT. Vol. 22, No. 6, pp. 2113 2134 c 21 Society for Idustrial ad Applied Mathematics FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS ZHAO ZHANG AND XIAODONG ZHANG

More information

Bank-interleaved cache or memory indexing does not require euclidean division

Bank-interleaved cache or memory indexing does not require euclidean division Bak-iterleaved cache or memory idexig does ot require euclidea divisio Adré Sezec To cite this versio: Adré Sezec. Bak-iterleaved cache or memory idexig does ot require euclidea divisio. 11th Aual Workshop

More information

Quality of Service. Spring 2018 CS 438 Staff - University of Illinois 1

Quality of Service. Spring 2018 CS 438 Staff - University of Illinois 1 Quality of Service Sprig 2018 CS 438 Staff - Uiversity of Illiois 1 Quality of Service How good are late data ad lowthroughput chaels? It depeds o the applicatio. Do you care if... Your e-mail takes 1/2

More information