Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Size: px

Start display at page:

Download "Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018"

Evangeline McDowell
5 years ago
Views:

1 Desig of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Executio Prof. Our Mutlu ETH Zurich Sprig April 2018

2 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed Microarchitectures Pipeliig Issues i Pipeliig: Cotrol & Data Depedece Hadlig, State Maiteace ad Recovery, Out-of-Order Executio Other Executio Paradigms 2

3 Remider: Optioal Homeworks Posted olie 3 Optioal Homeworks Optioal Good for your learig =homeworks 3

4 Readigs for Today Smith ad Sohi, The Microarchitecture of Superscalar Processors, Proceedigs of the IEEE, 1995 More advaced pipeliig Iterrupt ad exceptio hadlig Out-of-order ad superscalar executio cocepts H&H Chapters 7.8 ad 7.9 Optioal: Kessler, The Alpha Microprocessor, IEEE Micro

Lecture Aoucemet Moday, April 30, 2018 16:15-17:15 CAB G 61 Apéro after the lecture J Prof.

5 Lecture Aoucemet Moday, April 30, :15-17:15 CAB G 61 Apéro after the lecture J Prof. We-Mei Hwu (Uiversity of Illiois at Urbaa-Champaig) D-INFK Distiguished Collouium Iovative Applicatios ad Techology Pivots A Perfect Storm i Computig 5

6 Geeral Suggestio Atted the Computer Sciece Distiguished Collouia Happes o some Modays 16:15-17:15, followed by a apero Great way of learig about key developmets i the field Ad, meetig leadig researchers 6

7 Out-of-Order Executio (Dyamic Istructio Schedulig)

8 Review: Dataflow Graph for Our Example MUL R3 ß R1, R2 ADD R5 ß R3, R4 ADD R7 ß R2, R6 ADD R10 ß R8, R9 MUL R11 ß R7, R10 ADD R5 ß R5, R11 8

9 Review: State of RAT ad RS i Cycle 7 9

10 Review: Correspodig Dataflow Graph 10

11 Some More Questios (Desig Choices) Whe is a reservatio statio etry deallocated? Exactly whe does a istructio broadcast its tag? Should the reservatio statios be dedicated to each fuctioal uit or global across fuctioal uits? Cetralized vs. Distributed: What are the tradeoffs? Should reservatio statios ad ROB store data values or should there be a cetralized physical register file where all data values are stored? What are the tradeoffs? May other desig choices for OoO egies 11

12 For You: A Exercise, w/ Precise Exceptios MUL R3 ß R1, R2 ADD R5 ß R3, R4 ADD R7 ß R2, R6 ADD R10 ß R8, R9 MUL R11 ß R7, R10 ADD R5 ß R5, R11 F D E R W Assume ADD (4 cycle execute), MUL (6 cycle execute) Assume oe adder ad oe multiplier How may cycles i a o-pipelied machie i a i-order-dispatch pipelied machie with reorder buffer (o forwardig ad full forwardig) i a out-of-order dispatch pipelied machie with reorder buffer (full forwardig) 12

13 Out-of-Order Executio with Precise Exceptios Idea: Use a reorder buffer to reorder istructios before committig them to architectural state A istructio updates the RAT whe it completes executio Also called froted register file A istructio updates a separate architectural register file whe it retires i.e., whe it is the oldest i the machie ad has completed executio I other words, the architectural register file is always updated i program order O a exceptio: flush pipelie, copy architectural register file ito froted register file 13

14 Out-of-Order Executio with Precise Exceptios TAG ad VALUE Broadcast Bus F D S C H E D U L E E Iteger add E E E E Iteger mul E E E E E E E E FP mul E E E E E E E E... R E O R D E R W Load/store i order out of order i order Hump 1: Reservatio statios (schedulig widow) Hump 2: Reorderig (reorder buffer, aka istructio widow or active widow) 14

15 Two Humps i a Moder Pipelie TAG ad VALUE Broadcast Bus F D S C H E D U L E E Iteger add E E E E Iteger mul E E E E E E E E FP mul E E E E E E E E... R E O R D E R W Load/store i order out of order i order Hump 1: Reservatio statios (schedulig widow) Hump 2: Reorderig (reorder buffer, aka istructio widow or active widow) Photo credit: 15

16 Moder OoO Executio w/ Precise Exceptios Most moder processors use the followig Reorder buffer to support i-order retiremet of istructios A sigle register file to store all registers Both speculative ad architectural registers INT ad FP are still separate Two register maps Future/froted register map à used for reamig Architectural register map à used for maitaiig precise state 16

17 A Example from Moder Processors Boggs et al., The Microarchitecture of the Petium 4 Processor, Itel Techology Joural,

18 Eablig OoO Executio, Revisited 1. Lik the cosumer of a value to the producer Register reamig: Associate a tag with each data value 2. Buffer istructios util they are ready Isert istructio ito reservatio statios after reamig 3. Keep track of readiess of source values of a istructio Broadcast the tag whe the value is produced Istructios compare their source tags to the broadcast tag à if match, source value becomes ready 4. Whe all source values of a istructio are ready, dispatch the istructio to fuctioal uit (FU) Wakeup ad select/schedule the istructio 18

19 Summary of OOO Executio Cocepts Register reamig elimiates false depedecies, eables likig of producer to cosumers Bufferig eables the pipelie to move for idepedet ops Tag broadcast eables commuicatio (of readiess of produced value) betwee istructios Wakeup ad select eables out-of-order dispatch 19

20 OOO Executio: Restricted Dataflow A out-of-order egie dyamically builds the dataflow graph of a piece of the program which piece? The dataflow graph is limited to the istructio widow Istructio widow: all decoded but ot yet retired istructios Ca we do it for the whole program? Why would we like to? I other words, how ca we have a large istructio widow? Ca we do it efficietly with Tomasulo s algorithm? 20

21 Recall: Dataflow Graph for Our Example MUL R3 ß R1, R2 ADD R5 ß R3, R4 ADD R7 ß R2, R6 ADD R10 ß R8, R9 MUL R11 ß R7, R10 ADD R5 ß R5, R11 21

22 Recall: State of RAT ad RS i Cycle 7 22

23 Recall: Dataflow Graph 23

24 Questios to Poder Why is OoO executio beeficial? What if all operatios take a sigle cycle? Latecy tolerace: OoO executio tolerates the latecy of multi-cycle operatios by executig idepedet operatios cocurretly What if a istructio takes 500 cycles? How large of a istructio widow do we eed to cotiue decodig? How may cycles of latecy ca OoO tolerate? What limits the latecy tolerace scalability of Tomasulo s algorithm? Active/istructio widow size: determied by both schedulig widow ad reorder buffer size 24

25 Geeral Orgaizatio of a OOO Processor Smith ad Sohi, The Microarchitecture of Superscalar Processors, Proc. IEEE, Dec

26 A Moder OoO Desig: Itel Petium 4 26 Boggs et al., The Microarchitecture of the Petium 4 Processor, Itel Techology Joural, 2001.

27 Itel Petium 4 Simplified Mutlu+, Ruahead Executio, HPCA

28 Alpha Kessler, The Alpha Microprocessor, IEEE Micro, March-April

29 MIPS R10000 Yeager, The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April

30 IBM POWER4 Tedler et al., POWER4 system microarchitecture, IBM J R&D,

31 IBM POWER4 2 cores, out-of-order executio 100-etry istructio widow i each core 8-wide istructio fetch, issue, execute Large, local+global hybrid brach predictor 1.5MB, 8-way L2 cache Aggressive stream based prefetchig 31

32 IBM POWER5 Kalla et al., IBM Power5 Chip: A Dual-Core Multithreaded Processor, IEEE Micro

33 Hadlig Out-of-Order Executio of Loads ad Stores

34 Registers versus Memory So far, we cosidered maily registers as part of state What about memory? What are the fudametal differeces betwee registers ad memory? Register depedeces kow statically memory depedeces determied dyamically Register state is small memory state is large Register state is ot visible to other threads/processors memory state is shared betwee threads/processors (i a shared memory multiprocessor) 34

35 Memory Depedece Hadlig (I) Need to obey memory depedeces i a out-of-order machie ad eed to do so while providig high performace Observatio ad Problem: Memory address is ot kow util a load/store executes Corollary 1: Reamig memory addresses is difficult Corollary 2: Determiig depedece or idepedece of loads/stores eed to be hadled after their (partial) executio Corollary 3: Whe a load/store has its address ready, there may be youger/older loads/stores with udetermied addresses i the machie 35

36 Memory Depedece Hadlig (II) Whe do you schedule a load istructio i a OOO egie? Problem: A youger load ca have its address ready before a older store s address is kow Kow as the memory disambiguatio problem or the ukow address problem Approaches Coservative: Stall the load util all previous stores have computed their addresses (or eve retired from the machie) Aggressive: Assume load is idepedet of ukow-address stores ad schedule the load right away Itelliget: Predict (with a more sophisticated predictor) if the load is depedet o the/ay ukow address store 36

37 Hadlig of Store-Load Depedeces A load s depedece status is ot kow util all previous store addresses are available. How does the OOO egie detect depedece of a load istructio o a previous store? Optio 1: Wait util all previous stores committed (o eed to check for address match) Optio 2: Keep a list of pedig stores i a store buffer ad check whether load address matches a previous store address How does the OOO egie treat the schedulig of a load istructio wrt previous stores? Optio 1: Assume load depedet o all previous stores Optio 2: Assume load idepedet of all previous stores Optio 3: Predict the depedece of a load o a outstadig store 37

38 Memory Disambiguatio (I) Optio 1: Assume load is depedet o all previous stores + No eed for recovery -- Too coservative: delays idepedet loads uecessarily Optio 2: Assume load is idepedet of all previous stores + Simple ad ca be commo case: o delay for idepedet loads -- Reuires recovery ad re-executio of load ad depedets o mispredictio Optio 3: Predict the depedece of a load o a outstadig store + More accurate. Load store depedecies persist over time -- Still reuires recovery/re-executio o mispredictio Alpha : Iitially assume load idepedet, delay loads foud to be depedet Moshovos et al., Dyamic speculatio ad sychroizatio of data depedeces, ISCA Chrysos ad Emer, Memory Depedece Predictio Usig Store Sets, ISCA

Memory Disambiguatio (II) Chrysos ad Emer, Memory Depedece Predictio Usig Store Sets, ISCA 1998.

39 Memory Disambiguatio (II) Chrysos ad Emer, Memory Depedece Predictio Usig Store Sets, ISCA Predictig store-load depedecies importat for performace Simple predictors (based o past history) ca achieve most of the potetial performace 39

40 Data Forwardig Betwee Stores ad Loads We caot update memory out of program order à Need to buffer all store ad load istructios i istructio widow Eve if we kow all addresses of past stores whe we geerate the address of a load, two uestios still remai: 1. How do we check whether or ot it is depedet o a store 2. How do we forward data to the load if it is depedet o a store Moder processors use a LQ (load ueue) ad a SQ for this Ca be combied or separate betwee loads ad stores A load searches the SQ after it computes its address. Why? A store searches the LQ after it computes its address. Why? 40

41 Out-of-Order Completio of Memory Ops Whe a store istructio fiishes executio, it writes its address ad data i its reorder buffer etry Whe a later load istructio geerates its address, it: searches the reorder buffer (or the SQ) with its address accesses memory with its address receives the value from the yougest older istructio that wrote to that address (either from ROB or memory) This is a complicated search logic implemeted as a Cotet Addressable Memory Cotet is memory address (but also eed size ad age) Called store-to-load forwardig logic 41

42 Store-Load Forwardig Complexity Cotet Addressable Search (based o Load Address) Rage Search (based o Address ad Size of both the Load ad earlier Stores) Age-Based Search (for last writte values) Load data ca come from a combiatio of multiple places Oe or more stores i the Store Buffer (SQ) Memory/cache 42

43 Other Approaches to Cocurrecy (or Istructio Level Parallelism)

44 Approaches to (Istructio-Level) Cocurrecy Pipeliig Out-of-order executio Dataflow (at the ISA level) Superscalar Executio VLIW Fie-Graied Multithreadig SIMD Processig (Vector ad array processors, GPUs) Decoupled Access Execute Systolic Arrays 44

45 Review: Data Flow: Exploitig Irregular Parallelism

46 Data Flow Summary Availability of data determies order of executio A data flow ode fires whe its sources are ready Programs represeted as data flow graphs (of odes) Data Flow at the ISA level has ot bee (as) successful Data Flow implemetatios uder the hood (while preservig seuetial ISA sematics) have bee very successful Out of order executio is the prime example 46

47 Pure Data Flow Advatages/Disadvatages Advatages Very good at exploitig irregular parallelism Oly real depedecies costrai processig More parallelism ca be exposed tha vo Neuma model Disadvatages No precise state sematics Debuggig very difficult Iterrupt/exceptio hadlig is difficult (what is precise state sematics?) Too much parallelism? (Parallelism cotrol eeded) High bookkeepig overhead (tag matchig, data storage) 47

48 Approaches to (Istructio-Level) Cocurrecy Pipeliig Out-of-order executio Dataflow (at the ISA level) Superscalar Executio VLIW Fie-Graied Multithreadig SIMD Processig (Vector ad array processors, GPUs) Decoupled Access Execute Systolic Arrays 48

49 Superscalar Executio

50 Superscalar Executio Idea: Fetch, decode, execute, retire multiple istructios per cycle N-wide superscalar à N istructios per cycle Need to add the hardware resources for doig so Hardware performs the depedece checkig betwee cocurretly-fetched istructios Superscalar executio ad out-of-order executio are orthogoal cocepts Ca have all four combiatios of processors: [i-order, out-of-order] x [scalar, superscalar] 50

51 Caregie Mello I-Order Superscalar Processor Example Multiple copies of datapath: Ca issue multiple istructios at per cycle Depedecies make it tricky to issue multiple istructios at oce CLK CLK CLK CLK CLK PC A RD Istructio Memory A1 A2 A3 A4 A5 A6 WD3 WD6 Register File RD1 RD4 RD2 RD5 ALUs A1 RD1 A2 RD2 Data Memory WD1 WD2 Here: Ideal IPC = 2 51

52 Caregie Mello I-Order Superscalar Performace Example lw $t0, 40($s0) add $t1, $s1, $s2 sub $t2, $s1, $s3 ad $t3, $s3, $s4 or $t4, $s1, $s5 sw $s5, 80($s0) Ideal IPC = Time (cycles) lw $t0, 40($s0) add $t1, $s1, $s2 IM lw add RF $s0 40 $s1 $s2 + + DM $t0 $t1 RF sub $t2, $s1, $s3 ad $t3, $s3, $s4 IM $s1 sub $t2 $s3 - RF DM $s3 ad $t3 $s4 & RF or $t4, $s1, $s5 sw $s5, 80($s0) IM or sw RF $s1 $s5 $s DM $s5 $t4 RF Actual IPC = 2 (6 istructios issued i 3 cycles) 52

53 Caregie Mello Superscalar Performace with Depedecies lw $t0, 40($s0) add $t1, $t0, $s1 sub $t0, $s2, $s3 ad $t2, $s4, $t0 or $t3, $s5, $s6 sw $s7, 80($t3) Ideal IPC = Time (cycles) lw $t0, 40($s0) IM lw RF $s DM $t0 RF add $t1, $t0, $s1 sub $t0, $s2, $s3 IM add sub RF $t0 $s1 $s2 $s3 RF $t0 $s1 $s2 $s3 + - DM $t1 $t0 RF ad $t2, $s4, $t0 or $t3, $s5, $s6 Stall ad IM or IM ad or RF $s4 $t0 $s5 $s6 & DM $t2 $t3 RF sw $s7, 80($t3) IM sw RF $t $s7 DM RF Actual IPC = 1.2 (6 istructios issued i 5 cycles) 53

54 Superscalar Tradeoffs Advatages Higher IPC (istructios per cycle) Disadvatages Higher complexity for depedecy checkig Reuire checkig withi a pipelie stage Reamig becomes more complex i a OoO processor More hardware resources eeded 54

55 Desig of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Executio Prof. Our Mutlu ETH Zurich Sprig April 2018

56 We did ot cover the followig slides i lecture. These are for your preparatio for the ext lecture.

57 Approaches to (Istructio-Level) Cocurrecy Pipeliig Out-of-order executio Dataflow (at the ISA level) Superscalar Executio VLIW Fie-Graied Multithreadig SIMD Processig (Vector ad array processors, GPUs) Decoupled Access Execute Systolic Arrays 57

58 VLIW

59 VLIW Cocept Superscalar Hardware fetches multiple istructios ad checks depedecies betwee them VLIW (Very Log Istructio Word) Software (compiler) packs idepedet istructios i a larger istructio budle to be fetched ad executed cocurretly Hardware fetches ad executes the istructios i the budle cocurretly No eed for hardware depedecy checkig betwee cocurretly-fetched istructios i the VLIW model 59

60 VLIW Cocept Fisher, Very Log Istructio Word architectures ad the ELI-512, ISCA ELI: Eormously logword istructios (512 bits) 60

61 VLIW (Very Log Istructio Word) A very log istructio word cosists of multiple idepedet istructios packed together by the compiler Packed istructios ca be logically urelated (cotrast with SIMD/vector processors, which we will see soo) Idea: Compiler fids idepedet istructios ad statically schedules (i.e. packs/budles) them ito a sigle VLIW istructio Traditioal Characteristics Multiple fuctioal uits All istructios i a budle are executed i lock step Istructios i a budle statically aliged to be directly fed ito the fuctioal uits 61

62 Caregie Mello VLIW Performace Example (2-wide budles) lw $t0, 40($s0) add $t1, $s1, $s2 sub $t2, $s1, $s3 ad $t3, $s3, $s4 or $t4, $s1, $s5 sw $s5, 80($s0) Ideal IPC = Time (cycles) lw $t0, 40($s0) add $t1, $s1, $s2 IM lw add RF $s0 40 $s1 $s2 + + DM $t0 $t1 RF sub $t2, $s1, $s3 ad $t3, $s3, $s4 IM $s1 sub $t2 $s3 - RF DM $s3 ad $t3 $s4 & RF or $t4, $s1, $s5 sw $s5, 80($s0) IM or sw RF $s1 $s5 $s DM $s5 $t4 RF Actual IPC = 2 (6 istructios issued i 3 cycles) 62

63 VLIW Lock-Step Executio Lock-step (all or oe) executio: If ay operatio i a VLIW istructio stalls, all istructios stall I a truly VLIW machie, the compiler hadles all depedecy-related stalls, hardware does ot perform depedecy checkig What about variable latecy operatios? 63

64 VLIW Philosophy Philosophy similar to RISC (simple istructios ad hardware) Except multiple istructios i parallel RISC (Joh Cocke, 1970s, IBM 801 miicomputer) Compiler does the hard work to traslate high-level laguage code to simple istructios (Joh Cocke: cotrol sigals) Ad, to reorder simple istructios for high performace Hardware does little traslatio/decodig à very simple VLIW (Josh Fisher, ISCA 1983) Compiler does the hard work to fid istructio level parallelism Hardware stays as simple ad streamlied as possible Executes each istructio i a budle i lock step Simple à higher freuecy, easier to desig 64

65 Commercial VLIW Machies Multiflow TRACE, Josh Fisher (7-wide, 28-wide) Cydrome Cydra 5, Bob Rau Trasmeta Crusoe: x86 biary-traslated ito iteral VLIW TI C6000, Trimedia, STMicro (DSP & embedded processors) Most successful commercially Itel IA-64 Not fully VLIW, but based o VLIW priciples EPIC (Explicitly Parallel Istructio Computig) Istructio budles ca have depedet istructios A few bits i the istructio format specify explicitly which istructios i the budle are depedet o which other oes 65

66 VLIW Tradeoffs Advatages + No eed for dyamic schedulig hardware à simple hardware + No eed for depedecy checkig withi a VLIW istructio à simple hardware for multiple istructio issue + o reamig + No eed for istructio aligmet/distributio after fetch to differet fuctioal uits à simple hardware Disadvatages -- Compiler eeds to fid N idepedet operatios per cycle -- If it caot, iserts NOPs i a VLIW istructio -- Parallelism loss AND code size icrease -- Recompilatio reuired whe executio width (N), istructio latecies, fuctioal uits chage (Ulike superscalar processig) -- Lockstep executio causes idepedet operatios to stall -- No istructio ca progress util the logest-latecy istructio completes 66

67 VLIW Summary VLIW simplifies hardware, but reuires complex compiler techiues Solely-compiler approach of VLIW has several dowsides that reduce performace -- Too may NOPs (ot eough parallelism discovered) -- Static schedule itimately tied to microarchitecture -- Code optimized for oe geeratio performs poorly for ext -- No tolerace for variable or log-latecy operatios (lock step) ++ Most compiler optimizatios developed for VLIW employed i optimizig compilers (for superscalar compilatio) Eable code optimizatios ++ VLIW successful whe parallelism is easier to fid by the compiler (traditioally embedded markets, DSPs) 67

68 A Example Work: Superblock Hwu et al., The superblock: A effective techiue for VLIW ad superscalar compilatio. The Joural of Supercomputig, Lecture Video o Static Istructio Schedulig 68

69 Aother Example Work: IMPACT Chag et al., IMPACT: a architectural framework for multiple-istructio-issue processors. ISCA

70 Approaches to (Istructio-Level) Cocurrecy Pipeliig Out-of-order executio Dataflow (at the ISA level) Superscalar Executio VLIW Fie-Graied Multithreadig SIMD Processig (Vector ad array processors, GPUs) Decoupled Access Execute Systolic Arrays 70

71 Recall: How to Hadle Data Depedeces Ati ad output depedeces are easier to hadle write to the destiatio i oe stage ad i program order Flow depedeces are more iterestig Five fudametal ways of hadlig flow depedeces Detect ad wait util value is available i register file Detect ad forward/bypass data to depedet istructio Detect ad elimiate the depedece at the software level No eed for the hardware to detect depedece Predict the eeded value(s), execute speculatively, ad verify Do somethig else (fie-graied multithreadig) No eed to detect 71

72 Fie-Graied Multithreadig 72

73 Fie-Graied Multithreadig Idea: Hardware has multiple thread cotexts (PC+registers). Each cycle, fetch egie fetches from a differet thread. By the time the fetched brach/istructio resolves, o istructio is fetched from the same thread Brach/istructio resolutio latecy overlapped with executio of other threads istructios + No logic eeded for hadlig cotrol ad data depedeces withi a thread -- Sigle thread performace suffers -- Extra logic for keepig thread cotexts -- Does ot overlap latecy if ot eough threads to cover the whole pipelie 73

74 Fie-Graied Multithreadig (II) Idea: Switch to aother thread every cycle such that o two istructios from a thread are i the pipelie cocurretly Tolerates the cotrol ad data depedecy latecies by overlappig the latecy with useful work from other threads Improves pipelie utilizatio by takig advatage of multiple threads Thorto, Parallel Operatio i the Cotrol Data 6600, AFIPS Smith, A pipelied, shared resource MIMD computer, ICPP

75 Fie-Graied Multithreadig: History CDC 6600 s peripheral processig uit is fie-graied multithreaded Thorto, Parallel Operatio i the Cotrol Data 6600, AFIPS Processor executes a differet I/O thread every cycle A operatio from the same thread is executed every 10 cycles Deelcor HEP (Heterogeeous Elemet Processor) Smith, A pipelied, shared resource MIMD computer, ICPP threads/processor available ueue vs. uavailable (waitig) ueue for threads each thread ca have oly 1 istructio i the processor pipelie; each thread idepedet to each thread, processor looks like a o-pipelied machie system throughput vs. sigle thread performace tradeoff 75

76 Fie-Graied Multithreadig i HEP Cycle time: 100s 8 stages à 800 s to complete a istructio assumig o memory access No cotrol ad data depedecy checkig Burto Smith ( ) 76

77 Multithreaded Pipelie Example Slide credit: Joel Emer 77

78 Su Niagara Multithreaded Pipelie Kogetira et al., Niagara: A 32-Way Multithreaded Sparc Processor, IEEE Micro

79 Fie-graied Multithreadig Advatages + No eed for depedecy checkig betwee istructios (oly oe istructio i pipelie from a sigle thread) + No eed for brach predictio logic + Otherwise-bubble cycles used for executig useful istructios from differet threads + Improved system throughput, latecy tolerace, utilizatio Disadvatages - Extra hardware complexity: multiple hardware cotexts (PCs, register files, ), thread selectio logic - Reduced sigle thread performace (oe istructio fetched every N cycles from the same thread) - Resource cotetio betwee threads i caches ad memory - Some depedecy checkig logic betwee threads remais (load/store) 79

80 Moder GPUs are FGMT Machies 80

81 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) = data-parallel (SIMD) fuc. uit, cotrol shared across 8 uits = multiply-add = multiply = istructio stream decode = executio cotext storage Slide credit: Kayvo Fatahalia 81

82 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) Groups of 32 threads share istructio stream (each group is a Warp): they execute the same istructio o differet data Up to 32 warps are iterleaved i a FGMT maer Up to 1024 thread cotexts ca be stored Slide credit: Kayvo Fatahalia 82

83 NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 30 cores o the GTX 285: 30,720 threads Slide credit: Kayvo Fatahalia 83

84 Ed of Fie-Graied Multithreadig 84

85 I Memory of Burto Smith Burto Smith ( ) 85

86 I Memory of Burto Smith (II) 86

87 Wedesday Keyote (HiPEAC 2015) Resource Maagemet i PACORA Burto J. Smith Microsoft

88 Techical Fellow at Microsoft Burto Smith Past: Co-fouder, chief scietist, chairma of Tera/Cray, Deelcor, Professor at Colorado Eckert-Mauchly Award i 1991, Seymour Cray Award, US Natioal Academy of Egieerig, AAAS/ACM/IEEE Fellow ad may other hoors May wide-rage cotributios spaig architecture, system software, compilers,, icludig: Deelcor HEP, Tera MTA fie-graied sychroizatio, commuicatio, multithreadig parallel architectures, resource maagemet, itercoectio etworks Oe I would like to share: Smith, A pipelied, shared resource MIMD computer, ICPP 1978.

89 Fie-graied Multithreadig i HEP 128 processes (hardware cotexts) Cycle time: 100s 8 stages à 800 s to complete a istructio assumig o memory access No cotrol ad data depedecy checkig 89

90 Wedesday Keyote (HiPEAC 2015) Resource Maagemet i PACORA Burto J. Smith Microsoft

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 16: Out-of-Order Executio Prof. Our Mutlu ETH Zurich Sprig 2018 26 April 2018 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed