Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Size: px

Start display at page:

Download "Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018"

April Atkinson
5 years ago
Views:

1 Desig of Digital Circuits Lecture 16: Out-of-Order Executio Prof. Our Mutlu ETH Zurich Sprig April 2018

2 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed Microarchitectures Pipeliig Issues i Pipeliig: Cotrol & Data Depedece Hadlig, State Maiteace ad Recovery, Out-of-Order Executio Other Executio Paradigms 2

3 Remider: Optioal Homeworks Posted olie 3 Optioal Homeworks Optioal Good for your learig =homeworks 3

4 Readigs Specifically for Today Smith ad Sohi, The Microarchitecture of Superscalar Processors, Proceedigs of the IEEE, 1995 More advaced pipeliig Iterrupt ad exceptio hadlig Out-of-order ad superscalar executio cocepts Optioal: Kessler, The Alpha Microprocessor, IEEE Micro

Lecture Aoucemet Moday, April 30, 2018 16:15-17:15 CAB G 61 Apéro after the lecture J Prof.

5 Lecture Aoucemet Moday, April 30, :15-17:15 CAB G 61 Apéro after the lecture J Prof. We-Mei Hwu (Uiversity of Illiois at Urbaa-Champaig) D-INFK Distiguished Collouium Iovative Applicatios ad Techology Pivots A Perfect Storm i Computig 5

6 Geeral Suggestio Atted the Computer Sciece Distiguished Collouia Happes o some Modays 16:15-17:15, followed by a apero Great way of learig about key developmets i the field Ad, meetig leadig researchers 6

7 Review: I-Order Pipelie with Reorder Buffer Decode (D): Access regfile/rob, allocate etry i ROB, check if istructio ca execute, if so dispatch istructio (sed to fuctioal uit) Execute (E): Istructios ca complete out-of-order Completio (R): Write result to reorder buffer Retiremet/Commit (W): Check for exceptios; if oe, write result to architectural register file or memory; else, flush pipelie ad start from exceptio hadler I-order dispatch/executio, out-of-order completio, i-order retiremet E Iteger add F D R E E E E Iteger mul E E E E E E E E FP mul E E E E E E E E... R W Load/store 7

8 Remember: Register Reamig with a Reorder Buffer Output ad ati depedecies are ot true depedecies WHY? The same register refers to values that have othig to do with each other They exist due to lack of register ID s (i.e. ames) i the ISA The register ID is reamed to the reorder buffer etry that will hold the register s value Register ID à ROB etry ID Architectural register ID à Physical register ID After reamig, ROB etry ID used to refer to the register This elimiates ati ad output depedecies Gives the illusio that there are a large umber of registers 8

9 Remember: Reamig Example Assume Register file has a poiter to the reorder buffer etry that cotais or will cotai the value, if the register is ot valid Reorder buffer works as described before Where is the latest defiitio of R3 for each istructio below i seuetial order? LD R0(0) à R3 LD R3, R1 à R10 MUL R1, R2 à R3 MUL R3, R4 à R11 ADD R5, R6 à R3 ADD R7, R8 à R12 9

10 Reorder Buffer Tradeoffs Advatages Coceptually simple for supportig precise exceptios Ca elimiate false depedeces Disadvatages Reorder buffer eeds to be accessed to get the results that are yet to be writte to the register file CAM or idirectio à icreased latecy ad complexity Other solutios aim to elimiate the disadvatages History buffer Future file Checkpoitig We will ot cover these 10

11 Out-of-Order Executio (Dyamic Istructio Schedulig)

12 A I-order Pipelie E Iteger add F D E E E E Iteger mul E E E E E E E E FP mul R W E E E E E E E E... Cache miss Dispatch: Act of sedig a istructio to a fuctioal uit Reamig with ROB elimiates stalls due to false depedecies Problem: A true data depedecy stalls dispatch of youger istructios ito fuctioal (executio) uits 12

13 Ca We Do Better? What do the followig two pieces of code have i commo (with respect to executio i the previous desig)? IMUL R3 ß R1, R2 ADD R3 ß R3, R1 ADD R4 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R9, R9 LD R3 ß R1 (0) ADD R3 ß R3, R1 ADD R4 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R9, R9 Aswer: First ADD stalls the whole pipelie! ADD caot dispatch because its source registers uavailable Later idepedet istructios caot get executed How are the above code portios differet? Aswer: Load latecy is variable (ukow util rutime) What does this affect? Thik compiler vs. microarchitecture 13

14 Prevetig Dispatch Stalls Problem: i-order dispatch (schedulig, or executio) Solutio: out-of-order dispatch (schedulig, or executio) Actually, we have see the basic idea before: Dataflow: fetch ad fire a istructio oly whe its iputs are ready We will use similar priciples, but ot expose it i the ISA Aside: Ay other way to prevet dispatch stalls? 1. Compile-time istructio schedulig/reorderig 2. Value predictio 3. Fie-graied multithreadig 14

15 Out-of-order Executio (Dyamic Schedulig) Idea: Move the depedet istructios out of the way of idepedet oes (s.t. idepedet oes ca execute) Rest areas for depedet istructios: Reservatio statios Moitor the source values of each istructio i the restig area Whe all source values of a istructio are available, fire (i.e. dispatch) the istructio Istructios dispatched i dataflow (ot cotrol-flow) order Beefit: Latecy tolerace: Allows idepedet istructios to execute ad complete i the presece of a log-latecy operatio 15

16 I-order vs. Out-of-order Dispatch I order dispatch + precise exceptios: F D E E E E R W F D STALL E R W F STALL D E R W F D E E E E E R W F D STALL E R W IMUL R3 ß R1, R2 ADD R3 ß R3, R1 ADD R1 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R3, R5 Out-of-order dispatch + precise exceptios: F D E E E E R W F D F D WAIT E R E R W W F D E E E E R W F D WAIT E R W 16 vs. 12 cycles 16

17 Eablig OoO Executio 1. Need to lik the cosumer of a value to the producer Register reamig: Associate a tag with each data value 2. Need to buffer istructios util they are ready to execute Isert istructio ito reservatio statios after reamig 3. Istructios eed to keep track of readiess of source values Broadcast the tag whe the value is produced Istructios compare their source tags to the broadcast tag à if match, source value becomes ready 4. Whe all source values of a istructio are ready, eed to dispatch the istructio to its fuctioal uit (FU) Istructio wakes up if all sources are ready If multiple istructios are awake, eed to select oe per FU 17

18 Tomasulo s Algorithm for OoO Executio OoO with register reamig iveted by Robert Tomasulo Used i IBM 360/91 Floatig Poit Uits Read: Tomasulo, A Efficiet Algorithm for Exploitig Multiple Arithmetic Uits, IBM Joural of R&D, Ja What is the major differece today? Precise exceptios Provided by Patt, Hwu, Shebaow, HPS, a ew microarchitecture: ratioale ad itroductio, MICRO Patt et al., Critical issues regardig HPS, a high performace microarchitecture, MICRO OoO variats are used i most high-performace processors Iitially i Itel Petium Pro, AMD K5 Alpha 21264, MIPS R10000, IBM POWER5, IBM z196, Oracle UltraSPARC T4, ARM Cortex A15 18

19 Two Humps i a Moder Pipelie TAG ad VALUE Broadcast Bus F D S C H E D U L E E Iteger add E E E E Iteger mul E E E E E E E E FP mul E E E E E E E E... R E O R D E R W Load/store i order out of order i order Hump 1: Reservatio statios (schedulig widow) Hump 2: Reorderig (reorder buffer, aka istructio widow or active widow) 19

20 Two Humps i a Moder Pipelie TAG ad VALUE Broadcast Bus F D S C H E D U L E E Iteger add E E E E Iteger mul E E E E E E E E FP mul E E E E E E E E... R E O R D E R W Load/store i order out of order i order Hump 1: Reservatio statios (schedulig widow) Hump 2: Reorderig (reorder buffer, aka istructio widow or active widow) Photo credit: 20

21 Geeral Orgaizatio of a OOO Processor Smith ad Sohi, The Microarchitecture of Superscalar Processors, Proc. IEEE, Dec

22 Tomasulo s Machie: IBM 360/91 from memory from istructio uit FP registers load buffers store buffers operatio bus FP FU FP FU reservatio statios to memory Commo data bus 22

23 Register Reamig Output ad ati depedecies are ot true depedecies WHY? The same register refers to values that have othig to do with each other They exist because ot eough register ID s (i.e. ames) i the ISA The register ID is reamed to the reservatio statio etry that will hold the register s value Register ID à RS etry ID Architectural register ID à Physical register ID After reamig, RS etry ID used to refer to the register This elimiates ati- ad output- depedecies Approximates the performace effect of a large umber of registers eve though ISA has a small umber 23

24 Tomasulo s Algorithm: Reamig Register reame table (register alias table) tag value valid? R0 R1 R2 R3 R4 R5 R6 R7 R8 R

25 Tomasulo s Algorithm If reservatio statio available before reamig Istructio + reamed operads (source value/tag) iserted ito the reservatio statio Oly reame if reservatio statio is available Else stall While i reservatio statio, each istructio: Watches commo data bus (CDB) for tag of its sources Whe tag see, grab value for the source ad keep it i the reservatio statio Whe both operads available, istructio ready to be dispatched Dispatch istructio to the Fuctioal Uit whe istructio is ready After istructio fiishes i the Fuctioal Uit Arbitrate for CDB Put tagged value oto CDB (tag broadcast) Register file is coected to the CDB Register cotais a tag idicatig the latest writer to the register If the tag i the register file matches the broadcast tag, write broadcast value ito register (ad set valid bit) Reclaim reame tag o valid copy of tag i system! 25

26 A Exercise MUL R3 ß R1, R2 ADD R5 ß R3, R4 ADD R7 ß R2, R6 ADD R10 ß R8, R9 MUL R11 ß R7, R10 ADD R5 ß R5, R11 F D E W Assume ADD (4 cycle execute), MUL (6 cycle execute) Assume oe adder ad oe multiplier How may cycles i a o-pipelied machie i a i-order-dispatch pipelied machie with imprecise exceptios (o forwardig ad full forwardig) i a out-of-order dispatch pipelied machie imprecise exceptios (full forwardig) 26

27 Exercise Cotiued 27

28 Exercise Cotiued 28

29 Exercise Cotiued MUL R3 ß R1, R2 ADD R5 ß R3, R4 ADD R7 ß R2, R6 ADD R10 ß R8, R9 MUL R11 ß R7, R10 ADD R5 ß R5, R11 29

30 How It Works 30

31 Cycle 0 31

32 Cycle 2 32

33 Cycle 3 33

34 Cycle 4 34

35 Cycle 7 35

36 Cycle 8 36

37 Some Questios What is eeded i hardware to perform tag broadcast ad value capture? à make a value valid à wake up a istructio Does the tag have to be the ID of the Reservatio Statio Etry? What ca potetially become the critical path? Tag broadcast à value capture à istructio wake up How ca you reduce the potetial critical paths? 37

38 Dataflow Graph for Our Example MUL R3 ß R1, R2 ADD R5 ß R3, R4 ADD R7 ß R2, R6 ADD R10 ß R8, R9 MUL R11 ß R7, R10 ADD R5 ß R5, R11 38

39 State of RAT ad RS i Cycle 7 39

40 Dataflow Graph 40

41 Desig of Digital Circuits Lecture 16: Out-of-Order Executio Prof. Our Mutlu ETH Zurich Sprig April 2018

42 We did ot cover the followig slides i lecture. These are for your preparatio for the ext lecture.

43 A Exercise, with Precise Exceptios MUL R3 ß R1, R2 ADD R5 ß R3, R4 ADD R7 ß R2, R6 ADD R10 ß R8, R9 MUL R11 ß R7, R10 ADD R5 ß R5, R11 F D E R W Assume ADD (4 cycle execute), MUL (6 cycle execute) Assume oe adder ad oe multiplier How may cycles i a o-pipelied machie i a i-order-dispatch pipelied machie with reorder buffer (o forwardig ad full forwardig) i a out-of-order dispatch pipelied machie with reorder buffer (full forwardig) 43

44 Out-of-Order Executio with Precise Exceptios Idea: Use a reorder buffer to reorder istructios before committig them to architectural state A istructio updates the RAT whe it completes executio Also called froted register file A istructio updates a separate architectural register file whe it retires i.e., whe it is the oldest i the machie ad has completed executio I other words, the architectural register file is always updated i program order O a exceptio: flush pipelie, copy architectural register file ito froted register file 44

45 Out-of-Order Executio with Precise Exceptios TAG ad VALUE Broadcast Bus F D S C H E D U L E E Iteger add E E E E Iteger mul E E E E E E E E FP mul E E E E E E E E... R E O R D E R W Load/store i order out of order i order Hump 1: Reservatio statios (schedulig widow) Hump 2: Reorderig (reorder buffer, aka istructio widow or active widow) 45

46 Two Humps i a Moder Pipelie TAG ad VALUE Broadcast Bus F D S C H E D U L E E Iteger add E E E E Iteger mul E E E E E E E E FP mul E E E E E E E E... R E O R D E R W Load/store i order out of order i order Hump 1: Reservatio statios (schedulig widow) Hump 2: Reorderig (reorder buffer, aka istructio widow or active widow) Photo credit: 46

47 Moder OoO Executio w/ Precise Exceptios Most moder processors use the followig Reorder buffer to support i-order retiremet of istructios A sigle register file to store all registers Both speculative ad architectural registers INT ad FP are still separate Two register maps Future/froted register map à used for reamig Architectural register map à used for maitaiig precise state 47

48 A Example from Moder Processors Boggs et al., The Microarchitecture of the Petium 4 Processor, Itel Techology Joural,

49 Eablig OoO Executio, Revisited 1. Lik the cosumer of a value to the producer Register reamig: Associate a tag with each data value 2. Buffer istructios util they are ready Isert istructio ito reservatio statios after reamig 3. Keep track of readiess of source values of a istructio Broadcast the tag whe the value is produced Istructios compare their source tags to the broadcast tag à if match, source value becomes ready 4. Whe all source values of a istructio are ready, dispatch the istructio to fuctioal uit (FU) Wakeup ad select/schedule the istructio 49

50 Summary of OOO Executio Cocepts Register reamig elimiates false depedecies, eables likig of producer to cosumers Bufferig eables the pipelie to move for idepedet ops Tag broadcast eables commuicatio (of readiess of produced value) betwee istructios Wakeup ad select eables out-of-order dispatch 50

51 OOO Executio: Restricted Dataflow A out-of-order egie dyamically builds the dataflow graph of a piece of the program which piece? The dataflow graph is limited to the istructio widow Istructio widow: all decoded but ot yet retired istructios Ca we do it for the whole program? Why would we like to? I other words, how ca we have a large istructio widow? Ca we do it efficietly with Tomasulo s algorithm? 51

52 Recall: Dataflow Graph for Our Example MUL R3 ß R1, R2 ADD R5 ß R3, R4 ADD R7 ß R2, R6 ADD R10 ß R8, R9 MUL R11 ß R7, R10 ADD R5 ß R5, R11 52

53 Recall: State of RAT ad RS i Cycle 7 53

54 Recall: Dataflow Graph 54

55 Questios to Poder Why is OoO executio beeficial? What if all operatios take sigle cycle? Latecy tolerace: OoO executio tolerates the latecy of multi-cycle operatios by executig idepedet operatios cocurretly What if a istructio takes 500 cycles? How large of a istructio widow do we eed to cotiue decodig? How may cycles of latecy ca OoO tolerate? What limits the latecy tolerace scalability of Tomasulo s algorithm? Active/istructio widow size: determied by both schedulig widow ad reorder buffer size 55

56 Geeral Orgaizatio of a OOO Processor Smith ad Sohi, The Microarchitecture of Superscalar Processors, Proc. IEEE, Dec

57 A Moder OoO Desig: Itel Petium 4 57 Boggs et al., The Microarchitecture of the Petium 4 Processor, Itel Techology Joural, 2001.

58 Itel Petium 4 Simplified Mutlu+, Ruahead Executio, HPCA

59 Alpha Kessler, The Alpha Microprocessor, IEEE Micro, March-April

60 MIPS R10000 Yeager, The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April

61 IBM POWER4 Tedler et al., POWER4 system microarchitecture, IBM J R&D,

62 IBM POWER4 2 cores, out-of-order executio 100-etry istructio widow i each core 8-wide istructio fetch, issue, execute Large, local+global hybrid brach predictor 1.5MB, 8-way L2 cache Aggressive stream based prefetchig 62

63 IBM POWER5 Kalla et al., IBM Power5 Chip: A Dual-Core Multithreaded Processor, IEEE Micro

64 Hadlig Out-of-Order Executio of Loads ad Stores

65 Recall: Registers versus Memory So far, we cosidered maily registers as part of state What about memory? What are the fudametal differeces betwee registers ad memory? Register depedeces kow statically memory depedeces determied dyamically Register state is small memory state is large Register state is ot visible to other threads/processors memory state is shared betwee threads/processors (i a shared memory multiprocessor) 65

66 Memory Depedece Hadlig (I) Need to obey memory depedeces i a out-of-order machie ad eed to do so while providig high performace Observatio ad Problem: Memory address is ot kow util a load/store executes Corollary 1: Reamig memory addresses is difficult Corollary 2: Determiig depedece or idepedece of loads/stores eed to be hadled after their (partial) executio Corollary 3: Whe a load/store has its address ready, there may be youger/older loads/stores with udetermied addresses i the machie 66

67 Memory Depedece Hadlig (II) Whe do you schedule a load istructio i a OOO egie? Problem: A youger load ca have its address ready before a older store s address is kow Kow as the memory disambiguatio problem or the ukow address problem Approaches Coservative: Stall the load util all previous stores have computed their addresses (or eve retired from the machie) Aggressive: Assume load is idepedet of ukow-address stores ad schedule the load right away Itelliget: Predict (with a more sophisticated predictor) if the load is depedet o the/ay ukow address store 67

68 Hadlig of Store-Load Depedeces A load s depedece status is ot kow util all previous store addresses are available. How does the OOO egie detect depedece of a load istructio o a previous store? Optio 1: Wait util all previous stores committed (o eed to check for address match) Optio 2: Keep a list of pedig stores i a store buffer ad check whether load address matches a previous store address How does the OOO egie treat the schedulig of a load istructio wrt previous stores? Optio 1: Assume load depedet o all previous stores Optio 2: Assume load idepedet of all previous stores Optio 3: Predict the depedece of a load o a outstadig store 68

69 Memory Disambiguatio (I) Optio 1: Assume load depedet o all previous stores + No eed for recovery -- Too coservative: delays idepedet loads uecessarily Optio 2: Assume load idepedet of all previous stores + Simple ad ca be commo case: o delay for idepedet loads -- Reuires recovery ad re-executio of load ad depedets o mispredictio Optio 3: Predict the depedece of a load o a outstadig store + More accurate. Load store depedecies persist over time -- Still reuires recovery/re-executio o mispredictio Alpha : Iitially assume load idepedet, delay loads foud to be depedet Moshovos et al., Dyamic speculatio ad sychroizatio of data depedeces, ISCA Chrysos ad Emer, Memory Depedece Predictio Usig Store Sets, ISCA

Memory Disambiguatio (II) Chrysos ad Emer, Memory Depedece Predictio Usig Store Sets, ISCA 1998.

70 Memory Disambiguatio (II) Chrysos ad Emer, Memory Depedece Predictio Usig Store Sets, ISCA Predictig store-load depedecies importat for performace Simple predictors (based o past history) ca achieve most of the potetial performace 70

71 Data Forwardig Betwee Stores ad Loads We caot update memory out of program order à Need to buffer all store ad load istructios i istructio widow Eve if we kow all addresses of past stores whe we geerate the address of a load, two uestios still remai: 1. How do we check whether or ot it is depedet o a store 2. How do we forward data to the load if it is depedet o a store Moder processors use a LQ (load ueue) ad a SQ for this Ca be combied or separate betwee loads ad stores A load searches the SQ after it computes its address. Why? A store searches the LQ after it computes its address. Why? 71

72 Out-of-Order Completio of Memory Ops Whe a store istructio fiishes executio, it writes its address ad data i its reorder buffer etry Whe a later load istructio geerates its address, it: searches the reorder buffer (or the SQ) with its address accesses memory with its address receives the value from the yougest older istructio that wrote to that address (either from ROB or memory) This is a complicated search logic implemeted as a Cotet Addressable Memory Cotet is memory address (but also eed size ad age) Called store-to-load forwardig logic 72

73 Store-Load Forwardig Complexity Cotet Addressable Search (based o Load Address) Rage Search (based o Address ad Size) Age-Based Search (for last writte values) Load data ca come from a combiatio of Oe or more stores i the Store Buffer (SQ) Memory/cache 73

74 Food for Thought for You May other desig choices for OoO egies Should reservatio statios be cetralized or distributed across fuctioal uits? What are the tradeoffs? Should reservatio statios ad ROB store data values or should there be a cetralized physical register file where all data values are stored? What are the tradeoffs? Exactly whe does a istructio broadcast its tag? 74

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device