CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago
Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office hours today moved to tomorrow " Announcement on Piazza 2
Lecture Outline! Review: branch prediction! Out-of-order (OOO) execution " Motivation " How it works " Discussion 3
Review: Gshare Branch Predictor Which direction earlier branches went Direction predictor (e.g., 2-bit counters) taken? Global branch history Program Counter XOR hit? PC + 4 Next Fetch Address Address of the current instruction target address BTB: Branch Target Buffer 4
Two Levels of Gshare! First level: Global branch history register (N bits) xor PC! Second level: 2-bit counters for each history entry " The direction the branch took the last time the same history was seen Pattern History Table (PHT) 1 1.. 1 0 GHR (global history register) xor PC 00. 00 00. 01 00. 10 2 3 index 0 1 11. 11 5
Branch Prediction Using a 2-bit Counter actually taken strongly taken actually taken weakly!taken pred taken pred!taken actually!taken actually taken actually!taken actually taken pred taken pred!taken weakly taken actually!taken strongly!taken actually!taken Change predic3on a5er 2 consecu3ve mistakes 6
2-bit Counter: Another Scheme actually taken strongly taken weakly!taken pred taken actually!taken pred!taken actually!taken actually taken actually!taken actually taken pred taken actually taken pred!taken weakly taken strongly!taken actually!taken 7
Review: Dependency Handling in the Pipeline! Software vs. hardware " Software based instruction scheduling # static scheduling " Hardware based instruction scheduling # dynamic scheduling! What information does the compiler not know that makes static scheduling difficult? " Answer: Anything that is determined at run time! Variable-length operation latency, memory addr, branch direction 8
Example: Load-Use Dependency! Consider this sequence! Requires 1 stall LDUR X2, [X1,#20] AND X4,X2,X5 OR X8,X3,X6! Static scheduling to re-order instructions! No need to stall LDUR X2, [X1,#20] OR X8,X3,X6 AND X4,X2,X5 What if load sometimes take 100 cycles to execute? 9
Another Example: Instructions w/ Variable Latencies F D E Integer add E E E E Integer mul E E E E E E E E FP mul R W E E E E E E E E... Cache miss 10
Dependency Handling! Consider the following two pieces of code IMUL R3 $ R1, R2 ADD R3 $ R3, R1 ADD R1 $ R6, R7 IMUL R5 $ R6, R8 ADD R7 $ R3, R5 LD R3 $ R1 (0) ADD R3 $ R3, R1 ADD R1 $ R6, R7 IMUL R5 $ R6, R8 ADD R7 $ R3, R5! In both cases, first ADD stalls the whole pipeline! " ADD cannot dispatch because its source registers unavailable " Later independent instructions cannot get executed! IMUL and LD can take a really long time " Latency of LD is unknown until runtime (cache hit vs. miss) 11
How to Do Better?! Hardware has knowledge of dynamic events on a perinstruction basis (i.e., at a very fine granularity) " Cache misses " Branch mispredictions " Load/store addresses! Wouldn t it be nice if hardware did the scheduling of instructions?! Hardware-based dynamic instruction scheduling, enabling OOO execution " Tradeoffs vs. static scheduling? 12
Benefits of OOO! In order F D E E E E M W F D STALL E M W F! Out-of-order F D E E E E M W F D F D STALL WAIT E M! 15 vs. 12 cycles D E M W F D E E E E M W E F D STALL E M W M W W F D E E E E M W F D WAIT E M W IMUL R3 $ R1, R2 ADD R3 $ R3, R1 ADD R1 $ R6, R7 IMUL R5 $ R6, R8 ADD R7 $ R3, R5 Assume: IMUL: 4 Ex cycles ADD: 1 Ex cycle 13
Out-of-Order Execution
Out-of-Order Execution! Idea " Move the dependent instructions out of the way of independent ones (s.t. independent ones can execute)! Approach " Monitor the source values of each instruction " When all source values of an instruction are available, fire (i.e. dispatch) the instruction " Retire each instruction in program order! Benefit " Latency tolerance: Allows independent instructions to execute and complete in the presence of a long latency operation 15
Illustration of an OOO Pipeline TAG and VALUE Broadcast Bus F D S C H E D U L E E Integer add E E E E Integer mul E E E E E E E E FP mul E E E E E E E E... Load/store in order out of order in order R E O R D E R W! Two humps " Hump 1: reservation stations (scheduling window) " Hump 2: reorder buffer (instruction window or active window) 16
Dynamic Scheduling: Tomasulo s Algorithm! Invented by Robert Tomasulo " Used in IBM 360/91 Floating Point Units " Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units, IBM Journal of R&D, Jan. 1967.! Variants are used in many high-performance processors 17
Key Ideas of Tomasulo s Algorithm 1. Register renaming " Track true dependencies by linking the consumer of a value to the producer 2. Buffer instructions in reservation stations until they are ready to execute " Keep track of readiness of source values " Instruction wakes up and dispatch to the appropriate functional unit (FU) if all sources are ready! If multiple instructions are awake, need to select one per FU 18
Register Renaming! Output and anti dependencies are not true dependencies " WHY? " They exist because not enough register ID s (i.e. names) in the ISA! The register ID is renamed to the reservation station (RS) entry that will hold the register s value " Register ID # RS entry ID " Architectural register ID # Physical register ID " After renaming, RS entry ID used to refer to the register! This eliminates anti- and output- dependencies " As if there are a large number of registers even though ISA can only support a small number 19
Registe Renaming Using RAT! RAT: Register Alias Table (aka Register Rename Table) X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 tag value valid? Don t care 0 1 Don t care 1 1 RS entry 7 Don t care 0 Don t care 3 1 RS entry 3 Don t care 0 RS entry 13 Don t care 0 Don t care 6 1 Don t care 7 1 RS entry 4 Don t care 0 Don t care 9 1 20
Better Register Renaming Techniques Rename through ROB Rename through merged RF Hinton et al., The Microarchitecture of the Pentium 4 Processor, Intel Technology Journal, 2001. 21
Tomasulo s Machine: IBM 360/91 from memory from instruction unit FP registers load buffers store buffers operation bus FP FU FP FU reservation stations to memory Common data bus 22
Tomasulo s Algorithm! If reservation station not available, stall; else Instruction + renamed operands (source value/tag) inserted into the reservation station! While in reservation station, each instruction: " Watches common data bus (CDB) for tag of its sources " When tag seen, grab value for the source and keep it in the reservation station " When both operands available, instruction ready to be dispatched! Dispatch instruction to the Functional Unit (FU) when instruction is ready " If multiple instructions ready at the same time and require the same FU, need logic to select one! After instruction finishes in the FU " Arbitrate for CDB " Put tagged value onto CDB (tag broadcast) " Register file, RS, and RAT connected to the CDB! Register contains a tag indicating the latest writer to the register! If the tag in the register file, RS, and RAT matches the broadcast tag, write broadcast value into register (and set valid bit) " Reclaim rename tag (i.e., free the corresponding RS entry) 23
An Exercise MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 from memory! Assume ADD (4-cycle execute), MUL (6-cycle execute)! Assume one adder and one multiplier in HW! Assume operations are done entirely using registers " No memory access from instruction unit F D E W FP registers load buffers store buffers operation bus FP FU FP FU reservation stations to memory Common data bus 24
Drawing Template MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 v? tag val X1 X2 X3 a r X4 b t X5 c s X6 d v X7 X8 X9 X10 ADD MUL X11 25
Cycle 1 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 3 a r X4 1 * 4 b t X5 1 * 5 c s X6 1 * 6 d v X7 1 * 7 X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 26
Cycle 2 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D F v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a r 1 * 1 1 * 2 X4 1 * 4 b t X5 1 * 5 c s X6 1 * 6 d v X7 1 * 7 X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 27
Cycle 3 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E F D F MUL (in RS entry r) starts to execute since both operands are valid v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b t X5 0 a * c s X6 1 * 6 d v X7 1 * 7 X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 28
Cycle 4 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E F D -- F D F ADD (in RS entry a) waits since is not valid v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t X5 0 a * c s X6 1 * 6 d v X7 0 b * X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 29
Cycle 5 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E F D -- -- F D E F D F ADD (in RS entry b) starts to execute v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t X5 0 a * c 1 * 8 1 * 9 s X6 1 * 6 d v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 1 * 11 30
Cycle 6 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E F D -- -- -- F D E E F D E F D F ADD (in RS entry c) starts to execute v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 0 b * 0 c * X5 0 a * c 1 * 8 1 * 9 s X6 1 * 6 d v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 31
Cycle 7 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E F D -- -- -- -- F D E E E F D E E F D -- F D MUL (in RS entry t) waits Pay attention to register renaming removing WAW v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 0 b * 0 c * X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 32
Cycle 8 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E F D -- -- -- -- -- F D E E E E F D E E E F D -- -- F D -- Broadcast results through CDB to wake up dependent instructions (check both RAT and RS) v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 0 b * 0 c * X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 33
Cycle 9 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E F D E E E E W F D E E E E F D -- -- -- F D -- -- Assuming 2 reg write ports and forwarding, we can dispatch ADD in RS entry a; also reclaim RS entries r and b v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 1 * 8 0 c * X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 34
Cycle 10 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E F D E E E E W F D E E E E W F D -- -- -- E F D -- -- -- Now we dispatch the second MUL (in RS entry t) v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 35
Cycle 11 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E F D E E E E W F D E E E E W F D -- -- -- E E F D -- -- -- -- v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 36
Cycle 12 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E F D E E E E W F D E E E E W F D -- -- -- E E E F D -- -- -- -- -- v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 37
Cycle 13 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W F D E E E E W F D E E E E W F D -- -- -- E E E E F D -- -- -- -- -- -- v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 1 * 6 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 38
Cycle 14 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W F D E E E E W F D E E E E W F D -- -- -- E E E E E F D -- -- -- -- -- -- -- v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 1 * 6 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 39
Cycle 15 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W F D E E E E W F D E E E E W F D -- -- -- E E E E E E F D -- -- -- -- -- -- -- -- v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t 1 * 17 1 * 8 X5 0 d * c s X6 1 * 6 d 1 * 6 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 40
Cycle 16 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W Now we dispatch F D E E E E W the last ADD F D E E E E W F D -- -- -- E E E E E E W F D -- -- -- -- -- -- -- -- E v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t 1 * 17 1 * 8 X5 0 d * c s X6 1 * 6 d 1 * 6 1 * 136 v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 1 * 136 41
Cycle 19 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W F D E E E E W F D E E E E W F D -- -- E E E E E E W F D -- -- -- -- -- -- -- E E E E v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t X5 0 d * c s X6 1 * 6 d 1 * 6 1 * 136 v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 1 * 136 42
Cycle 20 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D -- -- -- -- -- E E E E W F D E E E E W F D E E E E W F D -- -- E E E E E E W F D -- -- -- -- -- -- -- E E E E W v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t X5 1 * 142 c s X6 1 * 6 d 1 * 6 1 * 136 v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 1 * 136 43