CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Size: px
Start display at page:

Download "CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago"

Transcription

1 CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago

2 Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office hours today moved to tomorrow " Announcement on Piazza 2

3 Lecture Outline! Review: branch prediction! Out-of-order (OOO) execution " Motivation " How it works " Discussion 3

4 Review: Gshare Branch Predictor Which direction earlier branches went Direction predictor (e.g., 2-bit counters) taken? Global branch history Program Counter XOR hit? PC + 4 Next Fetch Address Address of the current instruction target address BTB: Branch Target Buffer 4

5 Two Levels of Gshare! First level: Global branch history register (N bits) xor PC! Second level: 2-bit counters for each history entry " The direction the branch took the last time the same history was seen Pattern History Table (PHT) GHR (global history register) xor PC index

6 Branch Prediction Using a 2-bit Counter actually taken strongly taken actually taken weakly!taken pred taken pred!taken actually!taken actually taken actually!taken actually taken pred taken pred!taken weakly taken actually!taken strongly!taken actually!taken Change predic3on a5er 2 consecu3ve mistakes 6

7 2-bit Counter: Another Scheme actually taken strongly taken weakly!taken pred taken actually!taken pred!taken actually!taken actually taken actually!taken actually taken pred taken actually taken pred!taken weakly taken strongly!taken actually!taken 7

8 Review: Dependency Handling in the Pipeline! Software vs. hardware " Software based instruction scheduling # static scheduling " Hardware based instruction scheduling # dynamic scheduling! What information does the compiler not know that makes static scheduling difficult? " Answer: Anything that is determined at run time! Variable-length operation latency, memory addr, branch direction 8

9 Example: Load-Use Dependency! Consider this sequence! Requires 1 stall LDUR X2, [X1,#20] AND X4,X2,X5 OR X8,X3,X6! Static scheduling to re-order instructions! No need to stall LDUR X2, [X1,#20] OR X8,X3,X6 AND X4,X2,X5 What if load sometimes take 100 cycles to execute? 9

10 Another Example: Instructions w/ Variable Latencies F D E Integer add E E E E Integer mul E E E E E E E E FP mul R W E E E E E E E E... Cache miss 10

11 Dependency Handling! Consider the following two pieces of code IMUL R3 $ R1, R2 ADD R3 $ R3, R1 ADD R1 $ R6, R7 IMUL R5 $ R6, R8 ADD R7 $ R3, R5 LD R3 $ R1 (0) ADD R3 $ R3, R1 ADD R1 $ R6, R7 IMUL R5 $ R6, R8 ADD R7 $ R3, R5! In both cases, first ADD stalls the whole pipeline! " ADD cannot dispatch because its source registers unavailable " Later independent instructions cannot get executed! IMUL and LD can take a really long time " Latency of LD is unknown until runtime (cache hit vs. miss) 11

12 How to Do Better?! Hardware has knowledge of dynamic events on a perinstruction basis (i.e., at a very fine granularity) " Cache misses " Branch mispredictions " Load/store addresses! Wouldn t it be nice if hardware did the scheduling of instructions?! Hardware-based dynamic instruction scheduling, enabling OOO execution " Tradeoffs vs. static scheduling? 12

13 Benefits of OOO! In order F D E E E E M W F D STALL E M W F! Out-of-order F D E E E E M W F D F D STALL WAIT E M! 15 vs. 12 cycles D E M W F D E E E E M W E F D STALL E M W M W W F D E E E E M W F D WAIT E M W IMUL R3 $ R1, R2 ADD R3 $ R3, R1 ADD R1 $ R6, R7 IMUL R5 $ R6, R8 ADD R7 $ R3, R5 Assume: IMUL: 4 Ex cycles ADD: 1 Ex cycle 13

14 Out-of-Order Execution

15 Out-of-Order Execution! Idea " Move the dependent instructions out of the way of independent ones (s.t. independent ones can execute)! Approach " Monitor the source values of each instruction " When all source values of an instruction are available, fire (i.e. dispatch) the instruction " Retire each instruction in program order! Benefit " Latency tolerance: Allows independent instructions to execute and complete in the presence of a long latency operation 15

16 Illustration of an OOO Pipeline TAG and VALUE Broadcast Bus F D S C H E D U L E E Integer add E E E E Integer mul E E E E E E E E FP mul E E E E E E E E... Load/store in order out of order in order R E O R D E R W! Two humps " Hump 1: reservation stations (scheduling window) " Hump 2: reorder buffer (instruction window or active window) 16

17 Dynamic Scheduling: Tomasulo s Algorithm! Invented by Robert Tomasulo " Used in IBM 360/91 Floating Point Units " Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units, IBM Journal of R&D, Jan ! Variants are used in many high-performance processors 17

18 Key Ideas of Tomasulo s Algorithm 1. Register renaming " Track true dependencies by linking the consumer of a value to the producer 2. Buffer instructions in reservation stations until they are ready to execute " Keep track of readiness of source values " Instruction wakes up and dispatch to the appropriate functional unit (FU) if all sources are ready! If multiple instructions are awake, need to select one per FU 18

19 Register Renaming! Output and anti dependencies are not true dependencies " WHY? " They exist because not enough register ID s (i.e. names) in the ISA! The register ID is renamed to the reservation station (RS) entry that will hold the register s value " Register ID # RS entry ID " Architectural register ID # Physical register ID " After renaming, RS entry ID used to refer to the register! This eliminates anti- and output- dependencies " As if there are a large number of registers even though ISA can only support a small number 19

20 Registe Renaming Using RAT! RAT: Register Alias Table (aka Register Rename Table) X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 tag value valid? Don t care 0 1 Don t care 1 1 RS entry 7 Don t care 0 Don t care 3 1 RS entry 3 Don t care 0 RS entry 13 Don t care 0 Don t care 6 1 Don t care 7 1 RS entry 4 Don t care 0 Don t care

21 Better Register Renaming Techniques Rename through ROB Rename through merged RF Hinton et al., The Microarchitecture of the Pentium 4 Processor, Intel Technology Journal,

22 Tomasulo s Machine: IBM 360/91 from memory from instruction unit FP registers load buffers store buffers operation bus FP FU FP FU reservation stations to memory Common data bus 22

23 Tomasulo s Algorithm! If reservation station not available, stall; else Instruction + renamed operands (source value/tag) inserted into the reservation station! While in reservation station, each instruction: " Watches common data bus (CDB) for tag of its sources " When tag seen, grab value for the source and keep it in the reservation station " When both operands available, instruction ready to be dispatched! Dispatch instruction to the Functional Unit (FU) when instruction is ready " If multiple instructions ready at the same time and require the same FU, need logic to select one! After instruction finishes in the FU " Arbitrate for CDB " Put tagged value onto CDB (tag broadcast) " Register file, RS, and RAT connected to the CDB! Register contains a tag indicating the latest writer to the register! If the tag in the register file, RS, and RAT matches the broadcast tag, write broadcast value into register (and set valid bit) " Reclaim rename tag (i.e., free the corresponding RS entry) 23

24 An Exercise MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 from memory! Assume ADD (4-cycle execute), MUL (6-cycle execute)! Assume one adder and one multiplier in HW! Assume operations are done entirely using registers " No memory access from instruction unit F D E W FP registers load buffers store buffers operation bus FP FU FP FU reservation stations to memory Common data bus 24

25 Drawing Template MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 v? tag val X1 X2 X3 a r X4 b t X5 c s X6 d v X7 X8 X9 X10 ADD MUL X11 25

26 Cycle 1 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 3 a r X4 1 * 4 b t X5 1 * 5 c s X6 1 * 6 d v X7 1 * 7 X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 26

27 Cycle 2 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D F v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a r 1 * 1 1 * 2 X4 1 * 4 b t X5 1 * 5 c s X6 1 * 6 d v X7 1 * 7 X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 27

28 Cycle 3 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E F D F MUL (in RS entry r) starts to execute since both operands are valid v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b t X5 0 a * c s X6 1 * 6 d v X7 1 * 7 X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 28

29 Cycle 4 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E F D -- F D F ADD (in RS entry a) waits since is not valid v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t X5 0 a * c s X6 1 * 6 d v X7 0 b * X8 1 * 8 X9 1 * 9 X10 1 * 10 ADD MUL X11 1 * 11 29

30 Cycle 5 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E F D F D E F D F ADD (in RS entry b) starts to execute v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t X5 0 a * c 1 * 8 1 * 9 s X6 1 * 6 d v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 1 * 11 30

31 Cycle 6 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E F D F D E E F D E F D F ADD (in RS entry c) starts to execute v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 0 b * 0 c * X5 0 a * c 1 * 8 1 * 9 s X6 1 * 6 d v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 31

32 Cycle 7 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E F D F D E E E F D E E F D -- F D MUL (in RS entry t) waits Pay attention to register renaming removing WAW v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 0 b * 0 c * X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 32

33 Cycle 8 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E F D F D E E E E F D E E E F D F D -- Broadcast results through CDB to wake up dependent instructions (check both RAT and RS) v? tag val X1 1 * 1 X2 1 * 2 X3 0 r * a 0 r * 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 0 b * 0 c * X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 0 b * X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 33

34 Cycle 9 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D E F D E E E E W F D E E E E F D F D Assuming 2 reg write ports and forwarding, we can dispatch ADD in RS entry a; also reclaim RS entries r and b v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r 1 * 1 1 * 2 X4 1 * 4 b 1 * 2 1 * 6 t 1 * 8 0 c * X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 0 c * ADD MUL X11 0 t * 34

35 Cycle 10 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D E E F D E E E E W F D E E E E W F D E F D Now we dispatch the second MUL (in RS entry t) v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c 1 * 8 1 * 9 s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 35

36 Cycle 11 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D E E E F D E E E E W F D E E E E W F D E E F D v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 36

37 Cycle 12 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D E E E E F D E E E E W F D E E E E W F D E E E F D v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 0 a * 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 37

38 Cycle 13 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D E E E E W F D E E E E W F D E E E E W F D E E E E F D v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a 1 * 2 1 * 4 r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 1 * 6 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 38

39 Cycle 14 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D E E E E W F D E E E E W F D E E E E W F D E E E E E F D v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t 1 * 8 1 * 17 X5 0 d * c s X6 1 * 6 d 1 * 6 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 39

40 Cycle 15 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D E E E E W F D E E E E W F D E E E E W F D E E E E E E F D v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t 1 * 17 1 * 8 X5 0 d * c s X6 1 * 6 d 1 * 6 0 t * v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 0 t * 40

41 Cycle 16 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D E E E E W Now we dispatch F D E E E E W the last ADD F D E E E E W F D E E E E E E W F D E v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t 1 * 17 1 * 8 X5 0 d * c s X6 1 * 6 d 1 * 6 1 * 136 v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 1 *

42 Cycle 19 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D E E E E W F D E E E E W F D E E E E W F D E E E E E E W F D E E E E v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t X5 0 d * c s X6 1 * 6 d 1 * 6 1 * 136 v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 1 *

43 Cycle 20 MUL X3 $ X1, X2 ADD X5 $ X3, X4 ADD X7 $ X2, X6 ADD X10 $ X8, X9 MUL X11 $ X7, X10 ADD X5 $ X5, X11 F D E E E E E E W F D E E E E W F D E E E E W F D E E E E W F D E E E E E E W F D E E E E W v? tag val X1 1 * 1 X2 1 * 2 X3 1 * 2 a r X4 1 * 4 b t X5 1 * 142 c s X6 1 * 6 d 1 * 6 1 * 136 v X7 1 * 8 X8 1 * 8 X9 1 * 9 X10 1 * 17 ADD MUL X11 1 *

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013 18-447 Computer Architecture Lecture 13: State Maintenance and Recovery Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

Precise Exceptions and Out-of-Order Execution. Samira Khan

Precise Exceptions and Out-of-Order Execution. Samira Khan Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Out-of-Order Execution II Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15 Video

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011 15-740/18-740 Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011 Reviews Due next Monday Mutlu et al., Runahead Execution: An Alternative

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Case Study IBM PowerPC 620

Case Study IBM PowerPC 620 Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed

More information

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) 1 EEC 581 Computer Architecture Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University Overview

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods 10-1 Dynamic Scheduling 10-1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods Not yet complete. (Material below may

More information

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences

More information

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3 CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

1 Tomasulo s Algorithm

1 Tomasulo s Algorithm Design of Digital Circuits (252-0028-00L), Spring 2018 Optional HW 4: Out-of-Order Execution, Dataflow, Branch Prediction, VLIW, and Fine-Grained Multithreading uctor: Prof. Onur Mutlu TAs: Juan Gomez

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

COSC4201 Instruction Level Parallelism Dynamic Scheduling

COSC4201 Instruction Level Parallelism Dynamic Scheduling COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism

More information

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions) EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) EEC 581 Computer Architecture Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW) Chansu Yu Electrical and Computer Engineering Cleveland State University

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

Chapter. Out of order Execution

Chapter. Out of order Execution Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until

More information

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance

More information

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods 10 1 Dynamic Scheduling 10 1 This Set Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods Not yet complete. (Material below may repeat

More information

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Spring 2010 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic C/C++ program Compiler Assembly Code (binary) Processor 0010101010101011110 Memory MAR MDR INPUT Processing Unit OUTPUT ALU TEMP PC Control

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software: CS152 Computer Architecture and Engineering Lecture 17 Dynamic Scheduling: Tomasulo March 20, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

The Tomasulo Algorithm Implementation

The Tomasulo Algorithm Implementation 2162 Term Project The Tomasulo Algorithm Implementation Assigned: 11/3/2015 Due: 12/15/2015 In this project, you will implement the Tomasulo algorithm with register renaming, ROB, speculative execution

More information

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012 Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code

More information

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.) EECS 470 Lecture 6 Branches: Address prediction and recovery (And interrupt recovery too.) Announcements: P3 posted, due a week from Sunday HW2 due Monday Reading Book: 3.1, 3.3-3.6, 3.8 Combining Branch

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

DYNAMIC SPECULATIVE EXECUTION

DYNAMIC SPECULATIVE EXECUTION DYNAMIC SPECULATIVE EXECUTION Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation Digital Systems Architecture EECE 343-01 EECE 292-02 Predication, Prediction, and Speculation Dr. William H. Robinson February 25, 2004 http://eecs.vanderbilt.edu/courses/eece343/ Topics Aha, now I see,

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Good luck and have fun!

Good luck and have fun! Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW Computer Architecture ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW 1 Review from Last Lecture Leverage Implicit

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

CMSC411 Fall 2013 Midterm 2 Solutions

CMSC411 Fall 2013 Midterm 2 Solutions CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has

More information

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm: LECTURE - 13 Dynamic Scheduling Better than static scheduling Scoreboarding: Used by the CDC 6600 Useful only within basic block WAW and WAR stalls Tomasulo algorithm: Used in IBM 360/91 for the FP unit

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

Scoreboard information (3 tables) Four stages of scoreboard control

Scoreboard information (3 tables) Four stages of scoreboard control Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy

More information

Portland State University ECE 587/687. Superscalar Issue Logic

Portland State University ECE 587/687. Superscalar Issue Logic Portland State University ECE 587/687 Superscalar Issue Logic Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017 Instruction Issue Logic (Sohi & Vajapeyam, 1987) After instructions are

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue

More information

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018 Design of Digital Circuits Lecture 18: Branch Prediction Prof. Onur Mutlu ETH Zurich Spring 2018 3 May 2018 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

Multiple Instruction Issue and Hardware Based Speculation

Multiple Instruction Issue and Hardware Based Speculation Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner Hardware Based Speculation Exploiting more ILP requires that we

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 14: Speculation II Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CS 246, Harvard University] Tomasulo+ROB Add

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University Material from: Mostly from Modern Processor Design by Shen and

More information

ECE/CS 552: Introduction to Superscalar Processors

ECE/CS 552: Introduction to Superscalar Processors ECE/CS 552: Introduction to Superscalar Processors Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Limitations of Scalar Pipelines

More information

Superscalar Architectures: Part 2

Superscalar Architectures: Part 2 Superscalar Architectures: Part 2 Dynamic (Out-of-Order) Scheduling Lecture 3.2 August 23 rd, 2017 Jae W. Lee (jaewlee@snu.ac.kr) Computer Science and Engineering Seoul NaMonal University Download this

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units CS333: Computer Architecture Spring 006 Homework 3 Total Points: 49 Points (undergrad), 57 Points (graduate) Due Date: Feb. 8, 006 by 1:30 pm (See course information handout for more details on late submissions)

More information

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.

More information