Lecture: Out-of-order Processors

Similar documents
Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture: Branch Prediction

Lecture: Branch Prediction

Out of Order Processing

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Lecture: Static ILP, Branch Prediction

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Lecture: Branch Prediction

Hardware-Based Speculation

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Cache Organizations for Multi-cores

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Copyright 2012, Elsevier Inc. All rights reserved.

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture-13 (ROB and Multi-threading) CS422-Spring

Dynamic Scheduling. CSE471 Susan Eggers 1

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Hardware-based Speculation

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Static vs. Dynamic Scheduling


Multiple Instruction Issue and Hardware Based Speculation

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

6.823 Computer System Architecture

CMSC411 Fall 2013 Midterm 2 Solutions

Adapted from David Patterson s slides on graduate computer architecture

Branch Prediction Chapter 3

ILP: Instruction Level Parallelism

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CS 152, Spring 2011 Section 8

Computer Systems Architecture

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Handout 2 ILP: Part B

COSC3330 Computer Architecture Lecture 14. Branch Prediction

HY425 Lecture 05: Branch Prediction

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Getting CPI under 1: Outline

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Instruction Level Parallelism

Instruction Level Parallelism (ILP)

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Hardware-based Speculation

Tutorial 11. Final Exam Review

CS 152 Computer Architecture and Engineering

EECS 470 Midterm Exam Winter 2008 answers

5008: Computer Architecture

Branch prediction ( 3.3) Dynamic Branch Prediction

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

CS146 Computer Architecture. Fall Midterm Exam

" # " $ % & ' ( ) * + $ " % '* + * ' "

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

CS433 Homework 2 (Chapter 3)

EECS 470 Midterm Exam Fall 2006

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Control Dependence, Branch Prediction

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

CS152 Computer Architecture and Engineering. Complex Pipelines

Lecture 19: Instruction Level Parallelism

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Computer Architecture Spring 2016

EECS 470 Midterm Exam Winter 2015

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

BRANCH PREDICTORS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Static & Dynamic Instruction Scheduling

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

CSE 240A Midterm Exam

Four Steps of Speculative Tomasulo cycle 0

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Instruction Level Parallelism. Taken from

Exploitation of instruction level parallelism

Transcription:

Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer 1

Amdahl s Law Architecture design is very bottleneck-driven make the common case fast, do not waste resources on a component that has little impact on overall performance/power Amdahl s Law: performance improvements through an enhancement is limited by the fraction of time the enhancement comes into play Example: a web server spends 40% of time in the CPU and 60% of time doing I/O a new processor that is ten times faster results in a 36% reduction in execution time (speedup of 1.56) Amdahl s Law states that maximum execution time reduction is 40% (max speedup of 1.66) 2

Principle of Locality Most programs are predictable in terms of instructions executed and data accessed The 90-10 Rule: a program spends 90% of its execution time in only 10% of the code Temporal locality: a program will shortly re-visit X Spatial locality: a program will shortly visit X+1 3

Problem 1 What is the storage requirement for a global predictor that uses 3-bit saturating counters and that produces an index by XOR-ing 12 bits of branch PC with 12 bits of global history? 4

Problem 1 What is the storage requirement for a global predictor that uses 3-bit saturating counters and that produces an index by XOR-ing 12 bits of branch PC with 12 bits of global history? The index is 12 bits wide, so the table has 2^12 saturating counters. Each counter is 3 bits wide. So total storage = 3 * 4096 = 12 Kb or 1.5 KB 5

Problem 2 What is the storage requirement for a tournament predictor that uses the following structures: a selector that has 4K entries and 2-bit counters a global predictor that XORs 14 bits of branch PC with 14 bits of global history and uses 3-bit counters a local predictor that uses an 8-bit index into L1, and produces a 12-bit index into L2 by XOR-ing branch PC and local history. The L2 uses 2-bit counters. 6

Problem 2 What is the storage requirement for a tournament predictor that uses the following structures: a selector that has 4K entries and 2-bit counters a global predictor that XORs 14 bits of branch PC with 14 bits of global history and uses 3-bit counters a local predictor that uses an 8-bit index into L1, and produces a 12-bit index into L2 by XOR-ing branch PC and local history. The L2 uses 2-bit counters. Selector = 4K * 2b = 8 Kb Global = 3b * 2^14 = 48 Kb Local = (12b * 2^8) + (2b * 2^12) = 3 Kb + 8 Kb = 11 Kb Total = 67 Kb 7

Problem 3 For the code snippet below, estimate the steady-state bpred accuracies for the default PC+4 prediction, the 1-bit bimodal, 2-bit bimodal, global, and local predictors. Assume that the global/local preds use 5-bit histories. do { for (i=0; i<4; i++) { increment something } for (j=0; j<8; j++) { increment something } k++; } while (k < some large number) 8

Problem 3 For the code snippet below, estimate the steady-state bpred accuracies for the default PC+4 prediction, the 1-bit bimodal, 2-bit bimodal, global, and local predictors. Assume that the global/local preds use 5-bit histories. do { for (i=0; i<4; i++) { increment something } for (j=0; j<8; j++) { increment something } k++; } while (k < some large number) PC+4: 2/13 = 15% 1b Bim: (2+6+1)/(4+8+1) = 9/13 = 69% 2b Bim: (3+7+1)/13 = 11/13 = 85% Global: (4+7+1)/13 = 12/13 = 92% (gets confused by 01111 unless you take branch-pc into account while indexing) Local: (4+7+1)/13 = 12/13 = 92% 9

An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Instr Fetch Queue Decode & Rename Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 T1 R1+R2 T2 T1+R3 BEQZ T2 T4 T1+T2 T5 T4+T2 Register File R1-R32 ALU ALU ALU Results written to ROB and tags broadcast to IQ Issue Queue (IQ) 10

Problem 1 Show the renamed version of the following code: Assume that you have 4 rename registers T1-T4 R1 R2+R3 R3 R4+R5 BEQZ R1 R1 R1 + R3 R1 R1 + R3 R3 R1 + R3 11

Problem 1 Show the renamed version of the following code: Assume that you have 4 rename registers T1-T4 R1 R2+R3 T1 R2+R3 R3 R4+R5 T2 R4+R5 BEQZ R1 BEQZ T1 R1 R1 + R3 T4 T1+T2 R1 R1 + R3 T1 T4+T2 R3 R1 + R3 T2 T1 +R3 12

Design Details - I Instructions enter the pipeline in order No need for branch delay slots if prediction happens in time Instructions leave the pipeline in order all instructions that enter also get placed in the ROB the process of an instruction leaving the ROB (in order) is called commit an instruction commits only if it and all instructions before it have completed successfully (without an exception) To preserve precise exceptions, a result is written into the register file only when the instruction commits until then, the result is saved in a temporary register in the ROB 13

Design Details - II Instructions get renamed and placed in the issue queue some operands are available (T1-T6; R1-R32), while others are being produced by instructions in flight (T1-T6) As instructions finish, they write results into the ROB (T1-T6) and broadcast the operand tag (T1-T6) to the issue queue instructions now know if their operands are ready When a ready instruction issues, it reads its operands from T1-T6 and R1-R32 and executes (out-of-order execution) Can you have WAW or WAR hazards? By using more names (T1-T6), name dependences can be avoided 14

Design Details - III If instr-3 raises an exception, wait until it reaches the top of the ROB at this point, R1-R32 contain results for all instructions up to instr-3 save registers, save PC of instr-3, and service the exception If branch is a mispredict, flush all instructions after the branch and start on the correct path mispredicted instrs will not have updated registers (the branch cannot commit until it has completed and the flush happens as soon as the branch completes) Potential problems:? 15

Managing Register Names Temporary values are stored in the register file and not the ROB Logical Registers R1-R32 Physical Registers P1-P64 At the start, R1-R32 can be found in P1-P32 Instructions stop entering the pipeline when P64 is assigned R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 What happens on commit? 16

The Commit Process On commit, no copy is required The register map table is updated the committed value of R1 is now in P33 and not P1 on an exception, P33 is copied to memory and not P1 An instruction in the issue queue need not modify its input operand when the producer commits When instruction-1 commits, we no longer have any use for P1 it is put in a free pool and a new instruction can now enter the pipeline for every instr that commits, a new instr can enter the pipeline number of in-flight instrs is a constant = number of extra (rename) registers 17

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Instr Fetch Queue Decode & Rename Speculative Reg Map R1 P36 R2 P34 Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Committed Reg Map R1 P1 R2 P2 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 P36 P35+P34 Register File P1-P64 ALU ALU ALU Results written to regfile and tags broadcast to IQ Issue Queue (IQ) 18

Problem 2 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers R1 R2+R3 R3 R4+R5 BEQZ R1 R1 R1 + R3 R1 R1 + R3 R3 R1 + R3 R4 R3 + R1 19

Problem 2 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers R1 R2+R3 R3 R4+R5 BEQZ R1 R1 R1 + R3 R1 R1 + R3 R3 R1 + R3 R4 R3 + R1 P33 P2+P3 P34 P4+P5 BEQZ P33 P35 P33+P34 P36 P35+P34 P1 P36+P34 P3 P1+P36 20

Title Bullet 21