E0-243: Computer Architecture

Similar documents
EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Limitations of Scalar Pipelines

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

ECE/CS 552: Introduction to Superscalar Processors

Advanced issues in pipelining

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Handout 2 ILP: Part B

Static vs. Dynamic Scheduling

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Superscalar Organization

Lecture-13 (ROB and Multi-threading) CS422-Spring

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Processor: Superscalars Dynamic Scheduling

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Chapter 4 The Processor 1. Chapter 4D. The Processor

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

5008: Computer Architecture

Instruction Level Parallelism

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Metodologie di Progettazione Hardware-Software

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Copyright 2012, Elsevier Inc. All rights reserved.

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Dynamic Scheduling. CSE471 Susan Eggers 1

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Hardware-Based Speculation

Instruction-Level Parallelism and Its Exploitation

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Case Study IBM PowerPC 620

CS 152, Spring 2012 Section 8

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Processor (IV) - advanced ILP. Hwansoo Han

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 Exam Review 4 questions out of 6 questions

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CS 152, Spring 2011 Section 8

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Lecture 19: Instruction Level Parallelism

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CS425 Computer Systems Architecture

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

CS 423 Computer Architecture Spring Lecture 04: A Superscalar Pipeline

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Adapted from David Patterson s slides on graduate computer architecture

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Scoreboard information (3 tables) Four stages of scoreboard control

November 7, 2014 Prediction

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

The Processor: Instruction-Level Parallelism

Advanced Computer Architecture

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Multi-cycle Instructions in the Pipeline (Floating Point)

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Superscalar Organization

Multiple Instruction Issue and Hardware Based Speculation

Pipelining to Superscalar

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CS152 Computer Architecture and Engineering. Complex Pipelines

Getting CPI under 1: Outline

Instruction Level Parallelism

Four Steps of Speculative Tomasulo cycle 0

Instruction Level Parallelism

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Superscalar Processor

Tutorial 11. Final Exam Review

Hardware-based Speculation

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Advanced Computer Architecture. Chapter 4: More sophisticated CPU architectures

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

Superscalar Processors

Transcription:

E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1

ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2

Motivation for ILP How to achieve an CPI < 1? Exploit Instruction Level Parallelism (ILP) Multiple Instruction Issue and execution Superscalar and VLIW architectures Run-time vs. Compile-time approach. Must maintain sequential specification, although removes non-essential sequentiality and achieves parallel execution where possible. RG:E0243:L1-ILP Processors 3

Instruction-Level Parallelism Very Long Instruction Word (VLIW) Architecture Multiple operations per instruction packed at Compile time Compiler/Programmer responsibility to expose ILP Less hardware complexity Examples: Cydra, Multi-Flow, TI-C6x, Trimedia. Superpipelining: stages of the pipeline are further pipelined. Results in deep pipelines and faster clocks. RG:E0243:L1-ILP Processors 4

Introduction (contd.) Superscalar Architecture Multiple instructions are fetched, decoded, issued and executed each cycle. Hardware detection of which instructions can be executed in parallel in a cycle at run-time. In-order vs. Out-of-Order Issue Examples: Most modern processors are superpipelined, superscalar architectures: DEC 21x64, MIPS R-8000, R-10000, PA- RISC, PowerPC-620, Pentium, UltraSparc I, II, etc. RG:E0243:L1-ILP Processors 5

RG:E0243:L1-ILP Processors 6 Comparison Pipelined IF WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID Superpipelined Superscalar IF WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID VLIW IF WB MEM EX ID EX IF WB MEM EX ID EX

Pipeline Stages Fetch Instruction Buffer Decode Dispatch Buffer Dispatch Issuing Buffer Execute Completion Buffer Complete Store Buffer Retire Source: M. Lipasti @ U Wisc.

Superscalar Pipeline Stages Fetch Instruction/Decode Buffer Decode Dispatch Buffer Dispatch Issue Branch Reservation Stations Execute Finish Complete Reorder/ Completion Buffer Retire Store Buffer Source: M. Lipasti @ U Wisc.

Superscalar Execution Model Static Program Fetch & Decode Instrn. Dispatch Instrn. Issue Instrn. Execution Instrn. reorder & commit Instrn. Window True Data Dependency RG:E0243:L1-ILP Processors 9

Microarchitecture Overview Mem. interface Predecode Instrn. Cache Instrn. Buffer Decode, Rename & dispatch Instrn. Queues Int. FUs FP. FUs Branch Pred. Reg. File Reorder Buffer RG:E0243:L1-ILP Processors 10

IBM Power4 Diversified Pipelines I-Cache PC Fetch Q BR Scan Decode BR Predict FP Issue Q FX/LD 1 Issue Q FX/LD 2 Issue Q BR/CR Issue Q Reorder Buffer FP1 Unit FP2 Unit FX1 Unit LD1 Unit LD2 Unit FX2 Unit CR Unit BR Unit StQ D-Cache Source: M. Lipasti @ U Wisc.

Impediments to High IPC I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory Register Data Flow Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow 12 Source: M. Lipasti @ U Wisc.

MIPS R10000 Issue Queues Source: M. Lipasti @ U Wisc.

R-10000 Characteristics 298 mm 2, 0.35 mm, 6.8 M transistors (2.4M CPU + 4.4M Cache) Fetches and decodes 4 instrns. every cycle. Out-of-order execution. 5 Fus Non-block load/store unit 2 Integer Unit 1 FP Adder, and 1 FP mult. 32K I$ + 32K D$ (2-way associative). External 2-way L2 $ (128-bit sync. Bus) Branch prediction: 512 entry 2-bit BHT. Speculative execution upto 4 pending branches. RG:E0243:L1-ILP Processors 14

Instruction Fetch Multiple instructions are fetched from I-Cache into Instrn. Buffer. E.G., R10K fetches 4 instrns. (word aligned) within a 16-word I-Cache line. Fetch bandwidth might be lower due to Branch instrn. in the fetched instrns. Instrn. fetch starting from the middle of a cache line I-Cache miss RG:E0243:L1-ILP Processors 15

MIPS R10000 Instruction Fetch RG:E0243:L1-ILP Processors 16

Branch Prediction Branch prediction Predecode logic to identify branch instrns. early Branch prediction (through Branch History Table) using PC value (and previous pattern history) R10K has 512-entry 2-bit BHT to support speculative execution upto 4 pending branches. Speculatively execute past multiple pending branches. Branch stack to save alternate branch addr. and Int. and FP reg. remap tables. 4-bit branch mask with each speculatively executed instrn. When a branch is detected to be speculated wrong, all instrns. with corpg. branch mask bit set are squashed. RG:E0243:L1-ILP Processors 17

Superpipelined Superscalar Active List Free List A-Calc TLB& Cache Tag Check Write Back IF-1 IF-2 Decode Remap Issue Reg. Read Int. ALU Write Bak Align Add Align Write Back RG:E0243:L1-ILP Processors 18

Register Data Dependences Program data dependences cause hazards True dependences (RAW) Antidependences (WAR) Output dependences (WAW) WAW r3 r7 RAW dependency must for correctness WAR WAR and WAW dependencies false dependencies can be eliminated : register renaming RAW r8 load (r3) r3 r3 + 4 Source: M. Lipasti @ U Wisc.

Instruction Decode Decodes instructions from Instrn. Buffer. Detection of RAW hazards and elimination of WAR and WAR hazards through Dynamic Register Renaming WAR and WAW hazards occur due to reuse of registers. Limited logical registers - e.g., 32 Int. and 32 FP Registers. Available Physical registers/storage is higher. Replace the logical destination register with a free Physical register; any subsequent instrn. which use the value uses the Physical register/storage location. RG:E0243:L1-ILP Processors 20

Register Renaming Example: mult r2 r4, r5 add r8 r2, r4 sub r4 r1, r6 div r2 r19, r20 WAR dependency. sub cannot proceed for execution until add reads r4 Using Physical Registers rename WAR and WAW dependences (name dependences) mult R2 R4, R5 add R8 R2, R4 sub R14 R1, R6 div R25 R19, R20 WAR and WAW can be eliminated using Register Renaming. WAW dependency. Div cannot proceed for execution until mult completes exec. RG:E0243:L1-ILP Processors 21

Register Renaming Using Physical Registers (size equals twice the no. of logical registers) A register map table which maintains the association betn. logical and physical registers Free list maintains available free regs. Each source operand reg. is replaced by corpg. Physical reg. Each destn. reg. (logical) is assigned a free physical register. Physical reg. is reclaimed after all use of the value Before Add r3 r3, r4 - r0 r1 r2 r3 r4 R8 R7 R5 R1 R9 R2, R6, R13 Free list r0 r1 r2 r3 r4 After Add R2 R1, R9 R8 R7 R5 R2 R9 R6, R13 Free list A Reorder Buffer (ROB) stores instrns. in program order (for inorder completion) Example: MIPS 10k/12k, DEC-21264 RG:E0243:L1-ILP Processors 22

Register Renaming in R10000 Unit FLTX fs ft fr fd MADD 16 read ports + 4 write ports R FP Map R R W R FP Free Reg. List Br. mask Unit FP Instn. Queue FP busy FLTX Rdy OpA OpB OpC Dst. Active List Log Dst. Renaming complexity proportional to IW Renaming delay proportional to IW. D W Tag R Exc CC Old Dst. RG:E0243:L1-ILP Processors 23

Alternative Register Renaming Using Reorder Buffer as temporary storage for speculated values. No. of Physical Regs. = No. of Logical Regs. Each destn. reg. is assigned entries in the tail of the ROB. Upon instrn. execution result values are written in ROB. When the instrn. reaches the head of ROB, the value is written in physical (same as logical) reg. Source operands are read from ROB entry or register as indicated by the register map table. Before Add r3 r3, r4 r0 r1 r2 r3 r4 7 6 r3 Map table r0 r1 r2 rob6 r4 Reorder Buffer 0 After Add r3 rob6, r4 rob8 Example: Pentium Pro, HP-PA8000, Power-PC 604, SPARC64 RG:E0243:L1-ILP Processors 24 r0 r1 r2 r3 r4 8 r3 Map table r0 r1 r2 rob8 rob6 r4 7 6 r3 Reorder Buffer 0

Superpipelined Superscalar Active List Free List A-Calc TLB& Cache Tag Check Write Back IF-1 IF-2 Decode Remap Issue Reg. Read Int. ALU Write Bak Align Add Align Write Back RG:E0243:L1-ILP Processors 25

In-order vs. Out-of-Order RAW Stalls Stalls Stalls In-order Issue load R2 <-- M(R3) mult R6 <- R4, R2 sub R8 <- R6, R1 store M(R9) <- R8 add R3 <- R3, 4 add R9 <- R9, 4 proceeds to FU for exec. Stalls Stalls Out-of-Order Issue load R2 <-- M(R3) mult R6 <- R4, R2 sub R8 <- R6, R1 store M(R9) <- R8 add R3 <- R3, 4 add R9 <- R9, 4 Example: MIPS-8000, SPARC, DEC 21064, 21164 Example: MIPS-10000, PowerPC620, DEC-21264 RG:E0243:L1-ILP Processors 26

Instruction Dispatch & Issue Decoded and renamed instns. are dispatched to Instruction Queues Single instrn. queue for all instrn. types (typically in inorder issue processors!) Different queues for each instrn. type (e.g., Integer, FP, and Load/Store) -- typically in out-of-order issue processors! or Reservation stations for each each FU or FU type (typically for out-of-order issue processors) From despatch From despatch From despatch to Fus Ld/St Unit Int. Units FP Units Ld/St Unit Int. Units FP Units RG:E0243:L1-ILP Processors 27

Instruction Issue: Wakeup & Select Instrn. Issue: waits in instrn. queue/ reservation station for data dependences to be satisfied (check for ready bits in the Instrn. Queue/Res. Station entry). resource (Functional Unit) to be available. Instruction Select: Among the ready instrns., a subset is selected (based on priority ) to be issued to the FUs. RG:E0243:L1-ILP Processors 28

Priority Logic Wakeup & Select in R10000 Dest.1,2,3 Dest.1,2,3 Dest.1,2,3,4 Br. Verify = = = = = = = = = = Func. Immed. Br. Mask Tag OpA Ry OpB Ry OpC Dst. Ry RG:E0243:L1-ILP Processors 29

Dependence Checking Dest Src0 Src1 Dest Src0 Src1 Dest Src0 Src1 Dest Src0 Src1?=?=?=?=?=?=?=?=?=?=?=?= Trailing instructions in fetch group Check for dependence on leading instructions Source: M. Lipasti @ U Wisc.

Wakeup & Select Wakeup phase of the issue stage : The Result tag from the completing instructions is broadcast to all waiting instructions. All instructions compare all tags and set ready bits appropriately. Typically implemented as a CAM Wakeup logic complexity proportional to Instrn. Window size and Issue width. Select phase of the issue stage : Choose all or some of the ready instructions for execution depending on age and EU availability. Selection logic complexity proportional to WS Challenges The wakeup and select phases cannot be pipelined. Broadcast latency grows quadratically with issue queue and issue width. RG:E0243:L1-ILP Processors 31

Instn. Issue & Execution in R10000 Int. Instrn. Queue Register File Reorder Buffer ALU-2 Result Bypass ALU-1 Result Int. FU RG:E0243:L1-ILP Processors 32

Bypass Logic Bypass results from the earliest stage in FU where it is avalable. S stages betn. resultproducing to resultwriting. Bypass results from all FUs (of the same type) to all oprnds. of all FUs. for S cycles. Complexity a IW 2 xs. F U WB1 Buff. WB2 RG:E0243:L1-ILP Processors 33

Superpipelined Superscalar Active List Free List A-Calc TLB& Cache Tag Check Write Back IF-1 IF-2 Decode Remap Issue Reg. Read Int. ALU Write Bak Align Add Align Write Back RG:E0243:L1-ILP Processors 34

Addr. Instrn. Queue Load/Store Queue Memory ops. involve address calculation and translation use of TLB for translation. Load/Store queue is typically FIFO to ensure load/stores dependences in out-of-order execution. Store M(r8) r3 load Dependent on store or not? r6 M (r16) Addr. calculation & translation Store Load Load buffer Addr. compare Memory renaming (unlike reg. renaming) is difficult. Store buffer To memory RG:E0243:L1-ILP Processors 35

Commit Stage To maintain appearance of sequential exec. in speculative and out-of-order execution. To maintain precise interrupts. Typically, instructions are committed in program order (from Reorder Buffer) In case of branch misprediction, speculative executed instrns. and their effects are annulled by reloading the register map table (saved on branch prediction) In physical reg. renaming, restoring reg. map table suffices. squashing a part of the Reorder Buffer (if ROB is used for register renaming) Memory Consistency (Reading) Shared Memory Consistency Models: A Tutorial, S. Adve and K. Gharachorloo, IEEE Computer, 1995 RG:E0243:L1-ILP Processors 36

Microarchitecture Overview Mem. interface Predecode Instrn. Cache Instrn. Buffer Decode, Rename & dispatch Instrn. Queues Int. FUs FP. FUs Branch Pred. Reg. File Reorder Buffer RG:E0243:L1-ILP Processors 37

Recall: Register Renaming in R10000 Unit FLTX fs ft fr fd MADD 16 read ports + 4 write ports R FP Map R R W R FP Free Reg. List Br. mask Unit FP Instn. Queue FP busy FLTX Rdy OpA OpB OpC Dst. Active List Log Dst. Renaming complexity proportional to IW Renaming delay proportional to IW. D W Tag R Exc CC Old Dst. RG:E0243:L1-ILP Processors 38