Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Similar documents
Superscalar Processors Ch 14

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Organisasi Sistem Komputer

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Superscalar Processors

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Lecture 9: Superscalar processing

5008: Computer Architecture

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Superscalar Machines. Characteristics of superscalar processors

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

SUPERSCALAR AND VLIW PROCESSORS

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Hardware-based Speculation

Superscalar Processors

Getting CPI under 1: Outline

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Pentium IV-XEON. Computer architectures M

Hardware-Based Speculation

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Lecture 19: Instruction Level Parallelism

Copyright 2012, Elsevier Inc. All rights reserved.

The Processor: Instruction-Level Parallelism

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Processor (IV) - advanced ILP. Hwansoo Han

Case Study IBM PowerPC 620

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

The Pentium II/III Processor Compiler on a Chip

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Handout 2 ILP: Part B

Micro-programmed Control Ch 17

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Hardwired Control (4) Micro-programmed Control Ch 17. Micro-programmed Control (3) Machine Instructions vs. Micro-instructions

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Dynamic Scheduling. CSE471 Susan Eggers 1

November 7, 2014 Prediction

EC 513 Computer Architecture

Static vs. Dynamic Scheduling

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

COMPUTER ORGANIZATION AND DESI

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

E0-243: Computer Architecture

Instruction Level Parallelism

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4 The Processor 1. Chapter 4D. The Processor

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Hardware Speculation Support

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Computer Systems Architecture

EECC551 Exam Review 4 questions out of 6 questions

Multithreaded Processors. Department of Electrical Engineering Stanford University

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Advanced Computer Architecture

CS425 Computer Systems Architecture

PowerPC 620 Case Study

Chapter. Out of order Execution

CISC, RISC and post-risc architectures

Processors. Young W. Lim. May 12, 2016

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Micro-programmed Control Ch 15

Machine Instructions vs. Micro-instructions. Micro-programmed Control Ch 15. Machine Instructions vs. Micro-instructions (2) Hardwired Control (4)

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Intel released new technology call P6P

Micro-programmed Control Ch 15

Instruction Level Parallelism

Superscalar Processor Design

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Adapted from David Patterson s slides on graduate computer architecture

Instruction-Level Parallelism and Its Exploitation

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Limitations of Scalar Pipelines

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Transcription:

Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion per cycle Aimed at speeding up scalar processing use many pipelines and Fig. 14.2 not just more pipeline phases (Fig. 13.2 [Stal99]) Many instructions in execution phase simultaneously need parallelism also in earlier & later phases may not execute (completely) in given order Multiple pipelines question: when can instruction be executed? Fetch many instructions at the same time memory access must not be bottleneck (Fig. 13.1 [Stal99]) Fig. 14.1 2 Why couldn t we execute this instruction right now? (4) (True) Data Dependency (datariippuvuus) Procedural or Control Dependency even more costlier than with normal pipeline now may waste more than one instruction! Resource Conflict there is no available circuit right now memory buffer, FP adder, register file port Usual solution: circuits to detect problem and stall pipeline when needed load mul Fig. 14.3 (Fig. 13.3 [Stal99]) r4, salary(r6) r2, r4, r10 (kontrolliriippuvuus) (resurssikonflikti) 3 New dependency for superscalar case? (8) Name dependency (nimiriippuvuus) two instructions use the same data item register or in memory no value passed from one instruction to another instructions have all their correct data available each individual result is the one intended overall result is not the one intended two cases: Output Dependency & Antidependency (kirjoitusriippuvuus?) (antiriippuvuus) examples on next 2 slides what if there are aliases? E.g., two registers point to same physical address 4 Output Dependency? Some earlier instruction has not yet finished writing from the same location (register) that we want to write to execution time semantics determined by the original order of machine read r1, X instructions add r2, r1, r3 add r1, r4, r5 Need to preserve order Want to have sum of r4 and r5 in r1 after all these three instructions were executed (not value of variable X) Antidependency Some earlier instruction has not yet finished reading from the same location that we want to write to Need to preserve order mv add Want to have original value of r1 in r2 r2, r1 r1, r4, r5 5 6 Ch 14, Superscalar Processors 1

Machine Parallelism (2) Instruction-level parallelism (ILP) How much parallelism is there Theoretical maximum Machine parallelism How much parallelism is achieved by any specific machine or architecture? At most as much as instruction-level parallelism dependencies? physical resources? not optimized (I.e., stupid?) design? 7 Superscalar Processor (4) Instruction dispatch get next available executable instruction from instruction stream Window of execution all instructions that are considered to be issued Instruction issue allow instruction to start execution execution and completion phase should continue now with no stalls if any stalls needed, do them before issue Instruction reorder and commit (retiring) hopefully all system state changes here! last chance to change order or abandon results 8 Instruction Dispatch (7) Whenever there are both available slots in window of execution ready instructions from prefetch or branch prediction buffer instructions that do not need to stall at all during execution all dependencies do not need to be solved yet Window of Execution Bigger is better easier to find a good candidate that can be issued right now more work to figure out all dependencies too small value will limit machine parallelism significantly E.g., 6 th instruction could be issued, but only 4 next ones are even considered 9 10 Instruction Issue (3) Select next instruction(s) for execution Check first everything so that execution can proceed with no stalls (stopping) to the end resource conflicts data dependencies control dependencies output dependencies antidependencies data in R4 is not yet there, but it will be there in three cycles when it is needed by this instruction Simpler instruction execution pipelines no need to check for dependencies Instruction Issue Policies (3) Instruction fetch policy constraints on how many instructions are considered to be dispatched at a time E.g., 2 instructions fetched and decoded at a time both must be dispatched before next 2 fetched Instruction execution policy constraints on which order dispatched instructions may start execution Completion policy constraints the order of completions 11 12 Ch 14, Superscalar Processors 2

Example 1 of Issue Policy (7) In-order issue with in-order completion same as purely sequential execution Fig. 14.4 (a) no instruction window needed (Fig. 13.4 (a) [Stal99]) instruction issued only in original order many can be issued at the same time instructions completed only in original order many can be completed at the same time check before issue: resource conflicts, data & control dependencies execution time, so that completions occur in order: wait long enough that earlier instructions will complete first Pentium II: out-of-order middle execution for microops (µops) with in-order completion 13 Example 2 of Issue Policy (5) In-order issue with out-of-order completion issue in original order many can be issued at the same time Fig. 14.4 (b) no instruction window needed (Fig. 13.4 (b) [Stal99]) allow executions complete before those of earlier instructions check before issue: resource conflicts, data & control dependencies output dependencies: wait long enough to solve them 14 The real superscalar processor Example 3 of Issue Policy (5) Out-of-order issue with out-of-order completion issue in any order (Fig. 13.4 (c) [Stal99]) many can be issued at the same time Fig. 14.4 (c) instruction window for dynamic instruction scheduling allow executions complete before those of earlier instructions Check before issue: resource conflicts, data & control dependencies output dependencies: wait for earlier instructions to write their results before we overwrite them antidependencies: wait for earlier instructions issued later to pick up arguments before overwriting them 15 Get Rid of Name Dependencies (3) Problem: independent data stored in locations with the same name often a storage conflict: same register used for two different purposes results in wait stages (pipeline stalls, bubbles ) Cure: register renaming actual registers may be different than named registers actual registers allocated dynamically to named registers allocate them so that name dependencies are avoided Cost: more registers circuits to allocate and keep track of actual registers 16 Register Renaming (3) Output dependency: I3 can not complete before I1 has completed first: R3:=R3 + R5; (I1) Antidependency: I3 can not complete R4:=R3 + 1; (I2) before I2 has read value from R3: R3:=R5 + 1; (I3) Rename data in register R3 R7:=R3 + R4; (I4) to actual hardware registers R3b:=R3a + R5a (I1) R3a, R3b, R3c R4b:=R3b + 1 (I2) Rename also other registers: R3c:=R5a + 1 (I3) R4b, R5a, R7b No name dependencies now: R7b:=R3c + R4b (I4) Drawback: need more registers Pentium II: 40 extra regs + 16 normal regs Why R3a & R3b? 17 Superscalar Implementation (7) Fetch strategy prefetch, branch prediction Dependency check logic Forwarding circuits (shortcuts) to transfer dependency data directly instead via registers or memory (to get data accessible earlier) Multiple functional units (pipelines) Effective memory hierarchy to service many memory accesses simultaneously Logic to issue multiple instruction simultaneously Logic to commit instruction in correct order 18 Ch 14, Superscalar Processors 3

Overall Gain from Superscalar Implementation Base machine, starting point for comparison out-of-order issue (Fig. 13.5 [Stal99]) See the effect of... Fig. 14.5 renaming right graph issue window size color of vertical bar duplicated data cache access +ld/st ALU ALU both both Max speed-up about 4 19 20 Example: PowerPC 601 Architecture (2) General RISC organization instruction formats 3 execution units Fig. 11.9 (Fig. 10.9 [Stal99]) Fig. 14.10 (Fig. 13.10 [Stal99]) Logical view 4 instruction window for issue each execution unit picks up next one for it whenever there is room for new instruction integer instructions issued only when 1st (dispatch buffer 0) in queue Fig. 14.11 (Fig. 13.11 [Stal99]) PowerPC 601 Pipelines (4) Instruction pipelines (Fig. 13.12 [Stal99]) Fig. 14.12 all state changes in final Write Back phase up to 3 instruction can be dispatched at the same time, and issued right after that in each pipeline if no dependencies exist dependencies solved by stalls ALU ops place their result in one of 8 condition code field in condition register up to 8 separate conditions active concurrently 21 22 PowerPC 601 Branches (4) Zero cycle branches branch target addresses computed already in lower dispatch buffers before dispatch or issue! Easy: unconditional branches (jumps) or branch on already resolved condition code field otherwise conditional branch backward: guess taken conditional branch forward: guess not taken if speculation ends up wrong, cancel conditional instructions in pipeline before write-back speculate only on one branch at a time PowerPC 601 Example Conditional branch example Original C code Assembly code predict branch not taken Correct branch prediction Incorrect branch prediction Fig. 14.13 (a) (Fig. 13.13 [Stal99]) Fig. 14.13 (b) (Fig. 13.14 [Stal99]) Fig. 14.14 (a) Fig. 14.14 (b) 23 24 Ch 14, Superscalar Processors 4

PowerPC 620 Architecture 6 execution units Up to 4 instructions dispatched simultaneously Reservation stations to store dispatched instructions and their arguments kind of rename registers also! [HePa96] Fig. 4.49 PowerPC 620 Rename Registers (7) Rename registers to store results not yet committed [HePa96] Fig. 4.49 normal uncompleted and speculative instructions 8 int and 12 FP extra rename registers in same register file as normal registers results copied to normal registers at commit information on what to do at commit is in completion unit in reorder buffers Instruction completes (commits) from completion unit reorder buffer once all previous instructions are committed max 4 instructions can commit at a time 25 26 PowerPC 620 Speculation Speculation on branches 256-entry branch target buffer two-way set-associative 2048-entry branch history table used when branch target buffer misses speculation on max 4 unresolved branches 27 28 Intel Pentium II speculation 512-entry branch target buffer 4-bit prediction state, 4-way set-associative Static prediction used before dynamic will work forward not taken, backward taken In-order-completion for 40 µops (microoperations) limits speculation 4-entry Return Stack Buffer (RSB) return addresses are often found quickly without accessing Activation Record Stack 29 30 Ch 14, Superscalar Processors 5

Example: Pentium 4 Outside: CISC ISA Inside: full superscalar RISC core with microoperations (µops) Very long pipeline get next ISA instruction (rarely) map it to µops get µops from Trace Cache (usually) Trace Cache = L1 Instruction Cache Fig. 14.8 additional µops from ROM, if needed finish with µops drive stages just to make up for the time for the signal to traverse the chip Fig. 14.7 31 Pipeline Front End a) Fetch instruction from L2 cache and generate µops when needed store them to Trace Cache Fig. 14.9 (a-f) static branch prediction backward taken, forward not taken b) Get new trace cache IP (for µops) dynamic 4-bit branch prediction with 512 entry BTB c) Trace cache fetch d) Drive let data traverse the chip 32 Pipeline Out-of-Order Execution e) Allocate resources Fig. 14.9 (a-f) reorder buffer (ROB) entry (one of 126) state: scheduled, dispatched, completed, ready original IA-32 instruction address µop and which operands for it are available alias register for result (one of 128 ROB registers referencing one of 8 IA-32 register, or one of 48 load or 24 store buffers) true data dependencies solved with these false dependencies avoided by these 2 Register Alias Tables (RAT) keep track where current version of each 8 IA-32 register (E.g., EAX) is 33 Pipeline Out-of-Order Execution Fig. 14.9 (a-f) f) 2 FIFO queues for µop scheduling memory µops non-memory µops instruction dispatch to execution window from these queues in-order from each queue, out-of-order globally Fig. 14.9 (g-l) g) Schedule and (h) dispatch µops (superscalar) window of execution =??? µops max 6 instructions issued each cycle out-of-order scheduling (because of 2 queues) 34 Pipeline Int and FP Units i) Register access, data cache access j) Execute, set flags Many different, pipelined execution units E.g., double speed ALU for most common cases Update RAT, allow new µops to issue k) Branch checking kill bad instructions in pipeline l) Give branch prediction let signals propagate Fig. 14.9 (g-l) 35 -- End of Chapter 14: Superscalar -- Figure 4: Pentium 4 processor microarchitecture http://www.intel.com/technology/itj/q12001/articles/art_2.htm 36 Ch 14, Superscalar Processors 6