Appendix C: Pipelining: Basic and Intermediate Concepts

Similar documents
Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Instruction Pipelining Review

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Computer Architecture

Basic Pipelining Concepts

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Appendix A. Overview

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

There are different characteristics for exceptions. They are as follows:

Pipelining: Basic and Intermediate Concepts

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Modern Computer Architecture

ECE 486/586. Computer Architecture. Lecture # 12

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Pipeline Review. Review

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Administrivia. CMSC 411 Computer Systems Architecture Lecture 6. When do MIPS exceptions occur? Review: Exceptions. Answers to HW #1 posted

Improvement: Correlating Predictors

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

Pipelining: Hazards Ver. Jan 14, 2014

CMSC 611: Advanced Computer Architecture

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Floating Point/Multicycle Pipelining in DLX

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

Chapter 4. The Processor

MIPS An ISA for Pipelining

Advanced issues in pipelining

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Instr. execution impl. view

LECTURE 3: THE PROCESSOR

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

COMPUTER ORGANIZATION AND DESI

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

DLX Unpipelined Implementation

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COSC 6385 Computer Architecture - Pipelining

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Pipelining. CSC Friday, November 6, 2015

ECE/CS 552: Pipeline Hazards

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

COSC 6385 Computer Architecture - Pipelining (II)

COMPUTER ORGANIZATION AND DESIGN

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

Advanced Computer Architecture Pipelining

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Multi-cycle Instructions in the Pipeline (Floating Point)

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Pipelining. Maurizio Palesi

COSC4201. Prof. Mokhtar Aboelaze York University

Chapter 4 The Processor 1. Chapter 4A. The Processor

Instruction Pipelining

ECE260: Fundamentals of Computer Engineering

Instruction Pipelining

The Big Picture Problem Focus S re r g X r eg A d, M lt2 Sub u, Shi h ft Mac2 M l u t l 1 Mac1 Mac Performance Focus Gate Source Drain BOX

Chapter 4. The Processor

What is Pipelining? RISC remainder (our assumptions)

LECTURE 10. Pipelining: Advanced ILP

Chapter3 Pipelining: Basic Concepts

C.1 Introduction. What Is Pipelining? C-2 Appendix C Pipelining: Basic and Intermediate Concepts

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Page 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Exception Handling. Precise Exception Handling. Exception Types. Exception Handling Terminology

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Full Datapath. Chapter 4 The Processor 2

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

Updated Exercises by Diana Franklin

第三章 Instruction-Level Parallelism and Its Dynamic Exploitation. 陈文智 浙江大学计算机学院 2014 年 10 月

1 Hazards COMP2611 Fall 2015 Pipelined Processor

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Advanced Computer Architecture

Lecture 2: Processor and Pipelining 1

Transcription:

Appendix C: Pipelining: Basic and Intermediate Concepts Key ideas and simple pipeline (Section C.1) Hazards (Sections C.2 and C.3) Structural hazards Data hazards Control hazards Exceptions (Section C.4) Multicycle operations (Section C.5)

Pipelining - Key Idea instrns time 1/Throughput instrns time 1/Throughput Latency Latency Ideally, Time pipeline = Time sequential Pipeline Depth Speedup = Time sequential Time pipeline = Pipeline Depth

Practical Limit 1 Unbalanced Stages Consider an instruction that requires n stages s 1, s 2,..., s n, taking time t 1, t 2,..., t n. Let T = St i Without pipelining With an n-stage pipeline Throughput = Latency = Throughput = Latency = Speedup

Practical Limit 1 Unbalanced Stages** Consider an instruction that requires n stages s 1, s 2,..., s n, taking time t 1, t 2,..., t n. Let T = St i Without pipelining With an n-stage pipeline 1 Throughput = = 1 T St i Throughput = 1 max t i n T Latency = Latency = Speedup

Practical Limit 1 Unbalanced Stages** Consider an instruction that requires n stages s 1, s 2,..., s n, taking time t 1, t 2,..., t n. Let T = St i Without pipelining With an n-stage pipeline 1 Throughput = = 1 T St i Latency = T = 1 Throughput Throughput = 1 max t i n T Latency = n max t i ³ T Speedup

Practical Limit 1 Unbalanced Stages** Consider an instruction that requires n stages s 1, s 2,..., s n, taking time t 1, t 2,..., t n. Let T = St i Without pipelining With an n-stage pipeline 1 Throughput = = 1 T St i Latency = T = 1 Throughput Throughput = 1 max t i n T Latency = n max t i ³ T Speedup St i max t i n

Practical Limit 2 - Overheads Let D > 0 be extra delay per stage e.g., latches D limits the useful depth of a pipeline. With an nstage pipeline Throughput = D+ max t < n i T 1 Latency = n (D + max t i ) ³ nd + T Speedup = St i D+ max t i < n

Let t 1,2,3 = 8, 12, 10 ns and D = 2 ns Throughput = Latency = Speedup = Example

Example** Let t 1,2,3 = 8, 12, 10 ns and D = 2 ns Throughput = 1/(2 ns + 12 ns) = 1/(14 ns) Latency = Speedup =

Example** Let t 1,2,3 = 8, 12, 10 ns and D = 2 ns Throughput = 1/(2 ns + 12 ns) = 1/(14 ns) Latency = 3 (2 ns + 12 ns) = 42 ns Speedup =

Example** Let t 1,2,3 = 8, 12, 10 ns and D = 2 ns Throughput = 1/(2 ns + 12 ns) = 1/(14 ns) Latency = 3 (2 ns + 12 ns) = 42 ns Speedup = (30 ns)/(2 ns + 12 ns) = 2.14 < 3

Practical Limit 3 - Hazards Pipeline Speedup = Time sequential Time = CPI sequential pipeline CPI pipeline Cycle Time sequential Cycle Time pipeline If we ignore cycle time differences: CPI ideal-pipeline = CPI sequential Pipeline Depth Pipeline Speedup = CPI ideal-pipeline Pipeline Depth CPI ideal-pipeline + Pipeline stall cycles

Pipelining a Basic RISC ISA Assumptions: Only loads and stores affect memory Base register + immediate offset = effective address ALU operations Only access registers Two sources two registers, or register and immediate Branches and jumps Address = PC + offset Comparison between a register and zero The last assumption is different from the 6 th edition of the text and results in a slightly different pipeline. We will discuss reasons and implications in class.

A Simple Five Stage RISC Pipeline Pipeline Stages IF Instruction Fetch ID Instruction decode, register read, branch computation EX Execution and Effective Address MEM Memory Access WB Writeback 1 2 3 4 5 6 7 8 9 i IF ID EX MEM WB i+1 IF ID EX MEM WB i+2 IF ID EX MEM WB i+3 IF ID EX MEM WB i+4 IF ID EX MEM WB Pipelining really isn't this simple

A Naive Pipeline Implementation Figure C.28 of 5 th edition Pipelining really isn't this simple Copyright 2011, Elsevier Inc. All rights Reserved.

A Naive Pipeline Implementation** Figure C.28 The stall from branch hazards can be reduced by moving the zero test and branch-target calculation into the ID phase of the pipeline. Notice that we have made two important changes, each of which removes 1 cycle from the 3-cycle stall for branches. The first change is to move both the branch-target address calculation and the branch condition decision to the ID cycle. The second change is to write the PC of the instruction in the IF phase, using either the branch-target address computed during ID or the incremented PC computed during IF. In comparison, Figure C.22 obtained the branch-target address from the EX/MEM register and wrote the result during the MEM clock cycle. As mentioned in Figure C.22, the PC can be thought of as a pipeline register (e.g., as part of ID/IF), which is written with the address of the next instruction at the end of each IF cycle. Copyright 2011, Elsevier Inc. All rights Reserved.

Hazards Hazards Structural Hazards Data Hazards Control Hazards

Hazards** Hazards Conditions that prevent the next instruction from executing Structural Hazards Data Hazards Control Hazards

Hazards** Hazards Conditions that prevent the next instruction from executing Structural Hazards When two different instructions want to use the same hardware resource in the same cycle (resource conflict) Data Hazards Control Hazards

Hazards** Hazards Conditions that prevent the next instruction from executing Structural Hazards When two different instructions want to use the same hardware resource in the same cycle (resource conflict) Data Hazards When two different instructions use the same location, it must appear as if instructions execute one at a time and in the specified order Control Hazards

Hazards Hazards** Conditions that prevent the next instruction from executing Structural Hazards When two different instructions want to use the same hardware resource in the same cycle (resource conflict) Data Hazards When two different instructions use the same location, it must appear as if instructions execute one at a time and in the specified order Control Hazards When an instruction affects which instructions are executed next branches, jumps, calls

Handling Hazards Pipeline interlock logic Detects hazard and takes appropriate action Simplest solution: stall Increases CPI Decreases performance Other solutions are harder, but have better performance

Structural Hazards When two different instructions want to use the same hardware resource in the same cycle Stall (cause bubble) + Low cost, simple Increases CPI Use for rare events E.g.,?? Duplicate Resource + Good performance Increases cost (and maybe cycle time for interconnect) Use for cheap resources E.g., ALU and PC adder

Structural Hazards, cont. Pipeline Resource + Good performance Often complex to do Use when simple to do E.g., write & read registers every cycle Structural hazards are avoided if each instruction uses a resource At most once Always in the same pipeline stage For one cycle (Þ no cycle where two instructions use the same resource)

Structural Hazard Example Loads/stores (MEM) use same memory port as instrn fetches (IF) 30% of all instructions are loads and stores Assume CPI old is 1.5 1 2 3 4 5 6 7 8 9 i IF ID EX MEM WB < a load i+1 IF ID EX MEM WB i+2 IF ID EX MEM WB i+3 ** IF ID EX MEM WB i+4 IF ID EX MEM WB How much faster could a new machine with two memory ports be?

Structural Hazard Example** Loads/stores (MEM) use same memory port as instrn fetches (IF) 30% of all instructions are loads and stores Assume CPI old is 1.5 1 2 3 4 5 6 7 8 9 i IF ID EX MEM WB < a load i+1 IF ID EX MEM WB i+2 IF ID EX MEM WB i+3 ** IF ID EX MEM WB i+4 IF ID EX MEM WB How much faster could a new machine with two memory ports be? CPI new = 1.5-1 30% = 1.2 CPI Speedup = old = (1.5/1.2) = 1.25 CPI new

Data Hazards When two different instructions use the same location, it must appear as if instructions execute one at a time and in the specified order i ADD r1,r2, i+1 SUB r2,,r1 i+2 OR r1,--, Read-After-Write (RAW, data-dependence) A true dependence MOST IMPORTANT Write-After-Read (WAR, anti-dependence) Write-After-Write (WAW, output-dependence) NOT: Read-After-Read (RAR)

Example Read-After-Write Hazards r1 written ADD r1,_,_ IF ID EX MEM WB NOT OK! SUB _, r1,_ IF ID EX MEM WB r1 read r1 written LW r1,_,_ IF ID EX MEM WB NOT OK! SUB _, r1,_ IF ID EX MEM WB r1 read memory written SW r1,100(r0) IF ID EX MEM WB CORRECT! LW r2,100(r0) IF ID EX MEM WB (Unless LW instrn is at address 100(r0)) memory read

RAW Solutions Solutions must first detect RAW, and then... Stall r1 written ADD r1,_,_ IF ID EX MEM WB SUB _, r1,_ IF ID stall stall EX MEM WB r1 read (Assumes registers written then read each cycle) + Low cost, simple Increases CPI (plus 2 per stall in 5 stage pipeline) Use for rare events

Bypass/Forward/ShortCircuit RAW Solutions data available r1 written ADD r1,_,_ IF ID EX MEM WB SUB _, r1,_ IF ID EX MEM WB r1 read data used Use data before it is in register + Reduces (avoids) stalls More complex Critical for common RAW hazards

Bypass, cont. Figure C.27 5 th edition Additional hardware Muxes supply correct result to ALU Additional control Interlock logic must control muxes Copyright 2011, Elsevier Inc. All rights Reserved.

Bypass, cont.** Figure C.27 Forwarding of results to the ALU requires the addition of three extra inputs on each ALU multiplexer and the addition of three paths to the new inputs. The paths correspond to a bypass of: (1) the ALU output at the end of the EX, (2) the ALU output at the end of the MEM stage, and (3) the memory output at the end of the MEM stage. Copyright 2011, Elsevier Inc. All rights Reserved.

RAW Solutions, cont. Hybrid solution sometimes required: data available r1 written LW r1,_,_ IF ID EX MEM WB SUB _, r1,_ IF ID stall EX MEM WB r1 read data used One cycle bubble if result of load used by next instruction Pipeline scheduling at compile time Moves instructions to eliminate stalls

Pipeline Scheduling Example Before: After: a = b + c; LW Rb,b a = b + c; LW Rb,b LW Rc,c LW Rc,c < stall LW Re,e ADD Ra,Rb,Rc ADD Ra,Rb,Rc SW a, Ra d = e - f; LW Re,e d = e - f; LW Rf,f LW Rf,f SW a, Ra < stall SUB Rd,Re,Rf SUB Rd,Re, Rf SW d, Rd SW d, Rd

Other Data Hazards i ADD r1,r2, i+1 SUB r2,,r1 i+2 OR r1,, Write-After-Read (WAR, anti-dependence) i MULT, (r2), r1 /* RX mult */ i+1 LW, (r1)+ /* autoincrement */ Write-After-Write (WAW, output-dependence) i DIVF fr1,, /* slow */ i+1 i+2 ADDF fr1,, /* fast */

i ADD r1,r2, i+1 SUB r2,,r1 i+2 OR r1,, Other Data Hazards** Write-After-Read (WAR, anti-dependence) Not in basic pipeline: read early / write late Consider late read then early write: i MULT, (r2), r1 /* RX mult */ i+1 LW, (r1)+ /* autoincrement */ Write-After-Write (WAW, output-dependence) i DIVF fr1,, /* slow */ i+1 i+2 ADDF fr1,, /* fast */

Other Data Hazards** i ADD r1,r2, i+1 SUB r2,,r1 i+2 OR r1,, Write-After-Read (WAR, anti-dependence) Not in basic pipeline: read early / write late Consider late read then early write: i MULT, (r2), r1 /* RX mult */ i+1 LW, (r1)+ /* autoincrement */ Write-After-Write (WAW, output-dependence) Not in basic pipeline: writes are in order Consider: slow then fast operation i DIVF fr1,, /* slow */ i+1 i+2 ADDF fr1,, /* fast */ Occur easily with out-of-order execution

Control Hazards When an instruction affects which instructions are executed next -- i branches, jumps, calls BEQZ r1,#8 i+1 SUB,,... i+8 OR,, i+9 ADD,, 1 2 3 4 5 6 7 8 9 i IF ID EX MEM WB i+1 IF (aborted) i+8 IF ID EX MEM WB i+9 IF ID EX MEM Handling control hazards is very important

Handling Control Hazards Branch Prediction Guess the direction of the branch Minimize penalty when right May increase penalty when wrong Techniques Static At compile time Dynamic At run time Static Techniques Predict NotTaken Predict Taken Delayed Branches Dynamic techniques and more powerful static techniques later

Handling Control Hazards, cont. Predict NOT-TAKEN Always NotTaken: 1 2 3 4 5 6 7 8 i IF ID EX MEM WB i+1 IF ID EX MEM WB i+2 IF ID EX MEM WB i+3 IF ID EX MEM WB Taken: 1 2 3 4 5 6 7 8 i IF ID EX MEM WB i+1 IF (aborted) i+8 IF ID EX MEM WB i+9 IF ID EX MEM WB Don't change machine state until branch outcome is known Basic pipeline: State always changes late (WB)

Predict TAKEN Always Handling Control Hazards, cont. 1 2 3 4 5 6 7 8 i IF ID EX MEM WB i+8 IF ID EX MEM WB i+9 IF ID EX MEM WB i+10 IF ID EX MEM WB Must know what address to fetch at BEFORE branch is decoded Not practical for our basic pipeline

Handling Control Hazards, cont. Delayed branch Execute next instruction regardless (of whether branch is taken) What do we execute in the DELAY SLOT?

Delay Slots Fill from before branch When: Helps: Fill from target When: Helps: Fill from fall through When: Helps:

Delay Slots** Fill from before branch When: Branch independent of instruction Helps: Always Fill from target When: Helps: Fill from fall through When: Helps:

Delay Slots** Fill from before branch When: Branch independent of instruction Helps: Always Fill from target When: OK to execute target May have to duplicate code Helps: On taken branch May increase code size Fill from fall through When: Helps:

Delay Slots** Fill from before branch When: Branch independent of instruction Helps: Always Fill from target When: OK to execute target May have to duplicate code Helps: On taken branch May increase code size Fill from fall through When: OK to execute instruction Helps: when not taken

Delay Slots (Cont.) Cancelling or nullifying branch Instruction includes direction of prediction Delay instruction squashed if wrong prediction Allows second and third case of previous slide to be more aggressive

Comparison of Branch Schemes Suppose 14% of all instructions are branches Suppose 65% of all branches are taken Suppose 50% of delay slots usefully filled CPIpenalty = % branches (% Taken Taken-Penalty + % Not-Taken Not-Taken penalty) Branch Scheme Taken Penalty Not-Taken Penalty CPI Penalty Basic Branch 1 1.14 Not-Taken 1 0.09 Taken0 0 1.05 Taken1 1 1.14 Delayed Branch.5.5.07

Real Processors MIPS R4000: 3 cycle branch penalty First cycle: cancelling delayed branch (cancel if not taken) Next two cycles: Predict not taken Recent architectures: Because of deeper pipelines, delayed branches not very useful Processors rely more on hardware prediction (will see later) or may include both delayed and nondelayed branches

Branch Prediction Accuracy**

Interrupts Interrupts (a.k.a. faults, exceptions, traps) often require Surprise jump Linking of return address Saving of PSW (including CCs) State change (e.g., to kernel mode) Some examples Arithmetic overflow I/O device request O.S. call Page fault Make pipelining hard

One Classification of Interrupts 1a. Synchronous function of program and memory state (e.g., arithmetic overflow, page fault) 1b. Asynchronous external device or hardware malfunction (printer ready, bus error)

Handling Interrupts Precise Interrupts (Sequential Semantics) Complete instrns before offending one Squash (effects of) instrns after Save PC Force trap instrn into IF Must handle simultaneous interrupts IF ID EX MEM WB Which interrupt should be handled first?

Handling Interrupts** Precise Interrupts (Sequential Semantics) Complete instrns before offending one Squash (effects of) instrns after Save PC (& nextpc w/ delayed branches) Force trap instrn into IF Must handle simultaneous interrupts IF memory problems (pagefault, misaligned reference, protection violation) ID EX MEM WB Which interrupt should be handled first?

Handling Interrupts** Precise Interrupts (Sequential Semantics) Complete instrns before offending one Squash (effects of) instrns after Save PC (& nextpc w/ delayed branches) Force trap instrn into IF Must handle simultaneous interrupts IF memory problems (pagefault, misaligned reference, protection violation) ID illegal or privileged instrn EX MEM WB Which interrupt should be handled first?

Handling Interrupts** Precise Interrupts (Sequential Semantics) Complete instrns before offending one Squash (effects of) instrns after Save PC (& nextpc w/ delayed branches) Force trap instrn into IF Must handle simultaneous interrupts IF memory problems (pagefault, misaligned reference, protection violation) ID illegal or privileged instrn EX arithmetic exception MEM WB Which interrupt should be handled first?

Handling Interrupts** Precise Interrupts (Sequential Semantics) Complete instrns before offending one Squash (effects of) instrns after Save PC (& nextpc w/ delayed branches) Force trap instrn into IF Must handle simultaneous interrupts IF memory problems (pagefault, misaligned reference, protection violation) ID illegal or privileged instrn EX arithmetic exception MEM memory problems WB Which interrupt should be handled first?

Handling Interrupts** Precise Interrupts (Sequential Semantics) Complete instrns before offending one Squash (effects of) instrns after Save PC (& nextpc w/ delayed branches) Force trap instrn into IF Must handle simultaneous interrupts IF memory problems (pagefault, misaligned reference, protection violation) ID illegal or privileged instrn EX arithmetic exception MEM memory problems WB none Which interrupt should be handled first?

Example: Data Page Fault Interrupts, cont. 1 2 3 4 5 6 7 8 i IF ID EX MEM WB i+1 IF ID EX MEM WB < page fault (MEM) i+2 IF ID EX MEM WB < squash i+3 IF ID EX MEM WB < squash i+4 IF ID EX MEM WB < squash i+5 trap > IF ID EX MEM WB i+6 trap handler > IF ID EX MEM WB Preceding instruction already complete Squash succeeding instructions Prevent from modifying state Trap instruction jumps to trap handler Hardware saves PC in IAR Trap handler must save IAR

Example: Arithmetic Exception Interrupts, cont. 1 2 3 4 5 6 7 8 i IF ID EX MEM WB i+1 IF ID EX MEM WB i+2 IF ID EX MEM WB < Exception (EX) i+3 IF ID EX MEM WB < squash i+4 IF ID EX MEM WB < squash i+5 trap > IF ID EX MEM WB i+6 trap handler > IF ID EX MEM WB Let preceding instructions complete Squash succeeding instruction

Example: Illegal Opcode Interrupts, cont. 1 2 3 4 5 6 7 8 i IF ID EX MEM WB i+1 IF ID EX MEM WB i+2 IF ID EX MEM WB i+3 IF ID EX MEM WB < ill. op (ID) i+4 IF ID EX MEM WB < squash i+5 trap > IF ID EX MEM WB i+6 trap handler > IF ID EX MEM WB Let preceding instructions complete Squash succeeding instruction

Example: Out-of-order Interrupts Interrupts, cont. 1 2 3 4 5 6 7 8 i IF ID EX MEM WB < page fault (MEM) i+1 IF ID EX MEM WB < page fault (IF) i+2 IF ID EX MEM WB i+3 IF ID EX MEM WB Which page fault should we take? For precise interrupts Post interrupts on a status vector associated with instruction, disable later writes in pipeline Check interrupt bit on entering WB Longer latency For imprecise interrupts Handle immediately Interrupts may occur in different order than on a sequential machine May cause implementation headaches

Interrupts, cont. Other complications Odd bits of state (e.g., CCs) Earlywrites (e.g., autoincrement) Outoforder execution Interrupts come at random times The frequent case isn't everything The rare case MUST work correctly

Multicycle Operations Not all operations complete in one cycle Floating point arithmetic is inherently slower than integer arithmetic 2 to 4 cycles for multiply or add 20 to 50 cycles for divide Extend basic 5-stage pipeline EX stage may repeat multiple times Multiple function units Not pipelined for now

Handling Multicycle Operations Four Functional Units EX: Integer unit E*: FP/integer multiplier E+: FP adder E/: FP/integer divider Assume EX takes one cycle & all FP units take 4 Separate integer and FP registers All FP arithmetic from FP registers Worry about Structural hazards RAW hazards & forwarding WAR & WAW between integer & FP ops

Simple Multicycle Example 1 2 3 4 5 6 7 8 9 10 11 int IF ID EX MEM WB fp* IF ID E* E* E* E* MEM WB int IF ID EX MEM WB? (1) fp/ IF ID E/ E/ E/ E/ MEM WB int IF ID EX ** MEM WB (2) fp/ (3) IF ID ** ** E/ E/ int (4) IF ** ** ID EX Notes (1) WAW possible only if? (2) Stall forced by? (3) Stall forced by? (4) Stall forced by?

Simple Multicycle Example** 1 2 3 4 5 6 7 8 9 10 11 int IF ID EX MEM WB fp* IF ID E* E* E* E* MEM WB int IF ID EX MEM WB? (1) fp/ IF ID E/ E/ E/ E/ MEM WB int IF ID EX ** MEM WB (2) fp/ (3) IF ID ** ** E/ E/ int (4) IF ** ** ID EX Notes (1) WAW possible only if int is a load (2) Stall forced by (3) Stall forced by (4) Stall forced by

Simple Multicycle Example** 1 2 3 4 5 6 7 8 9 10 11 int IF ID EX MEM WB fp* IF ID E* E* E* E* MEM WB int IF ID EX MEM WB? (1) fp/ IF ID E/ E/ E/ E/ MEM WB int IF ID EX ** MEM WB (2) fp/ (3) IF ID ** ** E/ E/ int (4) IF ** ** ID EX Notes (1) WAW possible only if int is a load (2) Stall forced by MEM / WB conflict (3) Stall forced by (4) Stall forced by

Simple Multicycle Example** 1 2 3 4 5 6 7 8 9 10 11 int IF ID EX MEM WB fp* IF ID E* E* E* E* MEM WB int IF ID EX MEM WB? (1) fp/ IF ID E/ E/ E/ E/ MEM WB int IF ID EX ** MEM WB (2) fp/ (3) IF ID ** ** E/ E/ int (4) IF ** ** ID EX Notes (1) WAW possible only if int is a load (2) Stall forced by MEM / WB conflict (3) Stall forced by structural conflict (4) Stall forced by

Simple Multicycle Example** 1 2 3 4 5 6 7 8 9 10 11 int IF ID EX MEM WB fp* IF ID E* E* E* E* MEM WB int IF ID EX MEM WB? (1) fp/ IF ID E/ E/ E/ E/ MEM WB int IF ID EX ** MEM WB (2) fp/ (3) IF ID ** ** E/ E/ int (4) IF ** ** ID EX Notes (1) WAW possible only if int is a load (2) Stall forced by MEM / WB conflict (3) Stall forced by structural conflict (4) Stall forced by inorder issue

FP Instruction Issue Check for RAW data hazard (in ID) Wait until source registers are not used as destinations by instructions in EX that will not be available when needed Check for forwarding Bypass data from other stages, if necessary Check for structural hazard in function unit Wait until function unit is free (in ID) Check for structural hazard in MEM / WB Instructions stall in ID Instructions stall before MEM Static priority (e.g., FU with longest latency)

Check for WAW hazards DIVF F0, F2, F4 SUBF F0, F8, F10 SUBF completes first (1) Stall SUBF (2) Abort DIVF's WB WAR hazards? FP Instruction Issue (Cont.)

Check for WAW hazards DIVF F0, F2, F4 SUBF F0, F8, F10 SUBF completes first (1) Stall SUBF (2) Abort DIVF's WB WAR hazards Read early, write late FP Instruction Issue (Cont.)**

More Multicycle Operations Problems with Interrupts DIVF F0, F2, F4 ADDF F2, F8, F10 SUBF F6, F4, F10 ADDF and SUBF complete before DIVF Out-of-order completion Possible imprecise interrupt What happens if DIVF generates an exception after ADDF and SUBF complete?? We'll discuss solutions later