Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

Size: px

Start display at page:

Download "Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1"

Denis Wade
6 years ago
Views:

1 55:3/C:60 Spring 00 Pipelined Design Motivation: Increase processor throughput with modest increase in hardware. Bandwidth or Throughput = Performance Pipelined Processors Chapter Bandwidth (BW) = no. of tasks/unit time For a system that operates on one task at a time: BW = /delay (latency) where latency is the time required to complete a task BW can be increased by pipelining if many operands exist which need the same operation, i.e. many repetitions of the same task are to be performed. Latency required for each task remains the same or may even increase slightly. Ideal Pipelining Example: FP Multiplier L Comb. Logic n Gate Delay BW = ~(/n) Exponent: excess 8 (8 bits) Mantissa: sign-magnitude fraction with hidden bit (57 bits total) L n -- Gate Delay L n -- Gate Delay BW = ~(/n) Algorithm: Sign Exponent Mantissa L n -- Gate 3 Delay L n -- Gate 3 Delay n -- Gate 3 Delay BW = ~(3/n) Bandwidth increases linearly with pipeline depth Latency increases by latch delays L. Check if any operand is ZERO.. ADD the two characteristics (physical bit patterns of the exponents) and correct for the excess 8 bias, i.e. e+ (e-8) 3. Perform fixed-point MULTIPLICATION of the mantissas. 4. NORMALIZE the product of the mantissas, i.e. may require one left shift and decrement the exponent. 5. ROUND the result by adding to the first guard bit; if mantissa overflows, then shift right one bit and increment the exponent. Jon Kuhl

2 55:3/C:60 Spring 00 Nonpipelined Implementation Nonpipelined Implementation Total Chip counts and delays: P. P. Generation P. O. Reduction Final Reduction Normalization Rounding Exponent Section Input Registers Output Registers Chip Count Delay 34 5 ns 7 50 ns 55 ns 0 ns 5 50 ns ns Unpipelined clock period = 400 nsec. (.5 MFLOPS) (based on very old IC technology) Pipelined Implementation Pipelined Implementation Clock s s s 3 e e m m Add/Sub Add/Sub Add/Sub Add/Sub P.P. Generation P.P. Reduction Final Reduction Normalize Normalize Rounding Rounding e 3 m 3 5 ns 50 ns 55 ns 0 ns 50 ns Three Stage Pipelining: Longest delay path within a stage = 50 nsec. clock to register output: longest delay path: set up time: min. clock period: Number of ICs added: 8 edge-triggered registers; = 57 Original total delay nsec (.5 MFLOPS) New min. clock period - 7 nsec (5.8 MFLOPS) Original no. of ICs - 75 chips New total of ICs - 57 chips Less than 50% increase in hardware more than doubles the throughput! Jon Kuhl

3 55:3/C:60 Spring 00 Pipelining Idealisms Uniform subcomputations Can pipeline into stages with equal delay Balance pipeline stages Identical computations Can fill pipeline with identical work Unify instruction types (example later) Independent computations No relationships between work units Minimize pipeline stalls Are these practical? No, but can get close enough to get significant speedup Instruction Pipelining The computation to be pipelined. Instruction Fetch () Instruction Decode () Operand(s) Fetch (OF) Instruction Execution (EX) Operand Store (OS) Update Program Counter (PC) Generic Instruction Pipeline.. 3. Instruction Fetch Instruction Decode Operand Fetch 4. Instruction Execute 5. Operand Store Based on obvious subcomputations OF EX OS Granularity of Pipeline Stages OF EX 3 OS 4 OF EX OS DELAY OF DELAY DELAY EX EX OS DELAY 0 DELAY Logic needed for each pipeline stage. Register file ports needed to support all the stages Memory accessing ports needed to support all the stages Jon Kuhl 3

4 55:3/C:60 Spring 00 OF EX Example Pipelines MIPS R000/R3000 AMDAHL 470V/7 OS RD MEM WB OF EX OS PC PC GEN. Cache PC GEN. Read Cache PC GEN. Read Decode PC GEN. Read PC GEN. REG Add PC GEN GEN. Cache PC GEN. Read Cache PC GEN. Read EX PC GEN. E PC X GEN. Check PC GEN. Result Write PC Result GEN Amdahl 470/V7 Pipeline Stage Actions Performed S Compute instr. address Request instruction at next sequential address from storage control unit (S-unit) S Start Buffer Initiates cache to read instruction S3 Read Buffer Reads instruction from cache into I-unit S4 Decode instruction Decodes instruction opcode S5 Read GPRs Reads general-purpose registers (GPRs) used as address registers S6 Compute operand address Computes address of current memory operand S7 Start buffer Initiates cache to read memory operand S8 Read buffer Reads operand from cache into I-unit; also reads register operands S9 Execute Passes data to E-unit and begins execution S0 Execute Completes instruction execution S Check result Performs code-based error check on result S Write result Stores result in CPU register Instruction Pipeline UNYING INSTRUCTION TYPES Classification of Instruction Types Coalescing Resource Requirements Instruction Pipeline Implementation MINIMIZING PIPELINE STALLS Program Dependences and Pipeline Hazards Unifying Instruction Types Requirements of Different Instruction Types: (assume typical modern RISC ISA) Instructions: (Register Data Flow) Integer: Access integer register file; Perform operation; Write back to register file; Typically cycle (execution) latency; Floating-Point: Access F.P. register file; Perform operation; Write back to register file; Typically > cycle latency; (variable for different op s) Jon Kuhl 4

5 55:3/C:60 Spring 00 Unifying Instruction Types Load/Store instructions: (Memory Data Flow) Load: Access register file; Generate effective address; Access (read) memory (D-cache); Write back to register file; Typically + cycle latency; Store: Access register file; Generate effective address; Access (write) memory (D-cache); Typically + cycle latency; Unifying Instruction Types Branch instructions: (Instruction Flow) Jump: Access register file; Generate effective address; Update PC; Typically y no penalty cycle(s); Conditional Branch: Access register file; Generate effective address; Evaluate condition code; Update PC (if condition is true); Typically incur penalty cycles; Instruction Unification Procedure: Analyze the sequence of register transfers required by each instruction type. Find commonality across instruction types and merge them to share the same pipeline stage. If there exists flexibility, shift or reorder some register transfers to facilitate further merging. Instruction Specification Generic subcomputations. Instruction Type: Integer instruction Floating-point instruction - Fetch instruction (access I-cache) - Fetch instruction (access I-cache) - Decode instruction - Decode instruction OF - Access register file - Access FP register file EX - Perform operation - Perform FP operation OS - Write back to reg. file - Write back to FP reg. file Jon Kuhl 5

6 55:3/C:60 Spring 00 Memory Instruction Specification Generic subcomputations. Load/Store Instruction Type: Load instruction Store instruction - Fetch instruction (access I-cache) - Fetch instruction (access I-cache) - Decode instruction - Decode instruction OF - Access register file (base address) - Generate effective address (base + offset) - Access (read) memory location (D-cache) - Access register file (register operand, and dbase address) EX - - OS - Write back to reg. file - Generate effective address (base + offset) - Access (write) memory location (D-cache) Branch Instruction Specification Generic subcomputations 3. Branch Instruction Type: Jump (uncond.) instruction Conditional branch instr. - Fetch instruction (access I-cache) - Fetch instruction (access I-cache) - Decode instruction - Decode instruction OF - Access register file (base address) - Access register file (base address) - Generate effective address - Generate effective (base + offset) address (base + offset) EX - - Evaluate branch condition OS - Update program counter with target address - If condition is true, update program counter with target address Coalescing Resource Requirements Unifying Different Instruction Types : I-CACHE PC LOAD STORE BRANCH I-CACHE PC I-CACHE PC I-CACHE PC stage The Unified Pipeline instr. LOAD instr. STORE instr. BRANCH instr. Read Instr. From I_Cache; PC++ Read Instr. From I_Cache; PC++ Read Instr. From I_Cache; PC++ Read Instr. From I_Cache; PC++ : DECODE DECODE DECODE DECODE stage DECODE DECODE DECODE DECODE OF: EX: OS: RD. REG. OP. WR. REG. RD. REG. RD. REG. RD. REG. ADDR. GEN. RD. MEM. WR. REG. ADDR.GEN. ADDR. GEN. RD MEM WB RD stage stage MEM stage WB stage Read Regs (Src. operands) Operation Write Result to Dest. Reg Read Reg (mem base addr.) Compute Mem. Address Read Regs (mem base addr; store data) Compute Mem. Address Read Reg. (branch target base addr) Compute Branch Target Address Memory Read Memory Write PC Update Write Data to Dst. Reg. WR. MEM. WR. PC Jon Kuhl 6

7 55:3/C:60 Spring 00 Interface to Memory Subsystem Register File Interface The 6-stage TYP Pipeline Interface to the Memory Subsystem RD MEM WB A d r D ata A d r D ata I-Cache Memory D-Cache I-Cache RD MEM WB S S D WAdd WData Register RAdd File RAdd RData RData W/R TYP Pipeline Implementation I-Cache Add Data Update Instruction PC Decode RD D-Cache Data Register File MEM Add WB PC Address Instr. Cache Reg. S File S D Data DataIn Data Data r/w DataIn Cache Address DataOut PC Update Another view of the pipeline Instruction Fetch/Instruction Decode (/) Decode Logic Stage Stage Instruction Decode/ Register Read (/RD) RD Stage Register Read/ Operation (RD/) Stage Operation/Memory Access (/MEM) MEM Stage Memory Access/Register Write Back (MEM/WB) WB Stage Note: Blue cross-hatched boxes denote buffers (staging logic) between stages Jon Kuhl 7

8 55:3/C:60 Spring 00 Add Update PC Reconciling the Views I-Cache I-Cache / Data Instruction Decode /RD /MEM (for Store) MEM/WB (for Load) RD/ /MEM MEM/WB (for reg. writes) /RD (for reg. reads) /RD Staging Logic affiliation shown in green RD RD/ RD/ /MEM MEM/WB D-Cache D-Cache Data Register File MEM /MEM Add WB i: xxxx i: xxxx Program Dependences i i i: i: i3: xxxx i3 i3: The implied sequential precedences are an overspecification. It is sufficient but not necessary to ensure program correctness. A true dependence between two instructions may only involve one subcomputation of each instruction. Program Data Dependences True dependence (RAW) j cannot execute until i R( i) D( j) produces its result Anti-dependence (WAR) j cannot write its result until i D ( i ) R ( j ) has read its sources Output dependence (WAW) j cannot write its result until i R( i) R( j) has written its result Control Dependences Conditional branches Branch must execute to determine which instruction to fetch next Instructions following a conditional branch are control dependent on the branch instruction Jon Kuhl 8

9 55:3/C:60 Spring 00 Example (quicksort/mips) # for (; (j < high) && (array[j] < array[low]) ; ++j ); # $0 = j # $9 = high # $6 = array # $8 = low bge done, $0, $9 mul $5, $0, 4 addu $4, $6, $5 lw $5, 0($4) mul $3, $8, 4 addu $4, $6, $3 lw $5, 0($4) bge done, $5, $5 cont: addu $0, $0,... done: addu $, $, - Resolution of Pipeline Hazards Pipeline hazards Potential violations of program dependences Must ensure program dependences are not violated Hazard resolution Static: compiler/programmer guarantees correctness Dynamic: hardware performs checks at runtime Pipeline interlock Hardware mechanism for dynamic hazard resolution Must detect and enforce dependences at runtime Pipeline Hazards Necessary conditions: WAR: write stage earlier than read stage Is this possible in -RD-EX-MEM-WB? WAW: write stage earlier than write stage Is this possible in -RD-EX-MEM-WB? RAW: read stage earlier than write stage Is this possible in -RD-EX-MEM-WB? If conditions not met, no need to resolve Check for both register and memory RAW Data Dependence Earlier instruction produces a value used by a later instruction: add $, $, $3 sub $4, $5, $ Cycle: Instr: add F D R X M W sub F D R X M W 3 Jon Kuhl 9

10 55:3/C:60 Spring 00 RAW Data Dependence - Stall Detect dependence and stall: add $, $, $3 sub $4, $5, $ Cycle: Instr: add F D R X M W sub F D R X M W 3 Data Hazard Mitigation A better response forwarding Also called bypassing Comparators ensure register is read after it is written Instead of stalling until write occurs Use mux to select forwarded value rather than register value Control mux with hazard detection logic Note: 3 stall cycles (OUCH!!!) RAW Data Dependence With data forwarding paths Detect dependence and forward data directly to next instruction: add $, $, $3 sub $4, $5, $ Cycle: Instr: add F D R X M W sub F D R X M W RAW dependency detected at decode stage. Data is directly forwarded from EX stage of first instr. to the EX stage of fthe second dinstr. 3 Note: No stalls (But requires additional hardware to detect hazards and forward data). Other RAW Dependencies to Consider Dependencies among non-adjacent instructions add $, $, $3 add $7, $8, $9 sub $4, $5, $ Dependencies involving Load instructions add $, $, $3 ld $5, 0($7) sub $4, $5, $ Jon Kuhl 0

11 55:3/C:60 Spring 00 c b a FORWARDING PATHS Forwarding Paths ( instructions) RD i+: R i+: R i+3: R MEM WB i: R (i i+) Forwarding via Path a i+: i: R R (i i+) Forwarding via Path b i+: i+: i: R R (i i+3) i writes R before i+3 reads R Write before Read RF Register file design -phase clocks common Write RF on first phase Read RF on second phase Hence, same cycle: Write $ Read $ No bypass needed If read before write or DFF-based, need bypass Implementation of Forwarding /RD RD/ /MEM MEM/WB Register File Forwarding Paths (Load instructions) /RD Comp Comp Comp Comp RD/ RD/ RD e d LOAD FORWARDING PATH(s) MEM i+: R i+: R i+: R i:r MEM[] i+: R i:r MEM[] WB i:r MEM[] /MEM MEM/WB (i i+) Stall i+ (i i+) Forwarding via Path d (i i+) i writes R before i+ reads R Jon Kuhl

12 55:3/C:60 Spring 00 Implementation of Load Forwarding /RD RD/ /MEM MEM/WB /RD CompComp CompComp RD/ Register File RD/ 0 0 /MEM A d r D-Cache D ata LOAD Control Flow Hazards Control flow instructions branches, jumps, jals, returns Can t fetch until branch outcome known Too late for next 0 0 MEM/WB 0 0 Load Stall,,RD /MEM MEM/WB Control Flow Hazards Important Pipeline Considerations: Where is branch target address (BTA) computed? For conditional branches, how/where is the branch outcome determined. For our simple pipeline, assume: BTA is computed in EX stage, PC update done during MEM stage Branch Outcome is determined during EX stage. Control Dependence One instruction affects which executes next sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 Cycle : Instr: sw F D R X M W bne F D R X M W sub F D R M X W 3 Jon Kuhl

13 55:3/C:60 Spring 00 Control Dependence stall until branch outcome is known Detect dependence and stall sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 When branch instruction is decoded stall pipeline until branch outcome is known. Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F F D R X M W F D R X M W F D R X M Note: 4 stall cycles (reducible to 3 with minor pipeline redesign) New fetch at branch outcome address Control Dependence reducing to 3 stall cycles Detect dependence and stall sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 When branch instruction is decoded stall pipeline until branch outcome is known. Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F F D R X M W 3 stall cycles F D R X M W F D R X M New fetch at branch outcome address Control Flow Hazards What to do? Always stall? Easy to implement Performs poorly: Assume out of every 5 instructions is a branch. Each branch introduces three stall cycles: CPI = + (3 x.) =.6 (lower bound) Branch Penalty (ave. stall cycles/branch): BP = 3 Control Flow Hazards What else could we do? Predict branch not taken Continue to fetch instructions beyond the branch point into the pipeline Must cancel these instructions later if branch prediction incorrect Jon Kuhl 3

55:3/C:60 Spring 00 Branch Frequencies (From Hennessy & Patterson,Computer Architecture A Quantitative Approach, nd Ed Branching Behavior (From Hennessy & Patterson,Computer Architecture A

14 55:3/C:60 Spring 00 Branch Frequencies (From Hennessy & Patterson,Computer Architecture A Quantitative Approach, nd Ed Branching Behavior (From Hennessy & Patterson,Computer Architecture A Quantitative Approach, nd Ed.) Control Dependence Static not-taken branch prediction Detect dependence and cancel/refetch if necessary sw $4, 0($5) // I bne $, $3, loop // I+ sub $6, $7, $8 // I= add $, $, $3 // I+3 When branch outcome (branch taken) is known, cancel execution of instruction(s) in pipeline and refetch at branch target address Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R F D R X M W F D F D R X M W F F D R X M W New fetch at branch outcome address Control Dependence Static not-taken branch prediction No stalls if assumption (branch not taken) is correct sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 add $, $, $3 If branch outcome guess was correct, continue processing instructions following the branch instruction Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R X M W F D R X M W F D R X M W Jon Kuhl 4

15 55:3/C:60 Spring 00 Performance of static not-taken branch prediction Let T denote the fraction of executed branches that are taken Branch Penalty = 3T + 0(-T) = 3T Let B denote the fraction of executed instructions that are branches CPI (lower bound) = + B(3T) For B =., T =.667: For B =., T =.333: Branch Penalty =.0 Branch Penalty =.0 CPI = +.(.0) =.4 CPI = +.(.0) =.0 Other static branch prediction strategies (not covered in the Textbook) If branches are more likely to be taken than not taken, a static taken prediction strategy would be preferable to a not- taken strategy But, Branch Target Address must be computed before instruction fetch at BTA can commence. Partial solution: Move BTA generation to the stage. Control Dependence Static taken branch prediction Detect dependence and cancel/refetch if necessary sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 add $, $, $3 When branch outcome (branch taken) is known, cancel execution of instruction in pipeline Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R st Instr at BTA F D R X M W F D R X M W nd Instr at BTA Begin fetch at branch target address Control Dependence Static taken branch prediction Detect dependence and cancel/refetch if necessary sw $4, 0($5) // I bne $, $3, loop // I+ sub $6, $7, $8 // I= add $, $, $3 // I+3 When branch outcome (branch not taken) is known, cancel execution of instruction(s) in pipeline and Refetch at next not-taken address (I+3) Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R X M W F D F D R X M W F F D R X M W st Instr at BTA ndinstr at BTA Jon Kuhl 5

16 55:3/C:60 Spring 00 Performance of static taken branch prediction Let T denote the fraction of executed branches that are taken Branch Penalty = T + (-T) = -T Let B denote the fraction of executed instructions that are branches CPI (lower bound) = + B(-T) For B =., T =.667: For B =., T =.333: Branch Penalty =.33 Branch Penalty =.67 CPI = +.(.33) =.7 CPI = +.(.67) =.33 Performance Comparison static nottaken prediction versus static taken prediction Stall (no static Prediction) Static Not-taken Static Taken B=., T=.667 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =.4 Branch Penalty =.33 CPI (lower bound) =.7 B=., T=.333 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =. Branch Penalty =.67 CPI (lower bound) =.33 Performance Comparison static nottaken prediction versus static taken prediction Stall (no static Prediction) Static Not-taken Static Taken B=., T=.667 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =.4 Branch Penalty =.33 CPI (lower bound) =.7 B=., T=.333 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =. Branch Penalty =.67 CPI (lower bound) =.33 Control Flow Hazards--Continued Another option: delayed branches Processor always executes instructions following a branch following instruction until branch outcome is determined referred to as the branch shadow These instructions are considered to logically occur prior to the branch Compiler is responsible for rearranging program to place instructions into the branch shadow If compiler can t put a useful instruction there, must insert a nop. stalls are eliminated only if useful instructions (not nops) can be placed in the shadow. This is often difficult. Static taken prediction outperforms static not-taken prediction when more than 50% of branches are taken. Jon Kuhl 6

17 55:3/C:60 Spring 00 Control Dependence delayed branching Place instructions or nops in the branch shadow bne $, $3, loop first shadow instruction here second shadow instruction here third shadow instruction here sub $6, $7, $ bne F D R X M W st shadow instruction F D R X M W nd shadow instruction F D R X M W Branch outcome is known before first instruction logically following the branch is fetched 3 rd shadow instruction F D R X M W First instruction logically Following the branch F D R X M W Exceptions and Pipelining Consider processor exceptions such as: arithmetic exceptions (overflow, divide-by zero, etc) interrupts etc. These are essentially surprise branches Pipeline state must be saved in a clean manner in order to return from (recover from) exceptions. Exception occurs during the execution of fthis instruction ti These instructions are already in the pipeline (partially executed) at the point the exception occurs. Exceptions I I I3 I 4 I 5 I 6 I 7 I 8... exception handling routine Exceptions Even worse: Multiple exceptions could occur in one cycle: I/O interrupt (MEM) User trap to OS (EX) Illegal instruction () Arithmetic overflow Hardware error Etc. Interrupt priorities must be supported Jon Kuhl 7

18 55:3/C:60 Spring 00 MIPS R000/R3000 Pipeline Stage Phase Function performed φ Translate virtual instr. addr. using TLB Access I-cache φ Separate Adder RD φ Return instruction from I-cache, check tags & parity φ Read RF; if branch, generate target φ Start op; if branch, check condition φ Finish op; if ld/st, translate addr MEM φ Access D-cache φ Return data from D-cache, check tags & parity WB φ Write RF IBM RISC Experience [Agerwala and Cocke 987] Internal IBM study: Limits of a scalar pipeline? Memory Bandwidth Fetch instr/cycle from I-cache 40% of instructions are load/store (D-cache) Code characteristics (dynamic) Loads 5% Stores 5% /RR 40% Branches 0% /3 unconditional (always taken) /3 conditional taken /3 conditional not taken φ IBM Experience Cache Performance Assume 00% hit ratio (upper bound) Cache latency: I = D = cycle default Load and branch scheduling Loads 5% cannot be scheduled (delay slot empty) 65% can be moved back or instructions 0% can be moved back instruction Branches Unconditional 00% schedulable (fill one delay slot) Conditional 50% schedulable (fill one delay slot) CPI Optimizations Goal and impediments CPI =, prevented by pipeline stalls No cache bypass of RF, no load/branch scheduling Load penalty: cycles: 0.5 x = 0.5 CPI Branch penalty: cycles: 0. x /3 x = 0.7 CPI Total CPI: =.77 CPI Bypass, no load/branch scheduling Load penalty: cycle: 0.5 x = 0.5 CPI Total CPI: =.5 CPI Jon Kuhl 8

19 55:3/C:60 Spring 00 More CPI Optimizations Bypass, scheduling of loads/branches Load penalty: 65% + 0% = 75% moved back, no penalty 5% => cycle penalty 0.5 x 0.5 x = CPI Branch Penalty /3 unconditional 00% schedulable => cycle /3 cond. not-taken, => no penalty (predict not-taken) /3 cond. Taken, 50% schedulable => cycle /3 cond. Taken, 50% unschedulable => cycles 0.5 x [/3 x + /3 x 0.5 x + /3 x 0.5 x ] = 0.67 Total CPI: =.3 CPI Simplify Branches Assume 90% can be PC-relative No register indirect, no register access Separate adder (like MIPS R3000) Branch penalty reduced Total CPI: =.5 CPI PC-relative Schedulable Penalty Yes (90%) Yes (50%) 0 cycle Yes (90%) No (50%) cycle No (0%) Yes (50%) cycle No (0%) No (50%) cycles 5% Overhead from program dependences Review Pipelining Overview Control Data hazards Stalls Forwarding or bypassing Control flow hazards Branch prediction Exceptions Real Pipelines Jon Kuhl 9

ECE/CS 552: Pipeline Hazards

ECE/CS 552: Pipeline Hazards Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipeline Hazards Forecast Program Dependences