Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

Size: px
Start display at page:

Download "Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1"

Transcription

1 55:3/C:60 Spring 00 Pipelined Design Motivation: Increase processor throughput with modest increase in hardware. Bandwidth or Throughput = Performance Pipelined Processors Chapter Bandwidth (BW) = no. of tasks/unit time For a system that operates on one task at a time: BW = /delay (latency) where latency is the time required to complete a task BW can be increased by pipelining if many operands exist which need the same operation, i.e. many repetitions of the same task are to be performed. Latency required for each task remains the same or may even increase slightly. Ideal Pipelining Example: FP Multiplier L Comb. Logic n Gate Delay BW = ~(/n) Exponent: excess 8 (8 bits) Mantissa: sign-magnitude fraction with hidden bit (57 bits total) L n -- Gate Delay L n -- Gate Delay BW = ~(/n) Algorithm: Sign Exponent Mantissa L n -- Gate 3 Delay L n -- Gate 3 Delay n -- Gate 3 Delay BW = ~(3/n) Bandwidth increases linearly with pipeline depth Latency increases by latch delays L. Check if any operand is ZERO.. ADD the two characteristics (physical bit patterns of the exponents) and correct for the excess 8 bias, i.e. e+ (e-8) 3. Perform fixed-point MULTIPLICATION of the mantissas. 4. NORMALIZE the product of the mantissas, i.e. may require one left shift and decrement the exponent. 5. ROUND the result by adding to the first guard bit; if mantissa overflows, then shift right one bit and increment the exponent. Jon Kuhl

2 55:3/C:60 Spring 00 Nonpipelined Implementation Nonpipelined Implementation Total Chip counts and delays: P. P. Generation P. O. Reduction Final Reduction Normalization Rounding Exponent Section Input Registers Output Registers Chip Count Delay 34 5 ns 7 50 ns 55 ns 0 ns 5 50 ns ns Unpipelined clock period = 400 nsec. (.5 MFLOPS) (based on very old IC technology) Pipelined Implementation Pipelined Implementation Clock s s s 3 e e m m Add/Sub Add/Sub Add/Sub Add/Sub P.P. Generation P.P. Reduction Final Reduction Normalize Normalize Rounding Rounding e 3 m 3 5 ns 50 ns 55 ns 0 ns 50 ns Three Stage Pipelining: Longest delay path within a stage = 50 nsec. clock to register output: longest delay path: set up time: min. clock period: Number of ICs added: 8 edge-triggered registers; = 57 Original total delay nsec (.5 MFLOPS) New min. clock period - 7 nsec (5.8 MFLOPS) Original no. of ICs - 75 chips New total of ICs - 57 chips Less than 50% increase in hardware more than doubles the throughput! Jon Kuhl

3 55:3/C:60 Spring 00 Pipelining Idealisms Uniform subcomputations Can pipeline into stages with equal delay Balance pipeline stages Identical computations Can fill pipeline with identical work Unify instruction types (example later) Independent computations No relationships between work units Minimize pipeline stalls Are these practical? No, but can get close enough to get significant speedup Instruction Pipelining The computation to be pipelined. Instruction Fetch () Instruction Decode () Operand(s) Fetch (OF) Instruction Execution (EX) Operand Store (OS) Update Program Counter (PC) Generic Instruction Pipeline.. 3. Instruction Fetch Instruction Decode Operand Fetch 4. Instruction Execute 5. Operand Store Based on obvious subcomputations OF EX OS Granularity of Pipeline Stages OF EX 3 OS 4 OF EX OS DELAY OF DELAY DELAY EX EX OS DELAY 0 DELAY Logic needed for each pipeline stage. Register file ports needed to support all the stages Memory accessing ports needed to support all the stages Jon Kuhl 3

4 55:3/C:60 Spring 00 OF EX Example Pipelines MIPS R000/R3000 AMDAHL 470V/7 OS RD MEM WB OF EX OS PC PC GEN. Cache PC GEN. Read Cache PC GEN. Read Decode PC GEN. Read PC GEN. REG Add PC GEN GEN. Cache PC GEN. Read Cache PC GEN. Read EX PC GEN. E PC X GEN. Check PC GEN. Result Write PC Result GEN Amdahl 470/V7 Pipeline Stage Actions Performed S Compute instr. address Request instruction at next sequential address from storage control unit (S-unit) S Start Buffer Initiates cache to read instruction S3 Read Buffer Reads instruction from cache into I-unit S4 Decode instruction Decodes instruction opcode S5 Read GPRs Reads general-purpose registers (GPRs) used as address registers S6 Compute operand address Computes address of current memory operand S7 Start buffer Initiates cache to read memory operand S8 Read buffer Reads operand from cache into I-unit; also reads register operands S9 Execute Passes data to E-unit and begins execution S0 Execute Completes instruction execution S Check result Performs code-based error check on result S Write result Stores result in CPU register Instruction Pipeline UNYING INSTRUCTION TYPES Classification of Instruction Types Coalescing Resource Requirements Instruction Pipeline Implementation MINIMIZING PIPELINE STALLS Program Dependences and Pipeline Hazards Unifying Instruction Types Requirements of Different Instruction Types: (assume typical modern RISC ISA) Instructions: (Register Data Flow) Integer: Access integer register file; Perform operation; Write back to register file; Typically cycle (execution) latency; Floating-Point: Access F.P. register file; Perform operation; Write back to register file; Typically > cycle latency; (variable for different op s) Jon Kuhl 4

5 55:3/C:60 Spring 00 Unifying Instruction Types Load/Store instructions: (Memory Data Flow) Load: Access register file; Generate effective address; Access (read) memory (D-cache); Write back to register file; Typically + cycle latency; Store: Access register file; Generate effective address; Access (write) memory (D-cache); Typically + cycle latency; Unifying Instruction Types Branch instructions: (Instruction Flow) Jump: Access register file; Generate effective address; Update PC; Typically y no penalty cycle(s); Conditional Branch: Access register file; Generate effective address; Evaluate condition code; Update PC (if condition is true); Typically incur penalty cycles; Instruction Unification Procedure: Analyze the sequence of register transfers required by each instruction type. Find commonality across instruction types and merge them to share the same pipeline stage. If there exists flexibility, shift or reorder some register transfers to facilitate further merging. Instruction Specification Generic subcomputations. Instruction Type: Integer instruction Floating-point instruction - Fetch instruction (access I-cache) - Fetch instruction (access I-cache) - Decode instruction - Decode instruction OF - Access register file - Access FP register file EX - Perform operation - Perform FP operation OS - Write back to reg. file - Write back to FP reg. file Jon Kuhl 5

6 55:3/C:60 Spring 00 Memory Instruction Specification Generic subcomputations. Load/Store Instruction Type: Load instruction Store instruction - Fetch instruction (access I-cache) - Fetch instruction (access I-cache) - Decode instruction - Decode instruction OF - Access register file (base address) - Generate effective address (base + offset) - Access (read) memory location (D-cache) - Access register file (register operand, and dbase address) EX - - OS - Write back to reg. file - Generate effective address (base + offset) - Access (write) memory location (D-cache) Branch Instruction Specification Generic subcomputations 3. Branch Instruction Type: Jump (uncond.) instruction Conditional branch instr. - Fetch instruction (access I-cache) - Fetch instruction (access I-cache) - Decode instruction - Decode instruction OF - Access register file (base address) - Access register file (base address) - Generate effective address - Generate effective (base + offset) address (base + offset) EX - - Evaluate branch condition OS - Update program counter with target address - If condition is true, update program counter with target address Coalescing Resource Requirements Unifying Different Instruction Types : I-CACHE PC LOAD STORE BRANCH I-CACHE PC I-CACHE PC I-CACHE PC stage The Unified Pipeline instr. LOAD instr. STORE instr. BRANCH instr. Read Instr. From I_Cache; PC++ Read Instr. From I_Cache; PC++ Read Instr. From I_Cache; PC++ Read Instr. From I_Cache; PC++ : DECODE DECODE DECODE DECODE stage DECODE DECODE DECODE DECODE OF: EX: OS: RD. REG. OP. WR. REG. RD. REG. RD. REG. RD. REG. ADDR. GEN. RD. MEM. WR. REG. ADDR.GEN. ADDR. GEN. RD MEM WB RD stage stage MEM stage WB stage Read Regs (Src. operands) Operation Write Result to Dest. Reg Read Reg (mem base addr.) Compute Mem. Address Read Regs (mem base addr; store data) Compute Mem. Address Read Reg. (branch target base addr) Compute Branch Target Address Memory Read Memory Write PC Update Write Data to Dst. Reg. WR. MEM. WR. PC Jon Kuhl 6

7 55:3/C:60 Spring 00 Interface to Memory Subsystem Register File Interface The 6-stage TYP Pipeline Interface to the Memory Subsystem RD MEM WB A d r D ata A d r D ata I-Cache Memory D-Cache I-Cache RD MEM WB S S D WAdd WData Register RAdd File RAdd RData RData W/R TYP Pipeline Implementation I-Cache Add Data Update Instruction PC Decode RD D-Cache Data Register File MEM Add WB PC Address Instr. Cache Reg. S File S D Data DataIn Data Data r/w DataIn Cache Address DataOut PC Update Another view of the pipeline Instruction Fetch/Instruction Decode (/) Decode Logic Stage Stage Instruction Decode/ Register Read (/RD) RD Stage Register Read/ Operation (RD/) Stage Operation/Memory Access (/MEM) MEM Stage Memory Access/Register Write Back (MEM/WB) WB Stage Note: Blue cross-hatched boxes denote buffers (staging logic) between stages Jon Kuhl 7

8 55:3/C:60 Spring 00 Add Update PC Reconciling the Views I-Cache I-Cache / Data Instruction Decode /RD /MEM (for Store) MEM/WB (for Load) RD/ /MEM MEM/WB (for reg. writes) /RD (for reg. reads) /RD Staging Logic affiliation shown in green RD RD/ RD/ /MEM MEM/WB D-Cache D-Cache Data Register File MEM /MEM Add WB i: xxxx i: xxxx Program Dependences i i i: i: i3: xxxx i3 i3: The implied sequential precedences are an overspecification. It is sufficient but not necessary to ensure program correctness. A true dependence between two instructions may only involve one subcomputation of each instruction. Program Data Dependences True dependence (RAW) j cannot execute until i R( i) D( j) produces its result Anti-dependence (WAR) j cannot write its result until i D ( i ) R ( j ) has read its sources Output dependence (WAW) j cannot write its result until i R( i) R( j) has written its result Control Dependences Conditional branches Branch must execute to determine which instruction to fetch next Instructions following a conditional branch are control dependent on the branch instruction Jon Kuhl 8

9 55:3/C:60 Spring 00 Example (quicksort/mips) # for (; (j < high) && (array[j] < array[low]) ; ++j ); # $0 = j # $9 = high # $6 = array # $8 = low bge done, $0, $9 mul $5, $0, 4 addu $4, $6, $5 lw $5, 0($4) mul $3, $8, 4 addu $4, $6, $3 lw $5, 0($4) bge done, $5, $5 cont: addu $0, $0,... done: addu $, $, - Resolution of Pipeline Hazards Pipeline hazards Potential violations of program dependences Must ensure program dependences are not violated Hazard resolution Static: compiler/programmer guarantees correctness Dynamic: hardware performs checks at runtime Pipeline interlock Hardware mechanism for dynamic hazard resolution Must detect and enforce dependences at runtime Pipeline Hazards Necessary conditions: WAR: write stage earlier than read stage Is this possible in -RD-EX-MEM-WB? WAW: write stage earlier than write stage Is this possible in -RD-EX-MEM-WB? RAW: read stage earlier than write stage Is this possible in -RD-EX-MEM-WB? If conditions not met, no need to resolve Check for both register and memory RAW Data Dependence Earlier instruction produces a value used by a later instruction: add $, $, $3 sub $4, $5, $ Cycle: Instr: add F D R X M W sub F D R X M W 3 Jon Kuhl 9

10 55:3/C:60 Spring 00 RAW Data Dependence - Stall Detect dependence and stall: add $, $, $3 sub $4, $5, $ Cycle: Instr: add F D R X M W sub F D R X M W 3 Data Hazard Mitigation A better response forwarding Also called bypassing Comparators ensure register is read after it is written Instead of stalling until write occurs Use mux to select forwarded value rather than register value Control mux with hazard detection logic Note: 3 stall cycles (OUCH!!!) RAW Data Dependence With data forwarding paths Detect dependence and forward data directly to next instruction: add $, $, $3 sub $4, $5, $ Cycle: Instr: add F D R X M W sub F D R X M W RAW dependency detected at decode stage. Data is directly forwarded from EX stage of first instr. to the EX stage of fthe second dinstr. 3 Note: No stalls (But requires additional hardware to detect hazards and forward data). Other RAW Dependencies to Consider Dependencies among non-adjacent instructions add $, $, $3 add $7, $8, $9 sub $4, $5, $ Dependencies involving Load instructions add $, $, $3 ld $5, 0($7) sub $4, $5, $ Jon Kuhl 0

11 55:3/C:60 Spring 00 c b a FORWARDING PATHS Forwarding Paths ( instructions) RD i+: R i+: R i+3: R MEM WB i: R (i i+) Forwarding via Path a i+: i: R R (i i+) Forwarding via Path b i+: i+: i: R R (i i+3) i writes R before i+3 reads R Write before Read RF Register file design -phase clocks common Write RF on first phase Read RF on second phase Hence, same cycle: Write $ Read $ No bypass needed If read before write or DFF-based, need bypass Implementation of Forwarding /RD RD/ /MEM MEM/WB Register File Forwarding Paths (Load instructions) /RD Comp Comp Comp Comp RD/ RD/ RD e d LOAD FORWARDING PATH(s) MEM i+: R i+: R i+: R i:r MEM[] i+: R i:r MEM[] WB i:r MEM[] /MEM MEM/WB (i i+) Stall i+ (i i+) Forwarding via Path d (i i+) i writes R before i+ reads R Jon Kuhl

12 55:3/C:60 Spring 00 Implementation of Load Forwarding /RD RD/ /MEM MEM/WB /RD CompComp CompComp RD/ Register File RD/ 0 0 /MEM A d r D-Cache D ata LOAD Control Flow Hazards Control flow instructions branches, jumps, jals, returns Can t fetch until branch outcome known Too late for next 0 0 MEM/WB 0 0 Load Stall,,RD /MEM MEM/WB Control Flow Hazards Important Pipeline Considerations: Where is branch target address (BTA) computed? For conditional branches, how/where is the branch outcome determined. For our simple pipeline, assume: BTA is computed in EX stage, PC update done during MEM stage Branch Outcome is determined during EX stage. Control Dependence One instruction affects which executes next sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 Cycle : Instr: sw F D R X M W bne F D R X M W sub F D R M X W 3 Jon Kuhl

13 55:3/C:60 Spring 00 Control Dependence stall until branch outcome is known Detect dependence and stall sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 When branch instruction is decoded stall pipeline until branch outcome is known. Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F F D R X M W F D R X M W F D R X M Note: 4 stall cycles (reducible to 3 with minor pipeline redesign) New fetch at branch outcome address Control Dependence reducing to 3 stall cycles Detect dependence and stall sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 When branch instruction is decoded stall pipeline until branch outcome is known. Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F F D R X M W 3 stall cycles F D R X M W F D R X M New fetch at branch outcome address Control Flow Hazards What to do? Always stall? Easy to implement Performs poorly: Assume out of every 5 instructions is a branch. Each branch introduces three stall cycles: CPI = + (3 x.) =.6 (lower bound) Branch Penalty (ave. stall cycles/branch): BP = 3 Control Flow Hazards What else could we do? Predict branch not taken Continue to fetch instructions beyond the branch point into the pipeline Must cancel these instructions later if branch prediction incorrect Jon Kuhl 3

14 55:3/C:60 Spring 00 Branch Frequencies (From Hennessy & Patterson,Computer Architecture A Quantitative Approach, nd Ed Branching Behavior (From Hennessy & Patterson,Computer Architecture A Quantitative Approach, nd Ed.) Control Dependence Static not-taken branch prediction Detect dependence and cancel/refetch if necessary sw $4, 0($5) // I bne $, $3, loop // I+ sub $6, $7, $8 // I= add $, $, $3 // I+3 When branch outcome (branch taken) is known, cancel execution of instruction(s) in pipeline and refetch at branch target address Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R F D R X M W F D F D R X M W F F D R X M W New fetch at branch outcome address Control Dependence Static not-taken branch prediction No stalls if assumption (branch not taken) is correct sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 add $, $, $3 If branch outcome guess was correct, continue processing instructions following the branch instruction Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R X M W F D R X M W F D R X M W Jon Kuhl 4

15 55:3/C:60 Spring 00 Performance of static not-taken branch prediction Let T denote the fraction of executed branches that are taken Branch Penalty = 3T + 0(-T) = 3T Let B denote the fraction of executed instructions that are branches CPI (lower bound) = + B(3T) For B =., T =.667: For B =., T =.333: Branch Penalty =.0 Branch Penalty =.0 CPI = +.(.0) =.4 CPI = +.(.0) =.0 Other static branch prediction strategies (not covered in the Textbook) If branches are more likely to be taken than not taken, a static taken prediction strategy would be preferable to a not- taken strategy But, Branch Target Address must be computed before instruction fetch at BTA can commence. Partial solution: Move BTA generation to the stage. Control Dependence Static taken branch prediction Detect dependence and cancel/refetch if necessary sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 add $, $, $3 When branch outcome (branch taken) is known, cancel execution of instruction in pipeline Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R st Instr at BTA F D R X M W F D R X M W nd Instr at BTA Begin fetch at branch target address Control Dependence Static taken branch prediction Detect dependence and cancel/refetch if necessary sw $4, 0($5) // I bne $, $3, loop // I+ sub $6, $7, $8 // I= add $, $, $3 // I+3 When branch outcome (branch not taken) is known, cancel execution of instruction(s) in pipeline and Refetch at next not-taken address (I+3) Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R X M W F D F D R X M W F F D R X M W st Instr at BTA ndinstr at BTA Jon Kuhl 5

16 55:3/C:60 Spring 00 Performance of static taken branch prediction Let T denote the fraction of executed branches that are taken Branch Penalty = T + (-T) = -T Let B denote the fraction of executed instructions that are branches CPI (lower bound) = + B(-T) For B =., T =.667: For B =., T =.333: Branch Penalty =.33 Branch Penalty =.67 CPI = +.(.33) =.7 CPI = +.(.67) =.33 Performance Comparison static nottaken prediction versus static taken prediction Stall (no static Prediction) Static Not-taken Static Taken B=., T=.667 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =.4 Branch Penalty =.33 CPI (lower bound) =.7 B=., T=.333 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =. Branch Penalty =.67 CPI (lower bound) =.33 Performance Comparison static nottaken prediction versus static taken prediction Stall (no static Prediction) Static Not-taken Static Taken B=., T=.667 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =.4 Branch Penalty =.33 CPI (lower bound) =.7 B=., T=.333 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =. Branch Penalty =.67 CPI (lower bound) =.33 Control Flow Hazards--Continued Another option: delayed branches Processor always executes instructions following a branch following instruction until branch outcome is determined referred to as the branch shadow These instructions are considered to logically occur prior to the branch Compiler is responsible for rearranging program to place instructions into the branch shadow If compiler can t put a useful instruction there, must insert a nop. stalls are eliminated only if useful instructions (not nops) can be placed in the shadow. This is often difficult. Static taken prediction outperforms static not-taken prediction when more than 50% of branches are taken. Jon Kuhl 6

17 55:3/C:60 Spring 00 Control Dependence delayed branching Place instructions or nops in the branch shadow bne $, $3, loop first shadow instruction here second shadow instruction here third shadow instruction here sub $6, $7, $ bne F D R X M W st shadow instruction F D R X M W nd shadow instruction F D R X M W Branch outcome is known before first instruction logically following the branch is fetched 3 rd shadow instruction F D R X M W First instruction logically Following the branch F D R X M W Exceptions and Pipelining Consider processor exceptions such as: arithmetic exceptions (overflow, divide-by zero, etc) interrupts etc. These are essentially surprise branches Pipeline state must be saved in a clean manner in order to return from (recover from) exceptions. Exception occurs during the execution of fthis instruction ti These instructions are already in the pipeline (partially executed) at the point the exception occurs. Exceptions I I I3 I 4 I 5 I 6 I 7 I 8... exception handling routine Exceptions Even worse: Multiple exceptions could occur in one cycle: I/O interrupt (MEM) User trap to OS (EX) Illegal instruction () Arithmetic overflow Hardware error Etc. Interrupt priorities must be supported Jon Kuhl 7

18 55:3/C:60 Spring 00 MIPS R000/R3000 Pipeline Stage Phase Function performed φ Translate virtual instr. addr. using TLB Access I-cache φ Separate Adder RD φ Return instruction from I-cache, check tags & parity φ Read RF; if branch, generate target φ Start op; if branch, check condition φ Finish op; if ld/st, translate addr MEM φ Access D-cache φ Return data from D-cache, check tags & parity WB φ Write RF IBM RISC Experience [Agerwala and Cocke 987] Internal IBM study: Limits of a scalar pipeline? Memory Bandwidth Fetch instr/cycle from I-cache 40% of instructions are load/store (D-cache) Code characteristics (dynamic) Loads 5% Stores 5% /RR 40% Branches 0% /3 unconditional (always taken) /3 conditional taken /3 conditional not taken φ IBM Experience Cache Performance Assume 00% hit ratio (upper bound) Cache latency: I = D = cycle default Load and branch scheduling Loads 5% cannot be scheduled (delay slot empty) 65% can be moved back or instructions 0% can be moved back instruction Branches Unconditional 00% schedulable (fill one delay slot) Conditional 50% schedulable (fill one delay slot) CPI Optimizations Goal and impediments CPI =, prevented by pipeline stalls No cache bypass of RF, no load/branch scheduling Load penalty: cycles: 0.5 x = 0.5 CPI Branch penalty: cycles: 0. x /3 x = 0.7 CPI Total CPI: =.77 CPI Bypass, no load/branch scheduling Load penalty: cycle: 0.5 x = 0.5 CPI Total CPI: =.5 CPI Jon Kuhl 8

19 55:3/C:60 Spring 00 More CPI Optimizations Bypass, scheduling of loads/branches Load penalty: 65% + 0% = 75% moved back, no penalty 5% => cycle penalty 0.5 x 0.5 x = CPI Branch Penalty /3 unconditional 00% schedulable => cycle /3 cond. not-taken, => no penalty (predict not-taken) /3 cond. Taken, 50% schedulable => cycle /3 cond. Taken, 50% unschedulable => cycles 0.5 x [/3 x + /3 x 0.5 x + /3 x 0.5 x ] = 0.67 Total CPI: =.3 CPI Simplify Branches Assume 90% can be PC-relative No register indirect, no register access Separate adder (like MIPS R3000) Branch penalty reduced Total CPI: =.5 CPI PC-relative Schedulable Penalty Yes (90%) Yes (50%) 0 cycle Yes (90%) No (50%) cycle No (0%) Yes (50%) cycle No (0%) No (50%) cycles 5% Overhead from program dependences Review Pipelining Overview Control Data hazards Stalls Forwarding or bypassing Control flow hazards Branch prediction Exceptions Real Pipelines Jon Kuhl 9

ECE/CS 552: Pipeline Hazards

ECE/CS 552: Pipeline Hazards ECE/CS 552: Pipeline Hazards Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipeline Hazards Forecast Program Dependences

More information

Pipelining to Superscalar

Pipelining to Superscalar Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel

More information

Appendix C: Pipelining: Basic and Intermediate Concepts

Appendix C: Pipelining: Basic and Intermediate Concepts Appendix C: Pipelining: Basic and Intermediate Concepts Key ideas and simple pipeline (Section C.1) Hazards (Sections C.2 and C.3) Structural hazards Data hazards Control hazards Exceptions (Section C.4)

More information

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real

More information

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University CSE 533: Advanced Computer Architectures Pipelining Instructor: Gürhan Küçük Yeditepe University Lecture notes based on notes by Mark D. Hill and John P. Shen Updated by Mikko Lipasti Pipelining Forecast

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Computer Architecture

Computer Architecture Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations Principles of pipelining Pipelining Simple pipelining Structural Hazards Data Hazards Control Hazards Interrupts Multicycle operations Pipeline clocking ECE D52 Lecture Notes: Chapter 3 1 Sequential Execution

More information

COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

(Basic) Processor Pipeline

(Basic) Processor Pipeline (Basic) Processor Pipeline Nima Honarmand Generic Instruction Life Cycle Logical steps in processing an instruction: Instruction Fetch (IF_STEP) Instruction Decode (ID_STEP) Operand Fetch (OF_STEP) Might

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Modern Computer Architecture

Modern Computer Architecture Modern Computer Architecture Lecture2 Pipelining: Basic and Intermediate Concepts Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining. Peng Liu. Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and

More information

MIPS An ISA for Pipelining

MIPS An ISA for Pipelining Pipelining: Basic and Intermediate Concepts Slides by: Muhamed Mudawar CS 282 KAUST Spring 2010 Outline: MIPS An ISA for Pipelining 5 stage pipelining i Structural Hazards Data Hazards & Forwarding Branch

More information

Basic Pipelining Concepts

Basic Pipelining Concepts Basic ipelining oncepts Appendix A (recommended reading, not everything will be covered today) Basic pipelining ipeline hazards Data hazards ontrol hazards Structural hazards Multicycle operations Execution

More information

ECEC 355: Pipelining

ECEC 355: Pipelining ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly

More information

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science Cases that affect instruction execution semantics

More information

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.SP96 1 Review: Evaluating Branch Alternatives Two part solution: Determine

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Slide Set 7. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Slide Set 7. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng Slide Set 7 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017 ENCM 501 W17 Lectures: Slide

More information

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University COSC4201 Pipelining Prof. Mokhtar Aboelaze York University 1 Instructions: Fetch Every instruction could be executed in 5 cycles, these 5 cycles are (MIPS like machine). Instruction fetch IR Mem[PC] NPC

More information

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations

Pipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations Principles of pipelining Pipelining Simple pipelining Structural Hazards Data Hazards Control Hazards Interrupts Multicycle operations Pipeline clocking ECE D52 Lecture Notes: Chapter 3 1 Sequential Execution

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards CISC 662 Graduate Computer Architecture Lecture 6 - Hazards Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)

More information

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble)

More information

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. The Processor Pipeline Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. Pipeline A Basic MIPS Implementation Memory-reference instructions Load Word (lw) and Store Word (sw) ALU instructions

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count

More information

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Pipelining. CSC Friday, November 6, 2015

Pipelining. CSC Friday, November 6, 2015 Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

More information

Instr. execution impl. view

Instr. execution impl. view Pipelining Sangyeun Cho Computer Science Department Instr. execution impl. view Single (long) cycle implementation Multi-cycle implementation Pipelined implementation Processing an instruction Fetch instruction

More information

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017! Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Instruction Level Parallelism ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Basic Block A straight line code sequence with no branches in except to the entry and no branches

More information

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4) Lecture 2: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4) 1 Defining Fault, Error, and Failure A fault produces a latent error; it becomes effective when

More information

Lecture 2: Processor and Pipelining 1

Lecture 2: Processor and Pipelining 1 The Simple BIG Picture! Chapter 3 Additional Slides The Processor and Pipelining CENG 6332 2 Datapath vs Control Datapath signals Control Points Controller Datapath: Storage, FU, interconnect sufficient

More information

Execution/Effective address

Execution/Effective address Pipelined RC 69 Pipelined RC Instruction Fetch IR mem[pc] NPC PC+4 Instruction Decode/Operands fetch A Regs[rs]; B regs[rt]; Imm sign extended immediate field Execution/Effective address Memory Ref ALUOutput

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20

More information

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1 CMCS 611-101 Advanced Computer Architecture Lecture 9 Pipeline Implementation Challenges October 5, 2009 www.csee.umbc.edu/~younis/cmsc611/cmsc611.htm Mohamed Younis CMCS 611, Advanced Computer Architecture

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University Lecture 9 Pipeline Hazards Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee18b 1 Announcements PA-1 is due today Electronic submission Lab2 is due on Tuesday 2/13 th Quiz1 grades will

More information

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3. Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview

More information

Thomas Polzer Institut für Technische Informatik

Thomas Polzer Institut für Technische Informatik Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Pipelining. Maurizio Palesi

Pipelining. Maurizio Palesi * Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer

More information

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions. Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation

More information

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions. MIPS Pipe Line 2 Introduction Pipelining To complete an instruction a computer needs to perform a number of actions. These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously

More information

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07 Pipelined CPUs Where are the registers? Study Chapter 6 of Text L19 Pipelined CPU I 1 Review of CPU Performance MIPS = Millions of Instructions/Second MIPS = Freq CPI Freq = Clock Frequency, MHz CPI =

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

What is Pipelining? RISC remainder (our assumptions)

What is Pipelining? RISC remainder (our assumptions) What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Appendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002, Appendix C Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Pipelining Multiple instructions are overlapped in execution Each is in a different stage Each stage is called

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A

More information

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching

More information

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

CSEE 3827: Fundamentals of Computer Systems

CSEE 3827: Fundamentals of Computer Systems CSEE 3827: Fundamentals of Computer Systems Lecture 21 and 22 April 22 and 27, 2009 martha@cs.columbia.edu Amdahl s Law Be aware when optimizing... T = improved Taffected improvement factor + T unaffected

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Pipelining: Hazards Ver. Jan 14, 2014

Pipelining: Hazards Ver. Jan 14, 2014 POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Pipelining: Hazards Ver. Jan 14, 2014 Marco D. Santambrogio: marco.santambrogio@polimi.it Simone Campanoni:

More information

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently Pipelining Computational assembly line Each step does a small fraction of the job All steps ideally operate concurrently A form of vertical concurrency Stage/segment - responsible for 1 step 1 machine

More information

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP Overview Appendix A Pipelining: Basic and Intermediate Concepts Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations 1 2 Unpipelined

More information

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers? Pipelined CPUs Where are the registers? Study Chapter 4 of Text Second Quiz on Friday. Covers lectures 8-14. Open book, open note, no computers or calculators. L17 Pipelined CPU I 1 Review of CPU Performance

More information

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST Chapter 3. Pipelining EE511 In-Cheol Park, KAIST Terminology Pipeline stage Throughput Pipeline register Ideal speedup Assume The stages are perfectly balanced No overhead on pipeline registers Speedup

More information

Preventing Stalls: 1

Preventing Stalls: 1 Preventing Stalls: 1 2 PipeLine Pipeline efficiency Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls Ideal pipeline CPI: best possible (1 as n ) Structural hazards:

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL'15 1 Classical concepts (prerequisite) 1. Instruction

More information

LECTURE 10. Pipelining: Advanced ILP

LECTURE 10. Pipelining: Advanced ILP LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information