Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1
|
|
- Denis Wade
- 6 years ago
- Views:
Transcription
1 55:3/C:60 Spring 00 Pipelined Design Motivation: Increase processor throughput with modest increase in hardware. Bandwidth or Throughput = Performance Pipelined Processors Chapter Bandwidth (BW) = no. of tasks/unit time For a system that operates on one task at a time: BW = /delay (latency) where latency is the time required to complete a task BW can be increased by pipelining if many operands exist which need the same operation, i.e. many repetitions of the same task are to be performed. Latency required for each task remains the same or may even increase slightly. Ideal Pipelining Example: FP Multiplier L Comb. Logic n Gate Delay BW = ~(/n) Exponent: excess 8 (8 bits) Mantissa: sign-magnitude fraction with hidden bit (57 bits total) L n -- Gate Delay L n -- Gate Delay BW = ~(/n) Algorithm: Sign Exponent Mantissa L n -- Gate 3 Delay L n -- Gate 3 Delay n -- Gate 3 Delay BW = ~(3/n) Bandwidth increases linearly with pipeline depth Latency increases by latch delays L. Check if any operand is ZERO.. ADD the two characteristics (physical bit patterns of the exponents) and correct for the excess 8 bias, i.e. e+ (e-8) 3. Perform fixed-point MULTIPLICATION of the mantissas. 4. NORMALIZE the product of the mantissas, i.e. may require one left shift and decrement the exponent. 5. ROUND the result by adding to the first guard bit; if mantissa overflows, then shift right one bit and increment the exponent. Jon Kuhl
2 55:3/C:60 Spring 00 Nonpipelined Implementation Nonpipelined Implementation Total Chip counts and delays: P. P. Generation P. O. Reduction Final Reduction Normalization Rounding Exponent Section Input Registers Output Registers Chip Count Delay 34 5 ns 7 50 ns 55 ns 0 ns 5 50 ns ns Unpipelined clock period = 400 nsec. (.5 MFLOPS) (based on very old IC technology) Pipelined Implementation Pipelined Implementation Clock s s s 3 e e m m Add/Sub Add/Sub Add/Sub Add/Sub P.P. Generation P.P. Reduction Final Reduction Normalize Normalize Rounding Rounding e 3 m 3 5 ns 50 ns 55 ns 0 ns 50 ns Three Stage Pipelining: Longest delay path within a stage = 50 nsec. clock to register output: longest delay path: set up time: min. clock period: Number of ICs added: 8 edge-triggered registers; = 57 Original total delay nsec (.5 MFLOPS) New min. clock period - 7 nsec (5.8 MFLOPS) Original no. of ICs - 75 chips New total of ICs - 57 chips Less than 50% increase in hardware more than doubles the throughput! Jon Kuhl
3 55:3/C:60 Spring 00 Pipelining Idealisms Uniform subcomputations Can pipeline into stages with equal delay Balance pipeline stages Identical computations Can fill pipeline with identical work Unify instruction types (example later) Independent computations No relationships between work units Minimize pipeline stalls Are these practical? No, but can get close enough to get significant speedup Instruction Pipelining The computation to be pipelined. Instruction Fetch () Instruction Decode () Operand(s) Fetch (OF) Instruction Execution (EX) Operand Store (OS) Update Program Counter (PC) Generic Instruction Pipeline.. 3. Instruction Fetch Instruction Decode Operand Fetch 4. Instruction Execute 5. Operand Store Based on obvious subcomputations OF EX OS Granularity of Pipeline Stages OF EX 3 OS 4 OF EX OS DELAY OF DELAY DELAY EX EX OS DELAY 0 DELAY Logic needed for each pipeline stage. Register file ports needed to support all the stages Memory accessing ports needed to support all the stages Jon Kuhl 3
4 55:3/C:60 Spring 00 OF EX Example Pipelines MIPS R000/R3000 AMDAHL 470V/7 OS RD MEM WB OF EX OS PC PC GEN. Cache PC GEN. Read Cache PC GEN. Read Decode PC GEN. Read PC GEN. REG Add PC GEN GEN. Cache PC GEN. Read Cache PC GEN. Read EX PC GEN. E PC X GEN. Check PC GEN. Result Write PC Result GEN Amdahl 470/V7 Pipeline Stage Actions Performed S Compute instr. address Request instruction at next sequential address from storage control unit (S-unit) S Start Buffer Initiates cache to read instruction S3 Read Buffer Reads instruction from cache into I-unit S4 Decode instruction Decodes instruction opcode S5 Read GPRs Reads general-purpose registers (GPRs) used as address registers S6 Compute operand address Computes address of current memory operand S7 Start buffer Initiates cache to read memory operand S8 Read buffer Reads operand from cache into I-unit; also reads register operands S9 Execute Passes data to E-unit and begins execution S0 Execute Completes instruction execution S Check result Performs code-based error check on result S Write result Stores result in CPU register Instruction Pipeline UNYING INSTRUCTION TYPES Classification of Instruction Types Coalescing Resource Requirements Instruction Pipeline Implementation MINIMIZING PIPELINE STALLS Program Dependences and Pipeline Hazards Unifying Instruction Types Requirements of Different Instruction Types: (assume typical modern RISC ISA) Instructions: (Register Data Flow) Integer: Access integer register file; Perform operation; Write back to register file; Typically cycle (execution) latency; Floating-Point: Access F.P. register file; Perform operation; Write back to register file; Typically > cycle latency; (variable for different op s) Jon Kuhl 4
5 55:3/C:60 Spring 00 Unifying Instruction Types Load/Store instructions: (Memory Data Flow) Load: Access register file; Generate effective address; Access (read) memory (D-cache); Write back to register file; Typically + cycle latency; Store: Access register file; Generate effective address; Access (write) memory (D-cache); Typically + cycle latency; Unifying Instruction Types Branch instructions: (Instruction Flow) Jump: Access register file; Generate effective address; Update PC; Typically y no penalty cycle(s); Conditional Branch: Access register file; Generate effective address; Evaluate condition code; Update PC (if condition is true); Typically incur penalty cycles; Instruction Unification Procedure: Analyze the sequence of register transfers required by each instruction type. Find commonality across instruction types and merge them to share the same pipeline stage. If there exists flexibility, shift or reorder some register transfers to facilitate further merging. Instruction Specification Generic subcomputations. Instruction Type: Integer instruction Floating-point instruction - Fetch instruction (access I-cache) - Fetch instruction (access I-cache) - Decode instruction - Decode instruction OF - Access register file - Access FP register file EX - Perform operation - Perform FP operation OS - Write back to reg. file - Write back to FP reg. file Jon Kuhl 5
6 55:3/C:60 Spring 00 Memory Instruction Specification Generic subcomputations. Load/Store Instruction Type: Load instruction Store instruction - Fetch instruction (access I-cache) - Fetch instruction (access I-cache) - Decode instruction - Decode instruction OF - Access register file (base address) - Generate effective address (base + offset) - Access (read) memory location (D-cache) - Access register file (register operand, and dbase address) EX - - OS - Write back to reg. file - Generate effective address (base + offset) - Access (write) memory location (D-cache) Branch Instruction Specification Generic subcomputations 3. Branch Instruction Type: Jump (uncond.) instruction Conditional branch instr. - Fetch instruction (access I-cache) - Fetch instruction (access I-cache) - Decode instruction - Decode instruction OF - Access register file (base address) - Access register file (base address) - Generate effective address - Generate effective (base + offset) address (base + offset) EX - - Evaluate branch condition OS - Update program counter with target address - If condition is true, update program counter with target address Coalescing Resource Requirements Unifying Different Instruction Types : I-CACHE PC LOAD STORE BRANCH I-CACHE PC I-CACHE PC I-CACHE PC stage The Unified Pipeline instr. LOAD instr. STORE instr. BRANCH instr. Read Instr. From I_Cache; PC++ Read Instr. From I_Cache; PC++ Read Instr. From I_Cache; PC++ Read Instr. From I_Cache; PC++ : DECODE DECODE DECODE DECODE stage DECODE DECODE DECODE DECODE OF: EX: OS: RD. REG. OP. WR. REG. RD. REG. RD. REG. RD. REG. ADDR. GEN. RD. MEM. WR. REG. ADDR.GEN. ADDR. GEN. RD MEM WB RD stage stage MEM stage WB stage Read Regs (Src. operands) Operation Write Result to Dest. Reg Read Reg (mem base addr.) Compute Mem. Address Read Regs (mem base addr; store data) Compute Mem. Address Read Reg. (branch target base addr) Compute Branch Target Address Memory Read Memory Write PC Update Write Data to Dst. Reg. WR. MEM. WR. PC Jon Kuhl 6
7 55:3/C:60 Spring 00 Interface to Memory Subsystem Register File Interface The 6-stage TYP Pipeline Interface to the Memory Subsystem RD MEM WB A d r D ata A d r D ata I-Cache Memory D-Cache I-Cache RD MEM WB S S D WAdd WData Register RAdd File RAdd RData RData W/R TYP Pipeline Implementation I-Cache Add Data Update Instruction PC Decode RD D-Cache Data Register File MEM Add WB PC Address Instr. Cache Reg. S File S D Data DataIn Data Data r/w DataIn Cache Address DataOut PC Update Another view of the pipeline Instruction Fetch/Instruction Decode (/) Decode Logic Stage Stage Instruction Decode/ Register Read (/RD) RD Stage Register Read/ Operation (RD/) Stage Operation/Memory Access (/MEM) MEM Stage Memory Access/Register Write Back (MEM/WB) WB Stage Note: Blue cross-hatched boxes denote buffers (staging logic) between stages Jon Kuhl 7
8 55:3/C:60 Spring 00 Add Update PC Reconciling the Views I-Cache I-Cache / Data Instruction Decode /RD /MEM (for Store) MEM/WB (for Load) RD/ /MEM MEM/WB (for reg. writes) /RD (for reg. reads) /RD Staging Logic affiliation shown in green RD RD/ RD/ /MEM MEM/WB D-Cache D-Cache Data Register File MEM /MEM Add WB i: xxxx i: xxxx Program Dependences i i i: i: i3: xxxx i3 i3: The implied sequential precedences are an overspecification. It is sufficient but not necessary to ensure program correctness. A true dependence between two instructions may only involve one subcomputation of each instruction. Program Data Dependences True dependence (RAW) j cannot execute until i R( i) D( j) produces its result Anti-dependence (WAR) j cannot write its result until i D ( i ) R ( j ) has read its sources Output dependence (WAW) j cannot write its result until i R( i) R( j) has written its result Control Dependences Conditional branches Branch must execute to determine which instruction to fetch next Instructions following a conditional branch are control dependent on the branch instruction Jon Kuhl 8
9 55:3/C:60 Spring 00 Example (quicksort/mips) # for (; (j < high) && (array[j] < array[low]) ; ++j ); # $0 = j # $9 = high # $6 = array # $8 = low bge done, $0, $9 mul $5, $0, 4 addu $4, $6, $5 lw $5, 0($4) mul $3, $8, 4 addu $4, $6, $3 lw $5, 0($4) bge done, $5, $5 cont: addu $0, $0,... done: addu $, $, - Resolution of Pipeline Hazards Pipeline hazards Potential violations of program dependences Must ensure program dependences are not violated Hazard resolution Static: compiler/programmer guarantees correctness Dynamic: hardware performs checks at runtime Pipeline interlock Hardware mechanism for dynamic hazard resolution Must detect and enforce dependences at runtime Pipeline Hazards Necessary conditions: WAR: write stage earlier than read stage Is this possible in -RD-EX-MEM-WB? WAW: write stage earlier than write stage Is this possible in -RD-EX-MEM-WB? RAW: read stage earlier than write stage Is this possible in -RD-EX-MEM-WB? If conditions not met, no need to resolve Check for both register and memory RAW Data Dependence Earlier instruction produces a value used by a later instruction: add $, $, $3 sub $4, $5, $ Cycle: Instr: add F D R X M W sub F D R X M W 3 Jon Kuhl 9
10 55:3/C:60 Spring 00 RAW Data Dependence - Stall Detect dependence and stall: add $, $, $3 sub $4, $5, $ Cycle: Instr: add F D R X M W sub F D R X M W 3 Data Hazard Mitigation A better response forwarding Also called bypassing Comparators ensure register is read after it is written Instead of stalling until write occurs Use mux to select forwarded value rather than register value Control mux with hazard detection logic Note: 3 stall cycles (OUCH!!!) RAW Data Dependence With data forwarding paths Detect dependence and forward data directly to next instruction: add $, $, $3 sub $4, $5, $ Cycle: Instr: add F D R X M W sub F D R X M W RAW dependency detected at decode stage. Data is directly forwarded from EX stage of first instr. to the EX stage of fthe second dinstr. 3 Note: No stalls (But requires additional hardware to detect hazards and forward data). Other RAW Dependencies to Consider Dependencies among non-adjacent instructions add $, $, $3 add $7, $8, $9 sub $4, $5, $ Dependencies involving Load instructions add $, $, $3 ld $5, 0($7) sub $4, $5, $ Jon Kuhl 0
11 55:3/C:60 Spring 00 c b a FORWARDING PATHS Forwarding Paths ( instructions) RD i+: R i+: R i+3: R MEM WB i: R (i i+) Forwarding via Path a i+: i: R R (i i+) Forwarding via Path b i+: i+: i: R R (i i+3) i writes R before i+3 reads R Write before Read RF Register file design -phase clocks common Write RF on first phase Read RF on second phase Hence, same cycle: Write $ Read $ No bypass needed If read before write or DFF-based, need bypass Implementation of Forwarding /RD RD/ /MEM MEM/WB Register File Forwarding Paths (Load instructions) /RD Comp Comp Comp Comp RD/ RD/ RD e d LOAD FORWARDING PATH(s) MEM i+: R i+: R i+: R i:r MEM[] i+: R i:r MEM[] WB i:r MEM[] /MEM MEM/WB (i i+) Stall i+ (i i+) Forwarding via Path d (i i+) i writes R before i+ reads R Jon Kuhl
12 55:3/C:60 Spring 00 Implementation of Load Forwarding /RD RD/ /MEM MEM/WB /RD CompComp CompComp RD/ Register File RD/ 0 0 /MEM A d r D-Cache D ata LOAD Control Flow Hazards Control flow instructions branches, jumps, jals, returns Can t fetch until branch outcome known Too late for next 0 0 MEM/WB 0 0 Load Stall,,RD /MEM MEM/WB Control Flow Hazards Important Pipeline Considerations: Where is branch target address (BTA) computed? For conditional branches, how/where is the branch outcome determined. For our simple pipeline, assume: BTA is computed in EX stage, PC update done during MEM stage Branch Outcome is determined during EX stage. Control Dependence One instruction affects which executes next sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 Cycle : Instr: sw F D R X M W bne F D R X M W sub F D R M X W 3 Jon Kuhl
13 55:3/C:60 Spring 00 Control Dependence stall until branch outcome is known Detect dependence and stall sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 When branch instruction is decoded stall pipeline until branch outcome is known. Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F F D R X M W F D R X M W F D R X M Note: 4 stall cycles (reducible to 3 with minor pipeline redesign) New fetch at branch outcome address Control Dependence reducing to 3 stall cycles Detect dependence and stall sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 When branch instruction is decoded stall pipeline until branch outcome is known. Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F F D R X M W 3 stall cycles F D R X M W F D R X M New fetch at branch outcome address Control Flow Hazards What to do? Always stall? Easy to implement Performs poorly: Assume out of every 5 instructions is a branch. Each branch introduces three stall cycles: CPI = + (3 x.) =.6 (lower bound) Branch Penalty (ave. stall cycles/branch): BP = 3 Control Flow Hazards What else could we do? Predict branch not taken Continue to fetch instructions beyond the branch point into the pipeline Must cancel these instructions later if branch prediction incorrect Jon Kuhl 3
14 55:3/C:60 Spring 00 Branch Frequencies (From Hennessy & Patterson,Computer Architecture A Quantitative Approach, nd Ed Branching Behavior (From Hennessy & Patterson,Computer Architecture A Quantitative Approach, nd Ed.) Control Dependence Static not-taken branch prediction Detect dependence and cancel/refetch if necessary sw $4, 0($5) // I bne $, $3, loop // I+ sub $6, $7, $8 // I= add $, $, $3 // I+3 When branch outcome (branch taken) is known, cancel execution of instruction(s) in pipeline and refetch at branch target address Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R F D R X M W F D F D R X M W F F D R X M W New fetch at branch outcome address Control Dependence Static not-taken branch prediction No stalls if assumption (branch not taken) is correct sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 add $, $, $3 If branch outcome guess was correct, continue processing instructions following the branch instruction Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R X M W F D R X M W F D R X M W Jon Kuhl 4
15 55:3/C:60 Spring 00 Performance of static not-taken branch prediction Let T denote the fraction of executed branches that are taken Branch Penalty = 3T + 0(-T) = 3T Let B denote the fraction of executed instructions that are branches CPI (lower bound) = + B(3T) For B =., T =.667: For B =., T =.333: Branch Penalty =.0 Branch Penalty =.0 CPI = +.(.0) =.4 CPI = +.(.0) =.0 Other static branch prediction strategies (not covered in the Textbook) If branches are more likely to be taken than not taken, a static taken prediction strategy would be preferable to a not- taken strategy But, Branch Target Address must be computed before instruction fetch at BTA can commence. Partial solution: Move BTA generation to the stage. Control Dependence Static taken branch prediction Detect dependence and cancel/refetch if necessary sw $4, 0($5) bne $, $3, loop sub $6, $7, $8 add $, $, $3 When branch outcome (branch taken) is known, cancel execution of instruction in pipeline Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R st Instr at BTA F D R X M W F D R X M W nd Instr at BTA Begin fetch at branch target address Control Dependence Static taken branch prediction Detect dependence and cancel/refetch if necessary sw $4, 0($5) // I bne $, $3, loop // I+ sub $6, $7, $8 // I= add $, $, $3 // I+3 When branch outcome (branch not taken) is known, cancel execution of instruction(s) in pipeline and Refetch at next not-taken address (I+3) Cycle: Instr: 0 3 sw F D R X M W bne F D R X M W sub F D R X M W F D F D R X M W F F D R X M W st Instr at BTA ndinstr at BTA Jon Kuhl 5
16 55:3/C:60 Spring 00 Performance of static taken branch prediction Let T denote the fraction of executed branches that are taken Branch Penalty = T + (-T) = -T Let B denote the fraction of executed instructions that are branches CPI (lower bound) = + B(-T) For B =., T =.667: For B =., T =.333: Branch Penalty =.33 Branch Penalty =.67 CPI = +.(.33) =.7 CPI = +.(.67) =.33 Performance Comparison static nottaken prediction versus static taken prediction Stall (no static Prediction) Static Not-taken Static Taken B=., T=.667 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =.4 Branch Penalty =.33 CPI (lower bound) =.7 B=., T=.333 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =. Branch Penalty =.67 CPI (lower bound) =.33 Performance Comparison static nottaken prediction versus static taken prediction Stall (no static Prediction) Static Not-taken Static Taken B=., T=.667 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =.4 Branch Penalty =.33 CPI (lower bound) =.7 B=., T=.333 Branch Penalty = 3.0 CPI (lower bound) =.6 Branch Penalty =.0 CPI (lower bound) =. Branch Penalty =.67 CPI (lower bound) =.33 Control Flow Hazards--Continued Another option: delayed branches Processor always executes instructions following a branch following instruction until branch outcome is determined referred to as the branch shadow These instructions are considered to logically occur prior to the branch Compiler is responsible for rearranging program to place instructions into the branch shadow If compiler can t put a useful instruction there, must insert a nop. stalls are eliminated only if useful instructions (not nops) can be placed in the shadow. This is often difficult. Static taken prediction outperforms static not-taken prediction when more than 50% of branches are taken. Jon Kuhl 6
17 55:3/C:60 Spring 00 Control Dependence delayed branching Place instructions or nops in the branch shadow bne $, $3, loop first shadow instruction here second shadow instruction here third shadow instruction here sub $6, $7, $ bne F D R X M W st shadow instruction F D R X M W nd shadow instruction F D R X M W Branch outcome is known before first instruction logically following the branch is fetched 3 rd shadow instruction F D R X M W First instruction logically Following the branch F D R X M W Exceptions and Pipelining Consider processor exceptions such as: arithmetic exceptions (overflow, divide-by zero, etc) interrupts etc. These are essentially surprise branches Pipeline state must be saved in a clean manner in order to return from (recover from) exceptions. Exception occurs during the execution of fthis instruction ti These instructions are already in the pipeline (partially executed) at the point the exception occurs. Exceptions I I I3 I 4 I 5 I 6 I 7 I 8... exception handling routine Exceptions Even worse: Multiple exceptions could occur in one cycle: I/O interrupt (MEM) User trap to OS (EX) Illegal instruction () Arithmetic overflow Hardware error Etc. Interrupt priorities must be supported Jon Kuhl 7
18 55:3/C:60 Spring 00 MIPS R000/R3000 Pipeline Stage Phase Function performed φ Translate virtual instr. addr. using TLB Access I-cache φ Separate Adder RD φ Return instruction from I-cache, check tags & parity φ Read RF; if branch, generate target φ Start op; if branch, check condition φ Finish op; if ld/st, translate addr MEM φ Access D-cache φ Return data from D-cache, check tags & parity WB φ Write RF IBM RISC Experience [Agerwala and Cocke 987] Internal IBM study: Limits of a scalar pipeline? Memory Bandwidth Fetch instr/cycle from I-cache 40% of instructions are load/store (D-cache) Code characteristics (dynamic) Loads 5% Stores 5% /RR 40% Branches 0% /3 unconditional (always taken) /3 conditional taken /3 conditional not taken φ IBM Experience Cache Performance Assume 00% hit ratio (upper bound) Cache latency: I = D = cycle default Load and branch scheduling Loads 5% cannot be scheduled (delay slot empty) 65% can be moved back or instructions 0% can be moved back instruction Branches Unconditional 00% schedulable (fill one delay slot) Conditional 50% schedulable (fill one delay slot) CPI Optimizations Goal and impediments CPI =, prevented by pipeline stalls No cache bypass of RF, no load/branch scheduling Load penalty: cycles: 0.5 x = 0.5 CPI Branch penalty: cycles: 0. x /3 x = 0.7 CPI Total CPI: =.77 CPI Bypass, no load/branch scheduling Load penalty: cycle: 0.5 x = 0.5 CPI Total CPI: =.5 CPI Jon Kuhl 8
19 55:3/C:60 Spring 00 More CPI Optimizations Bypass, scheduling of loads/branches Load penalty: 65% + 0% = 75% moved back, no penalty 5% => cycle penalty 0.5 x 0.5 x = CPI Branch Penalty /3 unconditional 00% schedulable => cycle /3 cond. not-taken, => no penalty (predict not-taken) /3 cond. Taken, 50% schedulable => cycle /3 cond. Taken, 50% unschedulable => cycles 0.5 x [/3 x + /3 x 0.5 x + /3 x 0.5 x ] = 0.67 Total CPI: =.3 CPI Simplify Branches Assume 90% can be PC-relative No register indirect, no register access Separate adder (like MIPS R3000) Branch penalty reduced Total CPI: =.5 CPI PC-relative Schedulable Penalty Yes (90%) Yes (50%) 0 cycle Yes (90%) No (50%) cycle No (0%) Yes (50%) cycle No (0%) No (50%) cycles 5% Overhead from program dependences Review Pipelining Overview Control Data hazards Stalls Forwarding or bypassing Control flow hazards Branch prediction Exceptions Real Pipelines Jon Kuhl 9
ECE/CS 552: Pipeline Hazards
ECE/CS 552: Pipeline Hazards Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipeline Hazards Forecast Program Dependences
More informationPipelining to Superscalar
Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel
More informationAppendix C: Pipelining: Basic and Intermediate Concepts
Appendix C: Pipelining: Basic and Intermediate Concepts Key ideas and simple pipeline (Section C.1) Hazards (Sections C.2 and C.3) Structural hazards Data hazards Control hazards Exceptions (Section C.4)
More informationECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real
More informationCSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University
CSE 533: Advanced Computer Architectures Pipelining Instructor: Gürhan Küçük Yeditepe University Lecture notes based on notes by Mark D. Hill and John P. Shen Updated by Mikko Lipasti Pipelining Forecast
More informationInstruction Pipelining Review
Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number
More informationMinimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline
Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationAdvanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017
Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationComputer Architecture
Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationPipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations
Principles of pipelining Pipelining Simple pipelining Structural Hazards Data Hazards Control Hazards Interrupts Multicycle operations Pipeline clocking ECE D52 Lecture Notes: Chapter 3 1 Sequential Execution
More informationCOSC 6385 Computer Architecture - Pipelining
COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More information(Basic) Processor Pipeline
(Basic) Processor Pipeline Nima Honarmand Generic Instruction Life Cycle Logical steps in processing an instruction: Instruction Fetch (IF_STEP) Instruction Decode (ID_STEP) Operand Fetch (OF_STEP) Might
More informationLecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1
Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationModern Computer Architecture
Modern Computer Architecture Lecture2 Pipelining: Basic and Intermediate Concepts Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction
More informationChapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction
More informationChapter 4 The Processor 1. Chapter 4A. The Processor
Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined
More informationData Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard
Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:
More informationPipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationLecture 7 Pipelining. Peng Liu.
Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt
More information3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationPipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science
Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and
More informationMIPS An ISA for Pipelining
Pipelining: Basic and Intermediate Concepts Slides by: Muhamed Mudawar CS 282 KAUST Spring 2010 Outline: MIPS An ISA for Pipelining 5 stage pipelining i Structural Hazards Data Hazards & Forwarding Branch
More informationBasic Pipelining Concepts
Basic ipelining oncepts Appendix A (recommended reading, not everything will be covered today) Basic pipelining ipeline hazards Data hazards ontrol hazards Structural hazards Multicycle operations Execution
More informationECEC 355: Pipelining
ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly
More informationWhat is Pipelining? Time per instruction on unpipelined machine Number of pipe stages
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science Cases that affect instruction execution semantics
More informationLecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.SP96 1 Review: Evaluating Branch Alternatives Two part solution: Determine
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationSlide Set 7. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng
Slide Set 7 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017 ENCM 501 W17 Lectures: Slide
More informationCOSC4201 Pipelining. Prof. Mokhtar Aboelaze York University
COSC4201 Pipelining Prof. Mokhtar Aboelaze York University 1 Instructions: Fetch Every instruction could be executed in 5 cycles, these 5 cycles are (MIPS like machine). Instruction fetch IR Mem[PC] NPC
More informationPipelining. Principles of pipelining. Simple pipelining. Structural Hazards. Data Hazards. Control Hazards. Interrupts. Multicycle operations
Principles of pipelining Pipelining Simple pipelining Structural Hazards Data Hazards Control Hazards Interrupts Multicycle operations Pipeline clocking ECE D52 Lecture Notes: Chapter 3 1 Sequential Execution
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationCISC 662 Graduate Computer Architecture Lecture 6 - Hazards
CISC 662 Graduate Computer Architecture Lecture 6 - Hazards Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationLecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1
Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)
More informationPredict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch
branch taken Revisiting Branch Hazard Solutions Stall Predict Not Taken Predict Taken Branch Delay Slot Branch I+1 I+2 I+3 Predict Not Taken branch not taken Branch I+1 IF (bubble) (bubble) (bubble) (bubble)
More informationThe Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.
The Processor Pipeline Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. Pipeline A Basic MIPS Implementation Memory-reference instructions Load Word (lw) and Store Word (sw) ALU instructions
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationInstruction Pipelining
Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages
More informationThe Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture
The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count
More informationComputer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining
Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one
More information1 Hazards COMP2611 Fall 2015 Pipelined Processor
1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add
More informationDetermined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version
MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationComputer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationPipelining. CSC Friday, November 6, 2015
Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not
More informationInstr. execution impl. view
Pipelining Sangyeun Cho Computer Science Department Instr. execution impl. view Single (long) cycle implementation Multi-cycle implementation Pipelined implementation Processing an instruction Fetch instruction
More informationPipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!
Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationInstruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction
Instruction Level Parallelism ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Basic Block A straight line code sequence with no branches in except to the entry and no branches
More informationLecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)
Lecture 2: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4) 1 Defining Fault, Error, and Failure A fault produces a latent error; it becomes effective when
More informationLecture 2: Processor and Pipelining 1
The Simple BIG Picture! Chapter 3 Additional Slides The Processor and Pipelining CENG 6332 2 Datapath vs Control Datapath signals Control Points Controller Datapath: Storage, FU, interconnect sufficient
More informationExecution/Effective address
Pipelined RC 69 Pipelined RC Instruction Fetch IR mem[pc] NPC PC+4 Instruction Decode/Operands fetch A Regs[rs]; B regs[rt]; Imm sign extended immediate field Execution/Effective address Memory Ref ALUOutput
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20
More informationCMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1
CMCS 611-101 Advanced Computer Architecture Lecture 9 Pipeline Implementation Challenges October 5, 2009 www.csee.umbc.edu/~younis/cmsc611/cmsc611.htm Mohamed Younis CMCS 611, Advanced Computer Architecture
More informationComplex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar
Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting
More informationLecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University
Lecture 9 Pipeline Hazards Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee18b 1 Announcements PA-1 is due today Electronic submission Lab2 is due on Tuesday 2/13 th Quiz1 grades will
More informationPipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.
Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview
More informationThomas Polzer Institut für Technische Informatik
Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationPipelining. Maurizio Palesi
* Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer
More informationControl Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.
Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation
More informationThese actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.
MIPS Pipe Line 2 Introduction Pipelining To complete an instruction a computer needs to perform a number of actions. These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously
More informationL19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07
Pipelined CPUs Where are the registers? Study Chapter 6 of Text L19 Pipelined CPU I 1 Review of CPU Performance MIPS = Millions of Instructions/Second MIPS = Freq CPI Freq = Clock Frequency, MHz CPI =
More informationInstruction Pipelining
Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages
More informationWhat is Pipelining? RISC remainder (our assumptions)
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationAppendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,
Appendix C Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Pipelining Multiple instructions are overlapped in execution Each is in a different stage Each stage is called
More informationChapter 4. The Processor
Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A
More informationLecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S
Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching
More information4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16
4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt
More informationELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism
ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,
More informationCSEE 3827: Fundamentals of Computer Systems
CSEE 3827: Fundamentals of Computer Systems Lecture 21 and 22 April 22 and 27, 2009 martha@cs.columbia.edu Amdahl s Law Be aware when optimizing... T = improved Taffected improvement factor + T unaffected
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationPipelining: Hazards Ver. Jan 14, 2014
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Pipelining: Hazards Ver. Jan 14, 2014 Marco D. Santambrogio: marco.santambrogio@polimi.it Simone Campanoni:
More informationPipelining. Each step does a small fraction of the job All steps ideally operate concurrently
Pipelining Computational assembly line Each step does a small fraction of the job All steps ideally operate concurrently A form of vertical concurrency Stage/segment - responsible for 1 step 1 machine
More informationOverview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP
Overview Appendix A Pipelining: Basic and Intermediate Concepts Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations 1 2 Unpipelined
More informationPipelined CPUs. Study Chapter 4 of Text. Where are the registers?
Pipelined CPUs Where are the registers? Study Chapter 4 of Text Second Quiz on Friday. Covers lectures 8-14. Open book, open note, no computers or calculators. L17 Pipelined CPU I 1 Review of CPU Performance
More informationChapter 3. Pipelining. EE511 In-Cheol Park, KAIST
Chapter 3. Pipelining EE511 In-Cheol Park, KAIST Terminology Pipeline stage Throughput Pipeline register Ideal speedup Assume The stages are perfectly balanced No overhead on pipeline registers Speedup
More informationPreventing Stalls: 1
Preventing Stalls: 1 2 PipeLine Pipeline efficiency Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls Ideal pipeline CPI: best possible (1 as n ) Structural hazards:
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationEN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts
EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL'15 1 Classical concepts (prerequisite) 1. Instruction
More informationLECTURE 10. Pipelining: Advanced ILP
LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More information