Computer Architectures. DLX ISA: Pipelined Implementation
|
|
- Scott Taylor
- 6 years ago
- Views:
Transcription
1 Computer Architectures L ISA: Pipelined Implementation 1
2 The Pipelining Principle Pipelining is nowadays the main basic technique deployed to speed-up a CP. The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, ) A system S, has to execute N times a task A: A 1, A 2, A 3 A N S R 1, R 2, R 3 R N Latency : time occurring between the beginning and the end of task A (T A ). Throughput : frequency at which each task is completed. 2
3 1) Sequential System The Pipelining Principle A 1 A 2 A 3 A N t T A Latency (execution time of a single instruction) = T A Throughput(1) = 2) Pipelined System A 1 T A P 1 P 2 P 3 P 4 t S i : pipeline stage S 1 S 2 S 3 S 4 S 3
4 A 1 T P P 1 P 2 A 2 P 1 P 2 A 3 A 4 A n P 3 P 1 The Pipelining Principle P 4 P 3 P 2 P 1 P 4 P 3 P 4 P 2 P 3 P 4 T P : pipeline cycle Latency(2) = 4 *T P = T A Throughput(2) t 1 T P = 4 T A = 4 * Throughput(1) S 1 S 2 S 3 S 4 S 4
5 The Pipelining Principle (2) Pipelining does not decrease the amount of time needed for carrying out each single task: Latency(2) = Latency(1) Pipelining, instead, increases the Throughput, by multiplying it of a factor K equal to the number of stages of the pipeline: Throughput(2) = K * Throughput(1) This yields a reduction, by the same factor K, of the total execution time of a sequence of N tasks (T N ): T N = N Throughput T N (1) = N Throughput(1), T N(2) = N Throughput(2) Speedup 2 vs 1 = T N 1 T N 2 = Throughput(2) Throughput(1) =K 5
6 The Pipelining Principle (2) Ideal case: Real case: T P = T Pi = T A K T P = max T P1, T P2,.., T PK perfectly balanced pipeline (slightly) unbalanced pipeline Speedup = K Speedup < K Example: T A = 20 t (t: time unit) T P1 = 5t, T P2 = 5t, T P3 = 6t, T P4 = 4t T P = 6t Speedup 2 vs 1 = T A T P = 20t 6t =(<4) 6
7 Pipelining in a CP (L) Tasks: A 1, A 2, A 3 A N Instructions: I 1, I 2, I 3 I N I Combinatorial circuits E E WB t Registers (Pipeline Registers, FFs) / /E E/E E/WB E E WB CP (datapath) N.B. this architecture is COPLETELY different from the sequential one Pipeline Cycle Clock Cycle elay of the slowest stage CPI=1 (ideally!) 7
8 Pipeline in the L Instr i Instr i+1 Instr i+2 E E WB E E WB E E WB CPI (ideally) = 1 Instr i+3 Instr i+4 E E WB E E WB Overhead introduced by the Pipeline Registers: T clk = T d + T P + T su Clock Cycle elay of the Input stage register elay of the slowest combinatorial stage Set-up of the output stage register 8
9 Tp Combinatorial Circuit elay of the Input stage register elay of the slowest combinatorial stage Set-up of the output stage register 9
10 Requirements for implementation of the pipeline Each stage has to be active during each clock cycle. The PC has to be incremented in the stage (instead of ). An AER has to be introduced (PC <-- PC+4 PC <-PC+1) in the stage. Since instructions are aligned, a 30 bit register (counter) is incremented each clock cycle (2 ls bits are always 0). Two Rs are required (that will be referred to as LR e SR) to handle the situation where a LOA is immediately followed by a STORE (WB-E overlapping two data waiting to be written (one in memory, the other one in RF) are overlapping. At every clock cycle, it has to be possible to execute 2 memory accesses (, E): Instruction emory (I) and ata emory (): Harvard Architecture The CP clock is determined by the slowest stage: I, have to be cache memories (on-chip) Pipeline Registers store both data and control information (the Control nit is distributed among the pipeline stages) 10
11 Actually, it is a programmable counter since the two least-significative bits are always 0 4 A L Pipelined atapath For SCn E for Branch (also <0 and >0) E WB [acts on the output] if jumping PC EC For computing the new PC when branching =0? JL and JLR (PC stored in R31) PC INSTR E RS1 RS2 R RF =0? A L ATA E Sign extension Number of dest. registers in case of LOA and AL instr. SE For operations with immediates Number of destination register / /E E/E E/WB ata
12 stage / /E IR (J; JL)) IR 15-0 (Offset/Immediate/JR/Branch/Load est. reg. ) 26 (J and JL) 32 I R IR (Opcode) IR (R Instr.) EC LB SW Info travelling with the instruction 32 P C IR IR IR 15 IR 25 RS1 RS2 R RF Sign extension SE PC 31-0 (JAL,JALR) A B (31-16) Immed./Branch (31-26) Jump ata (from WB stage) 12 Number of the dest. register (from WB stage)
13 L Pipelined atapath for Branch for SCn E (also <0 e >0) E WB [acts on the output] PC 4 A Address I ata SR: Store emory ata Register LR: Load emory ata Register IRi : Instruction Register i : AL output, or AR, or Branch Target Address Y: data computed from prev. stages P C 1 I R 1 EC RF SE P C 2 A B I R 2 =0? / /E E/E E/WB =0? A L nr. destination register P C 3 C O N S R I R 3 JL JLR (PC in R31) P C 4 L R Y I R 4
14 Pipelined execution of an AL instruction NOTE: for these instructions, RS2/R need to be carried along the pipeline and up to the WB stage IR <- [PC] ; PC <- PC + 4 ; PC1 <- PC + 4 ecoded opcode is carried along all stages E E WB A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1 /E <- Instruction decode; <- A op B or <- A op (IR2 15 ) 16 ## IR Y <- (temp. storing, waiting for WB) R <- Y [IR4 <- IR3] [PC3 <- PC2] [IR3 <- IR2] [PC4 <- PC3] NOTE: IRi bits that are not needed for all instructions are dropped during successive stages. From a stage to the next one, those bits that are needed for all instructions are kept : ALOTPT (in E/E), Y : ALOTPT1 14
15 Pipelined execution of a E instruction ecoded opcode is carried along all stages E E WB IR <- [PC] ; PC <- PC + 4 ; PC1 <- PC + 4 A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1 /E <- Instruction decode; <- A op (IR2 15 ) 16 ## IR SR <- B LR <- [] (LOA) or [] <- SR (STORE) R <- R (LOA) [Sign ext.] [IR3 <.- IR2 [PC3 <- PC2] [PC4 <- PC3] [IR4 <.- IR3] : AR (ata emory Address Registrer) 15
16 Pipelined execution of a BRANCH instruction (normally after a SCn instruction) E IR <- [PC] ; PC <- PC + 4 ; PC1 <- PC + 4 A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1 /E <- Instruction decode; <- PC2 op (IR 15 ) 16 ## IR Cond <- A op 0 [PC3 <- PC2] [IR3 <.- IR2] ecoded opcode is carried along all stages E WB if (Cond) PC <- (NOP) [PC4 <- PC3] [IR4 <.- IR3 If the branch is taken, the PC is overwritten in this stage : BTA (BRANCH TARGET ARESS) Branch performed on the current value on register A 16
17 Pipelined execution of a JR instruction IR <- [PC] ; PC <- PC + 4 ; PC1 <- PC + 4 A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1 /E <- Instruction decode; ecoded opcode is carried along all stages E E E <- A WB PC <- [IR3 <.- IR2] [PC3 <- PC2] [IR4 <.- IR3] [PC4 <- PC3] WB (NOP) What would the stage sequence be for a J instruction? 17
18 Pipelined execution of a JL or JLR instruction E IR <- [PC] ; PC <- PC + 4 ; PC1 <- PC + 4 A <- RS1; B <- RS2; PC2 <- P1; IR2<-IR1 /E <- Instruction decode; PC3 <- PC2 [IR3 <.- IR2] <- A (If JLR) <- PC2 + (IR 25 ) 6 ## IR (If JL) E WB PC <- ; PC4<- PC3 R31 <- PC4 [IR4 <- IR3] In this case PCi values are used ecoded opcode is carried along all stages NOTE: Writing on R31 can NOT be done on-the-fly since it could overlap with another register write operation 18
19 What would be the sequence in case of SCn (ex SLT R1,R2,R3)? E E WB IR <- [PC] ; PC <- PC + 4 ; PC1 <- PC + 4 A <- RS1; B <- RS2; PC2 <- P1; IR2<-IR1 /E <- Instruction decode;??? 19
20 Pipeline hazards A Hazard occurs when, in a specific clock cycle, an instruction currently flowing through a pipeline stage can not be executed in the same clock cycle. Structural Hazards The same resource is used by two different pipeline stages: the instructions currently in those stages can not be executed simultaneously. ata Hazards they are due to instruction dependencies. For example, an instruction that needs to read a register not yet written by a previous instruction (Rear After Write - RAW). Control Hazards The instructions that follow a branch depend from the branch result (taken/not taken). The instruction that can not be executed has to be stopped ( pipeline stall or pipeline bubbling ), together with all the following instructions, while the previous instructions can proceed normally (so as to eliminate the hazard). 20
21 Hazards and stalls The consequence of a data hazard: if instruction I i needs thre result of instruction I i-1 (data are read in the stage), it has to wait until after WB of I i-1 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 Clk 6 Clk 7 Clk 8 Clk 9 Clk 10 Clk 11 Clk 12 I i-3 E E WB I i-2 E E WB I i-1 E E WB I i S S S WB I i+1 S S S WB Stall: the clock signal for I i,i i+1,.. is stopped for three cycles T 5 = 8 * CLK = (5 + 3) * CLK T N = N * 1 * CLK T 5 = 5 * (1 + 3/5 ) * CLK T N = N * (1 + S ) * CLK ideal CPI Stalls per Instruction effective CPI
22 Forwarding Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 Clk 6 Clk 7 Clk 8 Clk 9 A R3, R1, R4 E E WB SB R7, R3, R5 hazard E E WB OR R1, R3, R5 hazard E E WB LW R6, 100 (R3) hazard E E WB AN R9, R5, R3 no hazard E E WB Here too the data is not yet in RF since it is written on the positive clock edge at the end of WB (the register value is read in ) Forwarding allows eliminating almost all RAW hazards of the L pipeline without stalling the pipeline. (NOTE: in the L, registers are modified only in WB) 22
23 Forwarding implementation Alternatively, SPLIT-CYCLE (see next) write before read Comparison between RS1, RS2 and R1, R2 and the Opcodes Forwarding nit R1 (destination register/opcode) R2/OpCode RS1/RS2 OPCOE F RF A B Offset A L /E E/E E/WB Often performed inside the RF It allows anticipating the register on /E control: / opcode and comparison of R with RS1 and RS2 (/) 23
24 Forwarding nit Within the Forwarding nit, the opcodes of the instructions in the E, E and WB stages are decoded. If the instruction in the E stage needs a register value (either A or B i.e. an AL instruction, NOT a J or Branch instruction) the opcodes of the instructions in the E and WB stages are examined. If they require a register update, the number of the involved register is compared with the register numbers of the instruction in the E stage. If there is a match then the corresponding data is forwarded to the E stage, thus replacing the data read from the register file The bypass es (inputs of the /E barrier) are needed because a fetched instruction can require the contents of registers whose numbers can match that of the instruction in the WB stage (if it must store a register value). In this case data must be read from the E/WB barrier instead from the register file. Alternatively, split-cycle: T In this half-period the register is written In this half-period the register is read 24
25 ata hazard due to LOA instructions LW R1,32(R6) E E WB A R4,R1,R7 SB R5,R1,R8 AN R6,R1,R7 E E E NOTE: the datum required by the A is available only at the end of the E stage. The hazard can not be eliminated by means of forwarding (unless there is an additional input in the s between memory and AL and everything is done in the same clock cycle delays, there is a memory access in between which is already slow by itself!) As a matter of fact, the clock signal is not generated. The clock block is propagated along the pipeline one stage at a time. LW R1,32(R6) A R4,R1,R7 SB R5,R1,R8 AN R6,R1,R7 E E WB The pipeline needs to be stalled S E E S E S From the end of this stage onwards: standard forwarding E->E 25
26 elayed load In many RISC CPs, the hazard associated with the LOA instruction is not handled by the hardware through pipeline stalling, instead it is handled via software by the compiler (delayed load): LOA Instruction delay slot Next instruction The compiler tries to fill-in the delay-slot with a useful instruction (worst case: NOP). LW R1,32(R6) LW R3,10 (R4) A R5,R1,R3 LW R6, 20 (R7) LW R8, 40(R9) LW R1,32(R6) LW R3,10 (R4) LW R6, 20 (R7) A R5,R1,R3 LW R8, 40(R9) 26
27 PC BEQZ R4, 200 PC+4 PC+8 PC+12 Control Hazards SB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8) Next Instruction Address R4 = 0 : Branch Target Address (taken) R4 0 : PC+4 (not taken) PC (BTA) BEQZ R4, 200 SB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8) AN R9, R5, R3 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 E E WB E New computed PC value (Aluout) New value in PC (one clock after) E E Clk 6 Clk 7 Clk 8 WB E WB E E WB E E WB Fetch with the new PC 27
28 Instruction Fetch L Pipelined atapath Instruction ecode (Branch or JP) Execute NOTE if the feedback signal of the new PC was taken directly from the AL instead than from ALOT the required stalls would obviously be 2 but: slower clock! 4 A EC BEQZ R4, 200 emory Write Back =0? =0? PC I RF A L When the new PC acts on the I three instructions have already travelled through the first three stages (E included) / SE /E E/E E/WB
29 BEQZ R4,200 Handling the Control Hazards Always Stall (three-clock block being propagated) the previous instruction has not been decoded yet Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 E E WB S S S S S Clk 6 Clk 7 Clk 8 Fetch at new PC Hyp.: Branch Freq.= 25 % CPI = (1 + S ) = ( * 0.25 ) = 1.75 Real situation repeated PC <- PC - 4 Predict Not Taken NOP NOP NOP BEQZ R4, 200 SB R7, R3, R5 OR R1, R3, R5 LW R6, 100 (R8) Here the new value is sampled by the PC No problem since no instruction has gone through WB! Clk 1 Clk 2 Clk 3 Clk 4 Clk 5 E E WB E E E Clk 6 Clk 7 Clk 8 WB E WB E Branch Completion E WB Flush: they become NOP 29
30 Stalls with jumps (1/3) E E WB if jumping forced NOP for jumping 4 A N O P PC EC N O P N O P =0? PC INSTR E On the first positive clock edge after sampling the assertion of the jumping condition, 3 NOPs must be inserted to replace the 3 unwanted instructions already present in the pipeline. RS1 RS2 R RF SE A L ATA E / /E E/E E/WB =0? ata
31 NOTE in this case the jump condition and the new PC are sent to the in the same clock cycle as the processing of the condition 4 A Stalls when jumping(2/3) E E WB if jumping forced NOP when jumping N O P PC EC N O P =0? PC INSTR E On the first positive clock edge after sampling the assertion of the jumping condition, 2 NOPs must be inserted to replace the 2 unwanted instructions RS1 RS2 R RF SE A L ATA E / /E E/E E/WB =0? ata
32 NOTE In this case the jumping condition and the new PC control the in the same Stalls when jumping (3/3) moment as the processing of the condition E E WB if jumping 4 A PC EC NOP when jumping N O P =0? PC INSTR E On the first positive clock edge after the assertion of the jumping condition, a NOP is inserted to replace the instruction currently in the / stage RS1 RS2 R RF SE A L ATA E / /E E/E E/WB =0? ata
33 NOTE here there is only one stall since the new value is inserted in the PC on the positive clock edge that ends the stage while, in the previous case, it was inserted after the E stage, that is, two clock later!! To reduce the number of stalls Independent AL for BRANCH/JP IR <- [PC] ; PC <- PC + 4; PC1 <- PC + 4 (New fetch only one stall) A <- RS1; B <- RS2; PC2 <- PC1 /E <- ecode; /E <- Opc ext. BTA <-PC1+ (IR 15 ) 16 ## IR 15-0 /(IR 25 ) 6 ## IR if Branch: if (RS1 op 0) PC <- BTA if JP always PC <- BTA E AL (additional full adder) E WB N.B. The full adder is separated from the adder +4 (this means it overlaps with the addition required to compute the next instruction!), otherwise the same adder has to be used together with some multiplexers (so to select whether to add 4 or the offset, and whether to use PC or PC1) 33
34 Standard addition Branch 4 BRANCH/JP 1 stall A E R isplacement of the Branch instruction PC of the Branch instruction P C 1 EC ## The new PC is selected according to the opcode and the value of the branch test register This actually coincide with the current value in PC (can be avoided) P C 2 PC I I R 1 RF A B Offset and sign extension = 0? NOTE: for the nconditional Jump instructions there is an analogous situation: we only need to provide further inputs to the s of the PC by taking into consideration either the RS1 register (JR and JRL) or the 26 less-significant bits of the IR with SE (J and JL) to be added to the current PC) / SE For Branches /E
35 elayed branch Similarly to the LOA case, with several RISC CPs the hazard associated with BRANCH instructions is handled via SW by the compiler (delayed branch): BRANCH instruction delay slot delay slot delay slot The compiler tries to fill-in the delay-slots with useful instructions (worst case: NOP). Next instruction 35
36 elayed branch/jump Original Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Sne R1, R8, R9 ; branch condition Br R1, +100 Obviously in this group of instructions there must be no jumps!!! Compiled Sne R1, R8, R9 ; branch condition Br R1, +100 Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Executed in both cases Instead of one or more postponed instructions, the compiler inserts NOPs in case no suitable instructions are available 36
37 Handling the Control Hazards ynamic Prediction: Branch Target Buffer -> no stall (almost..) PC TAGS Predicted PC T/NT N.B. Here the branch slot is selected during the clock cycle that loads IR1 in / = HIT : Fetch with predicted PC ISS : Fetch with PC + 4 Correct prediction : Wrong prediction : no stall 1-3 stalls (correct fetch in or E, see before) 37
38 Prediction Buffer: the simplest implementation uses a single bit that indicates what happened when the last branch occurred. Loop1 Loop2 When exiting loop2, the prediction fails (branch predicted as taken but actually it is untaken), then it fails again when it predicts as untaken whilst entering once again loop2 In case of predominance of one prediction, when the opposite situation occurs we have two consecutive errors. 38
39 Hence, usually two bits are used for branch prediction: TAKEN TAKEN NTAKEN TAKEN TAKEN TAKEN NTAKEN NTAKEN NTAKEN TAKEN NTAKEN NTAKEN 39
DLX computer. Electronic Computers M
DLX computer Electronic Computers 1 RISC architectures RISC vs CISC (Reduced Instruction Set Computer vs Complex Instruction Set Computer In CISC architectures the 10% of the instructions are used in 90%
More informationAdvanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017
Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationPipelining: Basic Concepts
Pipelining: Basic Concepts Prof. Cristina Silvano Dipartimento di Elettronica e Informazione Politecnico di ilano email: silvano@elet.polimi.it Outline Reduced Instruction Set of IPS Processor Implementation
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationChapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined
More informationPipelining. Maurizio Palesi
* Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer
More informationWhat do we have so far? Multi-Cycle Datapath (Textbook Version)
What do we have so far? ulti-cycle Datapath (Textbook Version) CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instruction being processed in datapath How to lower CPI further? #1 Lec # 8 Summer2001
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationData Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard
Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:
More informationPipeline Architecture RISC
Pipeline Architecture RISC Independent tasks with independent hardware serial No repetitions during the process pipelined Pipelined vs Serial Processing Instruction Machine Cycle Every instruction must
More informationWhat is Pipelining? Time per instruction on unpipelined machine Number of pipe stages
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationPipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.
Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview
More informationCOSC 6385 Computer Architecture - Pipelining
COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage
More informationT = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good
CPU performance equation: T = I x CPI x C Both effective CPI and clock cycle C are heavily influenced by CPU design. For single-cycle CPU: CPI = 1 good Long cycle time bad On the other hand, for multi-cycle
More informationPipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome
Thoai Nam Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy & David a Patterson,
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More information3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationInstruction Pipelining Review
Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number
More informationCS 61C: Great Ideas in Computer Architecture Pipelining and Hazards
CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationPipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome
Pipeline Thoai Nam Outline Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy
More informationCO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19
CO2-3224 Computer Architecture and Programming Languages CAPL Lecture 8 & 9 Dr. Kinga Lipskoch Fall 27 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be
More information1 Hazards COMP2611 Fall 2015 Pipelined Processor
1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add
More informationWhat is Pipelining? RISC remainder (our assumptions)
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationLecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1
Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number
More informationChapter 4 (Part II) Sequential Laundry
Chapter 4 (Part II) The Processor Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Sequential Laundry 6 P 7 8 9 10 11 12 1 2 A T a s k O r d e r A B C D 30 30 30 30 30 30 30 30 30 30
More informationChapter 4 The Processor 1. Chapter 4A. The Processor
Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationInstruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31
4.16 Exercises 419 Exercise 4.11 In this exercise we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor
More informationCS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST
CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C
More informationLecture 19 Introduction to Pipelining
CSE 30321 Lecture 19 Pipelining (Part 1) 1 Lecture 19 Introduction to Pipelining CSE 30321 Lecture 19 Pipelining (Part 1) Basic pipelining basic := single, in-order issue single issue one instruction at
More informationPipelining. Each step does a small fraction of the job All steps ideally operate concurrently
Pipelining Computational assembly line Each step does a small fraction of the job All steps ideally operate concurrently A form of vertical concurrency Stage/segment - responsible for 1 step 1 machine
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationPipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!
Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!
More informationPage 1. Pipelining: Its Natural! Chapter 3. Pipelining. Pipelined Laundry Start work ASAP. Sequential Laundry A B C D. 6 PM Midnight
Pipelining: Its Natural! Chapter 3 Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes A B C D Dryer takes 40 minutes Folder
More informationCOSC4201 Pipelining. Prof. Mokhtar Aboelaze York University
COSC4201 Pipelining Prof. Mokhtar Aboelaze York University 1 Instructions: Fetch Every instruction could be executed in 5 cycles, these 5 cycles are (MIPS like machine). Instruction fetch IR Mem[PC] NPC
More informationPipelining: Hazards Ver. Jan 14, 2014
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Pipelining: Hazards Ver. Jan 14, 2014 Marco D. Santambrogio: marco.santambrogio@polimi.it Simone Campanoni:
More informationECEC 355: Pipelining
ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly
More informationCOMPUTER ORGANIZATION AND DESIGN
ARM COMPUTER ORGANIZATION AND DESIGN Edition The Hardware/Software Interface Chapter 4 The Processor Modified and extended by R.J. Leduc - 2016 To understand this chapter, you will need to understand some
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationThe Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture
The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count
More informationAppendix C: Pipelining: Basic and Intermediate Concepts
Appendix C: Pipelining: Basic and Intermediate Concepts Key ideas and simple pipeline (Section C.1) Hazards (Sections C.2 and C.3) Structural hazards Data hazards Control hazards Exceptions (Section C.4)
More informationLecture 7 Pipelining. Peng Liu.
Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt
More informationOverview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP
Overview Appendix A Pipelining: Basic and Intermediate Concepts Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations 1 2 Unpipelined
More informationPipelining. CSC Friday, November 6, 2015
Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not
More informationComputer Architecture
Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in
More informationEC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution
EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution Important guidelines: Always state your assumptions and clearly explain your answers. Please upload your solution document
More information6.823 Computer System Architecture Datapath for DLX Problem Set #2
6.823 Computer System Architecture Datapath for DLX Problem Set #2 Spring 2002 Students are allowed to collaborate in groups of up to 3 people. A group hands in only one copy of the solution to a problem
More information14:332:331 Pipelined Datapath
14:332:331 Pipelined Datapath I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationAppendix A. Overview
Appendix A Pipelining: Basic and Intermediate Concepts 1 Overview Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations 2 1 Unpipelined
More informationPipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science
Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and
More informationEN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts
EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL'15 1 Classical concepts (prerequisite) 1. Instruction
More informationEECS 322 Computer Architecture Improving Memory Access: the Cache
EECS 322 Computer Architecture Improving emory Access: the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow
More informationEE 457 Unit 6a. Basic Pipelining Techniques
EE 47 Unit 6a Basic Pipelining Techniques 2 Pipelining Introduction Consider a drink bottling plant Filling the bottle = 3 sec. Placing the cap = 3 sec. Labeling = 3 sec. Would you want Machine = Does
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationInstr. execution impl. view
Pipelining Sangyeun Cho Computer Science Department Instr. execution impl. view Single (long) cycle implementation Multi-cycle implementation Pipelined implementation Processing an instruction Fetch instruction
More informationSISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:
SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs
More informationProcessor Design CSCE Instructor: Saraju P. Mohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed
Lecture 3: General Purpose Processor Design CSCE 665 Advanced VLSI Systems Instructor: Saraju P. ohanty, Ph. D. NOTE: The figures, tet etc included in slides are borrowed from various books, websites,
More informationLecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1
Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University The Processor Logic Design Conventions Building a Datapath A Simple Implementation Scheme An Overview of Pipelining Pipelined
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationUnpipelined Machine. Pipelining the Idea. Pipelining Overview. Pipelined Machine. MIPS Unpipelined. Similar to assembly line in a factory
Pipelining the Idea Similar to assembly line in a factory Divide instruction into smaller tasks Each task is performed on subset of resources Overlap the execution of multiple instructions by completing
More informationPage 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson
More informationThe Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.
The Processor Pipeline Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes. Pipeline A Basic MIPS Implementation Memory-reference instructions Load Word (lw) and Store Word (sw) ALU instructions
More informationComputer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationCPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts
CPE 110408443 Computer Architecture Appendix A: Pipelining: Basic and Intermediate Concepts Sa ed R. Abed [Computer Engineering Department, Hashemite University] Outline Basic concept of Pipelining The
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationThe Pipelined MIPS Processor
1 The niversity of Texas at Dallas Lecture #20: The Pipeline IPS Processor The Pipelined IPS Processor We complete our study of AL architecture by investigating an approach providing even higher performance
More informationECE 154A Introduction to. Fall 2012
ECE 154A Introduction to Computer Architecture Fall 2012 Dmitri Strukov Lecture 10 Floating point review Pipelined design IEEE Floating Point Format single: 8 bits double: 11 bits single: 23 bits double:
More informationPage # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More informationThese actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.
MIPS Pipe Line 2 Introduction Pipelining To complete an instruction a computer needs to perform a number of actions. These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously
More informationInstruction Pipelining
Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages
More informationMinimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline
Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding
More informationInstruction Pipelining
Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages
More informationLecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation
Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu www.secs.oakland.edu/~yan
More informationImprove performance by increasing instruction throughput
Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access
More informationInstruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction
Instruction Level Parallelism ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction Basic Block A straight line code sequence with no branches in except to the entry and no branches
More informationProcessor (II) - pipelining. Hwansoo Han
Processor (II) - pipelining Hwansoo Han Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 =2.3 Non-stop: 2n/0.5n + 1.5 4 = number
More informationProcessor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Moore s Law Gordon Moore @ Intel (1965) 2 Computer Architecture Trends (1)
More informationPipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1
55:3/C:60 Spring 00 Pipelined Design Motivation: Increase processor throughput with modest increase in hardware. Bandwidth or Throughput = Performance Pipelined Processors Chapter Bandwidth (BW) = no.
More informationPipelined Processor Design
Pipelined Processor Design Pipelined Implementation: MIPS Virendra Singh Computer Design and Test Lab. Indian Institute of Science (IISc) Bangalore virendra@computer.org Advance Computer Architecture http://www.serc.iisc.ernet.in/~viren/courses/aca/aca.htm
More informationDetermined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version
MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationPipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12
Pipelined Datapath Lecture notes from KP, H. H. Lee and S. Yalamanchili Sections 4.5 4. Practice Problems:, 3, 8, 2 ing Note: Appendices A-E in the hardcopy text correspond to chapters 7- in the online
More informationTi Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr
Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006 Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions
More informationAppendix C. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,
Appendix C Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Pipelining Multiple instructions are overlapped in execution Each is in a different stage Each stage is called
More informationPipelined CPUs. Study Chapter 4 of Text. Where are the registers?
Pipelined CPUs Where are the registers? Study Chapter 4 of Text Second Quiz on Friday. Covers lectures 8-14. Open book, open note, no computers or calculators. L17 Pipelined CPU I 1 Review of CPU Performance
More informationMIPS An ISA for Pipelining
Pipelining: Basic and Intermediate Concepts Slides by: Muhamed Mudawar CS 282 KAUST Spring 2010 Outline: MIPS An ISA for Pipelining 5 stage pipelining i Structural Hazards Data Hazards & Forwarding Branch
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationSpeeding Up DLX Computer Architecture Hadassah College Spring 2018 Speeding Up DLX Dr. Martin Land
Speeding Up DLX 1 DLX Execution Stages Version 1 Clock Cycle 1 I 1 enters Instruction Fetch (IF) Clock Cycle2 I 1 moves to Instruction Decode (ID) Instruction Fetch (IF) holds state fixed Clock Cycle3
More information