Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining
|
|
- Gabriel Osborne
- 6 years ago
- Views:
Transcription
1 Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining
2 Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one clock cycle implies: CPI = 1 cycle time determined by length of the longest instruction path (load) but several instructions could run in a shorter clock cycle: waste of time consider if we have more complicated instructions like floating point! resources used more than once in the same cycle need to be duplicated waste of hardware and chip area IF ID IE MEM WB IM Reg DM Reg ALU
3 Ex.: Fixed-period clock vs. variable-period clock in a single-cycle implementation Consider a machine with an additional floating point unit. Assume functional unit delays as follows multiplexors, control unit, PC accesses, sign extension, wires: no delay memory ALU FP add FP mul R 2ns 2ns 8ns 16ns 1ns Assume instruction mix as follows Lw Sw R Beq J FP add FP mul 31% 21% 27% 5% 2% 7% 7% Compare the performance of (a) a single-cycle implementation using a fixedperiod clock with (b) one using a variable-period clock where each instruction executes in one clock cycle that is only as long as it needs to be (not really practical but pretend it s possible!)
4 Solution Instruction Instr. Register ALU Data Register FPU FPU Total class mem. read oper. mem. write add/ mul/ time sub div ns. Load word Store word R-format Branch Jump 2 2 FP mul/div FP add/sub Clock period for fixed-period clock = longest instruction time = 20 ns. Average clock period for variable-period clock = 8 31% % % + 5 5% + 2 2% % % = 7.0 ns. Therefore, performance var-period /performance fixed-period = 20/7 = 2.9 Where T=Ic*CPI*t, same Ic and same CPI
5 Fixing the problem with single-cycle designs I- One solution: a variable-period clock with different cycle times for each instruction class unfeasible, as implementing a variable-speed clock is technically difficult Another solution: use a smaller cycle time have different instructions take different numbers of cycles by breaking instructions into steps and fitting each step into one cycle II- Multicyle approach: Break up the instructions into steps each step takes one clock cycle. At the end of one cycle store data to be used in later cycles of the same instruction balance the amount of work to be done in each step/cycle so that they are about equal restrict each cycle to use at most once each major functional unit so that such units do not have to be replicated functional units can be shared between different cycles within one instruction
6 Multicycle Approach PC Address Memory Data Instruction or data Instruction register Memory data register Data Register # Registers Register # Register # A B ALU ALUOut Note particularities of multicycle vs. single- diagrams single memory for data and instructions single ALU, no extra adders extra registers to hold data between clock cycles
7 Breaking instructions into steps We break instructions into steps not all instructions require all the steps each step takes one clock cycle and Each MIPS instruction takes from 3 5 cycles (steps) 1. IF: Instruction fetch and PC increment:; to keep steps balanced in length, the design restriction is to allow 2. ID : Instruction decode and register fetch: each step to contain at most one ALU operation, or 3. EX : Execution, memory address computation, or branch one completion register access, or one memory access. 4. MEM : Memory access or R-type instruction completion Steps IF ID EX MEM 5. WB : Memory read completion Step name Instruction fetch Instruction decode/register fetch Action for R-type instructions Action for memory-reference Action for instructions branches IR = Memory[PC] PC = PC + 4 A = Reg [IR[25-21]] B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) Action for jumps Execution, address ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] II computation, branch/ ALUOut = A op B (IR[15-0]) PC = ALUOut (IR[25-0]<<2) jump completion Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut] completion ALUOut or Store: Memory [ALUOut] = B WR Memory read completion Load: Reg[IR[20-16]] = MDR
8 Pipelining Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Time Task order A B C D Time Task order A B C D Start work ASAP!! Do not waste time! 6 PM AM Not pipelined Assume 30 min. each task wash, dry, fold, store separate tasks use separate hardware So, can be overlapped 6 PM AM Pipelined Why is easy with MIPS? 1) all instructions are same length 1) fetch and decode stages are similar for all instructions 2) few instruction formats 1) simplifies instruction decode and makes it possible in one stage 3) memory operands appear only in load/stores so memory access can be deferred to exactly one later stage operands are aligned in memory one data transfer instruction requires one memory access stage What about x86? (1 t0 17 bytes instruction)
9 Pipelined Execution Representation Time IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB Program Flow IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB To simplify pipeline, every instruction takes same number of steps, called stages One clock cycle per stage
10 Pipelined vs. Single-Cycle Instruction Execution: the Plan P r o g r a m e x e c u t i o n o r d e r T i m e ( i n i n s t r u c t i o n s ) l w $ 1, ( $ 0 ) I n s t r u c t i o n f e t c h A L U D a t a a c c e s s Single-cycle T? l w $ 2, ( $ 0 ) 8 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s l w $ 3, ( $ 0 ) Assume 2 ns for memory access, ALU operation; 1 ns for register access: therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns. P r o g r a m e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) l w $ 1, ( $ 0 ) l w $ 2, ( $ 0 ) I n s t r u c t i o n f e t c h 2 n s 8 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s A L U D a t a a c c e s s assume write to register file occurs in first half of CLK and read in second half.. I n s t r u c t i o n f e t c h 8 n s Pipelined T?... l w $ 3, ( $ 0 ) 2 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s 2 n s 2 n s 2 n s 2 n s 2 n s
11 Hazards What makes it hard? Structural hazards: different instructions, at different stages, in the pipeline want to use the same hardware resource Control hazards: Deciding on control action depends on previous instruction Data hazards: an instruction in the pipeline requires data to be computed by a previous instruction still in the pipeline we first briefly examine these potential hazards individually
12 I n s t r. O r d e Structural Hazards Structural hazard: inadequate hardware to simultaneously support all instructions in the pipeline in the same clock cycle. E.g., suppose single instruction and data memory in pipeline with one read port: as a structural hazard between first and fourth lw instructions Load Instr 1 Instr 2 Instr 3 Instr 4 Time (clock cycles) M Reg M Reg ALU M Reg M Reg ALU M Reg M Reg M ALU Reg M Reg ALU M Reg M Reg Structural hazards are easy to avoid!; Hazards can always be resolved by waiting ALU
13 Control Hazards Control hazard: need to make a decision based on the result of a previous instruction still executing in pipeline Solution 1 Stall the pipeline P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) T i m e a d d $ 4, $ 5, $ 6 I n s t r u c t i o n f e t c h A L U D a t a a c c e s s Note that branch outcome is computed in ID stage with added hardware (later ) b e q $ 1, $ 2, n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s l w $ 3, ( $ 0 ) b u b b l e I n s t r u c t i o n f e t c h A L U D a t a a c c e s s 4 n s 2 n s Pipeline stall
14 Control Hazards Program execution order (in instructions) Solution 2 Predict branch outcome e.g., predict branch-not-taken : guess one direction then back up if wrong Random prediction: correct 50% of time History-based prediction: record recent history of each branch correct90% of time add $4, $5, $6 Time Instruction Reg fetch ALU Data access Reg beq $1, $2, 40 2 ns Instruction Reg fetch ALU Data access Reg lw $3, 300($0) 2 ns Instruction Reg fetch ALU Data access Reg Program execution order (in instructions) add $4, $5,$6 Time Instruction Reg fetch Prediction success ALU Data access Reg beq $1, $2, 40 2 ns Instruction Reg fetch ALU Data access Reg bubble bubble bubble bubble bubble or $7, $8, $9 4 ns Instruction Reg fetch Prediction failure: undo (=flush) lw ALU Data access Reg
15 Control Hazards Solution 3 Delayed branch: always execute the sequentially next statement with the branch executing after one instruction delay compiler s job to find a statement that can be put in the slot that is independent of branch outcome P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) MIPS does this but it is an option in SPIM (Simulator -> Settings) b e q $ 1, $ 2, 4 0 T i m e a d $ 4, $ 5, $ 6 ( d e l a y e d b r a n c h s l o t ) l w $ 3, ( $ 0 ) I n s t r u c t i o n f e t c h 2 n s I n s t r u c t i o n f e t c h 2 n s A L U I n s t r u c t i o n f e t c h 2 n s D a t a a c c e s s A L U D a t a a c c e s s A L U D a t a a c c e s s Delayed branch beq is followed by add that is independent of branch outcome
16 Data Hazards Data hazard: instruction depends on the result of a previous instruction still executing in pipeline Solution Forward data if possible Time add $s0, $t0, $t1 IF ID EX MEM WB Instruction pipeline diagram: shade indicates use left=write, right=read P r o g r a m e x e c u t i o n o r d e r T i m e ( i n i n s t r u c t i o n s ) a d d $ s 0, $ t 0, $ t 1 s u b $ t 2, $ s 0, $ t 3 I F I D E X M E M W B I F I D E X M E M W B Without forwarding blue line data has to go back in time; with forwarding red line data is available in time
17 Data Hazards Forwarding may not be enough e.g., if an R-type instruction following a load uses the result of the load called load-use data hazard P r o g r a m T i m e e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) l w $ s 0, 2 0 ( $ t 1 ) s u b $ t 2, $ s 0, $ t I F I D E X M E M W B I F I D E X M E M W B Without a stall it is impossible to provide input to the sub instruction in time Program Time execution order (in instructions) lw $s0, 20($t1) IF ID EX MEM WB bubble bubble bubble bubble bubble With a one-stage stall, forwardin can get the data to the sub instruction in time sub $t2, $s0, $t3 IF ID EX MEM WB
18 Reordering Code to Avoid Pipeline Stall (Software Solution) Example: lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Data hazard Reordered code: lw $t0, 0($t1) lw $t2, 4($t1) sw $t0, 4($t1) sw $t2, 0($t1) Interchanged
19 Pipelined Datapath - Single-Cycle Datapath Steps ADD 4 ADD PC ADDR RD Instruction Memory Instruction I 32 WD 5 5 RN1 RN2 WN RD1 5 Register File <<2 ALU Zero RD2 16 E X T N D 32 M U X ADDR Data Memory WD RD M U X IF Instruction Fetch ID Instruction Decode EX Execute/ Address Calc. MEM Memory Access WB Write Back
20 Pipelined Datapath Idea :What happens if we break the execution into multiple cycles, but keep the extra hardware? Answer: We may be able to start executing a new instruction at each clock cycle - pipelining but we shall need extra registers to hold data between cycles pipeline registers Pipeline registers wide enough to hold data coming in ADD PC 4 ADDR RD 32 Instruction Memory 64 bits 16 Instruction I RN1 RN2 WN RD1 Register File WD RD E X T N D bits <<2 M U X ADD ALU 97 bits 64 bits Zero ADDR Data Memory WD RD M U X IF/ID ID/EX EX/MEM MEM/WB
21 Pipelined Datapath Pipeline registers wide enough to hold data coming in ADD PC 4 ADDR RD 32 Instruction Memory 64 bits Instruction I RN1 RN2 WN RD1 Register File WD RD2 E 16 X 32 T N D 128 bits <<2 M U X ADD ALU 97 bits 64 bits Zero ADDR Data Memory WD RD M U X IF/ID ID/EX EX/MEM MEM/WB Only data flowing right to left may cause hazard, why?
22 Bug in the Datapath Write register number comes from another later instruction! ADD IF/ID ID/EX EX/MEM MEM/WB 4 ADD PC ADDR RD Instruction 32 Memory Instruction I RN1 RN2 WN RD1 Register File WD RD2 E X T N D <<2 M U X ALU ADDR Data Memory RD WD M U X
23 Corrected Datapath IF/ID ID/EX EX/MEM MEM/WB 4 ADD 64 bits 133 bits <<2 ADD 102 bits 69 bits PC ADDR RD Instruction 32 Memory RN1 RD1 RN2 Register WN File RD2 WD 16 E X T 32 N D M U X ALU Zero ADDR Data Memory RD WD M U X Destination register number is also passed through ID/EX, EX/MEM and MEM/WB registers, which are now wider by 5 bits
24 Single-Clock-Cycle Diagram: Clock Cycle 1 Example LW lw $t0, 10($t1); sw $t3, 20($t4) add $t5, $t6, $t7; sub $t8, $t9, $t10
25 Single-Clock-Cycle Diagram: Clock Cycle 2 SW LW Example lw $t0, 10($t1); sw $t3, 20($t4) add $t5, $t6, $t7; sub $t8, $t9, $t10
26 Single-Clock-Cycle Diagram: Clock Cycle 3 ADD SW LW
27 Single-Clock-Cycle Diagram: Clock Cycle 4 SUB ADD SW LW
28 Single-Clock-Cycle Diagram: Clock Cycle 5 SUB ADD SW LW
29 Single-Clock-Cycle Diagram: Clock Cycle 6 SUB ADD SW
30 Single-Clock-Cycle Diagram: Clock Cycle 7 SUB ADD
31 Single-Clock-Cycle Diagram: Clock Cycle 8 SUB
32 Alternative View Multiple-Clock-Cycle Diagram CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 lw $t0, 10($t1) IM REG ALU DM REG Time axis sw $t3, 20($t4) IM REG ALU DM REG add $t5, $t6, $t7 IM REG ALU DM REG sub $t8, $t9, $t10 IM REG ALU DM REG
33 Notes No write control for all pipeline registers and PC since they are updated at every clock cycle To specify the control for the pipeline, set the control values during each pipeline stage Control lines can be divided into 5 groups: IF NONE ID NONE ALU RegDst, ALUOp, ALUSrc MEM Branch, MemRead, MemWrite WB MemtoReg, RegWrite Group these nine control lines into 3 subsets: ALUControl, MEMControl, WBControl Control signals are generated at ID stage, how to pass them to other stages?
34
35 Instruction MemtoReg MemWrite RegWrite Pipelined Datapath with Control PCSrc 0 M u x 1 Control ID/EX WB M EX/MEM WB MEM/WB IF/ID EX M WB Add PC 4 Address Instruction memory Read register 1 Write data Read data 1 Read register 2 Registers Read Write data 2 register Shift left 2 0 M u x 1 Add Add result ALUSrc Zero ALU ALU result Branch Address Data memory Write data Read data 1 M u x 0 Control signals emanate from the control portions of the pipeline registers Instruction [15 0] Sign extend Instruction [20 16] Instruction [15 11] 6 0 M u x 1 ALU control RegDst ALUOp MemRead
36 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel. To increase ILP Deeper pipeline Less work per stage shorter clock cycle Multiple issue Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice
37 How ILP Works Issuing multiple instructions per cycle would require fetching multiple instructions from memory per cycle => called Superscalar degree or Issue width To find independent instructions, we must have a big pool of instructions to choose from, called instruction buffer (IB). As IB length increases, complexity of decoder (control) increases that increases the datapath cycle time Prefetching instructions sequentially by an IFU that operates independently from datapath control. Fetch instruction (PC)+L, where L is the IB size or as directed by the branch predictor.
38 Compiler/Hardware Speculation Compiler can reorder instructions Static Multiple Issue Compiler groups instructions into issue packets Group of instructions that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruction Specifies multiple concurrent operations Very Long Instruction Word (VLIW) Compiler must remove some/all hazards Reorder instructions into issue packets with No dependencies with a packet Varies between ISAs; compiler must know! Pad with nop if necessary Hardware can look ahead for instructions to execute Buffer results until it determines they are actually needed Flush buffers on incorrect speculation Explicitly Parallel Instruction Computer (EPIC).
39 Loop Unrolling Renaming the registers Loop: lw $t0, 0($s1) addu $t0,$t0,$s2 sw $t0, 0($s1) addi $s1,$s1, 4 bne $s1,$zero,loop Replicate loop body to expose more parallelism
40 HW Schemes: Instruction Parallelism Why in HW at run time? Works when can t know real dependence at compile time Compiler simpler Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion ID stage checks for hazards. If no hazards, issue the instn for execution.
41 Dynamic Multiple Issue (Superscalar) Superscalar processors: An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle CPU decides whether to issue 0, 1,..IPC Avoiding structural and data hazards(dynamic pipeline) Avoids the need for compiler scheduling Allow the CPU to execute instructions out of order to avoid stalls But commit result to registers in order Example: lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Can start sub while addu is waiting for lw
42
43 Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instn Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined X Ideal CPI + Pipeline stall CPI Clock Cycle pipelined Speedup = Pipeline depth Clock Cycle unpipelined X Pipeline stall CPI Clock Cycle pipelined
Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationThomas Polzer Institut für Technische Informatik
Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationCOMP2611: Computer Organization. The Pipelined Processor
COMP2611: Computer Organization The 1 2 Background 2 High-Performance Processors 3 Two techniques for designing high-performance processors by exploiting parallelism: Multiprocessing: parallelism among
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationThe Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture
The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationChapter 4. The Processor
Chapter 4 The Processor Recall. ISA? Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Instruction Format or Encoding how is it decoded? Location of operands and
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor? Chapter 4 The Processor 2 Introduction We will learn How the ISA determines many aspects
More informationDetermined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version
MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 4 Processor Part 2: Pipelining (Ch.4) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University The Processor Logic Design Conventions Building a Datapath A Simple Implementation Scheme An Overview of Pipelining Pipelined
More informationComputer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationPipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...
CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100
More informationChapter 4. The Processor
Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationComputer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationECS 154B Computer Architecture II Spring 2009
ECS 154B Computer Architecture II Spring 2009 Pipelining Datapath and Control 6.2-6.3 Partially adapted from slides by Mary Jane Irwin, Penn State And Kurtis Kredo, UCD Pipelined CPU Break execution into
More information4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16
4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt
More informationPipelined Processor Design
Pipelined Processor Design Pipelined Implementation: MIPS Virendra Singh Computer Design and Test Lab. Indian Institute of Science (IISc) Bangalore virendra@computer.org Advance Computer Architecture http://www.serc.iisc.ernet.in/~viren/courses/aca/aca.htm
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationChapter 4. The Processor
Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationChapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction
More informationCPE 335. Basic MIPS Architecture Part II
CPE 335 Computer Organization Basic MIPS Architecture Part II Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides http://www.abandah.com/gheith/courses/cpe335_s08/index.html CPE232 Basic MIPS Architecture
More informationChapter 4 The Processor 1. Chapter 4A. The Processor
Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationCOSC 6385 Computer Architecture - Pipelining
COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage
More informationLecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1
Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number
More informationEIE/ENE 334 Microprocessors
EIE/ENE 334 Microprocessors Lecture 6: The Processor Week #06/07 : Dejwoot KHAWPARISUTH Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2009, Elsevier (MK) http://webstaff.kmutt.ac.th/~dejwoot.kha/
More information14:332:331 Pipelined Datapath
14:332:331 Pipelined Datapath I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate
More informationDEE 1053 Computer Organization Lecture 6: Pipelining
Dept. Electronics Engineering, National Chiao Tung University DEE 1053 Computer Organization Lecture 6: Pipelining Dr. Tian-Sheuan Chang tschang@twins.ee.nctu.edu.tw Dept. Electronics Engineering National
More informationCO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19
CO2-3224 Computer Architecture and Programming Languages CAPL Lecture 8 & 9 Dr. Kinga Lipskoch Fall 27 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined
More informationCS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST
CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C
More informationData Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard
Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationEE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes
NAME: STUDENT NUMBER: EE557--FALL 1999 MAKE-UP MIDTERM 1 Closed books, closed notes Q1: /1 Q2: /1 Q3: /1 Q4: /1 Q5: /15 Q6: /1 TOTAL: /65 Grade: /25 1 QUESTION 1(Performance evaluation) 1 points We are
More informationLecture 2: Processor and Pipelining 1
The Simple BIG Picture! Chapter 3 Additional Slides The Processor and Pipelining CENG 6332 2 Datapath vs Control Datapath signals Control Points Controller Datapath: Storage, FU, interconnect sufficient
More information4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?
Chapter 4: Assessing and Understanding Performance 1. Define response (execution) time. 2. Define throughput. 3. Describe why using the clock rate of a processor is a bad way to measure performance. Provide
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationLecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University
Lecture 9 Pipeline Hazards Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee18b 1 Announcements PA-1 is due today Electronic submission Lab2 is due on Tuesday 2/13 th Quiz1 grades will
More informationPipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.
Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview
More informationOutline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception
Outline A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception 1 4 Which stage is the branch decision made? Case 1: 0 M u x 1 Add
More informationComputer Organization and Structure
Computer Organization and Structure 1. Assuming the following repeating pattern (e.g., in a loop) of branch outcomes: Branch outcomes a. T, T, NT, T b. T, T, T, NT, NT Homework #4 Due: 2014/12/9 a. What
More informationLec 25: Parallel Processors. Announcements
Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza
More informationELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control
ELEC 52/62 Computer Architecture and Design Spring 217 Lecture 4: Datapath and Control Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849
More informationPipelined Processor Design
Pipelined Processor Design Pipelined Implementation: MIPS Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 20 SE-273: Processor Design Courtesy: Prof. Vishwani Agrawal
More informationCPE 335 Computer Organization. Basic MIPS Pipelining Part I
CPE 335 Computer Organization Basic MIPS Pipelining Part I Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides http://www.abandah.com/gheith/courses/cpe335_s08/index.html CPE232 Basic MIPS Pipelining
More information3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationChapter 4. The Processor. Computer Architecture and IC Design Lab
Chapter 4 The Processor Introduction CPU performance factors CPI Clock Cycle Time Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS
More informationECE369. Chapter 5 ECE369
Chapter 5 1 State Elements Unclocked vs. Clocked Clocks used in synchronous logic Clocks are needed in sequential logic to decide when an element that contains state should be updated. State element 1
More informationLecture 8: Control COS / ELE 375. Computer Architecture and Organization. Princeton University Fall Prof. David August
Lecture 8: Control COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Datapath and Control Datapath The collection of state elements, computation elements,
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationChapter 4. The Processor. Jiang Jiang
Chapter 4 The Processor Jiang Jiang jiangjiang@ic.sjtu.edu.cn [Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2008, MK] Chapter 4 The Processor 2 Introduction CPU performance
More information1 Hazards COMP2611 Fall 2015 Pipelined Processor
1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationProcessor (I) - datapath & control. Hwansoo Han
Processor (I) - datapath & control Hwansoo Han Introduction CPU performance factors Instruction count - Determined by ISA and compiler CPI and Cycle time - Determined by CPU hardware We will examine two
More informationOrange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More information5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 5 th Edition Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined
More informationDesign of the MIPS Processor (contd)
Design of the MIPS Processor (contd) First, revisit the datapath for add, sub, lw, sw. We will augment it to accommodate the beq and j instructions. Execution of branch instructions beq $at, $zero, L add
More informationCOMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in
More informationChapter 4 The Processor 1. Chapter 4B. The Processor
Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always
More informationPipeline design. Mehran Rezaei
Pipeline design Mehran Rezaei How Can We Improve the Performance? Exec Time = IC * CPI * CCT Optimization IC CPI CCT Source Level * Compiler * * ISA * * Organization * * Technology * With Pipelining We
More informationCOMPUTER ORGANIZATION AND DESIGN
ARM COMPUTER ORGANIZATION AND DESIGN Edition The Hardware/Software Interface Chapter 4 The Processor Modified and extended by R.J. Leduc - 2016 To understand this chapter, you will need to understand some
More informationDesign of the MIPS Processor
Design of the MIPS Processor We will study the design of a simple version of MIPS that can support the following instructions: I-type instructions LW, SW R-type instructions, like ADD, SUB Conditional
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationHomework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures
Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang
More informationAdapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]
Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M
More informationPipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!
Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!
More informationCSE 2021 COMPUTER ORGANIZATION
CSE 22 COMPUTER ORGANIZATION HUGH CHESSER CHESSER HUGH CSEB 2U 2U CSEB Agenda Topics:. Sample Exam/Quiz Q - Review 2. Multiple cycle implementation Patterson: Section 4.5 Reminder: Quiz #2 Next Wednesday
More informationProcessor: Multi- Cycle Datapath & Control
Processor: Multi- Cycle Datapath & Control (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann, 27) COURSE
More informationInstruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31
4.16 Exercises 419 Exercise 4.11 In this exercise we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor
More informationPipelining. CSC Friday, November 6, 2015
Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3 Instructor: Dan Garcia inst.eecs.berkeley.edu/~cs61c! Compu@ng in the News At a laboratory in São Paulo,
More informationCOMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: A Based on P&H Introduction We will examine two MIPS implementations A simplified version A more realistic pipelined
More informationCS 61C: Great Ideas in Computer Architecture Pipelining and Hazards
CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time
More informationT = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good
CPU performance equation: T = I x CPI x C Both effective CPI and clock cycle C are heavily influenced by CPU design. For single-cycle CPU: CPI = 1 good Long cycle time bad On the other hand, for multi-cycle
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationLECTURE 10. Pipelining: Advanced ILP
LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction
More informationSystems Architecture I
Systems Architecture I Topics A Simple Implementation of MIPS * A Multicycle Implementation of MIPS ** *This lecture was derived from material in the text (sec. 5.1-5.3). **This lecture was derived from
More informationWhat do we have so far? Multi-Cycle Datapath (Textbook Version)
What do we have so far? ulti-cycle Datapath (Textbook Version) CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instruction being processed in datapath How to lower CPI further? #1 Lec # 8 Summer2001
More informationLecture 7 Pipelining. Peng Liu.
Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt
More informationRISC Processor Design
RISC Processor Design Single Cycle Implementation - MIPS Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 13 SE-273: Processor Design Feb 07, 2011 SE-273@SERC 1 Courtesy:
More informationLecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.
Lecture 4: Review of MIPS Instruction formats, impl. of control and datapath, pipelined impl. 1 MIPS Instruction Types Data transfer: Load and store Integer arithmetic/logic Floating point arithmetic Control
More informationIF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4
12 1 CMPE110 Fall 2006 A. Di Blas 110 Fall 2006 CMPE pipeline concepts Advanced ffl ILP ffl Deep pipeline ffl Static multiple issue ffl Loop unrolling ffl VLIW ffl Dynamic multiple issue Textbook Edition:
More informationWhat is Pipelining? Time per instruction on unpipelined machine Number of pipe stages
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationCC 311- Computer Architecture. The Processor - Control
CC 311- Computer Architecture The Processor - Control Control Unit Functions: Instruction code Control Unit Control Signals Select operations to be performed (ALU, read/write, etc.) Control data flow (multiplexor
More informationProcessor (II) - pipelining. Hwansoo Han
Processor (II) - pipelining Hwansoo Han Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 =2.3 Non-stop: 2n/0.5n + 1.5 4 = number
More informationECEC 355: Pipelining
ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly
More informationSystems Architecture
Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some or all figures from Computer Organization and Design: The Hardware/Software
More information