# Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Save this PDF as:

Size: px
Start display at page:

Download "Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining"

## Transcription

1 Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

2 Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one clock cycle implies: CPI = 1 cycle time determined by length of the longest instruction path (load) but several instructions could run in a shorter clock cycle: waste of time consider if we have more complicated instructions like floating point! resources used more than once in the same cycle need to be duplicated waste of hardware and chip area IF ID IE MEM WB IM Reg DM Reg ALU

3 Ex.: Fixed-period clock vs. variable-period clock in a single-cycle implementation Consider a machine with an additional floating point unit. Assume functional unit delays as follows multiplexors, control unit, PC accesses, sign extension, wires: no delay memory ALU FP add FP mul R 2ns 2ns 8ns 16ns 1ns Assume instruction mix as follows Lw Sw R Beq J FP add FP mul 31% 21% 27% 5% 2% 7% 7% Compare the performance of (a) a single-cycle implementation using a fixedperiod clock with (b) one using a variable-period clock where each instruction executes in one clock cycle that is only as long as it needs to be (not really practical but pretend it s possible!)

4 Solution Instruction Instr. Register ALU Data Register FPU FPU Total class mem. read oper. mem. write add/ mul/ time sub div ns. Load word Store word R-format Branch Jump 2 2 FP mul/div FP add/sub Clock period for fixed-period clock = longest instruction time = 20 ns. Average clock period for variable-period clock = 8 31% % % + 5 5% + 2 2% % % = 7.0 ns. Therefore, performance var-period /performance fixed-period = 20/7 = 2.9 Where T=Ic*CPI*t, same Ic and same CPI

5 Fixing the problem with single-cycle designs I- One solution: a variable-period clock with different cycle times for each instruction class unfeasible, as implementing a variable-speed clock is technically difficult Another solution: use a smaller cycle time have different instructions take different numbers of cycles by breaking instructions into steps and fitting each step into one cycle II- Multicyle approach: Break up the instructions into steps each step takes one clock cycle. At the end of one cycle store data to be used in later cycles of the same instruction balance the amount of work to be done in each step/cycle so that they are about equal restrict each cycle to use at most once each major functional unit so that such units do not have to be replicated functional units can be shared between different cycles within one instruction

6 Multicycle Approach PC Address Memory Data Instruction or data Instruction register Memory data register Data Register # Registers Register # Register # A B ALU ALUOut Note particularities of multicycle vs. single- diagrams single memory for data and instructions single ALU, no extra adders extra registers to hold data between clock cycles

7 Breaking instructions into steps We break instructions into steps not all instructions require all the steps each step takes one clock cycle and Each MIPS instruction takes from 3 5 cycles (steps) 1. IF: Instruction fetch and PC increment:; to keep steps balanced in length, the design restriction is to allow 2. ID : Instruction decode and register fetch: each step to contain at most one ALU operation, or 3. EX : Execution, memory address computation, or branch one completion register access, or one memory access. 4. MEM : Memory access or R-type instruction completion Steps IF ID EX MEM 5. WB : Memory read completion Step name Instruction fetch Instruction decode/register fetch Action for R-type instructions Action for memory-reference Action for instructions branches IR = Memory[PC] PC = PC + 4 A = Reg [IR[25-21]] B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) Action for jumps Execution, address ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] II computation, branch/ ALUOut = A op B (IR[15-0]) PC = ALUOut (IR[25-0]<<2) jump completion Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut] completion ALUOut or Store: Memory [ALUOut] = B WR Memory read completion Load: Reg[IR[20-16]] = MDR

8 Pipelining Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Time Task order A B C D Time Task order A B C D Start work ASAP!! Do not waste time! 6 PM AM Not pipelined Assume 30 min. each task wash, dry, fold, store separate tasks use separate hardware So, can be overlapped 6 PM AM Pipelined Why is easy with MIPS? 1) all instructions are same length 1) fetch and decode stages are similar for all instructions 2) few instruction formats 1) simplifies instruction decode and makes it possible in one stage 3) memory operands appear only in load/stores so memory access can be deferred to exactly one later stage operands are aligned in memory one data transfer instruction requires one memory access stage What about x86? (1 t0 17 bytes instruction)

9 Pipelined Execution Representation Time IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB Program Flow IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB To simplify pipeline, every instruction takes same number of steps, called stages One clock cycle per stage

10 Pipelined vs. Single-Cycle Instruction Execution: the Plan P r o g r a m e x e c u t i o n o r d e r T i m e ( i n i n s t r u c t i o n s ) l w \$ 1, ( \$ 0 ) I n s t r u c t i o n f e t c h A L U D a t a a c c e s s Single-cycle T? l w \$ 2, ( \$ 0 ) 8 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s l w \$ 3, ( \$ 0 ) Assume 2 ns for memory access, ALU operation; 1 ns for register access: therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns. P r o g r a m e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) l w \$ 1, ( \$ 0 ) l w \$ 2, ( \$ 0 ) I n s t r u c t i o n f e t c h 2 n s 8 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s A L U D a t a a c c e s s assume write to register file occurs in first half of CLK and read in second half.. I n s t r u c t i o n f e t c h 8 n s Pipelined T?... l w \$ 3, ( \$ 0 ) 2 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s 2 n s 2 n s 2 n s 2 n s 2 n s

11 Hazards What makes it hard? Structural hazards: different instructions, at different stages, in the pipeline want to use the same hardware resource Control hazards: Deciding on control action depends on previous instruction Data hazards: an instruction in the pipeline requires data to be computed by a previous instruction still in the pipeline we first briefly examine these potential hazards individually

12 I n s t r. O r d e Structural Hazards Structural hazard: inadequate hardware to simultaneously support all instructions in the pipeline in the same clock cycle. E.g., suppose single instruction and data memory in pipeline with one read port: as a structural hazard between first and fourth lw instructions Load Instr 1 Instr 2 Instr 3 Instr 4 Time (clock cycles) M Reg M Reg ALU M Reg M Reg ALU M Reg M Reg M ALU Reg M Reg ALU M Reg M Reg Structural hazards are easy to avoid!; Hazards can always be resolved by waiting ALU

13 Control Hazards Control hazard: need to make a decision based on the result of a previous instruction still executing in pipeline Solution 1 Stall the pipeline P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) T i m e a d d \$ 4, \$ 5, \$ 6 I n s t r u c t i o n f e t c h A L U D a t a a c c e s s Note that branch outcome is computed in ID stage with added hardware (later ) b e q \$ 1, \$ 2, n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s l w \$ 3, ( \$ 0 ) b u b b l e I n s t r u c t i o n f e t c h A L U D a t a a c c e s s 4 n s 2 n s Pipeline stall

14 Control Hazards Program execution order (in instructions) Solution 2 Predict branch outcome e.g., predict branch-not-taken : guess one direction then back up if wrong Random prediction: correct 50% of time History-based prediction: record recent history of each branch correct90% of time add \$4, \$5, \$6 Time Instruction Reg fetch ALU Data access Reg beq \$1, \$2, 40 2 ns Instruction Reg fetch ALU Data access Reg lw \$3, 300(\$0) 2 ns Instruction Reg fetch ALU Data access Reg Program execution order (in instructions) add \$4, \$5,\$6 Time Instruction Reg fetch Prediction success ALU Data access Reg beq \$1, \$2, 40 2 ns Instruction Reg fetch ALU Data access Reg bubble bubble bubble bubble bubble or \$7, \$8, \$9 4 ns Instruction Reg fetch Prediction failure: undo (=flush) lw ALU Data access Reg

15 Control Hazards Solution 3 Delayed branch: always execute the sequentially next statement with the branch executing after one instruction delay compiler s job to find a statement that can be put in the slot that is independent of branch outcome P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) MIPS does this but it is an option in SPIM (Simulator -> Settings) b e q \$ 1, \$ 2, 4 0 T i m e a d \$ 4, \$ 5, \$ 6 ( d e l a y e d b r a n c h s l o t ) l w \$ 3, ( \$ 0 ) I n s t r u c t i o n f e t c h 2 n s I n s t r u c t i o n f e t c h 2 n s A L U I n s t r u c t i o n f e t c h 2 n s D a t a a c c e s s A L U D a t a a c c e s s A L U D a t a a c c e s s Delayed branch beq is followed by add that is independent of branch outcome

16 Data Hazards Data hazard: instruction depends on the result of a previous instruction still executing in pipeline Solution Forward data if possible Time add \$s0, \$t0, \$t1 IF ID EX MEM WB Instruction pipeline diagram: shade indicates use left=write, right=read P r o g r a m e x e c u t i o n o r d e r T i m e ( i n i n s t r u c t i o n s ) a d d \$ s 0, \$ t 0, \$ t 1 s u b \$ t 2, \$ s 0, \$ t 3 I F I D E X M E M W B I F I D E X M E M W B Without forwarding blue line data has to go back in time; with forwarding red line data is available in time

17 Data Hazards Forwarding may not be enough e.g., if an R-type instruction following a load uses the result of the load called load-use data hazard P r o g r a m T i m e e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) l w \$ s 0, 2 0 ( \$ t 1 ) s u b \$ t 2, \$ s 0, \$ t I F I D E X M E M W B I F I D E X M E M W B Without a stall it is impossible to provide input to the sub instruction in time Program Time execution order (in instructions) lw \$s0, 20(\$t1) IF ID EX MEM WB bubble bubble bubble bubble bubble With a one-stage stall, forwardin can get the data to the sub instruction in time sub \$t2, \$s0, \$t3 IF ID EX MEM WB

18 Reordering Code to Avoid Pipeline Stall (Software Solution) Example: lw \$t0, 0(\$t1) lw \$t2, 4(\$t1) sw \$t2, 0(\$t1) sw \$t0, 4(\$t1) Data hazard Reordered code: lw \$t0, 0(\$t1) lw \$t2, 4(\$t1) sw \$t0, 4(\$t1) sw \$t2, 0(\$t1) Interchanged

19 Pipelined Datapath - Single-Cycle Datapath Steps ADD 4 ADD PC ADDR RD Instruction Memory Instruction I 32 WD 5 5 RN1 RN2 WN RD1 5 Register File <<2 ALU Zero RD2 16 E X T N D 32 M U X ADDR Data Memory WD RD M U X IF Instruction Fetch ID Instruction Decode EX Execute/ Address Calc. MEM Memory Access WB Write Back

20 Pipelined Datapath Idea :What happens if we break the execution into multiple cycles, but keep the extra hardware? Answer: We may be able to start executing a new instruction at each clock cycle - pipelining but we shall need extra registers to hold data between cycles pipeline registers Pipeline registers wide enough to hold data coming in ADD PC 4 ADDR RD 32 Instruction Memory 64 bits 16 Instruction I RN1 RN2 WN RD1 Register File WD RD E X T N D bits <<2 M U X ADD ALU 97 bits 64 bits Zero ADDR Data Memory WD RD M U X IF/ID ID/EX EX/MEM MEM/WB

21 Pipelined Datapath Pipeline registers wide enough to hold data coming in ADD PC 4 ADDR RD 32 Instruction Memory 64 bits Instruction I RN1 RN2 WN RD1 Register File WD RD2 E 16 X 32 T N D 128 bits <<2 M U X ADD ALU 97 bits 64 bits Zero ADDR Data Memory WD RD M U X IF/ID ID/EX EX/MEM MEM/WB Only data flowing right to left may cause hazard, why?

22 Bug in the Datapath Write register number comes from another later instruction! ADD IF/ID ID/EX EX/MEM MEM/WB 4 ADD PC ADDR RD Instruction 32 Memory Instruction I RN1 RN2 WN RD1 Register File WD RD2 E X T N D <<2 M U X ALU ADDR Data Memory RD WD M U X

23 Corrected Datapath IF/ID ID/EX EX/MEM MEM/WB 4 ADD 64 bits 133 bits <<2 ADD 102 bits 69 bits PC ADDR RD Instruction 32 Memory RN1 RD1 RN2 Register WN File RD2 WD 16 E X T 32 N D M U X ALU Zero ADDR Data Memory RD WD M U X Destination register number is also passed through ID/EX, EX/MEM and MEM/WB registers, which are now wider by 5 bits

24 Single-Clock-Cycle Diagram: Clock Cycle 1 Example LW lw \$t0, 10(\$t1); sw \$t3, 20(\$t4) add \$t5, \$t6, \$t7; sub \$t8, \$t9, \$t10

25 Single-Clock-Cycle Diagram: Clock Cycle 2 SW LW Example lw \$t0, 10(\$t1); sw \$t3, 20(\$t4) add \$t5, \$t6, \$t7; sub \$t8, \$t9, \$t10

26 Single-Clock-Cycle Diagram: Clock Cycle 3 ADD SW LW

27 Single-Clock-Cycle Diagram: Clock Cycle 4 SUB ADD SW LW

28 Single-Clock-Cycle Diagram: Clock Cycle 5 SUB ADD SW LW

29 Single-Clock-Cycle Diagram: Clock Cycle 6 SUB ADD SW

30 Single-Clock-Cycle Diagram: Clock Cycle 7 SUB ADD

31 Single-Clock-Cycle Diagram: Clock Cycle 8 SUB

32 Alternative View Multiple-Clock-Cycle Diagram CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 lw \$t0, 10(\$t1) IM REG ALU DM REG Time axis sw \$t3, 20(\$t4) IM REG ALU DM REG add \$t5, \$t6, \$t7 IM REG ALU DM REG sub \$t8, \$t9, \$t10 IM REG ALU DM REG

33 Notes No write control for all pipeline registers and PC since they are updated at every clock cycle To specify the control for the pipeline, set the control values during each pipeline stage Control lines can be divided into 5 groups: IF NONE ID NONE ALU RegDst, ALUOp, ALUSrc MEM Branch, MemRead, MemWrite WB MemtoReg, RegWrite Group these nine control lines into 3 subsets: ALUControl, MEMControl, WBControl Control signals are generated at ID stage, how to pass them to other stages?

34

36 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel. To increase ILP Deeper pipeline Less work per stage shorter clock cycle Multiple issue Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice

37 How ILP Works Issuing multiple instructions per cycle would require fetching multiple instructions from memory per cycle => called Superscalar degree or Issue width To find independent instructions, we must have a big pool of instructions to choose from, called instruction buffer (IB). As IB length increases, complexity of decoder (control) increases that increases the datapath cycle time Prefetching instructions sequentially by an IFU that operates independently from datapath control. Fetch instruction (PC)+L, where L is the IB size or as directed by the branch predictor.

38 Compiler/Hardware Speculation Compiler can reorder instructions Static Multiple Issue Compiler groups instructions into issue packets Group of instructions that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruction Specifies multiple concurrent operations Very Long Instruction Word (VLIW) Compiler must remove some/all hazards Reorder instructions into issue packets with No dependencies with a packet Varies between ISAs; compiler must know! Pad with nop if necessary Hardware can look ahead for instructions to execute Buffer results until it determines they are actually needed Flush buffers on incorrect speculation Explicitly Parallel Instruction Computer (EPIC).

39 Loop Unrolling Renaming the registers Loop: lw \$t0, 0(\$s1) addu \$t0,\$t0,\$s2 sw \$t0, 0(\$s1) addi \$s1,\$s1, 4 bne \$s1,\$zero,loop Replicate loop body to expose more parallelism

40 HW Schemes: Instruction Parallelism Why in HW at run time? Works when can t know real dependence at compile time Compiler simpler Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion ID stage checks for hazards. If no hazards, issue the instn for execution.

41 Dynamic Multiple Issue (Superscalar) Superscalar processors: An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle CPU decides whether to issue 0, 1,..IPC Avoiding structural and data hazards(dynamic pipeline) Avoids the need for compiler scheduling Allow the CPU to execute instructions out of order to avoid stalls But commit result to registers in order Example: lw \$t0, 20(\$s2) addu \$t1, \$t0, \$t2 sub \$s4, \$s4, \$t3 slti \$t5, \$s4, 20 Can start sub while addu is waiting for lw

42

43 Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instn Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined X Ideal CPI + Pipeline stall CPI Clock Cycle pipelined Speedup = Pipeline depth Clock Cycle unpipelined X Pipeline stall CPI Clock Cycle pipelined

### Full Datapath. Chapter 4 The Processor 2

Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

### COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

### Thomas Polzer Institut für Technische Informatik

Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =

### The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

### Chapter 4. The Processor

Chapter 4 The Processor Recall. ISA? Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Instruction Format or Encoding how is it decoded? Location of operands and

### Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

### COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor? Chapter 4 The Processor 2 Introduction We will learn How the ISA determines many aspects

### Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure Bing-Yu Chen National Taiwan University The Processor Logic Design Conventions Building a Datapath A Simple Implementation Scheme An Overview of Pipelining Pipelined

### Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

### Chapter 4. The Processor

Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A

### Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100

### 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

### Chapter 4. The Processor

Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

### Pipelined Processor Design

Pipelined Processor Design Pipelined Implementation: MIPS Virendra Singh Computer Design and Test Lab. Indian Institute of Science (IISc) Bangalore virendra@computer.org Advance Computer Architecture http://www.serc.iisc.ernet.in/~viren/courses/aca/aca.htm

### Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

### ECS 154B Computer Architecture II Spring 2009

ECS 154B Computer Architecture II Spring 2009 Pipelining Datapath and Control 6.2-6.3 Partially adapted from slides by Mary Jane Irwin, Penn State And Kurtis Kredo, UCD Pipelined CPU Break execution into

### Chapter 4. The Processor

Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

### Chapter 4. The Processor

Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA

### Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

### COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

### COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

### EIE/ENE 334 Microprocessors

EIE/ENE 334 Microprocessors Lecture 6: The Processor Week #06/07 : Dejwoot KHAWPARISUTH Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2009, Elsevier (MK) http://webstaff.kmutt.ac.th/~dejwoot.kha/

### CPE 335. Basic MIPS Architecture Part II

CPE 335 Computer Organization Basic MIPS Architecture Part II Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides http://www.abandah.com/gheith/courses/cpe335_s08/index.html CPE232 Basic MIPS Architecture

### 14:332:331 Pipelined Datapath

14:332:331 Pipelined Datapath I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate

### CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

CO2-3224 Computer Architecture and Programming Languages CAPL Lecture 8 & 9 Dr. Kinga Lipskoch Fall 27 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be

### COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

### Lecture 2: Processor and Pipelining 1

The Simple BIG Picture! Chapter 3 Additional Slides The Processor and Pipelining CENG 6332 2 Datapath vs Control Datapath signals Control Points Controller Datapath: Storage, FU, interconnect sufficient

### Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

### EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

NAME: STUDENT NUMBER: EE557--FALL 1999 MAKE-UP MIDTERM 1 Closed books, closed notes Q1: /1 Q2: /1 Q3: /1 Q4: /1 Q5: /15 Q6: /1 TOTAL: /65 Grade: /25 1 QUESTION 1(Performance evaluation) 1 points We are

### Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview

### Chapter 4. The Processor

Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

### 4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Chapter 4: Assessing and Understanding Performance 1. Define response (execution) time. 2. Define throughput. 3. Describe why using the clock rate of a processor is a bad way to measure performance. Provide

### LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

### ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

ELEC 52/62 Computer Architecture and Design Spring 217 Lecture 4: Datapath and Control Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849

### Full Datapath. Chapter 4 The Processor 2

Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

### Lecture 8: Control COS / ELE 375. Computer Architecture and Organization. Princeton University Fall Prof. David August

Lecture 8: Control COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Datapath and Control Datapath The collection of state elements, computation elements,

### CPE 335 Computer Organization. Basic MIPS Pipelining Part I

CPE 335 Computer Organization Basic MIPS Pipelining Part I Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides http://www.abandah.com/gheith/courses/cpe335_s08/index.html CPE232 Basic MIPS Pipelining

### Design of the MIPS Processor (contd)

Design of the MIPS Processor (contd) First, revisit the datapath for add, sub, lw, sw. We will augment it to accommodate the beq and j instructions. Execution of branch instructions beq \$at, \$zero, L add

### Chapter 4. The Processor. Computer Architecture and IC Design Lab

Chapter 4 The Processor Introduction CPU performance factors CPI Clock Cycle Time Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS

### ECE369. Chapter 5 ECE369

Chapter 5 1 State Elements Unclocked vs. Clocked Clocks used in synchronous logic Clocks are needed in sequential logic to decide when an element that contains state should be updated. State element 1

### MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

### Design of the MIPS Processor

Design of the MIPS Processor We will study the design of a simple version of MIPS that can support the following instructions: I-type instructions LW, SW R-type instructions, like ADD, SUB Conditional

### COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in

### COMPUTER ORGANIZATION AND DESIGN

ARM COMPUTER ORGANIZATION AND DESIGN Edition The Hardware/Software Interface Chapter 4 The Processor Modified and extended by R.J. Leduc - 2016 To understand this chapter, you will need to understand some

### Processor (I) - datapath & control. Hwansoo Han

Processor (I) - datapath & control Hwansoo Han Introduction CPU performance factors Instruction count - Determined by ISA and compiler CPI and Cycle time - Determined by CPU hardware We will examine two

### 5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 5 th Edition Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined

### CSE 2021 COMPUTER ORGANIZATION

CSE 22 COMPUTER ORGANIZATION HUGH CHESSER CHESSER HUGH CSEB 2U 2U CSEB Agenda Topics:. Sample Exam/Quiz Q - Review 2. Multiple cycle implementation Patterson: Section 4.5 Reminder: Quiz #2 Next Wednesday

### Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

4.16 Exercises 419 Exercise 4.11 In this exercise we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor

### Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!

### LECTURE 10. Pipelining: Advanced ILP

LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

### Four Steps of Speculative Tomasulo cycle 0

HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

### Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Lecture 4: Review of MIPS Instruction formats, impl. of control and datapath, pipelined impl. 1 MIPS Instruction Types Data transfer: Load and store Integer arithmetic/logic Floating point arithmetic Control

### Processor (II) - pipelining. Hwansoo Han

Processor (II) - pipelining Hwansoo Han Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 =2.3 Non-stop: 2n/0.5n + 1.5 4 = number

### Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

### T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good

CPU performance equation: T = I x CPI x C Both effective CPI and clock cycle C are heavily influenced by CPU design. For single-cycle CPU: CPI = 1 good Long cycle time bad On the other hand, for multi-cycle

### Single vs. Multi-cycle Implementation

Single vs. Multi-cycle Implementation Multicycle: Instructions take several faster cycles For this simple version, the multi-cycle implementation could be as much as 1.27 times faster (for a typical instruction

### Systems Architecture I

Systems Architecture I Topics A Simple Implementation of MIPS * A Multicycle Implementation of MIPS ** *This lecture was derived from material in the text (sec. 5.1-5.3). **This lecture was derived from

### What do we have so far? Multi-Cycle Datapath (Textbook Version)

What do we have so far? ulti-cycle Datapath (Textbook Version) CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instruction being processed in datapath How to lower CPI further? #1 Lec # 8 Summer2001

### IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

12 1 CMPE110 Fall 2006 A. Di Blas 110 Fall 2006 CMPE pipeline concepts Advanced ffl ILP ffl Deep pipeline ffl Static multiple issue ffl Loop unrolling ffl VLIW ffl Dynamic multiple issue Textbook Edition:

### COSC121: Computer Systems. ISA and Performance

COSC121: Computer Systems. ISA and Performance Jeremy Bolton, PhD Assistant Teaching Professor Constructed using materials: - Patt and Patel Introduction to Computing Systems (2nd) - Patterson and Hennessy

### Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

Basic Instruction Timings Pipelining 1 Making some assumptions regarding the operation times for some of the basic hardware units in our datapath, we have the following timings: Instruction class Instruction

### Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and

### COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: A Based on P&H Introduction We will examine two MIPS implementations A simplified version A more realistic pipelined

### Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Lecture 3 Pipelining Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1 A "Typical" RISC ISA 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair)

### ENE 334 Microprocessors

ENE 334 Microprocessors Lecture 6: Datapath and Control : Dejwoot KHAWPARISUTH Adapted from Computer Organization and Design, 3 th & 4 th Edition, Patterson & Hennessy, 2005/2008, Elsevier (MK) http://webstaff.kmutt.ac.th/~dejwoot.kha/

### Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

Beyond Pipelining Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/

### Chapter 5. The Processor. Islamic University of Gaza 2009/2010

Chapter 5 The Processor Husam Alzaq Islamic University of Gaza 2009/2010 Introduction CPU performance factors Instruction ti count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

### CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

### Systems Architecture

Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some or all figures from Computer Organization and Design: The Hardware/Software

### EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

### Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

IC220 Set #9: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life Return to Chapter 4 Midnight Laundry Task order A B C D 6 PM 7 8 9 0 2 2 AM 2 Smarty Laundry Task order A B C D 6 PM

### Computer Architecture

Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in

### Major CPU Design Steps

Datapath Major CPU Design Steps. Analyze instruction set operations using independent RTN ISA => RTN => datapath requirements. This provides the the required datapath components and how they are connected

Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

### Lecture 5 and 6. ICS 152 Computer Systems Architecture. Prof. Juan Luis Aragón

ICS 152 Computer Systems Architecture Prof. Juan Luis Aragón Lecture 5 and 6 Multicycle Implementation Introduction to Microprogramming Readings: Sections 5.4 and 5.5 1 Review of Last Lecture We have seen

### CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

CS 61C: Great Ideas in Computer Architecture Lecture 13: Pipelining Krste Asanović & Randy Katz http://inst.eecs.berkeley.edu/~cs61c/fa17 RISC-V Pipeline Pipeline Control Hazards Structural Data R-type

### CS/COE0447: Computer Organization

CS/COE0447: Computer Organization and Assembly Language Datapath and Control Sangyeun Cho Dept. of Computer Science A simple MIPS We will design a simple MIPS processor that supports a small instruction

### Instruction Pipelining Review

Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

### Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu www.secs.oakland.edu/~yan

### Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction

### The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

The Processor (3) Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

### ALUOut. Registers A. I + D Memory IR. combinatorial block. combinatorial block. combinatorial block MDR

Microprogramming Exceptions and interrupts 9 CMPE Fall 26 A. Di Blas Fall 26 CMPE CPU Multicycle From single-cycle to Multicycle CPU with sequential control: Finite State Machine Textbook Edition: 5.4,

### LECTURE 6. Multi-Cycle Datapath and Control

LECTURE 6 Multi-Cycle Datapath and Control SINGLE-CYCLE IMPLEMENTATION As we ve seen, single-cycle implementation, although easy to implement, could potentially be very inefficient. In single-cycle, we

### ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

### The Processor: Datapath & Control

Chapter Five 1 The Processor: Datapath & Control We're ready to look at an implementation of the MIPS Simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions:

### Chapter 4. The Processor

Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

### Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation

### ECE154A Introduction to Computer Architecture. Homework 4 solution

ECE154A Introduction to Computer Architecture Homework 4 solution 4.16.1 According to Figure 4.65 on the textbook, each register located between two pipeline stages keeps data shown below. Register IF/ID

### MIPS An ISA for Pipelining

Pipelining: Basic and Intermediate Concepts Slides by: Muhamed Mudawar CS 282 KAUST Spring 2010 Outline: MIPS An ISA for Pipelining 5 stage pipelining i Structural Hazards Data Hazards & Forwarding Branch

### RISC Architecture: Multi-Cycle Implementation

RISC Architecture: Multi-Cycle Implementation Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay

### CSE 2021 COMPUTER ORGANIZATION

CSE 2021 COMPUTER ORGANIZATION HUGH LAS CHESSER 1012U HUGH CHESSER CSEB 1012U W10-M Agenda Topics: 1. Multiple cycle implementation review 2. State Machine 3. Control Unit implementation for Multi-cycle

### CS 251, Winter 2018, Assignment % of course mark

CS 251, Winter 2018, Assignment 5.0.4 3% of course mark Due Wednesday, March 21st, 4:30PM Lates accepted until 10:00am March 22nd with a 15% penalty 1. (10 points) The code sequence below executes on a

### zhandling Data Hazards The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture.

zhandling Data Hazards The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture. We have already discussed in the previous module that true

### EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

### Chapter 5 Solutions: For More Practice

Chapter 5 Solutions: For More Practice 1 Chapter 5 Solutions: For More Practice 5.4 Fetching, reading registers, and writing the destination register takes a total of 300ps for both floating point add/subtract

### Improve performance by increasing instruction throughput

Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw \$1, 100(\$0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw \$2, 200(\$0) 8ns fetch ALU Data access

### Inf2C - Computer Systems Lecture 12 Processor Design Multi-Cycle

Inf2C - Computer Systems Lecture 12 Processor Design Multi-Cycle Boris Grot School of Informatics University of Edinburgh Previous lecture: single-cycle processor Inf2C Computer Systems - 2017-2018. Boris

Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not