# Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Size: px
Start display at page:

Download "Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining"

## Transcription

1 Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

2 Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one clock cycle implies: CPI = 1 cycle time determined by length of the longest instruction path (load) but several instructions could run in a shorter clock cycle: waste of time consider if we have more complicated instructions like floating point! resources used more than once in the same cycle need to be duplicated waste of hardware and chip area IF ID IE MEM WB IM Reg DM Reg ALU

3 Ex.: Fixed-period clock vs. variable-period clock in a single-cycle implementation Consider a machine with an additional floating point unit. Assume functional unit delays as follows multiplexors, control unit, PC accesses, sign extension, wires: no delay memory ALU FP add FP mul R 2ns 2ns 8ns 16ns 1ns Assume instruction mix as follows Lw Sw R Beq J FP add FP mul 31% 21% 27% 5% 2% 7% 7% Compare the performance of (a) a single-cycle implementation using a fixedperiod clock with (b) one using a variable-period clock where each instruction executes in one clock cycle that is only as long as it needs to be (not really practical but pretend it s possible!)

4 Solution Instruction Instr. Register ALU Data Register FPU FPU Total class mem. read oper. mem. write add/ mul/ time sub div ns. Load word Store word R-format Branch Jump 2 2 FP mul/div FP add/sub Clock period for fixed-period clock = longest instruction time = 20 ns. Average clock period for variable-period clock = 8 31% % % + 5 5% + 2 2% % % = 7.0 ns. Therefore, performance var-period /performance fixed-period = 20/7 = 2.9 Where T=Ic*CPI*t, same Ic and same CPI

5 Fixing the problem with single-cycle designs I- One solution: a variable-period clock with different cycle times for each instruction class unfeasible, as implementing a variable-speed clock is technically difficult Another solution: use a smaller cycle time have different instructions take different numbers of cycles by breaking instructions into steps and fitting each step into one cycle II- Multicyle approach: Break up the instructions into steps each step takes one clock cycle. At the end of one cycle store data to be used in later cycles of the same instruction balance the amount of work to be done in each step/cycle so that they are about equal restrict each cycle to use at most once each major functional unit so that such units do not have to be replicated functional units can be shared between different cycles within one instruction

6 Multicycle Approach PC Address Memory Data Instruction or data Instruction register Memory data register Data Register # Registers Register # Register # A B ALU ALUOut Note particularities of multicycle vs. single- diagrams single memory for data and instructions single ALU, no extra adders extra registers to hold data between clock cycles

7 Breaking instructions into steps We break instructions into steps not all instructions require all the steps each step takes one clock cycle and Each MIPS instruction takes from 3 5 cycles (steps) 1. IF: Instruction fetch and PC increment:; to keep steps balanced in length, the design restriction is to allow 2. ID : Instruction decode and register fetch: each step to contain at most one ALU operation, or 3. EX : Execution, memory address computation, or branch one completion register access, or one memory access. 4. MEM : Memory access or R-type instruction completion Steps IF ID EX MEM 5. WB : Memory read completion Step name Instruction fetch Instruction decode/register fetch Action for R-type instructions Action for memory-reference Action for instructions branches IR = Memory[PC] PC = PC + 4 A = Reg [IR[25-21]] B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) Action for jumps Execution, address ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] II computation, branch/ ALUOut = A op B (IR[15-0]) PC = ALUOut (IR[25-0]<<2) jump completion Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut] completion ALUOut or Store: Memory [ALUOut] = B WR Memory read completion Load: Reg[IR[20-16]] = MDR

8 Pipelining Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Time Task order A B C D Time Task order A B C D Start work ASAP!! Do not waste time! 6 PM AM Not pipelined Assume 30 min. each task wash, dry, fold, store separate tasks use separate hardware So, can be overlapped 6 PM AM Pipelined Why is easy with MIPS? 1) all instructions are same length 1) fetch and decode stages are similar for all instructions 2) few instruction formats 1) simplifies instruction decode and makes it possible in one stage 3) memory operands appear only in load/stores so memory access can be deferred to exactly one later stage operands are aligned in memory one data transfer instruction requires one memory access stage What about x86? (1 t0 17 bytes instruction)

9 Pipelined Execution Representation Time IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB Program Flow IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB To simplify pipeline, every instruction takes same number of steps, called stages One clock cycle per stage

10 Pipelined vs. Single-Cycle Instruction Execution: the Plan P r o g r a m e x e c u t i o n o r d e r T i m e ( i n i n s t r u c t i o n s ) l w \$ 1, ( \$ 0 ) I n s t r u c t i o n f e t c h A L U D a t a a c c e s s Single-cycle T? l w \$ 2, ( \$ 0 ) 8 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s l w \$ 3, ( \$ 0 ) Assume 2 ns for memory access, ALU operation; 1 ns for register access: therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns. P r o g r a m e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) l w \$ 1, ( \$ 0 ) l w \$ 2, ( \$ 0 ) I n s t r u c t i o n f e t c h 2 n s 8 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s A L U D a t a a c c e s s assume write to register file occurs in first half of CLK and read in second half.. I n s t r u c t i o n f e t c h 8 n s Pipelined T?... l w \$ 3, ( \$ 0 ) 2 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s 2 n s 2 n s 2 n s 2 n s 2 n s

11 Hazards What makes it hard? Structural hazards: different instructions, at different stages, in the pipeline want to use the same hardware resource Control hazards: Deciding on control action depends on previous instruction Data hazards: an instruction in the pipeline requires data to be computed by a previous instruction still in the pipeline we first briefly examine these potential hazards individually

12 I n s t r. O r d e Structural Hazards Structural hazard: inadequate hardware to simultaneously support all instructions in the pipeline in the same clock cycle. E.g., suppose single instruction and data memory in pipeline with one read port: as a structural hazard between first and fourth lw instructions Load Instr 1 Instr 2 Instr 3 Instr 4 Time (clock cycles) M Reg M Reg ALU M Reg M Reg ALU M Reg M Reg M ALU Reg M Reg ALU M Reg M Reg Structural hazards are easy to avoid!; Hazards can always be resolved by waiting ALU

13 Control Hazards Control hazard: need to make a decision based on the result of a previous instruction still executing in pipeline Solution 1 Stall the pipeline P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) T i m e a d d \$ 4, \$ 5, \$ 6 I n s t r u c t i o n f e t c h A L U D a t a a c c e s s Note that branch outcome is computed in ID stage with added hardware (later ) b e q \$ 1, \$ 2, n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s l w \$ 3, ( \$ 0 ) b u b b l e I n s t r u c t i o n f e t c h A L U D a t a a c c e s s 4 n s 2 n s Pipeline stall

14 Control Hazards Program execution order (in instructions) Solution 2 Predict branch outcome e.g., predict branch-not-taken : guess one direction then back up if wrong Random prediction: correct 50% of time History-based prediction: record recent history of each branch correct90% of time add \$4, \$5, \$6 Time Instruction Reg fetch ALU Data access Reg beq \$1, \$2, 40 2 ns Instruction Reg fetch ALU Data access Reg lw \$3, 300(\$0) 2 ns Instruction Reg fetch ALU Data access Reg Program execution order (in instructions) add \$4, \$5,\$6 Time Instruction Reg fetch Prediction success ALU Data access Reg beq \$1, \$2, 40 2 ns Instruction Reg fetch ALU Data access Reg bubble bubble bubble bubble bubble or \$7, \$8, \$9 4 ns Instruction Reg fetch Prediction failure: undo (=flush) lw ALU Data access Reg

15 Control Hazards Solution 3 Delayed branch: always execute the sequentially next statement with the branch executing after one instruction delay compiler s job to find a statement that can be put in the slot that is independent of branch outcome P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) MIPS does this but it is an option in SPIM (Simulator -> Settings) b e q \$ 1, \$ 2, 4 0 T i m e a d \$ 4, \$ 5, \$ 6 ( d e l a y e d b r a n c h s l o t ) l w \$ 3, ( \$ 0 ) I n s t r u c t i o n f e t c h 2 n s I n s t r u c t i o n f e t c h 2 n s A L U I n s t r u c t i o n f e t c h 2 n s D a t a a c c e s s A L U D a t a a c c e s s A L U D a t a a c c e s s Delayed branch beq is followed by add that is independent of branch outcome

16 Data Hazards Data hazard: instruction depends on the result of a previous instruction still executing in pipeline Solution Forward data if possible Time add \$s0, \$t0, \$t1 IF ID EX MEM WB Instruction pipeline diagram: shade indicates use left=write, right=read P r o g r a m e x e c u t i o n o r d e r T i m e ( i n i n s t r u c t i o n s ) a d d \$ s 0, \$ t 0, \$ t 1 s u b \$ t 2, \$ s 0, \$ t 3 I F I D E X M E M W B I F I D E X M E M W B Without forwarding blue line data has to go back in time; with forwarding red line data is available in time

17 Data Hazards Forwarding may not be enough e.g., if an R-type instruction following a load uses the result of the load called load-use data hazard P r o g r a m T i m e e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) l w \$ s 0, 2 0 ( \$ t 1 ) s u b \$ t 2, \$ s 0, \$ t I F I D E X M E M W B I F I D E X M E M W B Without a stall it is impossible to provide input to the sub instruction in time Program Time execution order (in instructions) lw \$s0, 20(\$t1) IF ID EX MEM WB bubble bubble bubble bubble bubble With a one-stage stall, forwardin can get the data to the sub instruction in time sub \$t2, \$s0, \$t3 IF ID EX MEM WB

18 Reordering Code to Avoid Pipeline Stall (Software Solution) Example: lw \$t0, 0(\$t1) lw \$t2, 4(\$t1) sw \$t2, 0(\$t1) sw \$t0, 4(\$t1) Data hazard Reordered code: lw \$t0, 0(\$t1) lw \$t2, 4(\$t1) sw \$t0, 4(\$t1) sw \$t2, 0(\$t1) Interchanged

19 Pipelined Datapath - Single-Cycle Datapath Steps ADD 4 ADD PC ADDR RD Instruction Memory Instruction I 32 WD 5 5 RN1 RN2 WN RD1 5 Register File <<2 ALU Zero RD2 16 E X T N D 32 M U X ADDR Data Memory WD RD M U X IF Instruction Fetch ID Instruction Decode EX Execute/ Address Calc. MEM Memory Access WB Write Back

20 Pipelined Datapath Idea :What happens if we break the execution into multiple cycles, but keep the extra hardware? Answer: We may be able to start executing a new instruction at each clock cycle - pipelining but we shall need extra registers to hold data between cycles pipeline registers Pipeline registers wide enough to hold data coming in ADD PC 4 ADDR RD 32 Instruction Memory 64 bits 16 Instruction I RN1 RN2 WN RD1 Register File WD RD E X T N D bits <<2 M U X ADD ALU 97 bits 64 bits Zero ADDR Data Memory WD RD M U X IF/ID ID/EX EX/MEM MEM/WB

21 Pipelined Datapath Pipeline registers wide enough to hold data coming in ADD PC 4 ADDR RD 32 Instruction Memory 64 bits Instruction I RN1 RN2 WN RD1 Register File WD RD2 E 16 X 32 T N D 128 bits <<2 M U X ADD ALU 97 bits 64 bits Zero ADDR Data Memory WD RD M U X IF/ID ID/EX EX/MEM MEM/WB Only data flowing right to left may cause hazard, why?

22 Bug in the Datapath Write register number comes from another later instruction! ADD IF/ID ID/EX EX/MEM MEM/WB 4 ADD PC ADDR RD Instruction 32 Memory Instruction I RN1 RN2 WN RD1 Register File WD RD2 E X T N D <<2 M U X ALU ADDR Data Memory RD WD M U X

23 Corrected Datapath IF/ID ID/EX EX/MEM MEM/WB 4 ADD 64 bits 133 bits <<2 ADD 102 bits 69 bits PC ADDR RD Instruction 32 Memory RN1 RD1 RN2 Register WN File RD2 WD 16 E X T 32 N D M U X ALU Zero ADDR Data Memory RD WD M U X Destination register number is also passed through ID/EX, EX/MEM and MEM/WB registers, which are now wider by 5 bits

24 Single-Clock-Cycle Diagram: Clock Cycle 1 Example LW lw \$t0, 10(\$t1); sw \$t3, 20(\$t4) add \$t5, \$t6, \$t7; sub \$t8, \$t9, \$t10

25 Single-Clock-Cycle Diagram: Clock Cycle 2 SW LW Example lw \$t0, 10(\$t1); sw \$t3, 20(\$t4) add \$t5, \$t6, \$t7; sub \$t8, \$t9, \$t10

26 Single-Clock-Cycle Diagram: Clock Cycle 3 ADD SW LW

27 Single-Clock-Cycle Diagram: Clock Cycle 4 SUB ADD SW LW

28 Single-Clock-Cycle Diagram: Clock Cycle 5 SUB ADD SW LW

29 Single-Clock-Cycle Diagram: Clock Cycle 6 SUB ADD SW

30 Single-Clock-Cycle Diagram: Clock Cycle 7 SUB ADD

31 Single-Clock-Cycle Diagram: Clock Cycle 8 SUB

32 Alternative View Multiple-Clock-Cycle Diagram CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 lw \$t0, 10(\$t1) IM REG ALU DM REG Time axis sw \$t3, 20(\$t4) IM REG ALU DM REG add \$t5, \$t6, \$t7 IM REG ALU DM REG sub \$t8, \$t9, \$t10 IM REG ALU DM REG

33 Notes No write control for all pipeline registers and PC since they are updated at every clock cycle To specify the control for the pipeline, set the control values during each pipeline stage Control lines can be divided into 5 groups: IF NONE ID NONE ALU RegDst, ALUOp, ALUSrc MEM Branch, MemRead, MemWrite WB MemtoReg, RegWrite Group these nine control lines into 3 subsets: ALUControl, MEMControl, WBControl Control signals are generated at ID stage, how to pass them to other stages?

34

36 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel. To increase ILP Deeper pipeline Less work per stage shorter clock cycle Multiple issue Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice

37 How ILP Works Issuing multiple instructions per cycle would require fetching multiple instructions from memory per cycle => called Superscalar degree or Issue width To find independent instructions, we must have a big pool of instructions to choose from, called instruction buffer (IB). As IB length increases, complexity of decoder (control) increases that increases the datapath cycle time Prefetching instructions sequentially by an IFU that operates independently from datapath control. Fetch instruction (PC)+L, where L is the IB size or as directed by the branch predictor.

38 Compiler/Hardware Speculation Compiler can reorder instructions Static Multiple Issue Compiler groups instructions into issue packets Group of instructions that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruction Specifies multiple concurrent operations Very Long Instruction Word (VLIW) Compiler must remove some/all hazards Reorder instructions into issue packets with No dependencies with a packet Varies between ISAs; compiler must know! Pad with nop if necessary Hardware can look ahead for instructions to execute Buffer results until it determines they are actually needed Flush buffers on incorrect speculation Explicitly Parallel Instruction Computer (EPIC).

39 Loop Unrolling Renaming the registers Loop: lw \$t0, 0(\$s1) addu \$t0,\$t0,\$s2 sw \$t0, 0(\$s1) addi \$s1,\$s1, 4 bne \$s1,\$zero,loop Replicate loop body to expose more parallelism

40 HW Schemes: Instruction Parallelism Why in HW at run time? Works when can t know real dependence at compile time Compiler simpler Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion ID stage checks for hazards. If no hazards, issue the instn for execution.

41 Dynamic Multiple Issue (Superscalar) Superscalar processors: An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle CPU decides whether to issue 0, 1,..IPC Avoiding structural and data hazards(dynamic pipeline) Avoids the need for compiler scheduling Allow the CPU to execute instructions out of order to avoid stalls But commit result to registers in order Example: lw \$t0, 20(\$s2) addu \$t1, \$t0, \$t2 sub \$s4, \$s4, \$t3 slti \$t5, \$s4, 20 Can start sub while addu is waiting for lw

42

43 Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instn Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined X Ideal CPI + Pipeline stall CPI Clock Cycle pipelined Speedup = Pipeline depth Clock Cycle unpipelined X Pipeline stall CPI Clock Cycle pipelined

### Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

### COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

### Full Datapath. Chapter 4 The Processor 2

Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

### Thomas Polzer Institut für Technische Informatik

Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =

### Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

### COMP2611: Computer Organization. The Pipelined Processor

COMP2611: Computer Organization The 1 2 Background 2 High-Performance Processors 3 Two techniques for designing high-performance processors by exploiting parallelism: Multiprocessing: parallelism among

### The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu

### The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count

### Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

### COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

### Chapter 4. The Processor

Chapter 4 The Processor Recall. ISA? Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Instruction Format or Encoding how is it decoded? Location of operands and

### COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor? Chapter 4 The Processor 2 Introduction We will learn How the ISA determines many aspects

### Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

### Outline Marquette University

COEN-4710 Computer Hardware Lecture 4 Processor Part 2: Pipelining (Ch.4) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike

### Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure Bing-Yu Chen National Taiwan University The Processor Logic Design Conventions Building a Datapath A Simple Implementation Scheme An Overview of Pipelining Pipelined

### Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

### Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100

### Chapter 4. The Processor

Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A

### Chapter 4. The Processor

Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

### Chapter 4. The Processor

Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

### Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

### ECS 154B Computer Architecture II Spring 2009

ECS 154B Computer Architecture II Spring 2009 Pipelining Datapath and Control 6.2-6.3 Partially adapted from slides by Mary Jane Irwin, Penn State And Kurtis Kredo, UCD Pipelined CPU Break execution into

### 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

### Pipelined Processor Design

Pipelined Processor Design Pipelined Implementation: MIPS Virendra Singh Computer Design and Test Lab. Indian Institute of Science (IISc) Bangalore virendra@computer.org Advance Computer Architecture http://www.serc.iisc.ernet.in/~viren/courses/aca/aca.htm

### Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

### Chapter 4. The Processor

Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA

### COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

### Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

### CPE 335. Basic MIPS Architecture Part II

CPE 335 Computer Organization Basic MIPS Architecture Part II Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides http://www.abandah.com/gheith/courses/cpe335_s08/index.html CPE232 Basic MIPS Architecture

### Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

### Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

### COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

### Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

### EIE/ENE 334 Microprocessors

EIE/ENE 334 Microprocessors Lecture 6: The Processor Week #06/07 : Dejwoot KHAWPARISUTH Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2009, Elsevier (MK) http://webstaff.kmutt.ac.th/~dejwoot.kha/

### 14:332:331 Pipelined Datapath

14:332:331 Pipelined Datapath I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate

### DEE 1053 Computer Organization Lecture 6: Pipelining

Dept. Electronics Engineering, National Chiao Tung University DEE 1053 Computer Organization Lecture 6: Pipelining Dr. Tian-Sheuan Chang tschang@twins.ee.nctu.edu.tw Dept. Electronics Engineering National

### CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

CO2-3224 Computer Architecture and Programming Languages CAPL Lecture 8 & 9 Dr. Kinga Lipskoch Fall 27 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be

### COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

### COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

### COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

### CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C

### Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

### Chapter 4. The Processor

Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

### EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

NAME: STUDENT NUMBER: EE557--FALL 1999 MAKE-UP MIDTERM 1 Closed books, closed notes Q1: /1 Q2: /1 Q3: /1 Q4: /1 Q5: /15 Q6: /1 TOTAL: /65 Grade: /25 1 QUESTION 1(Performance evaluation) 1 points We are

### Lecture 2: Processor and Pipelining 1

The Simple BIG Picture! Chapter 3 Additional Slides The Processor and Pipelining CENG 6332 2 Datapath vs Control Datapath signals Control Points Controller Datapath: Storage, FU, interconnect sufficient

### 4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Chapter 4: Assessing and Understanding Performance 1. Define response (execution) time. 2. Define throughput. 3. Describe why using the clock rate of a processor is a bad way to measure performance. Provide

### LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

### Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

Lecture 9 Pipeline Hazards Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee18b 1 Announcements PA-1 is due today Electronic submission Lab2 is due on Tuesday 2/13 th Quiz1 grades will

### Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview

### Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Outline A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception 1 4 Which stage is the branch decision made? Case 1: 0 M u x 1 Add

### Computer Organization and Structure

Computer Organization and Structure 1. Assuming the following repeating pattern (e.g., in a loop) of branch outcomes: Branch outcomes a. T, T, NT, T b. T, T, T, NT, NT Homework #4 Due: 2014/12/9 a. What

### Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

### ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

ELEC 52/62 Computer Architecture and Design Spring 217 Lecture 4: Datapath and Control Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849

### Pipelined Processor Design

Pipelined Processor Design Pipelined Implementation: MIPS Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 20 SE-273: Processor Design Courtesy: Prof. Vishwani Agrawal

### CPE 335 Computer Organization. Basic MIPS Pipelining Part I

CPE 335 Computer Organization Basic MIPS Pipelining Part I Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides http://www.abandah.com/gheith/courses/cpe335_s08/index.html CPE232 Basic MIPS Pipelining

### 3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

### Chapter 4. The Processor. Computer Architecture and IC Design Lab

Chapter 4 The Processor Introduction CPU performance factors CPI Clock Cycle Time Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS

### ECE369. Chapter 5 ECE369

Chapter 5 1 State Elements Unclocked vs. Clocked Clocks used in synchronous logic Clocks are needed in sequential logic to decide when an element that contains state should be updated. State element 1

### Lecture 8: Control COS / ELE 375. Computer Architecture and Organization. Princeton University Fall Prof. David August

Lecture 8: Control COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Datapath and Control Datapath The collection of state elements, computation elements,

### Full Datapath. Chapter 4 The Processor 2

Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

### Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

### Chapter 4. The Processor. Jiang Jiang

Chapter 4 The Processor Jiang Jiang jiangjiang@ic.sjtu.edu.cn [Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2008, MK] Chapter 4 The Processor 2 Introduction CPU performance

### 1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards Dependences in Programs 2 Data dependence Example: lw \$1, 200(\$2) add \$3, \$4, \$1 add can t do ID (i.e., read register \$1) until lw updates \$1 Control dependence Example: bne \$1, \$2, target add

### Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

### Processor (I) - datapath & control. Hwansoo Han

Processor (I) - datapath & control Hwansoo Han Introduction CPU performance factors Instruction count - Determined by ISA and compiler CPI and Cycle time - Determined by CPU hardware We will examine two

### Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction

### MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

### 5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 5 th Edition Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined

### Design of the MIPS Processor (contd)

Design of the MIPS Processor (contd) First, revisit the datapath for add, sub, lw, sw. We will augment it to accommodate the beq and j instructions. Execution of branch instructions beq \$at, \$zero, L add

### COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in

### Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always

### Pipeline design. Mehran Rezaei

Pipeline design Mehran Rezaei How Can We Improve the Performance? Exec Time = IC * CPI * CCT Optimization IC CPI CCT Source Level * Compiler * * ISA * * Organization * * Technology * With Pipelining We

### COMPUTER ORGANIZATION AND DESIGN

ARM COMPUTER ORGANIZATION AND DESIGN Edition The Hardware/Software Interface Chapter 4 The Processor Modified and extended by R.J. Leduc - 2016 To understand this chapter, you will need to understand some

### Design of the MIPS Processor

Design of the MIPS Processor We will study the design of a simple version of MIPS that can support the following instructions: I-type instructions LW, SW R-type instructions, like ADD, SUB Conditional

### Multiple Instruction Issue. Superscalars

Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

### Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang

### Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M

### Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!

### CSE 2021 COMPUTER ORGANIZATION

CSE 22 COMPUTER ORGANIZATION HUGH CHESSER CHESSER HUGH CSEB 2U 2U CSEB Agenda Topics:. Sample Exam/Quiz Q - Review 2. Multiple cycle implementation Patterson: Section 4.5 Reminder: Quiz #2 Next Wednesday

### Processor: Multi- Cycle Datapath & Control

Processor: Multi- Cycle Datapath & Control (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann, 27) COURSE

### Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

4.16 Exercises 419 Exercise 4.11 In this exercise we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor

### Pipelining. CSC Friday, November 6, 2015

Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

### CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3 Instructor: Dan Garcia inst.eecs.berkeley.edu/~cs61c! Compu@ng in the News At a laboratory in São Paulo,

### COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: A Based on P&H Introduction We will examine two MIPS implementations A simplified version A more realistic pipelined

### CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time

### T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good

CPU performance equation: T = I x CPI x C Both effective CPI and clock cycle C are heavily influenced by CPU design. For single-cycle CPU: CPI = 1 good Long cycle time bad On the other hand, for multi-cycle

### Four Steps of Speculative Tomasulo cycle 0

HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

### LECTURE 10. Pipelining: Advanced ILP

LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

### Systems Architecture I

Systems Architecture I Topics A Simple Implementation of MIPS * A Multicycle Implementation of MIPS ** *This lecture was derived from material in the text (sec. 5.1-5.3). **This lecture was derived from

### What do we have so far? Multi-Cycle Datapath (Textbook Version)

What do we have so far? ulti-cycle Datapath (Textbook Version) CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instruction being processed in datapath How to lower CPI further? #1 Lec # 8 Summer2001

### Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

### RISC Processor Design

RISC Processor Design Single Cycle Implementation - MIPS Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 13 SE-273: Processor Design Feb 07, 2011 SE-273@SERC 1 Courtesy:

### Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Lecture 4: Review of MIPS Instruction formats, impl. of control and datapath, pipelined impl. 1 MIPS Instruction Types Data transfer: Load and store Integer arithmetic/logic Floating point arithmetic Control

### IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

12 1 CMPE110 Fall 2006 A. Di Blas 110 Fall 2006 CMPE pipeline concepts Advanced ffl ILP ffl Deep pipeline ffl Static multiple issue ffl Loop unrolling ffl VLIW ffl Dynamic multiple issue Textbook Edition:

### What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

### CC 311- Computer Architecture. The Processor - Control

CC 311- Computer Architecture The Processor - Control Control Unit Functions: Instruction code Control Unit Control Signals Select operations to be performed (ALU, read/write, etc.) Control data flow (multiplexor

### Processor (II) - pipelining. Hwansoo Han

Processor (II) - pipelining Hwansoo Han Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 =2.3 Non-stop: 2n/0.5n + 1.5 4 = number

### ECEC 355: Pipelining

ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly

### Systems Architecture

Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some or all figures from Computer Organization and Design: The Hardware/Software