# Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

1 Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

2 Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one clock cycle implies: CPI = 1 cycle time determined by length of the longest instruction path (load) but several instructions could run in a shorter clock cycle: waste of time consider if we have more complicated instructions like floating point! resources used more than once in the same cycle need to be duplicated waste of hardware and chip area IF ID IE MEM WB IM Reg DM Reg ALU

3 Ex.: Fixed-period clock vs. variable-period clock in a single-cycle implementation Consider a machine with an additional floating point unit. Assume functional unit delays as follows multiplexors, control unit, PC accesses, sign extension, wires: no delay memory ALU FP add FP mul R 2ns 2ns 8ns 16ns 1ns Assume instruction mix as follows Lw Sw R Beq J FP add FP mul 31% 21% 27% 5% 2% 7% 7% Compare the performance of (a) a single-cycle implementation using a fixedperiod clock with (b) one using a variable-period clock where each instruction executes in one clock cycle that is only as long as it needs to be (not really practical but pretend it s possible!)

4 Solution Instruction Instr. Register ALU Data Register FPU FPU Total class mem. read oper. mem. write add/ mul/ time sub div ns. Load word Store word R-format Branch Jump 2 2 FP mul/div FP add/sub Clock period for fixed-period clock = longest instruction time = 20 ns. Average clock period for variable-period clock = 8 31% % % + 5 5% + 2 2% % % = 7.0 ns. Therefore, performance var-period /performance fixed-period = 20/7 = 2.9 Where T=Ic*CPI*t, same Ic and same CPI

5 Fixing the problem with single-cycle designs I- One solution: a variable-period clock with different cycle times for each instruction class unfeasible, as implementing a variable-speed clock is technically difficult Another solution: use a smaller cycle time have different instructions take different numbers of cycles by breaking instructions into steps and fitting each step into one cycle II- Multicyle approach: Break up the instructions into steps each step takes one clock cycle. At the end of one cycle store data to be used in later cycles of the same instruction balance the amount of work to be done in each step/cycle so that they are about equal restrict each cycle to use at most once each major functional unit so that such units do not have to be replicated functional units can be shared between different cycles within one instruction

6 Multicycle Approach PC Address Memory Data Instruction or data Instruction register Memory data register Data Register # Registers Register # Register # A B ALU ALUOut Note particularities of multicycle vs. single- diagrams single memory for data and instructions single ALU, no extra adders extra registers to hold data between clock cycles

7 Breaking instructions into steps We break instructions into steps not all instructions require all the steps each step takes one clock cycle and Each MIPS instruction takes from 3 5 cycles (steps) 1. IF: Instruction fetch and PC increment:; to keep steps balanced in length, the design restriction is to allow 2. ID : Instruction decode and register fetch: each step to contain at most one ALU operation, or 3. EX : Execution, memory address computation, or branch one completion register access, or one memory access. 4. MEM : Memory access or R-type instruction completion Steps IF ID EX MEM 5. WB : Memory read completion Step name Instruction fetch Instruction decode/register fetch Action for R-type instructions Action for memory-reference Action for instructions branches IR = Memory[PC] PC = PC + 4 A = Reg [IR[25-21]] B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) Action for jumps Execution, address ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] II computation, branch/ ALUOut = A op B (IR[15-0]) PC = ALUOut (IR[25-0]<<2) jump completion Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut] completion ALUOut or Store: Memory [ALUOut] = B WR Memory read completion Load: Reg[IR[20-16]] = MDR

8 Pipelining Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Time Task order A B C D Time Task order A B C D Start work ASAP!! Do not waste time! 6 PM AM Not pipelined Assume 30 min. each task wash, dry, fold, store separate tasks use separate hardware So, can be overlapped 6 PM AM Pipelined Why is easy with MIPS? 1) all instructions are same length 1) fetch and decode stages are similar for all instructions 2) few instruction formats 1) simplifies instruction decode and makes it possible in one stage 3) memory operands appear only in load/stores so memory access can be deferred to exactly one later stage operands are aligned in memory one data transfer instruction requires one memory access stage What about x86? (1 t0 17 bytes instruction)

9 Pipelined Execution Representation Time IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB Program Flow IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB To simplify pipeline, every instruction takes same number of steps, called stages One clock cycle per stage

10 Pipelined vs. Single-Cycle Instruction Execution: the Plan P r o g r a m e x e c u t i o n o r d e r T i m e ( i n i n s t r u c t i o n s ) l w \$ 1, ( \$ 0 ) I n s t r u c t i o n f e t c h A L U D a t a a c c e s s Single-cycle T? l w \$ 2, ( \$ 0 ) 8 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s l w \$ 3, ( \$ 0 ) Assume 2 ns for memory access, ALU operation; 1 ns for register access: therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns. P r o g r a m e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) l w \$ 1, ( \$ 0 ) l w \$ 2, ( \$ 0 ) I n s t r u c t i o n f e t c h 2 n s 8 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s A L U D a t a a c c e s s assume write to register file occurs in first half of CLK and read in second half.. I n s t r u c t i o n f e t c h 8 n s Pipelined T?... l w \$ 3, ( \$ 0 ) 2 n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s 2 n s 2 n s 2 n s 2 n s 2 n s

11 Hazards What makes it hard? Structural hazards: different instructions, at different stages, in the pipeline want to use the same hardware resource Control hazards: Deciding on control action depends on previous instruction Data hazards: an instruction in the pipeline requires data to be computed by a previous instruction still in the pipeline we first briefly examine these potential hazards individually

12 I n s t r. O r d e Structural Hazards Structural hazard: inadequate hardware to simultaneously support all instructions in the pipeline in the same clock cycle. E.g., suppose single instruction and data memory in pipeline with one read port: as a structural hazard between first and fourth lw instructions Load Instr 1 Instr 2 Instr 3 Instr 4 Time (clock cycles) M Reg M Reg ALU M Reg M Reg ALU M Reg M Reg M ALU Reg M Reg ALU M Reg M Reg Structural hazards are easy to avoid!; Hazards can always be resolved by waiting ALU

13 Control Hazards Control hazard: need to make a decision based on the result of a previous instruction still executing in pipeline Solution 1 Stall the pipeline P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) T i m e a d d \$ 4, \$ 5, \$ 6 I n s t r u c t i o n f e t c h A L U D a t a a c c e s s Note that branch outcome is computed in ID stage with added hardware (later ) b e q \$ 1, \$ 2, n s I n s t r u c t i o n f e t c h A L U D a t a a c c e s s l w \$ 3, ( \$ 0 ) b u b b l e I n s t r u c t i o n f e t c h A L U D a t a a c c e s s 4 n s 2 n s Pipeline stall

14 Control Hazards Program execution order (in instructions) Solution 2 Predict branch outcome e.g., predict branch-not-taken : guess one direction then back up if wrong Random prediction: correct 50% of time History-based prediction: record recent history of each branch correct90% of time add \$4, \$5, \$6 Time Instruction Reg fetch ALU Data access Reg beq \$1, \$2, 40 2 ns Instruction Reg fetch ALU Data access Reg lw \$3, 300(\$0) 2 ns Instruction Reg fetch ALU Data access Reg Program execution order (in instructions) add \$4, \$5,\$6 Time Instruction Reg fetch Prediction success ALU Data access Reg beq \$1, \$2, 40 2 ns Instruction Reg fetch ALU Data access Reg bubble bubble bubble bubble bubble or \$7, \$8, \$9 4 ns Instruction Reg fetch Prediction failure: undo (=flush) lw ALU Data access Reg

15 Control Hazards Solution 3 Delayed branch: always execute the sequentially next statement with the branch executing after one instruction delay compiler s job to find a statement that can be put in the slot that is independent of branch outcome P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) MIPS does this but it is an option in SPIM (Simulator -> Settings) b e q \$ 1, \$ 2, 4 0 T i m e a d \$ 4, \$ 5, \$ 6 ( d e l a y e d b r a n c h s l o t ) l w \$ 3, ( \$ 0 ) I n s t r u c t i o n f e t c h 2 n s I n s t r u c t i o n f e t c h 2 n s A L U I n s t r u c t i o n f e t c h 2 n s D a t a a c c e s s A L U D a t a a c c e s s A L U D a t a a c c e s s Delayed branch beq is followed by add that is independent of branch outcome

16 Data Hazards Data hazard: instruction depends on the result of a previous instruction still executing in pipeline Solution Forward data if possible Time add \$s0, \$t0, \$t1 IF ID EX MEM WB Instruction pipeline diagram: shade indicates use left=write, right=read P r o g r a m e x e c u t i o n o r d e r T i m e ( i n i n s t r u c t i o n s ) a d d \$ s 0, \$ t 0, \$ t 1 s u b \$ t 2, \$ s 0, \$ t 3 I F I D E X M E M W B I F I D E X M E M W B Without forwarding blue line data has to go back in time; with forwarding red line data is available in time

17 Data Hazards Forwarding may not be enough e.g., if an R-type instruction following a load uses the result of the load called load-use data hazard P r o g r a m T i m e e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) l w \$ s 0, 2 0 ( \$ t 1 ) s u b \$ t 2, \$ s 0, \$ t I F I D E X M E M W B I F I D E X M E M W B Without a stall it is impossible to provide input to the sub instruction in time Program Time execution order (in instructions) lw \$s0, 20(\$t1) IF ID EX MEM WB bubble bubble bubble bubble bubble With a one-stage stall, forwardin can get the data to the sub instruction in time sub \$t2, \$s0, \$t3 IF ID EX MEM WB

18 Reordering Code to Avoid Pipeline Stall (Software Solution) Example: lw \$t0, 0(\$t1) lw \$t2, 4(\$t1) sw \$t2, 0(\$t1) sw \$t0, 4(\$t1) Data hazard Reordered code: lw \$t0, 0(\$t1) lw \$t2, 4(\$t1) sw \$t0, 4(\$t1) sw \$t2, 0(\$t1) Interchanged

19 Pipelined Datapath - Single-Cycle Datapath Steps ADD 4 ADD PC ADDR RD Instruction Memory Instruction I 32 WD 5 5 RN1 RN2 WN RD1 5 Register File <<2 ALU Zero RD2 16 E X T N D 32 M U X ADDR Data Memory WD RD M U X IF Instruction Fetch ID Instruction Decode EX Execute/ Address Calc. MEM Memory Access WB Write Back

20 Pipelined Datapath Idea :What happens if we break the execution into multiple cycles, but keep the extra hardware? Answer: We may be able to start executing a new instruction at each clock cycle - pipelining but we shall need extra registers to hold data between cycles pipeline registers Pipeline registers wide enough to hold data coming in ADD PC 4 ADDR RD 32 Instruction Memory 64 bits 16 Instruction I RN1 RN2 WN RD1 Register File WD RD E X T N D bits <<2 M U X ADD ALU 97 bits 64 bits Zero ADDR Data Memory WD RD M U X IF/ID ID/EX EX/MEM MEM/WB

21 Pipelined Datapath Pipeline registers wide enough to hold data coming in ADD PC 4 ADDR RD 32 Instruction Memory 64 bits Instruction I RN1 RN2 WN RD1 Register File WD RD2 E 16 X 32 T N D 128 bits <<2 M U X ADD ALU 97 bits 64 bits Zero ADDR Data Memory WD RD M U X IF/ID ID/EX EX/MEM MEM/WB Only data flowing right to left may cause hazard, why?

22 Bug in the Datapath Write register number comes from another later instruction! ADD IF/ID ID/EX EX/MEM MEM/WB 4 ADD PC ADDR RD Instruction 32 Memory Instruction I RN1 RN2 WN RD1 Register File WD RD2 E X T N D <<2 M U X ALU ADDR Data Memory RD WD M U X

23 Corrected Datapath IF/ID ID/EX EX/MEM MEM/WB 4 ADD 64 bits 133 bits <<2 ADD 102 bits 69 bits PC ADDR RD Instruction 32 Memory RN1 RD1 RN2 Register WN File RD2 WD 16 E X T 32 N D M U X ALU Zero ADDR Data Memory RD WD M U X Destination register number is also passed through ID/EX, EX/MEM and MEM/WB registers, which are now wider by 5 bits

24 Single-Clock-Cycle Diagram: Clock Cycle 1 Example LW lw \$t0, 10(\$t1); sw \$t3, 20(\$t4) add \$t5, \$t6, \$t7; sub \$t8, \$t9, \$t10

25 Single-Clock-Cycle Diagram: Clock Cycle 2 SW LW Example lw \$t0, 10(\$t1); sw \$t3, 20(\$t4) add \$t5, \$t6, \$t7; sub \$t8, \$t9, \$t10

26 Single-Clock-Cycle Diagram: Clock Cycle 3 ADD SW LW

27 Single-Clock-Cycle Diagram: Clock Cycle 4 SUB ADD SW LW

28 Single-Clock-Cycle Diagram: Clock Cycle 5 SUB ADD SW LW

29 Single-Clock-Cycle Diagram: Clock Cycle 6 SUB ADD SW

30 Single-Clock-Cycle Diagram: Clock Cycle 7 SUB ADD

31 Single-Clock-Cycle Diagram: Clock Cycle 8 SUB

32 Alternative View Multiple-Clock-Cycle Diagram CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 lw \$t0, 10(\$t1) IM REG ALU DM REG Time axis sw \$t3, 20(\$t4) IM REG ALU DM REG add \$t5, \$t6, \$t7 IM REG ALU DM REG sub \$t8, \$t9, \$t10 IM REG ALU DM REG

33 Notes No write control for all pipeline registers and PC since they are updated at every clock cycle To specify the control for the pipeline, set the control values during each pipeline stage Control lines can be divided into 5 groups: IF NONE ID NONE ALU RegDst, ALUOp, ALUSrc MEM Branch, MemRead, MemWrite WB MemtoReg, RegWrite Group these nine control lines into 3 subsets: ALUControl, MEMControl, WBControl Control signals are generated at ID stage, how to pass them to other stages?

34

36 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel. To increase ILP Deeper pipeline Less work per stage shorter clock cycle Multiple issue Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice

37 How ILP Works Issuing multiple instructions per cycle would require fetching multiple instructions from memory per cycle => called Superscalar degree or Issue width To find independent instructions, we must have a big pool of instructions to choose from, called instruction buffer (IB). As IB length increases, complexity of decoder (control) increases that increases the datapath cycle time Prefetching instructions sequentially by an IFU that operates independently from datapath control. Fetch instruction (PC)+L, where L is the IB size or as directed by the branch predictor.

38 Compiler/Hardware Speculation Compiler can reorder instructions Static Multiple Issue Compiler groups instructions into issue packets Group of instructions that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruction Specifies multiple concurrent operations Very Long Instruction Word (VLIW) Compiler must remove some/all hazards Reorder instructions into issue packets with No dependencies with a packet Varies between ISAs; compiler must know! Pad with nop if necessary Hardware can look ahead for instructions to execute Buffer results until it determines they are actually needed Flush buffers on incorrect speculation Explicitly Parallel Instruction Computer (EPIC).

39 Loop Unrolling Renaming the registers Loop: lw \$t0, 0(\$s1) addu \$t0,\$t0,\$s2 sw \$t0, 0(\$s1) addi \$s1,\$s1, 4 bne \$s1,\$zero,loop Replicate loop body to expose more parallelism

40 HW Schemes: Instruction Parallelism Why in HW at run time? Works when can t know real dependence at compile time Compiler simpler Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion ID stage checks for hazards. If no hazards, issue the instn for execution.

41 Dynamic Multiple Issue (Superscalar) Superscalar processors: An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle CPU decides whether to issue 0, 1,..IPC Avoiding structural and data hazards(dynamic pipeline) Avoids the need for compiler scheduling Allow the CPU to execute instructions out of order to avoid stalls But commit result to registers in order Example: lw \$t0, 20(\$s2) addu \$t1, \$t0, \$t2 sub \$s4, \$s4, \$t3 slti \$t5, \$s4, 20 Can start sub while addu is waiting for lw

42

43 Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instn Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined X Ideal CPI + Pipeline stall CPI Clock Cycle pipelined Speedup = Pipeline depth Clock Cycle unpipelined X Pipeline stall CPI Clock Cycle pipelined

