Chapter 5 (a) Overview

Size: px

Start display at page:

Download "Chapter 5 (a) Overview"

Rafe Clark
5 years ago
Views:

1 Chapter 5 (a) Overview (a) The principles of pipelining (a) A pipelined design of SRC (b) Pipeline hazards (b) Instruction-level parallelism (ILP) Superscalar processors Very Long Instruction Word (VLIW) machines (b) Microprogramming Control store and micro-branching Horizontal and vertical microprogramming

2 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes 20 minutes A B C D

3 T a s k O r d e r A B C D Sequential Laundry 6 PM Midnight Time Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

4 Pipelined Laundry 6 PM Midnight Time T a s k O r d e r A B C D Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads

5 Pipelining Lessons T a s k O r d e r 6 PM Time A B C D Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup

6 A 5-Stage Pipeline 1. Fetch instruction 2. Fetch operands 3. ALU operation 4. Memory access 5. Register write A five instruction program: shr r3, r3, 2 sub r2, r5, r1 add r4, r3, r2 st r4, addr1 ld r2, addr2 T I M E

7 Pipelining shr r3,r3,2 Fetch instruction shr sub r2,r5,r1 sub shr Fetch operands add r4,r3,r2 add sub shr ALU operation Memory access st r4,addr1 st add sub shr (null) ld r2,addr2 ld st add sub (null) shr ld (null) st add (null) sub ld st add Are there any potential operand availability conflicts? ld Register write st (null) ld

8 Pipelining Instruction Processing Pipeline stages are shown top to bottom in order traversed by one instruction (usually shown horizontally). Instructions listed in order they are fetched. If each stage takes one clock: - every instruction takes 5 clocks to complete - one instruction completes every clock tick Two performance issues: instruction latency time for an instruction to completely execute. instruction bandwidth number of instructions executed/second.

9 Dependence Among Instructions Execution Dependencies: Some instructions may depend on the completion of others in the pipeline. Solution 1: stall the pipeline - early stages stop while later ones complete processing. Solution 2: forwarding. Some register dependences can be detected & data forwarded to instruction needing. Waiting for the register write is eliminated. Solution 3: Dependence involving memory is harder. It is sometimes addressed by restricting instruction set usage: Branch delay slot Load delay or may be solved using MULTITHREADING.

10 Pipelining Fetch instr. Fetch oper. ALU oper. Memory access Register write Time shr r3,r3,2 shr sub r2,r5,r1 sub shr add r4,r3,r2 add sub shr st r4,addr1 st add sub shr (null) ld r2,addr2 ld st add sub (null) ld (null) st add (null) ld stall- st stallld ld st Stalling to prevent add-st hazard shr sub add st (null) ld

11 Branch & Load Delay Examples Branch Delay brz r2, r3 add r6, r7, r8 st r6, addr1 Load Delay ld r2, addr add r5, r1, r2 shr r1,r1,4 sub r6, r8, r2 This inst. always executed Only done if r3 0 This inst. gets old value of r2 This inst. gets r2 value loaded from addr

12 Processor Design Characteristics Main memory must operate in one cycle: Usually done with cache (discussed later). Instruction & data memory must be separate: Harvard architecture has separate instruction & data memories. Usually done with separate caches. To avoid contention, few buses are used: Most connections are point to point. Some few-way multiplexers are used. Data is stored in temporary registers located between pipeline stages: pipeline registers. ALU operations take only 1 clock (esp. shift).

13 ALU Instructions use five stage pipeline Second ALU operand comes from: a register IR c2 field Op code must be available in Stage 3 to specialize ALU. Result register, ra, is written in stage 5 No memory operations for ALU instructions.

14 Control Signals for Pipeline Stage Activity branch := br brl : cond := (IR = 1) ((IR =1) (IR2 0 R[rb]=0)) ((IR =2) (IR2 0 R[rb] 31 ) : sh := shr shra shl shc : ;shifts alu := add addi sub neg and andi or ori not sh : ;ALU instr. imm := addi andi ori (sh (IR ) : ;immediate operand load := ld ldr : ;load instruction ladr := la lar : ;load address instruction store := st str : ;store instructions l-s := load ladr store : ;memory address instructions regwrite := load ladr brl alu: ;instructions that write to the register file dsp := ld st la : ;instructions that use disp addressing rl := ldr str lar : ;instructions that use relative addressing

15 Notes on the Equations & Different Stages The logic equations are based on the instruction in the stage where they are used. When necessary, we append a digit to a logic signal name to specify the stage in which it is computed. e.g., regwrite5 = true when op5 = load5 ladr5 brl5 alu5, where op5 is the opcode in stage 5.

16 Load & Store Instructions ALU computes effective addresses. Stage 4 does read or write. Result register written only on load.

17 Branch Instructions The new program counter value is known in stage 2 but not in stage 1 Only branch&link does a register write in stage 5. There is no ALU or memory operation.

18 SRC Pipeline Registers & RTN Specification The pipeline registers pass information from stage to stage. RTN specifies output register values in terms of input reg. values for stage. Discuss RTN at each stage.

19 Global State of the Pipelined SRC PC, the general registers, instruction memory, and data memory is the global machine state. PC is accessed in stage 1 (& stage 2 on branch). Instruction memory is accessed in stage 1. General registers are read in stage 2 and written in stage 5. Data memory is only accessed in stage 4.

20 Restrictions on Pipeline Access to Global State Why are separate instruction and data memories (or caches) needed? When a load or store accesses data memory in stage 4, stage 1 is accessing an instruction. Thus two memory accesses occur simultaneously. Two register operands may be needed in stage 2. Simultaneously, in stage 5 another instruction may be writing a result in a register. Thus, the register file must support simultaneously 2 reads and a write. Note that the increment of PC in stage 1 needs to be overridden by a successful branch in stage 2.

21 Pipeline Data Path & Control Signals Most control signals shown & given values. Multiplexer control is stressed.

22 Exp: Instruction Propagation Through Pipe 100: add r4, r6, r8; R[4] R[6] + R[8]; 104: ld r7, 128(r5); R[7] M[R[5]+128]; 108: brl r9, r11, 001; PC R[11]: R[9] PC; 112: str r12, 32; M[PC+32] R[12]; : sub... next instruction It is assumed that R[11] = 512 when brl instruction is executed. R[6] = 4 & R[8] = 5 are the add operands. R[5] =16 for the ld and R[12] = 23 for the store.

23 Cycle 1: add Enters Pipe Program counter is incremented to : sub : str r12, #32 108: brl r9, r11, : ld r7, r5, # : add r4, r6, r8

24 Cycle 2: ld Enters Pipe add operands are fetched in stage : sub : str r12, #32 108: brl r9, r11, : ld r7, r5, # : add r4, r6, r8

25 Cycle 3 brl Enters Pipe add performs its arithmetic in stage 3 512: sub : str r12, #32 108: brl r9, r11, : ld r7, r5, # : add r4, r6, r8

26 Cycle 4: str enters pipe add is idle in stage 4 Success of brl changes program counter to : sub : str r12, #32 108: brl r9, r11, : ld r7, r5, # : add r4, r6, r8

27 Cycle 5: sub Enters Pipe add completes in stage 5 sub is fetched from loc. 512 after successful brl 512: sub : str r12, #32 108: brl r9, r11, : ld r7, r5, # : add r4, r6, r8

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Pipelining 11142011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review I/O Chapter 5 Overview Pipelining Pipelining