Lecture 7 Pipelining. Peng Liu.

Size: px

Start display at page:

Download "Lecture 7 Pipelining. Peng Liu."

Alban Lawrence
6 years ago
Views:

1 Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1

2 Review: The Single Cycle Processor 2

3 Review: Given Datapath,RTL -> Control Instruction<31:0> Inst Memory Adr Op <21:25> Fun Rt <21:25> <0:15> <11:15> <16:20> Rs Rd Imm16 Control PCSrc RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg Zero DATA PATH 3

4 Review: The Concept of Local Decoding op 6 Main Control func 6 ALUop N ALU Control (Local) ALUctr 3 ALU 4

5 Review: The Encoding of ALUop op 6 Main Control func 6 ALUop N ALU Control (Local) ALUctr 3 In this exercise, ALUop has to be 2 bits wide to represent: (1) R-type instructions I-type instructions that require the ALU to perform: (2) Or, (3) Add, and (4) Subtract To implement the more of MIPS ISA, ALUop has to be 3 bits to represent (4 bits in book to include NOR): (1) R-type instructions I-type instructions that require the ALU to perform: (2) Or, (3) Add, (4) Subtract, and (5) And (Example: andi) R-type ori lw sw beq jump ALUop (Symbolic) R-type Or Add Add Subtract xxx ALUop<2:0> xxx 5

6 Review: The Decoding of the func Field op 6 Main Control func 6 ALUop N ALU Control (Local) ALUctr R-type ori lw sw beq jump ALUop (Symbolic) R-type Or Add Add Subtract xxx ALUop<2:0> xxx R-type op rs rt rd shamt funct funct<5:0> Instruction Operation add subtract and or set-on-less-than ALUctr ALU ALUctr<2:0> ALU Operation And Or Add Subtract Set-on-less-than 6

7 Drawback of This Single Cycle Processor Long cycle time: Cycle time must be long enough for the load instruction: PC s Clock -to-q + Instruction Memory Access Time + Register File Access Time + ALU Delay (address calculation) + Data Memory Access Time + Register File Setup Time + Clock Skew Cycle time for load is much longer than needed for all other instructions 7

8 Single Cycle Processor Advantages Single cycle per instruction makes logic and clock simple Disadvantages Inefficient utilization of memory and functional units since different instructions take different lengths of time ALU only computes values a small amount of the time Cycle time is the worst case path -> long cycle times Load instruction Best possible CPI is 1 8

9 Single Cycle Processor Performance Functional unit delay Memory: 200ps ALU and adders: 200ps Register file: 100ps CPU clock cycle = 800 ps = 0.8ns(1.25GHz) 9

10 Variable Clock Single Cycle Processor Performance Instruction Mix 45%ALU 25%loads 10%stores 15%branches 5%jumps CPU clock cycle = 0.6x45%+ 0.8x25% + 0.7x10% +0.5x15% +0.2x5%= ns(1.6ghz) 10

11 Increasing Parallelism Problem: Each functional unit used once per cycle Most of the time it is sitting waiting for its turn Well it is calculating all the time, but it is waiting for valid data There is no parallelism in this arrangement Making instructions take more cycles makes machine faster! Each instruction takes roughly the same time While the CPI is much worse, the clock freq is much higher Overlap execution of multiple instructions at the same time Different instructions will be active at the same time This is called Pipelining Increases the parallelism going on in the machine We will look at a 5-stage pipeline Modern machines have order 15 cycles/instruction 11

12 Pipelined MIPS Processor Start the next instruction while still working on the current one improves throughput or bandwidth - total amount of work done in a given time (average instructions per second or per clock) instruction latency is not reduced (time from the start of an instruction to its completion) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB pipeline clock cycle (pipeline stage time) is limited by the slowest stage for some instructions, some stages are wasted cycles 12

13 Pipeline Pipelining doesn t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously Potential speedup = number pipe stages Pipeline rate limited by slowest pipeline stage 13

14 Pipelining the MIPS ISA What makes it easy all instructions are the same length (32 bits) easier to fetch in 1 st stage and decode in 2 nd stage few instruction formats (three) with symmetry across formats can begin reading register file in 2 nd stage memory operations can occur only in loads and stores can use the execute stage to calculate memory addresses each MIPS instruction writes at most one result and does so near the end of the pipeline 14

15 An Ideal Pipeline stage 1 stage 2 stage 3 stage 4 All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines, but instructions depend on each other! 15

16 Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste Multiple Cycle Implementation: Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch Pipeline Implementation: lw IFetch Dec Exec Mem WB wasted cycles sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB 16

17 Multiple Cycle v. Pipeline, Bandwidth v. Latency Multiple Cycle Implementation: Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem R-type IFetch Pipeline Implementation: lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB Latency per lw = 5 clock cycles for both Bandwidth of lw is 1 per clock clock (IPC) for pipeline vs. 1/5 IPC for multicycle Pipelining improves instruction bandwidth, not instruction latency 17

18 Graphically Representing MIPS Pipeline ALU IM Reg DM Reg Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? is there a hazard, why does it occur, and how can it be fixed? 18

19 Why Pipeline? For Throughput! Time (clock cycles) I n s t r. Inst 0 Inst 1 ALU IM Reg DM Reg ALU IM Reg DM Reg Once the pipeline is full, one instruction is completed every cycle O r d e r Inst 2 Inst 3 ALU IM Reg DM Reg ALU IM Reg DM Reg Inst 4 Time to fill the pipeline ALU IM Reg DM Reg 19

20 The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address Mem: Read/write the data from/to the Data Memory WB: Write the result data into the register file 20

21 Pipelining Load Load instruction takes 5 stages Five independent functional units work on each stage Each functional unit used only once Another load can start as soon as 1 st finishes IF stage Each load still takes 5 cycles to complete The throughput, however, is much higher 21

Functional Units Are Busy Pipelining now keeps all the functional units busy Fetch a new instruction each cycle Fetch register every cycle Use the ALU almost

22 Functional Units Are Busy Pipelining now keeps all the functional units busy Fetch a new instruction each cycle Fetch register every cycle Use the ALU almost every cycle Use the Data Memory many cycles Instructions still take 10ns to complete But start a new instruction every 2ns Look like CPI is 1 Pipeline Timing Diagram 22

23 Pipeline Datapath 23

24 Load Datapath: Stage 1 24

25 Load Datapath: Stage 2 25

26 Load Datapath: Stage 3 26

27 Load Datapath: Stage 4 27

28 Load Datapath: Stage 5 28

29 Pipeline Control Need to control functional units But they are from working on different instructions! Not a problem Just pipeline the control signals along with the data Make sure they line up Using labeling conventions often helps Instruction_rf means this instructions is in RF Every time it gets flopped, changes pipestage Make sure right signal go to the right places 29

30 Control Signals Use a main control unit to generate signals during RF/ID stage Control signals for EX ExtOp, ALUSrc, used 1 cycle later Control signal for Mem MemWr, Branch used 2 cycles later Control signals for WB MemtoReg, MemWr used 3 cycles later 30

31 Implementing Control 31

32 Putting it All Together 32

33 Pipeline Performance Assume time for stage is 100ps for register read or write 200ps for other stages Compare pipelined datapath with single-cycle datapath 33

34 Pipeline Performance Program 34

35 Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructions pipelined = Time between instructions nonpipelined Number of stages If not balanced,speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease 35

36 Pipelined Datapath 0x4 PC Add addr rdata Inst. Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext ALU we addr rdata Data Memory wdata fetch phase decode & Reg-fetch phase execute phase memory phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles t C > max {t IM, t RF, t ALU, t DM, t RW } ( = t DM probably) However, CPI will increase unless instructions are pipelined write -back phase 36

37 PC IFetch/Dec Dec/Exec Exec/Mem Mem/WB MIPS Pipeline Datapath Modifications What do we need to add/modify in our MIPS datapath? registers between pipeline stages to isolate them IF:IFetch ID:Dec EX:Execute MEM: 1 MemAccess 0 WB: WriteBack Add 4 Instruction Memory Read Address Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Write Data Read Data 2 Shift left Add ALU Address Write Data Data Memory Read Data 1 0 Sign 16 Extend 32 System Clock 37

38 Technology Assumptions A small amount of very fast memory (caches) backed up by a large, slower memory Fast ALU (at least for integers) Multiported Register files (slower!) Thus, the following timing assumption is reasonable t IM t RF t ALU t DM t RW A 5-stage pipeline will be the focus of our detailed design - some commercial designs have over 30 pipeline stages to do an integer add! 38

39 5-Stage Pipelined Execution 0x4 PC Add addr rdata Inst. Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext ALU we addr rdata Data Memory wdata I-Fetch (IF) Decode, Reg. Fetch (ID) Execute (EX) Memory (MA) time t0 t1 t2 t3 t4 t5 t6 t7.... instruction1 IF 1 ID 1 EX 1 MA 1 WB 1 instruction2 IF 2 ID 2 EX 2 MA 2 WB 2 instruction3 IF 3 ID 3 EX 3 MA 3 WB 3 instruction4 IF 4 ID 4 EX 4 MA 4 WB 4 instruction5 IF 5 ID 5 EX 5 MA 5 WB 5 Write - Back (WB) 39

40 Resources 5-Stage Pipelined Execution Resource Usage Diagram 0x4 PC Add addr rdata Inst. Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext ALU we addr rdata Data Memory wdata I-Fetch (IF) Decode, Reg. Fetch (ID) Execute (EX) Memory (MA) time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 2 I 3 I 4 I 5 ID I 1 I 2 I 3 I 4 I 5 EX I 1 I 2 I 3 I 4 I 5 MA I 1 I 2 I 3 I 4 I 5 WB I 1 I 2 I 3 I 4 I 5 Write - Back (WB) 40

41 Pipelined Execution: ALU Instructions 0x4 Add 31 PC addr inst Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y we addr rdata Data Memory wdata wdata R MD1 MD2 Not quite correct! We need an Instruction Reg () for each stage 41

42 Pipelined MIPS Datapath without jumps F D E M W 0x4 Add 31 RegWrite RegDst PC addr inst Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B OpSel ALU Y MemWrite we addr Data Memory wdata wdata rdata WBSrc R MD1 MD2 ExtSel BSrc Control Points Need to Be Connected 42

43 What makes it hard Pipelining structural hazards: what if we had only one memory? control hazards: what about branches? data hazards: what if an instruction s input operands depend on the output of a previous instruction? 43

44 Instructions interact with each other in pipeline An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard An instruction may depend on something produced by an earlier instruction Dependence may be for a data value data hazard Dependence may be for the next instruction s address control hazard (branches, exceptions) 44

45 Resource conflict Structural hazards Occurs when two instructions try to use same hardware Often arise when some functional units are not fully pipelined Simple examples: MIPS pipeline with a single unified memory Load/store requires data access Instruction fetch would have to stall for that cycle Also used for units that are not fully pipelined (mult, div) 45

46 Resolving Structural Hazards Structural hazards occurs when two instructions need same hardware resource at the same time Can resolve in hardware by stalling newer instruction till older instruction finished with resource A structural hazard can always be avoided by adding more hardware to design E.g., if two instructions both need a port to memory at the same time, could avoid hazard by adding second port to memory Our 5-stage pipe has no structural hazards by design Thanks to MIPS ISA, which was designed for pipelining 46

47 Data dependencies Data dependencies for instruction j following instruction I Read after write (RAW) (true dependence) instruction j tries to read before instruction I tries to write it Write after write (WAW) (output dependence) instruction j tries to write an operand before I writes its value Write after read (WAR) (anti dependence) instruction j tries to write a destination before it is read by I No such thing as a read after read (RAR) hazard since there is never a problem reading twice 47

48 Dependency examples True dependency (RAW hazard) addu $t0, $t1, $t2 subu $t3, $t4, $t0 Output dependency (WAW hazard) addu $t0, $t1, $t2 subu $t0, $t4, $t5 Anti dependency (WAR hazard) addu $t0, $t1, $t2 subu $t1, $t4, $t5 48

49 Data Hazards r4 r1 r1 0x4 Add 31 PC addr inst Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y we addr rdata Data Memory wdata wdata R MD1 MD2... r1 r r4 r r1 is stale. Oops! 49

50 Resolving Data Hazards (1) Strategy 1: Wait for the result to be available by freezing earlier pipeline stages interlocks 50

51 Feedback to Resolve Hazards FB 1 FB 2 FB 3 FB 4 stage 1 stage 2 stage 3 stage 4 Later stages provide dependence information to earlier stages which can stall (or kill) instructions Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) 51

52 Interlocks to resolve Data Hazards Stall Condition 0x4 Add nop 31 PC addr inst Inst Memory... r1 r r4 r we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B MD1 ALU Y MD2 we addr rdata Data Memory wdata wdata R 52

53 Stalled Stages and Pipeline Bubbles time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) r1 (r0) + 10 IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4 (r1) + 17 IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) stalled stages IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB 5 Resource Usage time t0 t1 t2 t3 t4 t5 t6 t7.... IF I 1 I 2 I 3 I 3 I 3 I 3 I 4 I 5 ID I 1 I 2 I 2 I 2 I 2 I 3 I 4 I 5 EX I 1 nop nop nop I 2 I 3 I 4 I 5 MA I 1 nop nop nop I 2 I 3 I 4 I 5 WB I 1 nop nop nop I 2 I 3 I 4 I 5 nop pipeline bubble 53

54 Interlock Control Logic stall C stall ws rs rt? 0x4 Add nop 31 PC addr inst Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y we addr rdata Data Memory wdata wdata R Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. MD1 MD2 54

55 Interlock Control Logic ignoring jumps & branches stall ws we C stall rs rt? re1 re2 we C dest ws we C dest ws 0x4 Add C re nop 31 PC addr inst Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y we addr rdata Data Memory wdata wdata C dest R MD1 MD2 Should we always stall if the rs field matches some rd? not every instruction writes a register we not every instruction reads a register re 55

56 Source & Destination Registers R-type: op rs rt rd func I-type: op rs rt immediate16 J-type: op immediate26 source(s) destination ALU rd (rs) func (rt) rs, rt rd ALUi rt (rs) op imm rs rt LW rt M [(rs) + imm] rs rt SW M [(rs) + imm] (rt) rs, rt BZ cond (rs) true: PC (PC) + imm rs false: PC (PC) + 4 rs J PC (PC) + imm JAL r31 (PC), PC (PC) + imm 31 JR PC (rs) rs JALR r31 (PC), PC (rs) rs 31 56

57 Deriving the Stall Signal C dest ws = Case opcode ALU ALUi, LW JAL, JALR rd rt R31 we = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on... off C re re1 = Case opcode ALU, ALUi, LW, SW, BZ, JR, JALR J, JAL re2 = Case opcode ALU, SW... on off on off C stall stall = ((rs D =ws E ).we E + (rs D =ws M ).we M + (rs D =ws W ).we W ). re1 D + ((rt D =ws E ).we E + (rt D =ws M ).we M + (rt D =ws W ).we W ). re2 D 57

58 Hazards due to Loads & Stores Stall Condition What if (r1)+7 = (r3)+5? 0x4 Add nop 31 PC addr inst Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext A B ALU Y we addr rdata Data Memory wdata wdata R... M[(r1)+7] (r2) r4 M[(r3)+5]... MD1 MD2 Is there any possible data hazard in this instruction sequence? 58

59 Load & Store Hazards... M[(r1)+7] (r2) r4 M[(r3)+5]... (r1)+7 = (r3)+5 data hazard However, the hazard is avoided because our memory system completes writes in one cycle! Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself. More on this later in the course. 59

60 Resolving Data Hazards (2) Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage bypass 60

61 Bypassing time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) r1 r IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4 r IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 (I 4 ) stalled stages IF 4 ID 4 EX 4 (I 5 ) IF 5 ID 5 Each stall or kill introduces a bubble in the pipeline CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input time t0 t1 t2 t3 t4 t5 t6 t7.... (I 1 ) r1 r IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4 r IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB 5 61

62 stall Adding a Bypass 0x4 Add r4 r1... nop r1... E M W 31 PC addr inst Inst Memory D we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext ASrc A B ALU Y we addr rdata Data Memory wdata wdata R MD1 MD2... When does this bypass help? (I 1 ) r1 r r1 M[r0 + 10] JAL 500 (I 2 ) r4 r r4 r r4 r yes no no 62

63 The Bypass Signal Deriving it from the Stall Signal stall = ( ((rs D =ws E ).we E + (rs D =ws M ).we M + (rs D =ws W ).we W ).re1 D +((rt D =ws E ).we E + (rt D =ws M ).we M + (rt D =ws W ).we W ).re2 D ) ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 ASrc = (rs D =ws E ).we E.re1 D we = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on... off Is this correct? No because only ALU and ALUi instructions can benefit from this bypass Split we E into two components: we-bypass, we-stall 63

64 Bypass and Stall Signals Split we E into two components: we-bypass, we-stall we-bypass E = Case opcode E ALU, ALUi (ws 0)... off we-stall E = Case opcode E LW (ws 0) JAL, JALR on... off ASrc = (rs D =ws E ).we-bypass E. re1 D stall = ((rs D =ws E ).we-stall E + (rs D =ws M ).we M + (rs D =ws W ).we W ). re1 D +((rt D = ws E ).we E + (rt D = ws M ).we M + (rt D = ws W ).we W ). re2 D 64

65 Fully Bypassed Datapath stall PC for JAL,... 0x4 Add nop ASrc E M W 31 PC addr inst Inst Memory Is there still a need for the stall signal? D we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext BSrc A B MD1 ALU Y MD2 we addr rdata Data Memory wdata wdata stall = (rs D =ws E ). (opcode E =LW E ).(ws E 0 ).re1 D + (rt D =ws E ). (opcode E =LW E ).(ws E 0 ).re2 D R 65

66 Resolving Data Hazards (3) Strategy 3: Speculate on the dependence. Two cases: Guessed correctly do nothing Guessed incorrectly kill and restart. We ll later see examples of this approach in more complex processors. 66

67 Acknowledgements These slides contain material from courses: UCB CS152 Stanford EE108B 67

Lecture 6 Datapath and Controller

Lecture 6 Datapath and Controller Peng Liu liupeng@zju.edu.cn Windows Editor and Word Processing UltraEdit, EditPlus Gvim Linux or Mac IOS Emacs vi or vim Word Processing(Windows, Linux, and Mac IOS) LaTex