CS 152 Computer Architecture and Engineering Lecture 4 Pipelining

Size: px

Start display at page:

Download "CS 152 Computer Architecture and Engineering Lecture 4 Pipelining"

Jeremy Evans
6 years ago
Views:

1 CS 152 Computer rchitecture and Engineering Lecture 4 Pipelining John Lazzaro (not a prof - John is always OK) T: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: 1

otorola 68000 Next week we will return to the microcode

2 otorola Next week we will return to the microcode story... Today is the anti-microcode story - pipelining! 2

3 RISC CPU Caches Data Path and Control 3

4 Today: Pipelining Pipelining: an idea from assembly line production applied to CPU design Why pipelining is hard: data hazards, control hazards, structural hazards. Visualizing pipelines to evaluate hazard detection and resolution. Short Break. tool kit for hazard resolution. 4

5 Starting Point: Performance Equation Seconds Program Instructions Program Cycles Instruction Seconds Cycle Goal is to optimize execution time, not individual equation terms. achines are optimized with respect to program workloads. The CPI of the program. Reflects the program s instruction mix. Clock period. Optimize jointly with machine CPI. 5

6 Pipelining 6

7 + Recall: Our single-cycle processor Challenge: Speed up clock while keeping CPI == 1 Seconds Program Instructions Program Cycles Instruction Seconds Cycle 0x4 CPI == 1 This is good. Slow. This is bad. D PC Q Instr em Data RegFile rs1 rs2 rd1 ws rd2 wd op L U Data emory Dout Din emtoreg Ext 7

8 Recall: n R-format CPU design Decode fields to get : DD $8 $9 $10 opcode rs rt rd shamt funct Logic op RegFile rs1 rs2 rd1 ws rd2 wd L U 8

9 Reminder: How data flows after posedge PC Instr em + D Q Data 0x4 Logic op RegFile rs1 rs2 rd1 ws rd2 wd L U 9

10 Next posedge: Update state and repeat PC D Q RegFile rs1 rs2 rd1 ws rd2 wd 10

11 Observation: Logic idle most of cycle For most of cycle, LU is either waiting for its inputs, or holding its output Ideal: a CPU architecture where each part is always working. 0x4 + D PC Q Instr em Data RegFile rs1 rs2 rd1 ws rd2 wd op L U Data emory Dout Din emtoreg Ext 11

12 Inspiration: utomobile assembly line ssembly line moves on a steady clock. Each station does the same task on each car. The clock Car body shell erge station Bolting station Car chassis 12

13 Inspiration: utomobile assembly line Simpler station tasks more cars per hour. Simple tasks take less time, clock is faster. 13

14 Inspiration: utomobile assembly line Line speed limited by slowest task. ost efficient if all tasks take same time to do 14

15 Inspiration: utomobile assembly line Simpler tasks, complex car long line! These lines go 24 x 7, and rarely shut down. 15

16 Lessons from car assembly lines Faster line movement yields more cars per hour off the line. Faster line movement requires more stages, each doing simpler tasks. To maximize efficiency, all stages should take same amount of time (if not, workers in fast stages are idle) Filling, flushing, and stalling assembly line are all bad news. 16

17 Key analogy: The instruction is the car Pipeline Stage #1 Stage #2 Stage #3 Stage #4 Stage #5 Instruction Fetch + 0x4 Controls hardware in stage 2 Controls hardware in stage 3 Controls hardware in stage 4 Controls hardware in stage 5 PC Instr em D Q Data Data-stationary control 17

18 + Example: Decode & Register Fetch stage Pipeline Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch SUB R10, R9,R8 OR R7,R6,R5 DD R4,R3,R2 0x4 sample program D PC Q Instr em Data RegFile rs1 rs2 rd1 ws rd2 wd Ext B DD R4,R3,R2 OR R7,R6,R5 SUB R10,R9,R8 R s chosen so that instructions are independent - like cars on the line. 18

19 Performance Equation and Pipelining + Seconds Program Instructions Program Cycles Instruction Seconds Cycle D PC Q Instr Fetch Decode & Reg Fetch Stage #3 0x4 Instr em Data CPI == 1 Once pipe is fill, one instruction completes per cycle rs1 rs2 ws wd RegFile rd1 rd2 Ext B Clock period is shorter Less work to do in each cycle To get shortest clock period, balance the work to do in each pipeline stage. 19

20 Hazards: n instruction is not a car... + Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch D PC Q 0x4 Instr em Data OR R5,R4,R2... wrong value of R4 fetched from RegFile, contract with programmer broken! Oops! rs1 rs2 ws wd RegFile rd1 rd2 Ext B DD R4,R3,R2 R4 not written yet... New sample program DD R4,R3,R2 OR R5,R4,R2 n example of a hazard -- we must (1) detect and (2) resolve all hazards to make a CPU that matches IS 20

21 Performance Equation and Hazards + Seconds Program Instructions Program Cycles Instruction Seconds Cycle D PC Q Instr Fetch Decode & Reg Fetch Stage #3 0x4 Instr em Data Some ways to cope with hazards makes CPI > 1 stalling pipeline rs1 rs2 ws wd RegFile rd1 rd2 Ext B dded logic to detect and resolve hazards increases clock period Software slows the machine down Seymour Cray 21

22 + (simplified) 5-stage pipelined CPU 1 2 IF Stage Instr Fetch ID/RF Stage Decode & Reg Fetch 3 EX Stage Execution 4 E Stage emory 5 WB Write Back, emtoreg op D PC Q 0x4 Instr em Data ux,logic RegFile rs1 rs2 rd1 ws rd2 wd L U Y Data emory Dout Din emtoreg R Ext B 22

23 + Sometimes, contract is a challenge IF Stage Instr Fetch Sample Program LW R4,0(R0) OR R5,R4,R2 1 2 ID/RF Stage Decode & Reg Fetch OR R5,R4,R2... but we haven t even started the load yet! 3 EX Stage Execution LW R4, 0(R0) 4 E Stage emory, emtoreg 5 WB Write Back op D PC Q 0x4 Instr em Data ux,logic RegFile rs1 rs2 rd1 ws rd2 wd L U Y Data emory Dout Din emtoreg R Ext B One approach: change the contract! 23

24 From Lecture 1: Delayed Loads... Instruction Fetch Instruction Decode Operand Fetch Execute Fetch the load inst from memory opcode rs rt offset I-Format Decode fields to get : LW $1, ($2) Retrieve register value: $2 Compute memory address: + $2 Result Store Next Instruction Load memory address contents into: $1 Prepare to fetch instr that follows the LW in the program. Depending on load semantics, new $1 is visible to that instr, or not until the following instr ( delayed loads ). 24

25 + fter we change the contract... D PC IF Stage Instr Fetch Sample Program LW R4,0(R0) OR R5,R4,R2 Q 0x4 1 2 Instr em Data ID/RF Stage Decode & Reg Fetch OR R5,R4,R2... delayed load contract does not guarantee new R4 is seen. ux,logic rs1 rs2 ws wd RegFile rd1 rd2 EX Stage Execution 3 LW R4, 0(R0) op L U Y, emtoreg Data emory Din 4 E Stage emory Dout emtoreg R 5 WB Write Back Ext B Only partially solves problem... soon, we finish the story. 25

26 Visualizing Pipelines 26

27 Pipeline Representation #1: Timeline IF (Fetch) ID (Decode) EX (LU) E WB 0x4 + PC Instr em Good for visualizing pipeline fills. D Q Data Sample Program I1: I2: I3: I4: I5: DD R4,R3,R2 ND R6,R5,R4 SUB R1,R9,R8 XOR R3,R2,R1 OR R7,R6,R5 Time: Inst I1: I2: I3: I4: I5: I6: t1 t2 t3 t4 t5 t6 t7 t8 IF ID IF EX ID IF Pipeline is full E EX ID IF WB E EX ID IF WB E EX ID IF WB E EX ID WB E EX 27

28 Representation #2: Resource Usage + IF (Fetch) ID (Decode) EX (LU) E WB 0x4 PC Instr em Good for visualizing pipeline stalls. D Q Data Sample Program I1: I2: I3: I4: I5: DD R4,R3,R2 ND R6,R5,R4 SUB R1,R9,R8 XOR R3,R2,R1 OR R7,R6,R5 Time: Stage IF: ID: EX: E: WB: t1 t2 t3 t4 t5 t6 t7 t8 I1 I2 I1 I3 I2 I1 Pipeline is full I4 I3 I2 I1 I5 I4 I3 I2 I1 I6 I5 I4 I3 I2 I7 I6 I5 I4 I3 I8 I7 I6 I5 I4 28

29 Hazard Taxonomy 29

30 Structural Hazards Several pipeline stages need to use the same hardware resource at the same time. Solution #1: dd extra copies of the resource (only works sometime). Solution #2: Change resource so that it can handle concurrent use. Solution #3: Stages take turns by stalling parts of the pipeline. 30

31 Structural Hazard Example: One emory IF Stage ID/RF Stage EX Stage E Stage WB Used by IF stage and E stage ux,logic, emtoreg op PC Data emory Dout Din RegFile rs1 rs2 rd1 ws rd2 wd L U To branch logic Y emtoreg R Ext B 31

32 + solution: Extra copies of memory 1 2 IF Stage Instr Fetch ID/RF Stage Decode & Reg Fetch 3 EX Stage Execution 4 E Stage emory 5 WB Write Back, emtoreg ux,logic op D PC Q 0x4 Instr em Data RegFile rs1 rs2 rd1 ws rd2 wd L U Y Data emory Dout Din emtoreg R Ext B I and D caches are a hybrid solution

33 + lternatively: Concurrent use IF Stage Instr Fetch ID/RF Stage Decode & Reg Fetch 3 EX Stage Execution 4 E Stage emory 5 WB Write Back, emtoreg ux,logic op D PC Q 0x4 Instr em Data RegFile rs1 rs2 rd1 ws rd2 wd L U Y Data emory Dout Din emtoreg R Ext B ID and WB stages use register file in same clock cycle 33

34 Data Hazards: 3 Types (RW, WR, WW) Several pipeline stages read or write the same data location in an incompatible way. Read fter Write (RW) hazards. Instruction I2 expects to read a data value written by an earlier instruction, but I2 executes too early and reads the wrong copy of the data. Note data value, not register. Data hazards are possible for any architected state (such as main memory). In practice, main memory hazard avoidance is the job of the memory system. 34

35 Recall: RW example Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch Sample program DD R4,R3,R2 OR R5,R4,R2 + D PC Q 0x4 Instr em Data OR R5,R4,R2... wrong value of R4 fetched from RegFile, contract with programmer broken! Oops! rs1 rs2 ws wd RegFile rd1 rd2 Ext B DD R4,R3,R2 R4 not written yet... This is what we mean when we say Read fter Write (RW) Hazard 35

36 Data Hazards: 3 Types (RW, WR, WW) Write fter Read (WR) hazards. Instruction I2 expects to write over a data value after an earlier instruction I1 reads it. But instead, I2 writes too early, and I1 sees the new value. Write fter Write (WW) hazards. Instruction I2 writes over data an earlier instruction I1 also writes. But instead, I1 writes after I2, and the final data value is incorrect. WR and WW not possible in our 5-stage pipeline. But are possible in other pipeline designs. 36

37 Control Hazards: taken branch/jump + IF (Fetch) ID (Decode) EX (LU) E WB 0x4 D PC Q Instr em Data Note: with branch delay slot, I2 UST complete, I3 UST NOT complete. Sample Program Time: t1 t2 t3 t4 t5 t6 t7 t8 (IS w/o branch Inst EX stage delay slot) I1: IF ID EX E WB computes if I2: IF ID branch is I1: BEQ R4,R3,25 I3: IF taken I2: ND R6,R5,R4 I4: I3: SUB R1,R9,R8 If branch is taken, these I5: instructions UST NOT I6: complete! 37

38 Hazards Recap Structural Hazards Data Hazards (RW, WR, WW) Control Hazards (taken branches and jumps) On each clock cycle, we must detect the presence of all of these hazards, and resolve them before they break the contract with the programmer. 38

39 Break Play: 39

40 Hazard Resolution Tools 40

41 The Hazard Resolution Toolkit Stall earlier instructions in pipeline. Forward results computed in later pipeline stages to earlier stages. dd new hardware or rearrange hardware design to eliminate hazard. Change IS to eliminate hazard. Kill earlier instructions in pipeline. ake hardware handle concurrent requests to eliminate hazard. 41

42 Resolving a RW hazard by stalling Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch Sample program DD R4,R3,R2 OR R5,R4,R2 + D PC Q 0x4 Instr em Data OR R5,R4,R2 Keep executing OR instruction until R4 is ready. Until then, send NOPS to 2/3. rs1 rs2 ws wd RegFile rd1 rd2 DD R4,R3,R2 Let DD proceed to WB stage, so that R4 is written to regfile. New datapath hardware (1) ux into 2/3 to feed in NOP. Freeze PC and until stall is over. Ext B (2) Write enable on PC and 1/2 42

43 The Hazard Resolution Toolkit Stall earlier instructions in pipeline. Forward results computed in later pipeline stages to earlier stages. dd new hardware or rearrange hardware design to eliminate hazard. Change IS to eliminate hazard. Kill earlier instructions in pipeline. ake hardware handle concurrent requests to eliminate hazard. 43

44 Resolving a RW hazard by forwarding + IF Stage Instr Fetch Sample program DD R4,R3,R2 OR R5,R4,R2 0x4 1 2 ID/RF Stage Decode & Reg Fetch OR R5,R4,R2 Just forward it back! EX Stage Execution op L U 3 DD R4,R3,R2 LU computes R4 in the EX stage, so... Y RegFile D PC Q Instr em Data rs1 rs2 ws wd rd1 rd2 Ext B Unlike stalling, does not change CPI. ay hurt cycle time. 44

45 The Hazard Resolution Toolkit Stall earlier instructions in pipeline. Forward results computed in later pipeline stages to earlier stages. dd new hardware or rearrange hardware design to eliminate hazard. Change IS to eliminate hazard. Kill earlier instructions in pipeline. ake hardware handle concurrent requests to eliminate hazard. 45

46 Control Hazards: Fix with more hardware + IF (Fetch) ID (Decode) EX (LU) E WB 0x4 D PC Q Instr em Data If we add hardware, can we move it here? Sample Program Time: t1 t2 t3 t4 t5 t6 t7 t8 (IS w/o branch Inst EX stage delay slot) I1: IF ID EX E WB computes if I2: IF ID branch is I1: BEQ R4,R3,25 I3: IF taken I2: ND R6,R5,R4 I4: I3: SUB R1,R9,R8 If branch is taken, these I5: instructions UST NOT I6: complete! 46

47 + Resolving control hazard with hardware Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch To branch control logic == 0x4 RegFile D PC Q Instr em Data rs1 rs2 ws wd rd1 rd2 Ext B 47

48 Control Hazards: fter more hardware + IF (Fetch) ID (Decode) EX (LU) E WB 0x4 D PC Q Instr em Data If we change IS, can we always let I2 complete ( branch delay slot ) and eliminate the control hazard. Sample Program Time: t1 t2 t3 t4 t5 t6 t7 t8 (IS w/o branch Inst ID stage delay slot) I1: IF ID EX E WB computes if I2: IF branch is I1: BEQ R4,R3,25 I3: taken I2: ND R6,R5,R4 I4: I3: SUB R1,R9,R8 If branch is taken, this I5: instruction UST NOT I6: complete! 48

49 From Lecture 1: BEQ $1,$2,25 Instruction Fetch Instruction Decode Operand Fetch Execute Fetch branch inst from memory opcode rs rt offset I-Format Decode fields to get: BEQ $1, $2, 25 Retrieve register values: $1, $2 Compute if we take branch: $1 == $2? Result Store Next Instruction LWYS prepare to fetch instr that follows the BEQ in the program ( delayed branch ). IF we take branch, the instr we fetch FTER that instruction is PC PC == Program Counter 49

50 The Hazard Resolution Toolkit Stall earlier instructions in pipeline. Forward results computed in later pipeline stages to earlier stages. dd new hardware or rearrange hardware design to eliminate hazard. Change IS to eliminate hazard. Kill earlier instructions in pipeline. ake hardware handle concurrent requests to eliminate hazard. 50

51 Resolve control hazard by killing instr Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch Sample program (no delay slot) J 200 OR R5,R4,R2 + D PC Q 0x4 Instr em Data J 200 Detect J instruction, mux a NOP into 1/2 rs1 rs2 ws wd RegFile rd1 rd2 This hurts CPI. Can we do better? Compute new PC using hardware not shown... Ext 51 B

52 The Hazard Resolution Toolkit Stall earlier instructions in pipeline. Forward results computed in later pipeline stages to earlier stages. dd new hardware or rearrange hardware design to eliminate hazard. Change IS to eliminate hazard. Kill earlier instructions in pipeline. ake hardware handle concurrent requests to eliminate hazard. 52

53 + Structural hazard solution: concurrent use IF Stage Instr Fetch Does not come for free ID/RF Stage Decode & Reg Fetch ux,logic 3 EX Stage Execution 4 E Stage emory, emtoreg 5 WB Write Back op D PC Q 0x4 Instr em Data RegFile rs1 rs2 rd1 ws rd2 wd L U Y Data emory Dout Din emtoreg R Ext B ID and WB stages use register file in same clock cycle 53

54 Hazard Diagnosis 54

55 Data Hazards: Read fter Write Read fter Write (RW) hazards. Instruction I2 expects to read a data value written by an earlier instruction, but I2 executes too early and reads the wrong copy of the data. Classic solution: use forwarding heavily, fall back on stalling when forwarding won t work or slows down the critical path too much. 55

56 Full bypass network... ID (Decode) EX E, emtoreg WB ux,logic From WB op rs1 rs2 RegFile rd1 L U Y Data emory Dout Din emtoreg R ws wd rd2 Ext B 56

57 Common bug: ultiple forwards... DD R4,R3,R2 OR R2,R3,R1 ND R2,R2,R1 Which do we forward from? ID (Decode) EX E, emtoreg WB ux,logic From WB op rs1 rs2 RegFile rd1 L U Y Data emory Dout Din emtoreg R ws wd rd2 Ext B 57

58 Common bug: ultiple forwards II... DD R4,R0,R2 Which do we forward from? ID (Decode) OR R0,R3,R1 ND R0,R2,R1 EX E, emtoreg WB ux,logic rs1 rs2 ws wd RegFile rd1 rd2 From WB op L U Y Data emory Dout Din emtoreg R Ext B 58

59 LW and Hazards No load delay slot 59

60 Questions about LW and forwarding DDIU R1 R1 24 Do we need to stall? ID (Decode) OR R3,R3,R2 LW R1 128(R29) EX E WB, emtoreg ux,logic rs1 rs2 ws wd RegFile rd1 rd2 From WB op L U Y Data emory Dout Din emtoreg R Ext B 60

61 Questions about LW and forwarding DDIU R1 R1 24 Do we need to stall? ID (Decode) LW R1 128(R29) EX OR R1,R3,R1 E, emtoreg WB ux,logic rs1 rs2 ws wd RegFile rd1 rd2 From WB op L U Y Data emory Dout Din emtoreg R Ext B 61

62 Resolving a RW hazard by stalling Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch Sample program DD R4,R3,R2 OR R5,R4,R2 + D PC Q 0x4 Instr em Data OR R5,R4,R2 Keep executing OR instruction until R4 is ready. Until then, send NOPS to 2/3. rs1 rs2 ws wd RegFile rd1 rd2 DD R4,R3,R2 Let DD proceed to WB stage, so that R4 is written to regfile. New datapath hardware (1) ux into 2/3 to feed in NOP. Freeze PC and until stall is over. Ext B (2) Write enable on PC and 1/2 62

63 Branches and Hazards Single delay slot 63

64 + Recall: Control hazard and hardware Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch To branch control logic == 0x4 RegFile D PC Q Instr em Data rs1 rs2 ws wd rd1 rd2 Ext B 64

65 Recall: fter more hardware, change IS + IF (Fetch) ID (Decode) EX (LU) E WB 0x4 D PC Q Instr em Data If we change IS, can we always let I2 complete ( branch delay slot ) and eliminate the control hazard. Sample Program Time: t1 t2 t3 t4 t5 t6 t7 t8 (IS w/o branch Inst ID stage delay slot) I1: IF ID EX E WB computes if I2: IF branch is I1: BEQ R4,R3,25 I3: taken I2: ND R6,R5,R4 I4: I3: SUB R1,R9,R8 If branch is taken, this I5: instruction UST NOT I6: complete! 65

66 Question about branch and forwards: BEQ R1 R3 label Will this work as shown? OR R3,R3,R1 To branch control logic ux,logic ID (Decode) == EX E, emtoreg WB op RegFile rs1 rs2 rd1 L U Y Data emory Dout Din emtoreg R ws wd rd2 Ext B 66

67 Lessons learned Pipelining is hard Study every instruction Write test code in advance Think about interactions... 67

68 Lessons learned Pipelining is hard Study every instruction Write test code in advance Think about interactions... between forwarding, branch and jump delay slots, R0 issues LW issues... a long list! 68

69 Control Implementation 69

70 Recall: What is single cycle control? Instr em Data Equal Combinational Logic (Only Gates, No Flip Flops) Just specify logic functions! RegDest RegWr ExtOp LUsrc emwr emtoreg PCSrc RegDest RegFile rs1 rs2 rd1 ws rd2 wd Ext LUctr op L U Equal Data emory Dout Din RegWr ExtOp LUsrc emwr emtoreg 70

71 In pipelines, all registers are used ID (Decode) EX E WB Equal Combinational Logic (Only Gates, No Flip Flops) (add extra state outside!) RegDest PCSrc RegWr ExtOp emtoreg conceptual design -- for shortest critical path, registers may hold decoded info, not the complete -bit instruction 71

72 On Tuesday Quantitative instruction set architecture... lso, we will revisit the CPU design, and the topic of microcode. Have a good weekend! 72

EECS Digital Design

EECS Digital Design EECS 150 -- Digital Design Lecture 11-- Processor Pipelining 2010-2-23 John Wawrzynek Today s lecture by John Lazzaro www-inst.eecs.berkeley.edu/~cs150 1 Today: Pipelining How to apply the performance