Implementing a MIPS Processor. Readings:

Size: px
Start display at page:

Download "Implementing a MIPS Processor. Readings:"

Transcription

1 Implementing a MIPS Processor Readings:

2 Goals for this Class Unrstand how CPUs run programs How do we express the computation the CPU? How does the CPU execute it? How does the CPU support other system components (e.g., the OS)? What techniques and technologies are involved and how do they work? Unrstand why CPU performance (and other metrics) varies How does CPU sign impact performance? What tra-offs are involved in signing a CPU? How can we meaningfully measure and compare computer systems? Unrstand why program performance varies How do program characteristics affect performance? How can we improve a programs performance by consiring the CPU running it? How do other system components impact program performance? 2

3 Goals Unrstand how the 5-stage MIPS pipeline works See examples of how architecture impacts ISA sign Unrstand how the pipeline affects performance Unrstand hazards and how to avoid them Structural hazards Data hazards Control hazards 3

4 Processor Design in Two Acts Act I: A single-cycle CPU

5 Foreshadowing Act I: A Single-cycle Processor Simplest sign Not how many real machines work (maybe some eply embedd processors) Figure out the basic parts; what it takes to execute instructions Act II: A Pipelined Processor This is how many real machines work Exploit parallelism by executing multiple instructions at once.

6 Target ISA We will focus on part of MIPS Enough to run into the interesting issues Memory operations A few arithmetic/logical operations (Generalizing is straightforward) BEQ and J This corresponds pretty directly to what you ll be implementing in 141L. 6

7 Basic Steps for Execution Fetch an instruction from the instruction store Deco it What does this instruction do? Gather inputs From the register file From memory Perform the operation Write the outputs To register file or memory Determine the next instruction to execute 7

8 The Processor Design Algorithm Once you have an ISA Design/Draw the datapath Intify and instantiate the hardware for your architectural state Foreach instruction Simulate the instruction Add and connect the datapath elements it requires Is it workable? If not, fix it. Design the control Foreach instruction Simulate the instruction What control lines do you need? How will you compute their value? Modify control accordingly Is it workable? If not, fix it. You ve already done much of this in 141L. 8

9 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

10 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

11 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

12 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

13 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

14 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

15 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

16 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

17 ADDI; I-Type PC = PC + 4 REG[rd] = REG[rs] op SignExtImm bits 31:26 25:21 20:16 15:0 name op rs rt imm # bits

18 ADDI; I-Type PC = PC + 4 REG[rd] = REG[rs] op SignExtImm bits 31:26 25:21 20:16 15:0 name op rs rt imm # bits

19 Load Word PC = PC + 4 REG[rt] = MEM[signextendImm + REG[rs]] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits

20 Load Word PC = PC + 4 REG[rt] = MEM[signextendImm + REG[rs]] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits

21 Store Word PC = PC + 4 MEM[signextendImm + REG[rs]] = REG[rt] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits

22 Store Word PC = PC + 4 MEM[signextendImm + REG[rs]] = REG[rt] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits

23 Branch-equal; I-Type PC = (REG[rs] == REG[rt])? PC SignExtImmediate *4 : PC + 4; bits 31:26 25:21 20:16 15:0 name op rs rt displacement # bits

24 Branch-equal; I-Type PC = (REG[rs] == REG[rt])? PC SignExtImmediate *4 : PC + 4; bits 31:26 25:21 20:16 15:0 name op rs rt displacement # bits

25 A Single-cycle Processor Performance refresher ET = IC * CPI * CT Single cycle CPI == 1; That sounds great Unfortunately, Single cycle CT is large Even RISC instructions take quite a bite of effort to execute This is a lot to do in one cycle 14

26 Our Hardware is Mostly Idle Cycle time = 15 ns Slowest module (alu) is ~6ns 15

27 Our Hardware is Mostly Idle Cycle time = 15 ns Slowest module (alu) is ~6ns 15

28 Processor Design in Two Acts Act II: A pipelined CPU

29 Pipelining Review 17

30 Pipelining Cycle 1 10n L a t c h Logic (10ns) L a t c Cycle 2 20n L a t c h Logic (10ns) L a t c Cycle 3 30n L a t c h Logic (10ns) L a t c What s the throughput? What s the latency for one unit of work?

31 Pipelining Break up the logic with latches into pipeline stages L a t c h Each stage can act on different data Latches hold the inputs to their stage Every clock cycle data transfers from one pipe stage to the next Logic (10ns) L a t c h L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h

32 cycle 1 cycle 2 2ns 4ns Pipelining Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic cycle 3 cycle 4 cycle 5 cycle 6 6ns 8ns 10ns 12ns Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic What s the latency for one unit of work? What s the throughput?

33 Critical path review Critical path is the longest possible lay between two registers in a sign. The critical path sets the cycle time, since the cycle time must be long enough for a signal to traverse the critical path. Lengthening or shortening non-critical paths does not change performance Ially, all paths are about the same length Logic 21

34 Pipelining and Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Hopefully, critical path reduced by 1/3 22

35 Limitations of Pipelining Logic Logic Logic Logic Logic Logic You cannot pipeline forever Some logic cannot be pipelined arbitrarily -- Memories Some logic is inconvenient to pipeline. How do you insert a register in the middle of an multiplier? Registers have a cost They cost area -- choose narrow points in the logic They cost energy -- latches don t do any useful work They cost time Extra logic lay Set-up and hold times. Pipelining may not affect the critical path as you expect 23

36 Pipelining Overhead Logic Delay (LD) -- How long does the logic take (i.e., the useful part) Set up time (ST) -- How long before the clock edge do the inputs to a register need be ready? Relatively short ns on our FPGAs Register lay (RD) -- Delay through the internals of the register. Longer ns for our FPGAs. Much, much shorter for RAW CMOS. 24

37 Pipelining Overhead Clock Setup time Register in????? New Data Register out Old data New data Register lay 25

38 Pipelining Overhead Logic Delay (LD) -- How long does the logic take (i.e., the useful part) Set up time (ST) -- How long before the clock edge do the inputs to a register need be ready? Register lay (RD) -- Delay through the internals of the register. CTbase -- cycle time before pipelining CT base = LD + ST + RD. CTpipe -- cycle time after pipelining N times CT pipe = ST + RD + LD/N Total time = N*ST + N*RD + LD 26

39 Pipelining Difficulties Fast Logic Slow Logic Slow Logic Fast Logic Fast Logic Slow Logic Slow Logic Fast Logic You can t always pipeline how you would like 27

40 How to pipeline a processor Break each instruction into pieces -- remember the basic algorithm for execution Fetch Deco Collect arguments Execute Write results Compute next PC The classic 5-stage MIPS pipeline Fetch -- read the instruction Deco -- co and read from the register file Execute -- Perform arithmetic ops and address calculations Memory -- access data memory. Write -- Store results in the register file. 28

41 Pipelining a processor 29

42 Pipelining a processor Fetch Deco reg read Mem Reality Reg Write 29

43 Pipelining a processor Fetch Deco reg read Mem Reality Reg Write Fetch Deco Mem Reg reg read Write Easier to draw 29

44 Pipelining a processor Fetch Deco reg read Mem Reality Reg Write Fetch Deco Mem Reg reg read Write Easier to draw Fetch Deco/ Mem Reg reg read Write Pipelined 29

45 Pipelined Datapath Add PC 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 ALU Add Address Write Data Data Memory Read Data Sign 16 Extend 32

46 Pipelined Datapath PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32

47 Pipelined Datapath Add 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 ALU Add Address Write Data Data Memory Read Data add lw Sub Sub. Add Add Sign 16 Extend 32

48 Pipelined Datapath 4 Read Address Add Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 Add ALU Address Write Data Data Memory Read Data add lw Sub Sub. Add Add Sign 16 Extend 32

49 Pipelined Datapath Add 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 Add ALU Address Write Data Data Memory Read Data add lw Sub Sub. Add Add Sign 16 Extend 32

50 Pipelined Datapath Add 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 Add ALU Address Write Data Data Memory Read Data add lw Sub Sub. Add Add Sign 16 Extend 32

51 Pipelined Datapath Add 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 Add ALU Address Write Data Data Memory Read Data add lw Subi Sub. Add Add Sign 16 Extend 32

52 Pipelined Datapath PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32

53 Pipelined Datapath PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32 This is a bug rd must flow through the pipeline with the instruc>on. This signal needs to come from the WB stage.

54 Pipelined Datapath PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32 This is a bug rd must flow through the pipeline with the instruc>on. This signal needs to come from the WB stage.

55 Pipelined Control Control lives in co stage, signals flow down the pipe with the instructions they correspond to Control PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec ex ctrl signals Shi< le< 2 Add ALU Exec/Mem mem ctrl signals Address Write Data Data Memory Read Data Mem/WB Write ctrl signals Sign 16 Extend 32 38

56 Impact of Pipelining L = IC * CPI * CT Break the processor into P pipe stages CT new = CT/P CPI new = CPIold CPI is an average: Cycles/instructions The latency of one instruction is P cycles The average CPI = 1 IC new = ICold Total speedup should be 5x! Except for the overhead of the pipeline registers And the realities of logic sign... 39

57

58

59

60 Imem 2.77 ns

61 Imem 2.77 ns Ctrl ns

62 Imem 2.77 ns Ctrl ns ArgBMux ns

63 Imem 2.77 ns Ctrl ns ArgBMux ns ALU 5.94ns

64 Dmem ns

65

66 ALU 5.94ns

67

68 WriteRegMux ns

69

70 RegFile ns

71 + PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time

72 + PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time

73 What we want: 5 balanced pipeline stages Each stage would take ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

74 + PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time What we want: 5 balanced pipeline stages Each stage would take ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

75 + Cycle time = 15.8 ns Clock Speed = 63Mhz PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time What we want: 5 balanced pipeline stages Each stage would take ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

76 + Cycle time = 15.8 ns Clock Speed = 63Mhz PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time What we want: 5 balanced pipeline stages Each stage would take ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

77 + Cycle time = 15.8 ns Clock Speed = 63Mhz PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time What we want: 5 balanced pipeline stages Each stage would take ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

78 A Structural Hazard Both the co and write stage have to access the register file. There is only one registers file. A structural hazard!! Solution: Write early, read late Writes occur at the clock edge and complete long before the end of the cycle This leave enough time for the outputs to settle for the reads. Hazard avoid! Fetch Deco Mem Write 45

79 Quiz 4 1. For my CPU, CPI = 1 and CT = 4ns. a. What is the maximum possible speedup I can achieve by turning it into a 8-stage pipeline? b. What will the new CT, CPI, and IC be in the ial case? c. Give 2 reasons why achieving the speedup in (a) will be very difficult. 2. Instructions in a VLIW instruction word are executed a. Sequentially b. In parallel c. Conditionally d. Optionally 3. List x86, MIPS, and ARM in orr from most CISC-like to most RISC-like. 4. List 3 characteristics of MIPS that make it a RISC ISA. 5. Give one advantage of the multiple, complex addressing mos that ARM and x86 provi. 46

80 HW 1 Score Distribution # of stunts F D- D+ C B- B+ A 47

81 HW 2 Score Distribution # of stunts F D- D+ C B- B+ A 48

82 Projected Gra Distribution # of stunts F D- D+ C B- B+ A 49

83 What You Like More real world stuff. More class interactions. Cool factoids si notes, etc. Example in the slis. Jokes 50

84 What You Would Like to See Improved More time for the quizzes. Too much math. Going too fast. The subject of the class. The pace should be faster Late-breaking changes to the homeworks. Too much discussion of latency and metrics. It d be nice to have the slis before lecture. Not enough time to ask questions MIPS is not the most relevant ISA. More precise instructions for HW and quizzes. The pace should be slower The software (SPIM, Quartus, and Molsim) are hard to use 51

85 Pipelining is Tricky Simple pipelining is easy If the data flows in one direction only If the stages are inpennt In fact the tool can do this automatically via retiming (If you are curious, experiment with this in Quartus). Not so, for processors. Branch instructions affect the next PC -- ward flow Instructions need values computed by previous instructions -- not inpennt 52

86 Not just tricky, Hazardous! Hazards are situations where pipelining does not work as elegantly as we would like Three kinds Structural hazards -- we have run out of a hardware resource. Data hazards -- an input is not available on the cycle it is need. Control hazards -- the next instruction is not known. Dealing with hazards increases complexity or creases performance (or both) Dealing efficienctly with hazards is much of what makes processor sign hard. That, and the Quartus tools ;-) 53

87 Hazards: Key Points They prevent us from achieving CPI = 1 Hazards cause imperfect pipelining They are generally causes by counter flow data pennces in the pipeline Three kinds Structural -- contention for hardware resources Data -- a data value is not available when/where it is need. Control -- the next instruction to execute is not known. Two ways to al with hazards Removal -- add hardware and/or complexity to work around the hazard so it does not occur Stall -- Hold up the execution of new instructions. Let the olr instructions finish, so the hazard will clear. 54

88 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

89 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

90 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

91 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

92 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

93 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

94 Depennces in the pipeline In our simple pipeline, these instructions cause a data hazard Time Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 56

95 Depennces in the pipeline In our simple pipeline, these instructions cause a data hazard Time Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 56

96 Depennces in the pipeline In our simple pipeline, these instructions cause a data hazard Time Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 56

97 How can we fix it? Ias? 57

98 Solution 1: Make the compiler al with it. Expose hazards to the big A architecture A result is available N instructions after the instruction that generates it. In the meantime, the register file has the old value. This is called a register lay slot What is N? Can it change? add $s0, $t0, $t1 Fetch Deco Mem Write lay slot Fetch Deco Mem Write lay slot Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 58

99 Solution 1: Make the compiler al with it. Expose hazards to the big A architecture A result is available N instructions after the instruction that generates it. In the meantime, the register file has the old value. This is called a register lay slot What is N? Can it change? N = 2, for our sign add $s0, $t0, $t1 Fetch Deco Mem Write lay slot Fetch Deco Mem Write lay slot Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 58

100 Compiling for lay slots The compiler must fill the lay slots Ially, with useful instructions, but nops will work too. add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 and $t7, $t5, $t4 add $s0, $t0, $t1 and $t7, $t5, $t4 nop sub $t2, $s0, $t3 add $t3, $s0, $t4 59

101 Solution 2: Stall When you need a value that is not ready, stall Suspend the execution of the executing instruction and those that follow. This introduces a pipeline bubble. A bubble is a lack of work to do, it propagates through the pipeline like nop instructions Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write NOPs in the bubble Mem Write Mem Write add $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write

102 Solution 2: Stall When you need a value that is not ready, stall Suspend the execution of the executing instruction and those that follow. This introduces a pipeline bubble. A bubble is a lack of work to do, it propagates through the pipeline like nop instructions Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write Both of these instructions are stalled NOPs in the bubble Mem Write Mem Write add $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write

103 Solution 2: Stall When you need a value that is not ready, stall Suspend the execution of the executing instruction and those that follow. This introduces a pipeline bubble. A bubble is a lack of work to do, it propagates through the pipeline like nop instructions Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write Both of these instructions are stalled NOPs in the bubble Mem Write Mem Write One instruction or nop completes each cycle add $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write

104 Stalling the pipeline Freeze all pipeline stages before the stage where the hazard occurred. Disable the PC update Disable the pipeline registers This is equivalent to inserting into the pipeline when a hazard exists Insert nop control bits at stalled stage (co in our example) How is this solution still potentially better than relying on the compiler? 61

105 Stalling the pipeline Freeze all pipeline stages before the stage where the hazard occurred. Disable the PC update Disable the pipeline registers This is equivalent to inserting into the pipeline when a hazard exists Insert nop control bits at stalled stage (co in our example) How is this solution still potentially better than relying on the compiler? The compiler can still act like there are lay slots to avoid stalls. Implementation tails are not exposed in the ISA 61

106 Calculating CPI for Stalls In this case, the bubble lasts for 2 cycles. As a result, in cycle (6 and 7), no instruction completes. What happens to CPI? In the absence of stalls, CPI is one, since one instruction completes per cycle If an instruction stalls for N cycles, it s CPI goes up by N Cyc 1 Cyc 2 Cyc 3 Cyc 4Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 dd $s0, $t0, $t1 Fetch Deco Mem Write ub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write dd $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write nd $t3, $t2, $t4 Fetch Deco Mem Write

107 Midterm and Midterm Review Midterm in 1 week! Midterm review in 2 days! I ll just take questions from you. Come prepared with questions. Today Go over last 2 quizzes More about hazards 63

108 Quiz 4 1. For my CPU, CPI = 1 and CT = 4ns. a. What is the maximum possible speedup I can achieve by turning it into a 8-stage pipeline? a. 8x b. What will the new CT, CPI, and IC be in the ial case? a. CT = CT/8, CPI = 1, IC = IC c. Give 2 reasons why achieving the speedup in (a) will be very difficult. a. Setup time; Difficult-to-pipeline logic; register logic lay. 2. Instructions in a VLIW instruction word are executed a. Sequentially b. In parallel c. Conditionally d. Optionally 3. List x86, MIPS, and ARM in orr from most CISC-like to most RISC-like. 1. x86, ARM, MIPS 4. List 3 characteristics of MIPS that make it a RISC ISA. 1. Fixed inst size; few inst formats; Orthogonality; general purpose registers; Designed for fast implementations; few addressing mos. 5. Give one advantage of the multiple, complex addressing mos that ARM and x86 provi. 1. Reduced co size 2. User-friendliness for hand-written assembly 64

109 Hardware for Stalling Turn off the enables on the earlier pipeline stages The earlier stages will keep processing the same instruction over and over. No new instructions get fetched. Insert control and data values corresponding to a nop into the downstream pipeline register. This will create the bubble. The nops will flow downstream, doing nothing. When the stall is over, re-enable the pipeline registers The instructions in the upstream stages will start moving again. New instructions will start entering the pipeline again. inject_nop_co_execute stall_pc stall_fetch_co enb enb 0 Fetch Deco/ Mem Write reg read 0 65

110 The Impact of Stalling On Performance ET = I * CPI * CT I and CT are constant What is the impact of stalling on CPI? What do we need to know to figure it out? 66

111 The Impact of Stalling On Performance ET = I * CPI * CT I and CT are constant What is the impact of stalling on CPI? Fraction of instructions that stall: 30% Baseline CPI = 1 Stall CPI = = 3 New CPI = 67

112 The Impact of Stalling On Performance ET = I * CPI * CT I and CT are constant What is the impact of stalling on CPI? Fraction of instructions that stall: 30% Baseline CPI = 1 Stall CPI = = 3 New CPI = 0.3* *1 =

113 Solution 3: Bypassing/ Data values are computed in Ex and Mem but publicized in write The data exists! We should use it. inputs are need results known Results "published" to registers Fetch Deco Mem Write 68

114 Bypassing or Forwarding Take the values, where ever they are Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 69

115 Forwarding Paths Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 70

116 Forwarding in Hardware Add PC 4 Read Address Add Instruction Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shift left 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32

117 Hardware Cost of Forwarding In our pipeline, adding forwarding required relatively little hardware. For eper pipelines it gets much more expensive Roughly: ALU * pipe_stages you need to forward over Some morn processor have multiple ALUs (4-5) And eper pipelines (4-5 stages of to forward across) Not all forwarding paths need to be supported. If a path does not exist, the processor will need to stall. 72

118 Forwarding for Loads Values can also come from the Mem stage In this case, forward is not possible to the next instruction (and is not necessary for later instructions) Choices Always stall! Stall when need! Expose this in the ISA. Cycles ld $s0, (0)$t0 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem 73

119 Forwarding for Loads Values can also come from the Mem stage In this case, forward is not possible to the next instruction (and is not necessary for later instructions) Choices Always stall! Stall when need! Expose this in the ISA. Cycles ld $s0, (0)$t0 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Time travel presents significant implementation challenges 73

120 Forwarding for Loads Values can also come from the Mem stage In this case, forward is not possible to the next instruction (and is not necessary for later instructions) Choices Always stall! Stall when need! Expose this in the ISA. Which does MIPS do? Cycles ld $s0, (0)$t0 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Time travel presents significant implementation challenges 73

121 Pros and Cons Punt to the compiler This is what MIPS does and is the source of the loadlay slot Future versions must emulate a single load-lay slot. The compiler fills the slot if possible, or drops in a nop. Always stall. The compiler is oblivious, but performance will suffer 10-15% of instructions are loads, and the CPI for loads will be 2 Forward when possible, stall otherwise Here the compiler can orr instructions to avoid the stall. If the compiler can t fix it, the hardware will. 74

122 Stalling for Load Cycles Fetch Deco Mem Write Fetch Deco Mem Write Load $s1, 0($s1) Fetch Deco Mem Write Addi $t1, $s1, 4 Fetch Deco Mem Write Fetch Deco Deco Mem Write Fetch Fetch Deco Mem Writ bac Fetch Deco Me To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 75

123 Stalling for Load Cycles Fetch Deco Mem Write Fetch Deco Mem Write Load $s1, 0($s1) Fetch Deco Mem Write Addi $t1, $s1, 4 Fetch Deco Mem Write All stages of the pipeline earlier Fetch Deco Deco Mem Write Fetch Fetch Deco Mem Writ bac than Fetch Deco Me the stall stand still. To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 75

124 Stalling for Load Cycles Fetch Deco Mem Write Only four stages Load $s1, 0($s1) Fetch Deco Mem Write Fetch Deco Mem Write are occupied. What s in Mem? Addi $t1, $s1, 4 Fetch Deco Mem Write All stages of the pipeline earlier Fetch Deco Deco Mem Write Fetch Fetch Deco Mem Writ bac than Fetch Deco Me the stall stand still. To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 75

125 Inserting Noops Cycles Fetch Deco Mem Write Fetch Deco Mem Write Load $s1, 0($s1) Fetch Deco Mem Write The noop is in Mem Noop inserted Mem Write Addi $t1, $s1, 4 Fetch Deco Mem Write Fetch Deco Deco Mem Write Fetch Fetch Deco Mem To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 76

126 Quiz 1 Score Distribution # of stunts F D- D+ C B- B+ A 77

127 Quiz 2 Score Distribution # of stunts F D- D+ C B- B+ A 78

128 Quiz 3 Score Distribution # of stunts F D- D+ C B- B+ A 79

129 Quiz # of stunts F D- D+ C B- B+ A 80

130 Key Points: Control Hazards Control occur when we don t know what the next instruction is Caused by branches and jumps. Strategies for aling with them Stall Guess! Leads to speculation Flushing the pipeline Strategies for making better guesses Unrstand the difference between stall and flush 81

131 Computing the PC Normally Non-branch instruction PC = PC + 4 When is PC ready? Fetch Deco Mem Write 82

132 Computing the PC Normally Non-branch instruction PC = PC + 4 When is PC ready? Fetch Deco Mem Write 82

133 Computing the PC Normally Non-branch instruction PC = PC + 4 When is PC ready? Fetch Deco Mem Write Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 82

134 Fixing the Ubiquitous Control Hazard We need to know if an instruction is a branch in the fetch stage! How can we accomplish this? Solution 1: Partially co the instruction in fetch. You just need to know if it s a branch, a jump, or something else. Solution 2: We ll discuss later. Fetch Deco Mem Write 83

135 Fixing the Ubiquitous Control Hazard We need to know if an instruction is a branch in the fetch stage! How can we accomplish this? Solution 1: Partially co the instruction in fetch. You just need to know if it s a branch, a jump, or something else. Solution 2: We ll discuss later. Fetch Deco Mem Write 83

136 Computing the PC Normally Pre-co in the fetch unit. PC = PC + 4 The PC is ready for the next fetch cycle. Fetch Deco Mem Write 84

137 Computing the PC Normally Pre-co in the fetch unit. PC = PC + 4 The PC is ready for the next fetch cycle. Fetch Deco Mem Write 84

138 Computing the PC Normally Pre-co in the fetch unit. PC = PC + 4 The PC is ready for the next fetch cycle. Fetch Deco Mem Write Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 84

139 Computing the PC for Branches Branch instructions bne $s1, $s2, offset if ($s1!= $s2) { PC = PC + offset} else {PC = PC + 4;} When is the value ready? Fetch Deco Mem Write Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write bne $t2, $s0, somewhere Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Deco Mem Write and $s4, $t0, $t1 Fetch Deco Mem Write 85

140 Computing the PC for Branches Branch instructions bne $s1, $s2, offset if ($s1!= $s2) { PC = PC + offset} else {PC = PC + 4;} When is the value ready? Fetch Deco Mem Write Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write bne $t2, $s0, somewhere Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Deco Mem Write and $s4, $t0, $t1 Fetch Deco Mem Write 85

141 Computing the PC for Jumps Jump instructions jr $s1 -- jump register PC = $s1 When is the value ready? Fetch Deco Mem Write Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write jr $s4 Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Deco Mem Write 86

142 Computing the PC for Jumps Jump instructions jr $s1 -- jump register PC = $s1 When is the value ready? Fetch Deco Mem Write Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write jr $s4 Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Deco Mem Write 86

143 Dealing with Branches: Option 0 -- stall Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write bne $t2, $s0, somewhere Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Fetch Stall Fetch Deco Mem Write NOP Fetch Deco Mem Write NOP Fetch Deco Mem Write and $s4, $t0, $t1 Fetch Deco Mem Write What does this do to our CPI? 87

144 Option 1: The compiler Use branch lay slots. The next N instructions after a branch are always executed How big is N? For jumps? For branches? Good Simple hardware Bad N cannot change. 88

145 Delay slots. Cycles Taken bne $t2, $s0, somewhere Fetch Deco Mem Writ bac add $t2, $s4, $t1 Fetch Deco Me Branch Delay add $s0, $t0, $t1 Fetch Deco... somewhere: sub $t2, $s0, $t3 Fetch De d 89

146 But MIPS Only Has One Delay Slot! The second branch lay slot is expensive! Filling one slot is hard. Filling two is even more so. Solution!: Resolve branches in co. en en en Control en en en en 4 en A Out + B inst[0:31] <<2 A Out + B in0 Out in1 sel A B func branch jump CMP Out PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel inst[16:31] WriteData clk reset sign ext WriteData clk reset 90

147 For the rest of this sli ck, we will assume that MIPS has no branch lay slot. If you have questions about whether part of the homework/test/quiz makes this assumption. 91

148 Option 2: Simple Prediction Can a processor tell the future? For non-taken branches, the new PC is ready immediately. Let s just assume the branch is not taken Also called branch prediction or control speculation What if we are wrong? Branch prediction vocabulary Prediction -- a guess about whether a branch will be taken or not taken Misprediction -- a prediction that turns out to be incorrect. Misprediction rate -- fraction of predictions that are incorrect. 92

149 Predict Not-taken Not-taken bne $t2, $s0, foo Cycles Fetch Deco Mem Write add $s4, $t6, $t5 Fetch Deco Mem Write Taken bne $t2, $s4, else Fetch Deco Mem Write and $s4, $t0, $t1... else: sub $t2, $s0, $t3 Fetch Deco Mem Write Fetch Squash Deco Mem We start the add, and then, when we discover the branch outcome, we squash it. Also called flushing the pipeline Just like a stall, flushing one instruction increases the branch s CPI by 1 93

150 Flushing the Pipeline When we flush the pipe, we convert instructions into noops Turn off the write enables for write and mem stages Disable branches (i.e., make sure the ALU does raise the branch signal). Instructions do not stop moving through the pipeline For the example on the previous sli the inject_nop_co_execute signal will go high for one cycle. stall_pc inject_nop_co_execute inject_nop_execute_mem Control signals stall_fetch_co Deco 0 0 enb enb Fetch Reg Mem Write file 94

151 Flushing the Pipeline When we flush the pipe, we convert instructions into noops Turn off the write enables for write and mem stages Disable branches (i.e., make sure the ALU does raise the branch signal). Instructions do not stop moving through the pipeline For the example on the previous sli the inject_nop_co_execute signal will go high for one cycle. These signals for stalling stall_pc inject_nop_co_execute inject_nop_execute_mem Control signals stall_fetch_co Deco 0 0 enb enb Fetch Reg Mem Write file 94

152 Flushing the Pipeline When we flush the pipe, we convert instructions into noops Turn off the write enables for write and mem stages Disable branches (i.e., make sure the ALU does raise the branch signal). Instructions do not stop moving through the pipeline For the example on the previous sli the inject_nop_co_execute signal will go high for one cycle. These signals for stalling This signal is for both stalling and flushing stall_pc inject_nop_co_execute inject_nop_execute_mem Control signals stall_fetch_co Deco 0 0 enb enb Fetch Reg Mem Write file 94

153 Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Predict not-taken Pros? Backward taken/forward not taken The best of both worlds! Most loops have have a ward branch at the bottom, those will predict taken Others (non-loop) branches will be not-taken. 95

154 Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Loops are commons Predict not-taken Pros? Backward taken/forward not taken The best of both worlds! Most loops have have a ward branch at the bottom, those will predict taken Others (non-loop) branches will be not-taken. 95

155 Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Loops are commons Predict not-taken Pros? Not all branches are for loops. Backward taken/forward not taken The best of both worlds! Most loops have have a ward branch at the bottom, those will predict taken Others (non-loop) branches will be not-taken. 95

156 Basic Pipeline Recap The PC is required in Fetch For branches, it s not know till co. en en en en 4 en A B Out + <<2 A B Out + in0 Out in1 sel A B func branch jump CMP Out PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel inst[16:31] WriteData clk reset sign ext WriteData clk reset Should this be here? Branches only, one lay slot, simplified ISA, no control 96

157 Supporting Speculation Branch Predict Control nop in0 Out in1 sel en nop in0 Out in1 sel en en 4 en A Out + B inst[16:31] sign ext <<2 A Out + B in0 in1 Out in2 sel inst[0:31] en in0 Out in1 sel A B func branch jump CMP Out en en en PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

158 Supporting Speculation PC Calculation Branch Predict Control nop in0 Out in1 sel en nop in0 Out in1 sel en en 4 en A Out + B inst[16:31] sign ext <<2 A Out + B in0 sel in1 Out in2 inst[0:31] en sel in0 Out in1 A B func branch jump CMP Out en en en PC Addr Inst Instruction Memory inst[7:11] inst[12:16] inst[12:16] inst[17:21] sel in0 Out in1 Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

159 Supporting Speculation PC Calculation Branch Predict Control Branch Predict nop in0 Out in1 sel en Control nop in0 Out in1 sel en nop en 4 en PC Addr A Out + Instruction Memory B Inst inst[16:31] sign ext <<2 4 en PC Addr A Out + B Out inst[0:31] + Instruction Memory A B in0 in1 Out in2 sel inst[7:11] inst[12:16] inst[12:16] inst[17:21] Inst inst[16:31] en inst[16:31] in0 Out in1 sel sign ext in0 Out in1 Read0 Read1 Write sel WriteData write_enable <<2 Register File sign ext Data0 Data1 clk reset A B Out + A B func branch jump in0 in1 Out CMP in2 sel Out inst[0:31] inst[7:11] inst[12:16] en en in0 Out in1 sel A B func ALU Out in0 Out in1 Read0 Read1 sel en write_enable Da Da Register File Addr write_enable WriteData signed width read_enable Data Memory clk Data reset en in0 Out in1 sel inst[12:16] in0 se Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

160 Supporting Speculation Branch Predict Control nop in0 Out in1 sel en nop in0 Out in1 sel en en 4 en A Out + B inst[16:31] sign ext <<2 A Out + B in0 in1 Out in2 sel inst[0:31] en in0 Out in1 sel A B func branch jump CMP Out en en en PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

161 Supporting Speculation Squash/Flush Control Branch Predict Control nop sel in0 Out in1 en nop sel in0 Out in1 en en 4 en A Out + B inst[16:31] sign ext <<2 A Out + B in0 in1 Out in2 sel inst[0:31] en in0 Out in1 sel A B func branch jump CMP Out en en en PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

162 Supporting Speculation Branch Predict Control Squash/Flush Contro nop in0 en nop in0 en en Out Out in1 in1 sel sel 4 en PC Addr A Out + Instruction Memory B Control Inst inst[16:31] sign ext <<2 A Out + B in0 in1 Out in2 sel nop inst[0:31] inst[7:11] inst[12:16] inst[12:16] inst[17:21] en in0 Out in1 in0 Out in1 sel sel in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 A enb func branch jump CMP Out en in0 Out in1 sel A B func nop Out ALU en in0 Out in1 sel Addr en write_enable signed width read_enable Data Memory Data en in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

163 Implementing Backward taken/forward not taken (BTFNT) A new branch predictor module termines what guess we are going to make. The BTFNT branch predictor has one input The sign of the offset -- to make the prediction The branch signal from the comparator -- to check if the prediction was correct. And two output The PC mux selector Steers execution in the predicted direction Re-directs execution when the branch resolves. A mis-predict signal that causes control to flush the pipe. 98

164 Performance Impact (ex 1) ET = I * CPI * CT BTFTN is has a misprediction rate of 20%. Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup BTFNT compared to just stalling on every branch? 99

165 Performance Impact (ex 1) ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = 1.1 IC = IC ET = Stall CPI =.2*2 +.8*1 = 1.2 CT = 1 IC = IC ET = 1.2 Speed up = 1.2/1.144 =

166 The Branch Delay Penalty The number of cycle between fetch and branch resolution is called the branch lay penalty It is the number of instruction that get flushed on a misprediction. It is the number of extra cycles the branch gets charged (i.e., the CPI for mispredicted branches goes up by the penalty for) 101

167 Performance Impact (ex 2) ET = I * CPI * CT Our current sign resolves branches in co, so the branch lay penalty is 1 cycle. If removing the comparator from co (and resolving branches in execute) would reduce cycle time by 20%, would it help or hurt performance? Mis predict rate = 20% Branches are 20% of instructions Resolve in Deco CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = 1 IC = IC ET = 1.04 Resolve in execute CPI = 0.2*0.2*(1 + 2) + (1-.2*.2)*1 = 1.08 CT = 0.8 IC = IC ET = Speedup =

168 Performance Impact (ex 2) ET = I * CPI * CT Our current sign resolves branches in co, so the branch lay penalty is 1 cycle. If removing the comparator from co (and resolving branches in execute) would reduce cycle time by 20%, would it help or hurt performance? Mis predict rate = 20% Branches are 20% of instructions 103

169 The Importance of Pipeline pth There are two important parameters of the pipeline that termine the impact of branches on performance Branch co time -- how many cycles does it take to intify a branch (in our case, this is less than 1) Branch resolution time -- cycles until the real branch outcome is known (in our case, this is 2 cycles) 104

170 Pentium 4 pipeline Branches take 19 cycles to resolve Stalling is not an option. Intifying a branch takes 4 cycles. 80% branch prediction accuracy is also not an option. Not quite as bad now, but BP is still very important.

171 Performance Impact (ex 1) ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = IC = IC ET = Stall CPI =.2*2 +.8*1 = 1.2 CT = 1 What if this were 20 instead of 1? IC = IC ET = 1.2 Speed up = 1.2/1.144 =

172 Performance Impact (ex 1) ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = IC = IC ET = Stall CPI =.2*2 +.8*1 = 1.2 CT = 1 What if this were 20 instead of 1? Branches are relatively infrequent (~20% of instructions), but Amdahl s Law tells that we can t completely ignore this uncommon case. IC = IC ET = 1.2 Speed up = 1.2/1.144 =

173 Dynamic Branch Prediction Long pipes mand higher accuracy than static schemes can liver. Instead of making the the guess once (i.e. statically), make it every time we see the branch. Many ways to predict dynamically We will focus on predicting future behavior based on past behavior 107

174 Predictable control Use previous branch behavior to predict future branch behavior. When is branch behavior predictable? 108

175 Predictable control Use previous branch behavior to predict future branch behavior. When is branch behavior predictable? Loops -- for(i = 0; i < 10; i++) {} 9 taken branches, 1 not-taken branch. All 10 are pretty predictable. Run-time constants Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. The branch is always taken or not taken. Corollated control a = 10; b = <something usually larger than a > if (a > 10) {} if (b > 10) {} Function calls LibraryFunction() -- Converts to a jr (jump register) instruction, but it s always the same. BaseClass * t; // t is usually a of sub class, SubClass t->somevirtualfunction() // will usually call the same function 109

176 Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous branch did. Pros? Cons? 110

177 Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous branch did. Pros? Dead simple. Keep a bit in the fetch stage. Works ok for simple loops. The compiler might be able to arrange things to make it work better. Cons? 110

178 Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous branch did. Pros? Dead simple. Keep a bit in the fetch stage. Works ok for simple loops. The compiler might be able to arrange things to make it work better. Cons? An unpredictable branch in a loop will mess everything up. It can t tell the difference between branches 110

179 Dynamic Prediction 2: A table of bits Give each branch it s own bit in a table Look up the prediction bit for the branch How big does the table need to be? Pros: Cons: 111

180 Dynamic Prediction 2: A table of bits Give each branch it s own bit in a table Look up the prediction bit for the branch How big does the table need to be? Pros: Cons: It can differentiate between branches. Bad behavior by one won t mess up others... mostly. 111

Implementing a MIPS Processor. Readings:

Implementing a MIPS Processor. Readings: Implementing a MIPS Processor Readings: 4.1-4.11 1 Goals for this Class Unrstand how CPUs run programs How do we express the computation the CPU? How does the CPU execute it? How does the CPU support other

More information

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches! Control Hazards 1 Today Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches! 2 Key Points: Control Hazards Control hazards occur when we don t know which instruction to execute

More information

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches! Control Hazards 1 Today Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches! 2 Key Points: Control Hazards Control hazards occur when we don t know which instruction to execute

More information

What about branches? Branch outcomes are not known until EXE What are our options?

What about branches? Branch outcomes are not known until EXE What are our options? What about branches? Branch outcomes are not known until EXE What are our options? 1 Control Hazards 2 Today Quiz Control Hazards Midterm review Return your papers 3 Key Points: Control Hazards Control

More information

What about branches? Branch outcomes are not known until EXE What are our options?

What about branches? Branch outcomes are not known until EXE What are our options? What about branches? Branch outcomes are not known until E What are our options? 1 Control Hazards 2 Today Quiz Control Hazards Midterm review Return your papers 3 Key Points: Control Hazards Control occur

More information

Pipelining is Hazardous!

Pipelining is Hazardous! Pipelining is Hazardous! Hazards are situations where pipelining does not work as elegantly as we would like Three kinds Structural hazards -- we have run out of a hardware resource. Data hazards -- an

More information

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers? Pipelined CPUs Where are the registers? Study Chapter 4 of Text Second Quiz on Friday. Covers lectures 8-14. Open book, open note, no computers or calculators. L17 Pipelined CPU I 1 Review of CPU Performance

More information

ECE232: Hardware Organization and Design

ECE232: Hardware Organization and Design ECE232: Hardware Organization and Design Lecture 17: Pipelining Wrapup Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Outline The textbook includes lots of information Focus on

More information

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07 Pipelined CPUs Where are the registers? Study Chapter 6 of Text L19 Pipelined CPU I 1 Review of CPU Performance MIPS = Millions of Instructions/Second MIPS = Freq CPI Freq = Clock Frequency, MHz CPI =

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining. Peng Liu. Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Computer Organization and Structure

Computer Organization and Structure Computer Organization and Structure 1. Assuming the following repeating pattern (e.g., in a loop) of branch outcomes: Branch outcomes a. T, T, NT, T b. T, T, T, NT, NT Homework #4 Due: 2014/12/9 a. What

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

CS3350B Computer Architecture Quiz 3 March 15, 2018

CS3350B Computer Architecture Quiz 3 March 15, 2018 CS3350B Computer Architecture Quiz 3 March 15, 2018 Student ID number: Student Last Name: Question 1.1 1.2 1.3 2.1 2.2 2.3 Total Marks The quiz consists of two exercises. The expected duration is 30 minutes.

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always

More information

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz CS 61C: Great Ideas in Computer Architecture Lecture 13: Pipelining Krste Asanović & Randy Katz http://inst.eecs.berkeley.edu/~cs61c/fa17 RISC-V Pipeline Pipeline Control Hazards Structural Data R-type

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Pipeline design. Mehran Rezaei

Pipeline design. Mehran Rezaei Pipeline design Mehran Rezaei How Can We Improve the Performance? Exec Time = IC * CPI * CCT Optimization IC CPI CCT Source Level * Compiler * * ISA * * Organization * * Technology * With Pipelining We

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

ECE331: Hardware Organization and Design

ECE331: Hardware Organization and Design ECE331: Hardware Organization and Design Lecture 35: Final Exam Review Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Material from Earlier in the Semester Throughput and latency

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17 Short Pipelining Review! ! Readings! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 4.5-4.7!! (Over the next 3-4 lectures)! Lecture 17" Short Pipelining Review! 3! Processor components! Multicore processors and programming! Recap: Pipelining

More information

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

More information

ECE331: Hardware Organization and Design

ECE331: Hardware Organization and Design ECE331: Hardware Organization and Design Lecture 27: Midterm2 review Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Midterm 2 Review Midterm will cover Section 1.6: Processor

More information

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes NAME: STUDENT NUMBER: EE557--FALL 1999 MAKE-UP MIDTERM 1 Closed books, closed notes Q1: /1 Q2: /1 Q3: /1 Q4: /1 Q5: /15 Q6: /1 TOTAL: /65 Grade: /25 1 QUESTION 1(Performance evaluation) 1 points We are

More information

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin.   School of Information Science and Technology SIST CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C

More information

Basic Pipelining Concepts

Basic Pipelining Concepts Basic ipelining oncepts Appendix A (recommended reading, not everything will be covered today) Basic pipelining ipeline hazards Data hazards ontrol hazards Structural hazards Multicycle operations Execution

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

EIE/ENE 334 Microprocessors

EIE/ENE 334 Microprocessors EIE/ENE 334 Microprocessors Lecture 6: The Processor Week #06/07 : Dejwoot KHAWPARISUTH Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2009, Elsevier (MK) http://webstaff.kmutt.ac.th/~dejwoot.kha/

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 4 Processor Part 2: Pipelining (Ch.4) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike

More information

14:332:331 Pipelined Datapath

14:332:331 Pipelined Datapath 14:332:331 Pipelined Datapath I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate

More information

CPU Pipelining Issues

CPU Pipelining Issues CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! Finishing up Chapter 6 L20 Pipeline Issues 1 Structural Data Hazard Consider LOADS: Can we fix all

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

ELE 655 Microprocessor System Design

ELE 655 Microprocessor System Design ELE 655 Microprocessor System Design Section 2 Instruction Level Parallelism Class 1 Basic Pipeline Notes: Reg shows up two places but actually is the same register file Writes occur on the second half

More information

EECS Digital Design

EECS Digital Design EECS 150 -- Digital Design Lecture 11-- Processor Pipelining 2010-2-23 John Wawrzynek Today s lecture by John Lazzaro www-inst.eecs.berkeley.edu/~cs150 1 Today: Pipelining How to apply the performance

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 13 Project Introduction You will design and optimize a RISC-V processor Phase 1: Design

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA

More information

More CPU Pipelining Issues

More CPU Pipelining Issues More CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! Important Stuff: Study Session for Problem Set 5 tomorrow night (11/11) 5:30-9:00pm Study Session

More information

Introduction to Pipelining. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Introduction to Pipelining. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. Introduction to Pipelining Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L15-1 Performance Measures Two metrics of interest when designing a system: 1. Latency: The delay

More information

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3. Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview

More information

Computer Architecture. Lecture 6.1: Fundamentals of

Computer Architecture. Lecture 6.1: Fundamentals of CS3350B Computer Architecture Winter 2015 Lecture 6.1: Fundamentals of Instructional Level Parallelism Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and

More information

Processor Design Pipelined Processor (II) Hung-Wei Tseng

Processor Design Pipelined Processor (II) Hung-Wei Tseng Processor Design Pipelined Processor (II) Hung-Wei Tseng Recap: Pipelining Break up the logic with pipeline registers into pipeline stages Each pipeline registers is clocked Each pipeline stage takes one

More information

CS 152 Computer Architecture and Engineering Lecture 4 Pipelining

CS 152 Computer Architecture and Engineering Lecture 4 Pipelining CS 152 Computer rchitecture and Engineering Lecture 4 Pipelining 2014-1-30 John Lazzaro (not a prof - John is always OK) T: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: 1 otorola 68000 Next week

More information

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds? Chapter 4: Assessing and Understanding Performance 1. Define response (execution) time. 2. Define throughput. 3. Describe why using the clock rate of a processor is a bad way to measure performance. Provide

More information

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs. Exam 2 April 12, 2012 You have 80 minutes to complete the exam. Please write your answers clearly and legibly on this exam paper. GRADE: Name. Class ID. 1. (22 pts) Circle the selected answer for T/F and

More information

Pipelining. CSC Friday, November 6, 2015

Pipelining. CSC Friday, November 6, 2015 Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

More information

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017! Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!

More information

Pipeline Control Hazards and Instruction Variations

Pipeline Control Hazards and Instruction Variations Pipeline Control Hazards and Instruction Variations Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H Appendix 4.8 Goals for Today Recap: Data Hazards Control Hazards

More information

The Big Picture: Where are We Now? EEM 486: Computer Architecture. Lecture 3. Designing a Single Cycle Datapath

The Big Picture: Where are We Now? EEM 486: Computer Architecture. Lecture 3. Designing a Single Cycle Datapath The Big Picture: Where are We Now? EEM 486: Computer Architecture Lecture 3 The Five Classic Components of a Computer Processor Input Control Memory Designing a Single Cycle path path Output Today s Topic:

More information

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 Computer Architecture ELEC3441 RISC vs CISC Iron Law CPUTime = # of instruction program # of cycle instruction cycle Lecture 5 Pipelining Dr. Hayden Kwok-Hay So Department of Electrical and Electronic

More information

CS 61C: Great Ideas in Computer Architecture. MIPS CPU Datapath, Control Introduction

CS 61C: Great Ideas in Computer Architecture. MIPS CPU Datapath, Control Introduction CS 61C: Great Ideas in Computer Architecture MIPS CPU Datapath, Control Introduction Instructor: Alan Christopher 7/28/214 Summer 214 -- Lecture #2 1 Review of Last Lecture Critical path constrains clock

More information

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception Outline A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception 1 4 Which stage is the branch decision made? Case 1: 0 M u x 1 Add

More information

361 datapath.1. Computer Architecture EECS 361 Lecture 8: Designing a Single Cycle Datapath

361 datapath.1. Computer Architecture EECS 361 Lecture 8: Designing a Single Cycle Datapath 361 datapath.1 Computer Architecture EECS 361 Lecture 8: Designing a Single Cycle Datapath Outline of Today s Lecture Introduction Where are we with respect to the BIG picture? Questions and Administrative

More information

DEE 1053 Computer Organization Lecture 6: Pipelining

DEE 1053 Computer Organization Lecture 6: Pipelining Dept. Electronics Engineering, National Chiao Tung University DEE 1053 Computer Organization Lecture 6: Pipelining Dr. Tian-Sheuan Chang tschang@twins.ee.nctu.edu.tw Dept. Electronics Engineering National

More information

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl. Lecture 4: Review of MIPS Instruction formats, impl. of control and datapath, pipelined impl. 1 MIPS Instruction Types Data transfer: Load and store Integer arithmetic/logic Floating point arithmetic Control

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Thomas Polzer Institut für Technische Informatik

Thomas Polzer Institut für Technische Informatik Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =

More information

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions. Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation

More information

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ... CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100

More information

CSEE 3827: Fundamentals of Computer Systems

CSEE 3827: Fundamentals of Computer Systems CSEE 3827: Fundamentals of Computer Systems Lecture 21 and 22 April 22 and 27, 2009 martha@cs.columbia.edu Amdahl s Law Be aware when optimizing... T = improved Taffected improvement factor + T unaffected

More information

Computer Architecture

Computer Architecture Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

More information

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard Computer Architecture and Organization Pipeline: Control Hazard Lecturer: Prof. Yifeng Zhu Fall, 2015 Portions of these slides are derived from: Dave Patterson UCB Lec 15.1 Pipelining Outline Introduction

More information

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution Important guidelines: Always state your assumptions and clearly explain your answers. Please upload your solution document

More information

COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version

5 th Edition. The Processor We will examine two MIPS implementations A simplified version A more realistic pipelined version COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 5 th Edition Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined

More information

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Moore s Law Gordon Moore @ Intel (1965) 2 Computer Architecture Trends (1)

More information

Processor Design Pipelined Processor. Hung-Wei Tseng

Processor Design Pipelined Processor. Hung-Wei Tseng Processor Design Pipelined Processor Hung-Wei Tseng Pipelining 7 Pipelining Break up the logic with isters into pipeline stages Each stage can act on different instruction/data States/Control signals of

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science Pipeline Overview Dr. Jiang Li Adapted from the slides provided by the authors Outline MIPS An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and

More information

Lecture 2: Processor and Pipelining 1

Lecture 2: Processor and Pipelining 1 The Simple BIG Picture! Chapter 3 Additional Slides The Processor and Pipelining CENG 6332 2 Datapath vs Control Datapath signals Control Points Controller Datapath: Storage, FU, interconnect sufficient

More information

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL'15 1 Classical concepts (prerequisite) 1. Instruction

More information

Appendix C: Pipelining: Basic and Intermediate Concepts

Appendix C: Pipelining: Basic and Intermediate Concepts Appendix C: Pipelining: Basic and Intermediate Concepts Key ideas and simple pipeline (Section C.1) Hazards (Sections C.2 and C.3) Structural hazards Data hazards Control hazards Exceptions (Section C.4)

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information