Chapter 4 The Processor

Size: px

Start display at page:

Download "Chapter 4 The Processor"

Jemima Mosley
6 years ago
Views:

1 Chapter 4 The Processor 4.1 Introduction 4.2 Logic Design Conventions 4.3 The Single-Cycle Design 4.4 The Pipelined Design (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 1

2 4.1 Introduction In this chapter we will examine in more detail the inner workings of a computing system, in particular, the CPU. Two separate designs for a simple MIPS32-like processor are discussed: the single-cycle design and a pipelined design. A Basic MIPS Implementation It would be impossible in the time we have allotted for Chapter 4 to discuss the design of a processor that would implement every MIPS32 instruction, so Chapter 4 focuses on a select subset of instructions to give you an idea of how R-, I-, and J-format instructions are executed. The subset of instructions are, Memory reference instructions: lw, sw (I-format) Arithmetic-logical instructions: add, sub, and, or, nor, slt (R-format) Branch instructions: beq (I-format), j (J-format) (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 2

3 4.2 Logic Design Conventions A microprocessor is a complex digital circuit which executes instructions. It can be divided into two primary parts: the datapath and the control unit (or simply, the control). The datapath consists of those elements that store data (bits), move bits around, and operate on bits. Datapath elements include, Logic gates (AND, OR, NAND, XOR, etc) Muxes, Decoders, Encoders Adders, Multipliers, Shifting circuits Register file (CPU registers) Instruction and data memory Cache memory The Arithmetic Logic Unit (ALU) is a major component of the datapath and contains the circuitry for performing arithmetic operations (+, -, *, /), as well as other operations, e.g., logical AND. The control is responsible for generating control signals which control the behavior of the datapath. There are two common techniques for implementing the control, Hardwiring - employs sequential logic and a finite state machine (FSM) to generate the control signals. Microprogramming - microinstructions (called microcode) are executed to generate the control signals. Each has its advantages and disadvantages (control is discussed in more detail in Appendix D). In Chapter 4, a simple hardwired control unit is designed. Note that the wall time it takes to execute an instruction will vary depending on the instruction. For example, a lw may take 50 ns (memory accesses are slow) whereas ADD add may require 100 ps (assuming the operands are already in registers and available to be sent to the ALU). It is more difficult to design a microprocessor circuit in which instructions execute in a variable amount of time. For this reason, in Chapter 4, a single-cycle design is implemented first. In the single-cycle design, every instruction takes one clock cycle to execute even though each individual instruction will take a variable amount of time within that clock cycle. Question: If each instruction requires variable time to complete, how do we ensure that each instruction completes within one clock cycle? Answer: Make the clock period greater than or equal to the time required for the slowest instruction. For example, if lw is the slowest instruction, at 50 ns, then make the clock period 50 ns (or the clock frequency 200 MHz). I hope you will notice that the single-cycle is very inefficient, but understand for learning purposes, we are more interested in the simplicity of the design than the performance. The single-cycle design implements a Harvard architecture 1 : separate instruction and data memories. The primary advantage is that both memories can be simultaneously accessed. The main disadvantage is that it requires two buses. 1 The term originated with the Harvard Mark I (officially the IBM Automatic Sequence Controlled Calculator or ASCC) designed and built by IBM in (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 3

4 Some small microcontrollers and DSP controllers used primarily in embedded systems implement a Harvard architecture. Most other processors, and certainly the type found in desktops, laptops, pad and cell phones, implement a Princeton architecture or Von Neumann architecture 2 : one combined memory for both instructions and data. (Looking ahead to Chapter 5, most modern systems actually implement a modified Harvard architecture where there are separate instruction and data caches backed by a common memory storing both instructions and data.) Clocking In digital logic, components employ combinational or sequential logic. A combinational logic component is one where the outputs depend only on the inputs, e.g., a logic gate. A sequential logic component is one that contains state (i.e., internal storage or memory) and where the outputs depend on both the current inputs and the state of the component. All modern processors are designed using synchronous digital logic, i.e., a clock is involved. The clock is used to determine when data are valid and stable (i.e., the signal on the wire has reached a steady 0 or 1 state) relative to the clock, and to control when writes to sequential logic components occur. In the clocking methodology, events can be based on the level of the clock signal (low or high) and in this case, the events are said to be level-triggered. Alternatively, events can be based on the edge of the clock signal (rising or falling) and these events would be edge-triggered. Edge-triggering is more common and is used in Chapter 4. For an example of how edge-triggering is used see Fig. 4.3 in the textbook. During a single-clock cycle, this sequence of events will occur, Signals emanating from State Element 1 are "read" and fed into the combinational logic. The combinational logic will perform some operation which will take x amount of time. On the next rising clock edge, combinational logic outputs will be written to State Element 2. The clock period must be greater than or equal to x, i.e., the combinational logic must have enough time to generate its results prior to the next clock edge, but note that the prior contents of State Element 2 (i.e., its state) will be coming out of State Element 2 just ahead of the bits being written on the rising clock edge. State Element 2 will perform some internal logic on the inputs, taking y amount of time, which will change its state. This may change its outputs, so the outputs of State Element 2 will not be stable until at least y time units after the rising edge of the clock. Consequently the clock cycle time (or clock period) must be greater than or equal to x + y, and since clock frequency is inversely related to clock period, the clock frequency must be less than or equal to 1 over x + y. 2 The term originated in a 1945 draft document describing the design of a computer system known as EDVAC (Electronic Discrete Variable Automatic Computer) which was authored by John Von Neumann, a renowned physicist and mathematician at the Princeton Institute for Advanced Study. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 4

5 Note: In the diagrams of Chapter 4, state elements are assumed to be written at the end of every clock cycle. If a state element is not written at the end of every clock cycle, a separate write control signal must be connected to it so that it can be written when required. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 5

6 4.3 The Single-Cycle Design Fig shows the datapath and control that implements the subset of instructions mentioned in 4.1, with the exception of the j instruction not yet being implemented (we will see how j is implemented shortly). I have augmented the diagram in the book by assigning names to some of the datapath components. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 6

7 4.3.1 The Machine Cycle The digital logic circuit comprising the processor implements what we will refer to as the machine cycle (other terms include instruction cycle and fetch-decode-execute cycle). The machine cycle begins with the rising edge of the system clock. 1. At the rising edge of the clock, the 32-bit instruction Instr is fetched from InstructionMemory[PC]. 2. In parallel do: a. PC Adder: Compute PC + 4 b. Send instruction opcode bits Instr 31:26 to control unit for decoding c. Send instruction function code bits Instr 5:0 to ALU control unit for decoding. d. Send instruction bits Instr 25:21 for register rs operand to register file. e. Send instruction bits Instr 20:16 for register rt operand to register file. f. Send either instruction bits Instr 20:16 or Instr 15:11 for register rt or rd operand to register file. g. Control unit decodes opcode bits Instr 31:26 to determine which instruction this is: 1. If the instruction is lw then the ALU Control configures the ALU to perform an addition; the sum is sent as the address to DataMemory; the data word is read; the word is sent to the register file for writing to rt on the next clock edge. 2. If the instruction is sw then the ALU Control configures the ALU to perform an addition; the sum is sent as the address to DataMemory; the word read from the source register rt is written into DataMemory on the next clock edge. 3. If the instruction is an arithmetic-logical instruction (R-format), then the two operands rs and rt are sent from the register file to the ALU; the control unit sends a 2-bit signal to the ALU Control to help it determine which operation to perform; the ALU Control sends a 4-bit signal to the ALU telling it which operation to perform; the ALU result is sent back to the register file for writing to rd on the next clock edge. 4. If the instruction is beq, then the two operands rs and rt are sent from the register file to the ALU; the ALU Control configures the ALU to perform a subtraction; the ALU will assert the Zero control signal if rs = rt; the 16-bit immediate value in Instr 15:0 is sign-extended to a 32-bit immediate value; the Branch Adder computes the branch target address; if Zero is asserted the PC Src Mux will be selected so the branch target address is written to PC on the next clock edge. 5. What happens during a j instruction is discussed in If the instruction is not beq or j then the PC Src Mux is selected so PC+4 is written to PC on the next clock edge. 3. On the next rising edge of the clock, any writes take place at the same time that Step 1 starts again. This cycle is repeated millions, billions, trillions, or gazillions of times per second. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 7

8 4.3.2 Datapath Components: Program Counter Register The Program Counter (PC) register always contains the address in the Instruction Memory (IMEM) of the instruction that will be fetched and executed in the next clock cycle. Assuming no branch or jump is taken, the address in IMEM of the instruction that will be executed next is PC + 4 which is calculated by the PC Adder Datapath Components: Register File The processor's 32 general-purpose registers $0 through $31 are stored in a structure termed the Register File. Fig. 8.9 in Appendix B (shown below) shows one way to construct an n-register register file. Each 32- bit register could be constructed using 32 D flip-flops and would have two inputs: C is a CLK signal that is asserted to write to the register and D is the 32-bit word to write. In the single-cycle datapath we write to a register by performing these steps, 1. Put the 5-bit register number on the Write Register input to the register file. 2. Put the 32-bit word on the Write Data input. 3. Activate the RegWrite control signal. Internal to the register file, the 5-to-32 decoder selects the appropriate register for writing by ensuring that the output of only one AND gate will be asserted. The other input to each AND gate is the RegWrite control signal which is asserted at the end (beginning) of each clock cycle when a write must occur Datapath Components: ALU The construction of a 32-bit ALU is discussed in Section B.5 of Appendix B. The ALU implements the following operations on 32-bit inputs A and B, A B A B A + B A B SLT A B Logical AND Logical OR Addition Subtraction Set on less than (1 when A < B, 0 otherwise) Logical NOR (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 8

9 The 32-bit ALU is constructed from 32 1-bit ALU's (ALU 0 through ALU 31 ) with ALU 0 through ALU 30 being implemented using the 1-bit ALU shown below left and ALU 31 being constructed from the 1-bit ALU shown below right. Each 1-bit ALU contains, 1. Four 1-bit data inputs, a One bit of the full 32-bit operand A. b One bit of the full 32-bit operand B. CarryIn The carry input signal for the internal full adder. Less Used in implementing the SLT instruction. 2. Three control inputs, Ainvert When asserted, a is used in the resulting operation. Binvert When asserted, b is used in the resulting operation. Operation A 2-bit signal that selects the mux that outputs the Result. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 9

10 3. ALU 0 through ALU 30 have two data outputs, Result The 1-bit result of performing the operation selected by Operation. CarryOut The carry out bit from the internal full adder. 4. ALU 31 has three data outputs, Result Set Overflow The 1-bit result of performing the operation selected by Operation. Used in implementing the slt instruction. Asserted high when an arithmetic operation results in overflow. The 32-bit ALU has three control inputs, Ainvert Used in computing A B Bnegate Used in computing A B and in implementing subtraction Operation A 2-bit signal that selects each of the 4-to-1 muxes in ALU 0 through ALU 31. For ALU 0 through ALU 31, the 2-bit Operation signal controls one of four operations by selecting the output of the 4-to-1 Result mux. In conjunction with Ainvert and Binvert this permits the ALU to perform these operations, Ainvert Binvert Operation Result Implements a b Logical AND a b Logical OR a b = a b Logical NOR sum bit from a + b Addition sum bit from a + b Subtraction Less input SLT Subtraction of B from A is performed as: A B = A + -B. The negation of B in two's complement is accomplished by forming the one's complement of B (by inverting each b bit) and adding 1. The addition of 1 is accomplished by setting CarryIn of ALU 0 to 1 when Operation = 10 (addition/subtraction). The slt rd, rs, rt instruction is performed by subtracting B (register rt) from A (register rs). If A is less than B then the result of the subtraction will be a negative value; if A is not less than B then the result of the subtraction will be zero or positive. In two's complement, a negative 32-bit integer has bit 31 set and a nonnegative integer has bit 31 cleared. Bit 31 of our 32-bit result will be set when the sum bit from the full adder of ALU 31 is 1 and bit 31 will be cleared when the sum bit of this full adder is 0. Therefore, the sum bit from the full adder of ALU 31 forms the Set output of ALU31 and this output is fed back to become the Less input of ALU 0. Consequently, when Operation = 11 the ALU Result will be 1 if A is less than B and 0 if A is greater than or equal to B. ALU31 contains overflow detection logic which we will not discuss. To implement beq rs, rt, label and bne rs, rt, label in hardware requires us to determine if rs = rt and if rs = rt, respectively. We can easily determine if rs = rt or if rs = rt by detecting if rs - rt is zero (as it will be when rs = rt) or nonzero (as it will be when rs = rt). That is the function of the large 32-bit XNOR gate: when all of the Result bits are 0, the Zero control signal will be 1; if one or more Result bits are 1, then Zero will be 0. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 10

11 The standard circuit symbol for the full 32-bit ALU is, In the diagram for the datapath of the single-cycle design the ALU Control logic block generates a 4-bit ALU control signal we shall label ALUOpInput 3:0 where, ALUOpInput 3 ALUOpInput 2 Connects to Ainvert Connects to Bnegate Thus, ALUOpInput 1 Connects to internal ALUOperation 1 ALUOpInput 0 Connects to internal ALUOperation 0 ALUOpInput ALU Operation 0000 Logical AND 0001 Logical OR 0010 Addition 0110 Subtraction 0111 Set on less than 1100 Logical NOR The ALU Control is controlled by two inputs: the 6-bit function code field from instruction bits Instr 5:0 and a 2-bit signal from the main control unit named ALUOp 1:0 which is encoded as, lw 00 (I) sub 10 (R) sw 00 (I) and 10 (R) beq 01 (I) or 10 (R) j xx (J) nor 10 (R) add 10 (R) slt 10 (R) For lw and sw the ALU needs to perform an addition. The 2-bit ALUOp signal going from the control unit to the ALU Control is set to 00 and the ALU control must output For beq the ALU needs to perform an addition. The 2-bit ALUOp signal going from the control unit to the ALU control is set to 01 and the ALU control must output For j, we do not use the ALU, so the 2-bit ALUOp signal can be set to anything, i.e., ALUOp = xx. For R-format instructions add, sub, and, or, nor, slt the main control unit sets ALUOp to 10 and the ALU control must output 0010, 0110, 0000, 0001, 1100, or (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 11

12 All of this can be summarized in a truth table where the inputs are ALUOp 1:0 and Instr 5:0 and the output is ALUOpInput 3:0. Instruction ALUOp 1:0 Instr 5:0 ALUOpInput 3:0 lw 00 xxxxxx 0010 sw 00 xxxxxx 0010 beq 01 xxxxxx 0110 j xx xxxxxx xxxx add 10 xx sub 10 xx and 10 xx or 10 xx nor 10 xx slt 10 xx Drawing the resulting combinational logic block that this truth table produces will be left as an exercise to the student (honestly, it is really not as complex as you might think it would be) Datapath Components: ALU Src 2 Mux The ALU Src 2 Mux is used to select the second operand for an ALU operation. Instruction ALU Source Operand 1 ALU Source Operand 2 lw, sw rs sign-ext(imm 15:0 ) beq rs rt R-format rs rt j n/a n/a Datapath Components: Sign Extend Unit We can always prepend 0-bits to a nonnegative binary number, changing the bit-length of the integer, without changing the number itself, e.g., = ==> = The same is true of a negative binary number, i.e., we can always prepend 1-bits, e.g., = ==> = The Sign Extend unit is used to convert the 16-bit immediate encoded in I-format instructions to a 32-bit word by replicating the sign bit. There is no logic involved: this simply involves wiring, (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 12

13 4.3.7 Datapath Components: Shift Left by Two Unit Shifting left by two is simply wiring as well, Datapath Components: Branch Adder The beq instruction is an I-format instruction with the 16-bit immediate field being used to compute the branch target address using this formula, branch target address = (PC + 4) + sign-ext(imm 15:0 ) << 2 Most processor architectures form the branch target address by using an addressing mode which is known as PC-relative addressing where a branch offset is added to PC to form the branch target address. If we form the 32-bit branch target address by adding PC and the 16-bit immediate field of the branch instruction (which forms a two's complement offset in the range [-32768, 32767]) then we could branch to any address which is in the range [PC , PC ]. To extend this range, MIPS treats the offset as being words rather than bytes, i.e., offset = imm 15:0 << 2. Although branch instructions are common they are used to implement if statements and loops, which are very common in HLL programming the majority of instructions that are executed are not branches. This means that the majority of the time, the next instruction that is executed will be the instruction following the one that is currently being executed. Consequently, the PC Adder proceeds to immediately calculate PC + 4 before the Control and the rest of the datapath has time to figure out that the instruction is a branch instruction and the branch will or will not be taken. Therefore, PC + 4, rather than PC, is the input to the Branch Adder. Example: Consider this code. What would be the encoding of the beq and bne instructions assuming that the address of the add instruction is 0x0040_4000? loop: add $t0, $t0, $t1 # 0x0040_4000 sll $t0, $t0, 2 # 0x0040_4004 slt $t1, $t0, $zero # 0x0040_4008 beq $t1, $zero, false # 0x0040_400C li $t2, 13 # 0x0040_4010 j end_if # 0x0040_4014 false: li $t2, -13 # 0x0040_4018 end_if: bne $t0, $zero, loop # 0x0040_401C end_loop: nop # 0x0040_4020 (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 13

14 When the beq instruction is fetched from memory to be executed, PC is 0x0040_400C and PC+4 is 0x0040_4010. The branch target address is 0x0040_4018, branch-target-address 31:0 = (PC + 4) + (sign-ext(imm 15:0 ) << 2) Solving for imm 15:0 we have, imm 15:0 = (branch-target-address 31:0 - (PC + 4)) >> 2 Consequently, imm 15:0 = (0x0040_4018-0x0040_4010) >> 2 imm 15:0 = 0x08 >> 2 imm 15:0 = 1000 >> 2 imm 15:0 = The encoding for beq will be: = 0x When the bne instruction is fetched from memory to be executed, PC is 0x0040_401C and PC+4 is 0x0040_4020. The branch target address is 0x0040_4000. Consequently, imm 15:0 = (0x0040_4000-0x0040_4020) >> 2 imm 15:0 = -0x20 >> 2 (note: -0x20 in 32-bit two's complement is 0xFFFFFFE0) imm 15:0 = >> 2 imm 15:0 = (which is -8 in decimal) The encoding for bne will be: = 0x1500FFF Datapath Components: PC Src Mux The value to be written to PC at the next clock is selected by what I will refer to as the PC Src Mux which is selected by the output of an AND gate. The inputs to the AND gate are a control signal named Branch emanating from the Control (asserted when Instr 31:26 = = BEQ) and the Zero output from the ALU. Remember that Zero is 1 when the ALU result is 0. During the execution of beq the contents of registers rs and rt will be fed into the ALU and ALUOpInput will be configured to perform a subtraction. If rs = rt then the ALU result will be 0 and both Zero and Branch will be 1, thus selecting the output from the Branch Adder. If rs = rt Zero will be 0 which causes the AND gate to output 0 which selects PC + 4 as the address to be written to PC Datapath Components: Dst Reg Mux For R-format instructions (add, sub, and, or, nor, slt) at the next clock edge we must write a result to the rd register (encoded in Instr 15:11 ). sw, beq, and j do not write a result, but lw does, and for lw, the register to be written is in rt (encoded in Instr 20:16 ). Therefore, the Dst Reg Mux is used to select the destination register for those instructions that write to the register file. The mux is selected by a signal named RegDst which emanates from the Control. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 14

4.3.11 Datapath Components: Result Mux The Result Mux selects the word to be written to the destination register on the next clock edge.

Decoding is performed by examining the op bits in Instr 31:26 and for R-format instructions, the function bits in Instr 5:0.

15 Datapath Components: Result Mux The Result Mux selects the word to be written to the destination register on the next clock edge. For R- format instructions, the word will be ALU Result, but for lw the value will be a word read from the Data Memory (DMEM) The Control Unit How to decode the instruction? Decoding is performed by examining the op bits in Instr 31:26 and for R-format instructions, the function bits in Instr 5:0. Thus, The Control receives the opcode bits of the instruction, examines them, and using combinational logic, asserts control signals going to different parts of the datapath. Combining these two tables, we get the master control table on the top of p. 16. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 15

Drawing the resulting combinational logic is left as an exercise for the student. 4.3.13 Implementing Jump Implementing j is straightforward.

16 Drawing the resulting combinational logic is left as an exercise for the student Implementing Jump Implementing j is straightforward. The formula to compute the jump target address is, jump target address = (PC+4) 31:28 (Instr 25:0 << 2) where means bit-concatenation. All of this is accomplished with wiring, a new mux (PC Src Mux 2), and a new control signal named Jump that is asserted when Instr 31:26 = Example: Consider this code. What would be the encoding of the j instruction assuming that the address of the add instruction is 0x0040_4000? loop: add $t0, $t0, $t1 # 0x0040_4000 sll $t0, $t0, 2 # 0x0040_4004 slt $t1, $t0, $zero # 0x0040_4008 beq $t1, $zero, false # 0x0040_400C li $t2, 13 # 0x0040_4010 j end_if # 0x0040_4014 false: li $t2, -13 # 0x0040_4018 end_if: bne $t0, $zero, loop # 0x0040_401C end_loop: nop # 0x0040_4020 When the j end_if instruction is fetched from memory to be executed, PC is 0x0040_4014 and PC+4 is 0x0040_4018. The jump target address is 0x0040_401C, jump-target-address 31:0 = (PC+4) 31:28 (addr 25:0 << 2) Solving for addr 25:0 we have, addr 25:0 = (jump-target-address 31:0 (PC+4) 31:28 000_0000) >> 2 Consequently, addr 25:0 = (0x0040_401C 0x0000_0000) >> 2 addr 25:0 = 0x0040_401C >> 2 addr 25:0 = >> 2 addr 25:0 = (discarding the four msb's after shifting right) (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 16

17 The j opcode is , so the instruction encoding will be: = 0x Here is the completed single-cycle design block diagram showing the logic in the upper right corner to implement j. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 17

18 Single-Cycle Summary Every instruction finishes within one clock cycle. This restriction limits each combinational logic block to one operation per clock cycle, i.e., at the beginning of the clock cycle, it starts computing some output and must be finished by the end of the clock cycle. Because each combinational logic block can only be used once, we must duplicate some of them. Which ones? We have three adders (PC Adder, Branch Adder, ALU Adder) We have two shift-left-by-2 units. Memory cannot be read and written in the same clock cycle, so separate instruction and data memories were required. The clock period and hence clock frequency are determined by the time it takes for the bits of the slowest instruction (lw) to travel through the datapath. The path the bits take is known as the critical path. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 18

19 4.4 Pipelining Laundry Analogy Suppose we need to complete four loads of laundry. Each load requires four steps: wash the clothes in the washer; dry the clothes in the dryer; fold the clothes; put the clothes up. Suppose each step takes 30 minutes. How long will it take to complete all four loads? I hope you see that it will take 4 steps 30 mins/step = 2 hours to complete one load and if we complete the loads in sequence, then the total time will be 2 hours/ load 4 loads = 8 hours. However, if we were smart about it, we would quickly realize that while one load of clothes is in the dryer which will take exactly 30 minutes to dry another load could be placed in the washer. A little more thought would lead us to realize that we could speed things up even more if we followed this procedure, 1. Place a load in washer. 2. When the washer is complete, move the load to the dryer, and start another load in the washer. 3. When the washer and dryer are both compete, move the clothes from the dryer to the folding table and fold them, move the clothes from the washer to the dryer, and start another load in the washer. 4. When the washer, dryer, and folding are all complete, ask your roommate to put the clothes up, move the clothes from the dryer to the folding table and fold them, move the clothes from the washer to the dryer, and start another load in the washer I hope you get the idea. Now how long does it take to complete all four loads of laundry? As the figure makes clear, 3.5 hours. Let's define a time unit to be the time for one step, i.e., 30 minutes, and time units per load to be the total number of time units it takes to complete one load of laundry starting at the time we put the clothes in the washer to when they are all put up. For method 1: 4 time units per load For method 2: 4 time units per load (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 19

20 Hmm, that's strange. We don't seem to have saved any time, yet we finished 4.5 hours sooner using method 2. Let's define average time units per load to be the total number of time units to complete all four of the loads, divided by the number of loads, For method 1: 16 time units 4 loads = 4 time units per load For method 2: 7 time units 4 loads = 1.75 time units per load Now we're on to something. By overlapping the work (i.e., performing steps in parallel) in method 2 we did not decrease the total time units that it took to complete any one load but we did decrease the average time units for four loads. If you recall from Chapter 1, we discussed that there were two primary performance metrics that we could focus on when measuring the performance of a system, 1. Execution (response) time - time to compete a program or task (want to minimize). 2. Throughput - number of programs or tasks completed per time unit (want to maximize). If we think of a time unit as being a computer clock cycle and a load of laundry as being an instruction, then the time units per load is analogous to execution time: the time to complete one instruction (4 clock cycles; this is called instruction latency) was the same whether we performed the steps sequentially or in parallel. The average time units per load is analogous to average clocks per instruction (CPI): the average time to complete one instruction (load of laundry) was decreased when we performed the steps in parallel. The inverse of clocks per instruction is instructions per clock (IPC) and IPC measures throughput, i.e., by decreasing CPI, we are increasing IPC, which increases throughput Pipelining History Method 2 is such a common technique both in doing laundry and in computing that we have a name for it: pipelining. The basic idea of pipelining is to divide the execution of an instruction into steps, called stages, so that multiple instructions may be executed in parallel, with each instruction being in one pipeline stage at a time. The most important concept to know regarding pipeline is that pipelining does not improve system performance by reducing the time to execute a single instruction (i.e., it does not decrease instruction latency) but rather, it increases performance by increasing instruction throughput. Pipelining is not a new idea. It was first used in the IBM 7030 "Stretch" computer system in 1961 (meaning we have been building pipelined architectures for roughly 55 years). Stretch was a computer designed to perform scientific and mathematical calculations very quickly. In it's time, it was the fastest computer in the world (until 1964 when the CDC 6600 came along; the CDC 6600 is generally considered to be the world's first supercomputer and the 6600 employed pipelining as well). Pipelining was then used extensively in the mainframes and supercomputers of the 1960's and 1970's. It did not become popular in the microprocessor world until the RISC philosophy came along in the early 1980's with the first MIPS (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 20

21 and RISC microprocessors. Today, all modern microprocessors use extensive pipelining because it is a proven technique for improving throughput Designing Instruction Sets for Pipelining MIPS is an acronym for Microprocessor without Interlocked Pipeline Stages. Its difficult at this stage (ha ha) to fully explain what "interlocked" means but we will learn it later. The original MIPS pipeline design employed five stages (this design is sometimes referred to as a classic pipeline design), Stage 1: Fetch the instruction from memory (IF = instruction fetch). Stage 2: Decode the instruction, read source operands from the register file (ID = instruction decode). Stage 3: Perform any required ALU operation (EX = execute). Stage 4: Perform any memory accesses (MEM = memory). Stage 5: Write any results to a register (WB = write back). In order to facilitate pipelining, each MIPS instruction was carefully designed. 1. All MIPS instructions are the same length (4-bytes in MIPS32). CISC processors commonly use variable-length instructions. Fixed-length instructions makes it easier to fetch an instruction in the first pipeline stage IF while the second stage ID is decoding the instruction (decoding means determining which instruction this is so the control can assert the proper signals to make the datapath execute the instruction). 2. MIPS has only a few instruction formats (R, I, and J; there are actually a few others not discussed in the text book). CISC processors often have many different instruction formats. Fewer instruction formats simplifies the decoding logic in the control. Also, for R and I format instructions the first two source operands rs and rt are always in bits 25:21 and 20:16 of the instruction. This enables the hardware to begin reading the register file in the second stage ID to obtain the first and second source operands at the same time the instruction is being decoded. 3. MIPS has only a few addressing modes. Remember that addressing modes refer to ways of specifying the locations of operands. CISC processors typically have numerous addressing modes. In the MIPS subset we are implementing, the only addressing modes are, Register Direct Addressing Mode: add $t1, $t2, $t3 All three operands are in registers. Immediate Addressing Mode: addi $t1, $t2, 100 The operand 100 is an immediate. Register direct is used for $t1 and $t2. Base + Displacement Addressing Mode: lw $t1, 4($sp) $sp (the base register) is added to 4 (the displacement). Register direct is used for $t1. PC-Relative Addressing Mode: beq $t1, $t2, label PC+4 is added to sign-extend(imm) << 2. Register direct is used for $t1 and $t2. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 21

22 Jump Addressing Mode: j label PC+4 31:28 is bit concatenated with addr << 2. A fewer number of addressing modes simplifies the overall design. 4. lw and sw are the only instructions which access memory. This allows stage 3 EX to calculate the memory address with the actual memory access being performed in the following stage 4 MEM. If it were possible to specify a memory address as an operand (e.g., in an add instruction, which is very typical of CISC processors) then stages 3 and 4 would need to be expanded to three stages: a stage to calculate the memory address (EX1), a stage to fetch the operand from memory (MEM), and a stage to perform the operation (EX2). 5. Instructions and operands are required to be word-aligned (i.e., at an address that is divisible by four). If an operand was not aligned on a word boundary, then a memory access would require two clock cycles rather than one Pipelining Speedup Suppose a single-cycle design is implemented where, 1. A memory access takes 200 ps (from cache; main memory access will be much slower, e.g., ns). 2. An ALU operation completes in 200 ps. 3. A read from or write to the register file requires 100 ps. The time to execute various instructions will be, Instruction Fetch Instruction Read Reg ALU Op Access Mem Write Reg Total lw 200 ps 100 ps 200 ps 200 ps 100 ps 800 ps sw 200 ps 100 ps 200 ps 200 ps n/a 700 ps R-format 200 ps 100 ps 200 ps n/a 100 ps 600 ps beq 200 ps 100 ps 200 ps n/a n/a 500 ps Because each instruction must finish in exactly one clock cycle in the single-cycle design, then the clock period must be greater than or equal to that of the slowest lw instruction, i.e., clock per > 800 ps. Since clock period and clock frequency are inversely related, the maximum clock frequency for this example is 1.25 GHz. Try to make it any faster, and data and control signals will not always arrive at their destination before the next clock edge. Consider this sequence of instructions, lw $t0, 0($s0) # will take 800 ps lw $t1, 4($s0) # will take 800 ps lw $t2, 8($s0) # will take 800 ps CPI = ( )/3 = 1, IPC = 1/CPI = 1/1 = 1 So the total time to execute the three instruction is 2400 ps and the throughput, IPC, is 1. Now consider the five stage pipeline (IF, ID, EX, MEM, WB) where each stage finishes in one clock cycle = 200 ps (in a manner similar to the clock cycle time of the single-cycle design, the pipeline stage clock period is determined by whichever stage takes the most amount of time; for lw it is both the memory access time and the ALU time). Consider the execution of the same sequence of lw instruction. Each lw will go through all five stages, (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 22

23 Stage 1 IF: Stage 2 ID: Stage 3 EX: Stage 4 MEM: Stage 5 WB: The lw instruction is fetched from memory (200 ps). The control determines this is a lw, rs ($s0) is read (200 ps). The ALU adds the offset and $s0 (200 ps). The word is read from memory (200 ps). The destination register rt is written with the word that was read (200 ps). However, the key to pipelining is to recognize that multiple instructions may be executed in parallel as long as each stage of the pipeline is working on only one instruction at a time. This can be seen by drawing what we will call a pipeline diagram. time lw $t0, 0($s0) IF ID EX MEM WB DONE Completes at 1000 ps lw $t1, 4($s0) IF ID EX MEM WB DONE Completes at 1200 ps lw $t2, 8($s0) IF ID EX MEM WB DONE Completes at 1400 ps So the total time to execute the three instructions is 1400 ps = 7 clock cycles. The average CPI is 7 3 = 2.33 clocks per instruction and the IPC is 1 / 2.33 clocks per instruction = 0.43 instructions per clock, which seems worse than the single-cycle design, but note that we are comparing apples and oranges: the clock period for the single-cycle design is 800 ps and the clock period for each stage of the pipelined design is 200 ps (which means each instruction completes in 1000 ps). Under ideal conditions, if p s is the clock period of a single-cycle design, then the clock period of a pipeline design with n stages should be p p p s /n (we assume a balanced pipeline where each stage requires the same amount of time). When this holds, the time to execute k instructions on the single-cycle design will be t s (k) = kp s. For the n-stage pipeline, the time to execute k instructions will be, t p (k) = p s + (k - 1)p p = p s + (k - 1)(p s /n) = p s + kp s /n - p s /n = p s (1 + (k - 1)/n). The pipeline speedup is defined to be, speedup = single-cycle time/pipeline time = t s (k)/t p (k) = kp s /p s (1 + (k - 1)/n) = k/(1 + (k - 1)/n) = kn/(k + n - 1) For k n, lim [k ] speedup kn/k n. That is, the ideal speedup is proportional to the number of pipeline stages. In practice, the speedup is less than this for various reasons, including hazards Pipeline Hazards In our lw example we assumed that each lw began executing in the IF stage exactly one clock cycle after the previous lw began executing. However, that was an ideal situation. In practice it is not always possible for the next instruction to execute in the next clock cycle due to hazards Structural Hazards A structural hazard occurs when a planned instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 23

24 This would happen when two instructions in different pipeline stages each need to use the same datapath component. Structural hazards are generally easy to design around, e.g., one can simply duplicate the datapath component Data Hazards A data hazard occurs when a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available. For the laundry analogy, assume you are putting up clothes and you determine that a sock is missing. A little searching around leads to the discovery that the missing sock is in the washer. Obviously, you cannot finish putting up the clothes until that sock makes its way out of the washer and through the dryer and folding stages. Data hazards are quite common and we will examine them in more detail later Control Hazards A control hazard is also called a branch hazard because these arise during the execution of branch instructions. The problem basically boils down to an inability to load the pipeline with the correct next instruction because at the time we are to fetch the next instruction, we have not determined if the branch is going to be taken or not. 4.5 Pipelined Datapath and Control As we saw in the single-cycle design, there are steps that must be performed in order for each instruction. Every instruction begins execution by being fetched from memory. Now consider lw. The second step is to read rs (the base register) and sign-extend the 16-bit immediate to form a 32-bit immediate. The third step is to use the main ALU to compute the address in memory. The fourth step is to fetch the word from the data memory. The fifth, and final, step is to write the word that was read into the destination register encoded in rt. Since lw is the instruction that requires the most steps, if we are going to design a pipelined datapath using the single-cycle design as our starting point, then it will need at least five stages. Fig on the next page shows how the five stages map onto the single-cycle datapath. Instruction Fetch (IF) Fetches the instruction from memory. Computes PC + 4. Instruction Decode (ID) The control determines which instruction this is in order to assert and deassert the necessary control signals. The register file is read (rs and rt) and the 16-bit immediate in Instr 15:0 is signextended to form a 32-bit immediate. Execute (EX) The main ALU performs (or executes) an operation. The branch adder computes the branch target address. Memory Access (MEM) For lw the address output from the main ALU is sent to the the data memory to read a word. For sw, the address output from the main ALU is sent to the data memory to write a word. Writeback (WB) For an R-format instruction, the ALU result is written into the destination register encoded in rd. For lw the word read from memory is written into the destination register encoded in rt. For sw, nothing is done because sw does not write to a register. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 24

To see how instructions flow through the five stages, consider a graphical representation of each stage and the datapath components that are used in each stage, where IM means the instruction memory

25 To see how instructions flow through the five stages, consider a graphical representation of each stage and the datapath components that are used in each stage, where IM means the instruction memory is being accessed, Reg means the register file is being read or written, ALU means the main ALU is performing an operation, and DM means the data memory is being read or written. Fig shows three lw instructions in the pipeline, First, note that each stage completes in one clock cycle, and since lw moves through all five stages, it would require five clock cycles to complete. For IM, Reg, and DM, the left half of the icon is shaded when it is being read or written in the first half of the clock cycle and the right half is shaded when it is being read or (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 25

26 written in the second half. Note that the register file is both read and written in each clock cycle, i.e., written in the first half and read in the second half. This can be accomplished by asserting the register write control signal on the rising clock edge and deasserting it on the falling clock edge. Now consider what happens in each clock cycle (the three lw instructions are named lw 1, lw 2, and lw 3 ). CC1: lw 1 is fetched from IM. CC2: The contents of rs = $0 for lw 1 is read from Reg in the first half of the clock cycle. At bit later, lw 2 is fetched from IM in the second half. CC3: The ALU adds rs = $0 to 100 to compute the address of the memory location that will be read for lw 1. At the same time, rs = $0 for lw 2 is read from Reg in the first half of the clock cycle. A bit later lw 3 is fetched from IM in the second half of the clock cycle. CC4: The word at address 100 is read in the second half of the clock cycle. One-half clock before that, the ALU adds rs = $0 to 200 to compute the memory location that will be read for lw 2. A bit later, rs = $0 for lw 3 is read from Reg in the second half of the clock cycle. CC5: The word read from DM[100] is written to the register encoded in the rt field of lw 1 in the first half of the clock cycle. A half-clock cycle later, the word at address 200 is read. At the beginning of the clock, the ALU adds rs = $0 to 300 to compute the memory location that will be read for lw 3. Note that lw 1 has completed at the end of this clock cycle. CC6: The word read from DM[200] is written to the register encoded in the rt field of lw 2 in the first half of the clock. A half-clock later, the word at address 300 is read. Note that lw 2 has completed at the end of this clock cycle. CC7: The word read from DM[300] is written to the register encoded in the rt field of lw 3 in the first half of the clock. Note that lw 3 has completed at the end of this clock cycle Hazards The most significant challenge with pipelining is keeping the pipeline full. Things can happen which will or could prevent the next instruction from starting execution (IF) in the following clock cycle. These are called hazards, and as discussed in Section 4.5.1, the three main types are: structural, data, and control hazards Structural Hazards A structural hazard arises when something about the structure of the hardware precludes two instructions from using the same datapath component during the same clock cycle. Consider a microprocessor with a von Neumann architecture (one combined memory for instructions and data) and a lw in the MEM stage trying to read from memory at the same time that another instruction is being fetched in the IF stage, see pipeline diagram on next page. As we saw with the single-cycle design, a simple fix is to employ a Harvard architecture: separate instruction and data memories. The structural hazard would be eliminated because the lw in the MEM stage will be reading from the data memory while the instruction being fetched in the IF stage will be reading from the instruction memory. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 26

27 Another structural hazard concerns two instructions which are trying to simultaneously write to and read from the register file. In this case the fix is to write to the register file in the first half of the clock cycle (on the rising clock edge) and read in the second half of the clock cycle (on the falling clock edge) Data Hazards A data hazard occurs when a data value that is needed in some stage of the pipeline is not yet available. For example, C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 add $s0, $t0, $t1 IF ID EX MEM WB add $s2, $s0, $s1 IF ID EX MEM WB The second add cannot execute in the EX stage until the clock cycle after the first add has written the sum of $t0 and $t1 to $s0 in its WB stage. A simple way to handle a data hazard is to stall the pipeline (commonly called inserting bubbles into the pipeline). For these two add instructions how many bubbles would be required? C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 add $s0, $t0, $t1 IF ID EX MEM WB nop IF ID EX MEM WB nop IF ID EX MEM WB add $s2, $s0, $s1 IF ID EX MEM WB Bubbles can be inserted into the pipeline using a nop (no operation) instruction. In MIPS, nop is an assembler-generated pseudoinstruction which is equivalent to sll $zero, $zero, 0. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 27

28 Now consider this HLL code, a = b + d; // Assume a is at 8($sp), b at 12($sp), c at 16($sp) c = b + e; // d at 20($sp), and e at 24($sp). And the compiler-generated assembly language code, lw $t0, 12($sp) # $t0 = b lw $t1, 20($sp) # $t1 = d add $t1, $t0, $t1 # $t1 = b + d sw $t1, 8($sp) # a = b + d lw $t1, 24($sp) # $t1 = e add $t1, $t0, $t1 # $t1 = b + e sw $t1, 16($sp) # c = b + e Any data hazards here? lw $t0, 12($sp) lw $t1, 20($sp) add $t1, $t0, $t1 sw $t1, 8($sp) lw $t1, 24($sp) add $t1, $t0, $t1 sw $t1, 16($sp) C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 C12 C13 C14 C15 C16 C17 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Ideally this code would complete in 11 clocks. C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 lw $t0, 12($sp) IF ID EX MEM WB lw $t1, 20($sp) IF ID EX MEM WB add $t1, $t0, $t1 IF ID EX MEM WB sw $t1, 8($sp) IF ID EX MEM WB lw $t1, 24($sp) IF ID EX MEM WB add $t1, $t0, $t1 IF ID EX MEM WB sw $t1, 16($sp) IF ID EX MEM WB Requires inserting 8 bubbles and completes in 19 clocks which 72% slower than the ideal code. Another technique for handling data hazards is instruction reordering. Consider, lw $t0, 12($sp) # $t0 = b lw $t1, 20($sp) # $t1 = d lw $t2, 24($sp) # $t2 = e add $t1, $t0, $t1 # $t1 = b + d add $t2, $t0, $t2 # $t2 = b + e sw $t1, 8($sp) # a = b + d sw $t2, 16($sp) # c = b + e (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 28

29 Any data hazards here? C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 lw $t0, 12($sp) IF ID EX MEM WB lw $t1, 20($sp) IF ID EX MEM WB lw $t2, 24($sp) IF ID EX MEM WB add $t1, $t0, $t1 IF ID EX MEM WB add $t2, $t0, $t2 IF ID EX MEM WB sw $t1, 8($sp) IF ID EX MEM WB sw $t2, 16($sp) IF ID EX MEM WB Requires inserting 2 bubbles and completes 13 clocks which is only 18% slower than the ideal code. A compiler will, during optimization, reorder instructions to eliminate as many data hazards as it can. Key points concerning data hazards and assembly language code, This example illustrates that the quality of the code generated by the compiler can affect execution time. Compiler writers and assembly language programmers have to have a good understanding of the hardware. However, instruction reordering cannot remove all bubbles. Consider, for whatever reason we cannot reorder the code above the lw lw $t0, 8($sp) # $t0 = a sw $t0, 12($sp) # b = a for whatever reason we cannot reorder the code below the sw C01 C02 C03 C04 C05 C06 C07 C08 lw $t0, 8($sp) IF ID EX ME WB sw $t0, 12($sp) IF ID EX ME WB Note: lw writes to the register file on the rising edge of the clock in clock cycle 5 and sw reads from the register file on the falling edge of the clock, also in clock cycle 5. Hence, 2 bubbles are required. Can we design the hardware to remove bubbles or to lessen their occurrence? That way, the compiler and the hardware could work together to reduce execution time. Consider this sequence, C01 C02 C03 C04 C05 C06 C07 C08 add $s0, $t1, $t2 IF ID EX ME WB add $s1, $s0, $t3 IF ID EX ME WB which requires 2 bubbles. But note that the ALU produces $t1 + $t2 in the EX stage during clock cycle 3 and it is coming out of the ALU on the ALUResult line. If we could send that value forward (in time) so it becomes the first source operand to the ALU for the second add instruction during it's EX stage, then we would have no bubbles. C01 C02 C03 C04 C05 C06 C07 C08 add $s0, $t1, $t2 IF ID EX ME WB add $s1, $s0, $t3 IF ID EX ME WB This is called forwarding (also referred to as bypassing). We will call this an EX data hazard since the needed value is coming from the EX stage of a prior instruction. Next, consider this code, (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 29

30 C01 C02 C03 C04 C05 C06 C07 C08 lw $t0, 8($sp) IF ID EX ME WB nop IF ID EX ME WB nop IF ID EX ME WB add $s0, $t0, $t1 IF ID EX ME WB which requires 2 bubbles. But note that the value read from memory which is to be written to $t0 during clock cycle 5 is available at the end of clock cycle 4. If we could send that value forward to become the first source operand to the ALU for the add instruction's EX stage then we would only require 1 bubble, lw $t0, 8($sp) nop add $s0, $t0, $t1 C01 C02 C03 C04 C05 C06 C07 C08 IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB We will call this a MEM data hazard since the needed value is coming from the MEM stage of a prior instruction. To summarize data hazards, remember these two key points about data hazards and forwarding, Forwarding cannot completely eliminate pipeline stalls, but it may be able to reduce their occurrence. Forwarding coupled with instruction reordering is more powerful than either technique alone Control Hazards Also called branch hazards because these hazards arise in branching instructions. A branch hazard occurs when the proper instruction cannot execute in the proper pipeline cycle because the instruction that was fetched is not the one that is needed, i.e., the flow of instruction addresses is not what the pipeline expected. Consider the code below. With our current pipeline design, we can determine during the MEM stage if the branch is taken (because the branch adder calculates the branch target address when beq is in the EX stage while the main ALU calculates $rs - $rt and asserts Zero). We have three immediate options for pipelining it: (1) wait to fetch the next instruction until the MEM stage of the beq completes when we know which instruction to fetch; (2) assume the branch will not be taken, and fetch the lw instruction in the next clock cycle; (3) assume the branch is taken and fetch the instruction at the branch target address in the next clock cycle. C01 C02 C03 C04 C05 C06 C07 C08 add $v0, $t0, $t1 IF ID EX ME WB beq $s0, $s1, label1 IF ID EX ME WB Branch taken/not taken determined in MEM stage lw $t3, 4($sp) addi $t5, $t6, 1 j label2 label1: add $v0, $t0, $t1 label2: (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 30

31 Option 1: Stall the pipeline until beq completes before fetching the correct instruction in clock 5. C01 C02 C03 C04 C05 C06 C07 C08 C09 add $v0, $t0, $t1 IF ID EX ME WB beq $s0, $s1, label1 IF ID EX ME WB nop nop nop lw $t3, 4($sp) IF ID EX ME WB addi $t5, $t6, 1 j label2 or label1: add $v0, $t0, $t1 IF ID EX ME WB label2: How does stalling impact performance? The instruction latency is 5 pipeline clocks. With stalling, each beq will essentially require 8 pipeline clocks. In one computer system benchmarking suite (SPECint2006) branches are 17% of the instructions. Suppose we analyze a program consisting of 10,000 instructions: 8,300 of them will complete in 5 pipeline clocks and 1,700 of them will complete in 8 pipeline clocks. The average instruction latency would be the sum of the pipeline clocks divided by the number of instructions, i.e., average pipeline CPI = [(8300 5) + (1700 8)] = The stalling would lead to a 1-5/5.51 = 9.3% performance hit. This option is very costly, so clearly, it would be desirable to find a better way to pipeline branches. Option 2: Assume the branch is not taken. Fetch lw in clock 3, addi in clock 4, and j in clock 5. If it turns out the branch is, in fact, not taken (we guessed incorrectly), then the proper sequence of instructions is in the pipeline, the pipeline has remained full, and there is no performance impact. However, if the branch is taken, then the lw, addi, and j that are in progress need to be flushed from the pipeline and we need to fetch add in clock 6. The performance hit for guessing incorrectly is 3 clocks, which is the same as for stalling. However, if it turns out that 75% of the time that branches are not taken, then 75% of the time there will be no performance penalty (the average performane penalty would drop to 75% 9.3% = 7%). C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 C12 add $v0, $t0, $t1 IF ID EX ME WB beq $s0, $s1, label1 IF ID EX ME WB lw $t3, 4($sp) IF ID EX ME WB Flush these three addi $t5, $t6, 1 IF ID EX ME WB instructions if necessary j label2 IF ID EX ME WB label1: add $v0, $t0, $t1 IF ID EX ME WB 3 clock delay if branch taken label2: Option 3: Assume the branch will be taken. Fetch add in clock 3. If we determine in clock 5 the branch is taken then there is no performance impact (leading to the same 7% performance penalty). On the other hand, if the branch is not taken, then we have to flush the add and fetch lw in clock 6. The equivalent performance impact is 3 clocks. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 31

32 C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 C12 add $v0, $t0, $t1 IF ID EX ME WB beq $s0, $s1, label1 IF ID EX ME WB lw $t3, 4($sp) IF ID EX ME WB 3 clock delay if branch addi $t5, $t6, 1 IF ID EX ME WB not taken j label2 label1: add $v0, $t0, $t1 IF ID EX ME WB Flush if branch not taken label2: So we seem to have a performance impact of 3 clocks in all three scenarios. However, in stalling, the performance impact is always present and is always 3 clocks. When we guess fetching either the next instruction in sequence or the instruction at the branch target address then sometimes we will guess correctly and there will be no performance impact. If we guess incorrectly, then the performance impact would be the same as if we had stalled, so clearly stalling is not a solution, i.e., it's better to guess. Guessing is a form of branch prediction, i.e., we are attempting to predict the future. Sometimes we will predict correctly and life is great; other times we will guess incorrectly and take a hit. However, notice that with the current pipeline design, we are not actually determining if the branch will be taken until the MEM stage of the beq instruction. What if we could determine this sooner, i.e., in a prior pipeline stage? If we reduce the clock delay for a wrong guess from 3 to 2, or 1, then there would still be a performance impact, but it would not be as severe. So the question is: with some modifications to the datapath, what would be the earliest pipeline stage in which we could determine if the branch will be taken? Remember the format of the beq instruction, beq $rs, $rt, label There are two basic operations that have to be performed: (1) determine if the contents of $rs and $rt are the same; and (2) compute the branch target address: branch target address = (PC + 4) + sign-ext(imm 15:0 ) << 2. It would be easy to move the calculation of the branch target address from the EX stage to the ID stage. PC+4 is already calculated in the IF stage and we can easily sign-extend the 16-bit immediate and shift left by two in the ID stage, feeding those inputs into the branch adder which is also moved to the ID stage. To determine if $rs = $rt does not actually require us to perform a subtraction using the main ALU. That was a convenient way to determine if they are equal, but it is simple to build an n-bit comparator by observing that a a = 0. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 32

33 Since $rs and $rt are read during the ID stage, it would be possible to add a 32-bit comparator to the ID stage. If the result of the comparator is 0, then the control would assert the proper control signal to cause PC to be updated with the branch target address. Consequently, a beq would essentially execute in 2 pipeline clocks, and using branch prediction we would reduce the performance impact of an incorrect guess to only 1 clock, C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 C12 add $v0, $t0, $t1 IF ID EX ME WB beq $s0, $s1, label1 IF ID EX ME WB lw $t3, 4($sp) IF ID EX ME WB Predict branch is not taken. If wrong, addi $t5, $t6, 1 IF ID EX ME WB flush these instructions. j label2 label1: add $v0, $t0, $t1 IF ID EX ME WB Fetch correct instruction, 1 clock hit label2: Many RISC processors of that era would schedule (when possible) an instruction that must always be executed, whether the branch is taken or not, following beq. The "space" created in the pipeline for the inserted instruction is called the branch delay slot. With the branch delay slot, when an instruction sequence can be reordered (by the compiler) so that a must-execute instruction fills the branch delay slot, then the branch penalty is reduced to 0 clocks. If no instruction can be reordered to fill that slot, then the original 1-clock penalty would still apply. Suffice it to say, correctly predicting branches is of such importance that there are numerous strategies for improving the prediction rate. Section 4.8 of the book discusses more efficient and complex strategies for branch prediction, which we will not discuss. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 33

34 4.6 Pipeline Registers Recall this sequence of instructions and the pipeline diagram, lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) Question: Consider CC4. The main ALU has computed the DM address as = 100 in CC3 and that address is sent to the DM so we can read DM[100] during CC4. But what is the ALU doing during CC4? It is adding to compute the memory address for lw 2. And where does that result go in CC5? To the DM. Now, the time that it takes to retrieve the word at DM[100] will be longer than the time it takes the ALU to compute , which means that the address inputs to DM cannot be changed during CC4, or in other words, we need to store the result from the ALU in CC4 somewhere until that result is needed in CC5. In fact, that is not our only problem. Consider lw 1. The instruction bits are fetched from the IM during CC1, but the 32-bit sign-extended immediate encoded in Instr 15:0 are not used until CC3 when they become the second source operand to the ALU. If we read the lw 1 instruction during CC1 and stored the instruction bits in a register (an instruction register or IR) so the bits would be available in later clock cycles, then what would happen during CC2 when lw 2 is fetched? The instruction bits for lw 2 would overwrite the instruction bits for lw 1 stored in IR. This would seem to imply that we would need two instruction registers. Or maybe we would need five because in theory, we could have up to five instructions in the pipeline at one time each instruction would, of course, be in a different stage. The way to resolve this problem is to place pipeline registers between each pipeline stage. The pipeline registers will store bits obtained or generated in previous pipeline stages that are needed in successive pipeline stages. The pipeline registers are named according to the stages they lie between, i.e., the IF/ID register separates the IF and ID stages; the ID/EX register separates the ID and EX stages, and so on, IF/ID: Stores bit from the IF stage that are used in the ID, EX, MEM, or WB stages. ID/EX: Stores bits from the IF or ID stages that used in the EX, MEM, or WB stages. EX/MEM: Stores bits from the IF, ID, or EX stages that are used in the MEM, or WB stages. MEM/WB: Stores bits from the IF, ID, EX, or MEM stages that are used in the WB stage. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 34

35 As with the register file, we can both read from and write to a pipeline register in each clock cycle: we write in the first half on the rising clock edge, and read in the second half on the falling clock edge, see Fig in the textbook. The next question then is: what are the contents of these registers? Lets consider lw $t0, 4($sp) as it moves through the pipeline. The format of lw is: op (6-bits), rs (base register = $sp), rt (destination register = $t0), imm 15:0 = At the beginning of CC1 the contents of PC is being sent to IM and the lw instruction at that address is fetched. At the same time, the contents of PC is being added to 4 by the PC adder. Any bits created in stage n of the pipeline that are needed in stages n + 1, n + 2,... must be written to the pipeline register that separates stage n from stage n + 1. For lw, the bits that must be forwarded are, op rs rt imm 15:0 To ID stage, used in Control To ID, address of register to be read from To WB, address of register to be written to To ID, to be sign-extended and ultimately used in EX to calculate the DM address These bits will be written to IF/ID on the rising edge of CC2. 2. On the rising edge of CC2, bits from the IF stage will be written to IF/ID. Note that PC + 4 will also be written to PC which will cause the instruction following the lw to be fetched from the instruction memory; that instruction will enter the pipeline in the IF stage. On the falling edge of CC2, rs (now stored in IF/ID) is sent to the register file and the contents of rs is read. During CC2, imm 15:0 will be sign-extended to form imm 31:0. Also during ID, the control logic will assert and deassert control signals that will be written to ID/EX on the rising edge of CC3; these control signals will travel through the subsequent stages along with the relevant bits of the instruction being executed in each stage. We will discuss the control logic soon, but for now, once the op bits are read from IF/ID and used by the control, they are not needed for later stages. Fig. 4.51, (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 35

36 The bits that must be forwarded are, rt imm 31:0 $rs To WB, address of register to write to To EX, used to calculate DM address to read from To EX (contents of $rs) used to calculate DM address to read from These bits will be written to ID/EX on the rising edge of CC3. 3. On the rising edge of CC3, bits from the ID stage will be written to ID/EX. On the falling edge of CC3, the contents of $rs (now stored in ID/EX) will be sent to the main ALU as source operand 1. The 32-bit immediate imm 31:0 will be read from ID/EX and send to the main ALU as source operand 2. The ALU will compute the DM address. The bits that must be forwarded are, rt ALUResult To WB, address of register to write to To MEM, DM address to read from These bits will be written to EX/MEM on the rising edge of CC4. 4. On the rising edge of CC4, bits from the EX stage will be written to EX/MEM. The ALU result from the EX stage will be sent to DM to specify the address to read from. Since this is not a branch instruction, the Control will assert PCSrc to cause PC to be written with PC + 4 on the rising edge of CC5. The bits that must be forwarded are, rt DM word To WB, address of register to write to To WB, the word that was read from DM, to be written to register rt These bits will be written to MEM/WB on the rising edge of CC5. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 36

37 5. On the rising edge of CC5, bits from the MEM stage will be written to MEM/WB. On the falling edge of CC5, rt will be read from MEM/WB and will be sent to the Write Register input of the register file. Also, the word that was read from DM will be sent to the Write Data input of the register file and will be written on the rising edge of CC6. This completes the pipelined execution of the lw $t0, 4($sp) instruction. However, lw is not the only instruction our pipeline supports. Other instructions may (will) require additional bits in the pipeline registers. Remember, the other instructions are, add $rd, $rs, $rt and $rd, $rs, $rt beq $rs, $rt, label j label or $rd, $rs, $rt slt $rd, $rs, $rt sub $rd, $rs, $rt sw $rt, imm 15:0 ($rs) Determining the pipeline contents for the remaining instructions will be left as an exercise for the student. 4.7 Pipeline Control Fig shows the pipeline control signals which are derived from the single-cycle design: ALUOp, ALUSrc, Branch, MemRead, MemToReg, MemWrite, PCSrc, RegDst, and RegWrite. These can be grouped into the pipeline stages in which the control signal is needed: IF: We always write either PC + 4 or the branch target address to PC. For a beq instruction, the state of the PCSrc control signal (asserted) will come from the beq MEM stage if the branch is taken. In all other cases (a non-beq instruction or the branch is not taken) PCSrc will be deasserted to cause PC + 4 to be written to PC. In the IF stage we always read from the instruction memory so no write control signal is needed for the instruction memory (how the instruction bits find their way into the IM is a mystery to me). ID: We always read the contents of rs and rt from the register file, whether those words are needed or not. This is so, even in the case of a J instruction where the rs and rt fields do not even exist we read from the register file; however, the words that are read are simply not used. EX: ALUSrc selects the second source operand to the ALU. ALUOp is generated by the main control and routed to the ALU control to select the ALU operation. For the j instruction that does not use the ALU these control signals are don't cares. For those instructions that write to a destination register (all but beq, j, and sw) RegDst selects the destination register (Instr 20:16 = rt for lw or Instr 15:11 = rd for R-format instructions). MEM: For beq, Branch needs to be asserted. The Zero output from the ALU will be forwarded from the beq EX stage to the beq MEM stage via the EX/MEM register. If both of those signals are asserted, then the output from the AND gate, labeled PCSrc, will be asserted which will cause the branch target address to be written to PC at the rising edge of the next clock, i.e., when beq moves to the WB stage. MemWrite is asserted for sw. MemRead is asserted for lw. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 37

38 WB: RegWrite is asserted for those instructions that write to a destination register (all but beq, j, and sw). MemToReg selects the word to be written to the destination register (either the ALU result or the word read from the data memory). Fig summarizes the control signals, The control signals are generated by the main control when an instruction reaches the ID stage. The states of these signals must be propagated through the pipeline registers as an instruction moves through the EX, MEM, and WB stages, see Fig. 4.50, 4.8 Resolving Data Hazards by Forwarding Earlier we discussed two types of data hazards that could be resolved in hardware, EX data hazard: arises when a word is available in the EX stage of instruction X that is needed as an ALU operand for the next instruction Y that will be in the EX stage in the next clock. MEM data hazard: arises when a word is available in the MEM stage of instruction X that is needed as an ALU operand for a subsequent instruction Y (Y is the instruction that follows the instruction that follows X) that will be in the EX stage in the next clock. Resolving data hazards in hardware requires, 1. Detecting the hazard 2. Fowarding the value. (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 38

39 The text uses this notation to simplify the discussion, ID/EX.RegisterRs rs field stored in the ID/EX pipeline register. ID/EX.RegisterRt rt field stored in the ID/EX pipeline register. EX/MEM.RegisterRd rd field stored in the EX/MEM pipeline register. MEM/WB.RegisterRd rd field stored in the MEM/WB pipeline register. Consider this code and the pipeline diagram, sub $t0, $t1, $t2 and $t3, $t0, $t4 C01 C02 C03 C04 C05 C06 C07 C08 IF ID EX MEM WB IF ID EX MEM WB This code illustrates an EX hazard: the result generated by the ALU for the sub instruction in the EX stage (clock 3) is needed as an ALU source operand for the and instruction in its EX stage (clock 4). This particular type of hazard is denoted, 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs The other three specific data hazards that can be detected are, 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Examples, sub $t0, $t1, $t2 # EX/MEM.RegisterRd = $t0 and $t3, $t4, $t0 # ID/EX.RegisterRt = $t0 sub $t0, $t1, $t2 # MEM/WB.RegisterRd = $t0 lw $t9, 0($t9) # and $t3, $t0, $t4 # ID/EX.RegisterRs = $t0 sub $t0, $t1, $t2 # MEM/WB.RegisterRd = $t0 lw $t9, 0($t9) # and $t3, $t4, $t0 # ID/EX.RegisterRt = $t0 Type 1b Type 2a Type 2b Remember that the $zero register can be both a source and destination register, sll $zero, $zero, $zero # EX/MEM.RegisterRd = $zero and $t3, $t4, $zero # ID/EX.RegisterRt = $zero This is not a type 1b EX data hazard because the sll instruction is not actually going to write to the $zero register when it reached the WB stage. To account for this, the hazard logic equations must be modified, Detect type 1a EX/MEM.RegisterRd = ID/EX.RegisterRs If EX/MEM.RegWrite = 1 And EX/MEM.RegisterRd = $zero And EX/MEM.RegisterRd = ID/EX.RegisterRs Then forward ALU result from EX/MEM to ALU input 1 (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 39

40 In English, Will the instruction entering the MEM stage write to a register when it reaches the WB stage, and is the destination register rd not $zero, and does the instruction entering the EX stage use the value that is going to be written by the instruction entering the MEM stage (which should be in register rs for the instruction entering the EX stage), then forward the value that is going to be written to rd to be the first input to the ALU. Detect type 1b EX/MEM.RegisterRd = ID/EX.RegisterRt If EX/MEM.RegWrite = 1 And EX/MEM.RegisterRd = $zero And EX/MEM.RegisterRd = ID/EX.RegisterRt Then forward ALU result from EX/MEM to ALU input 2 In English, Will the instruction entering the MEM stage write to a register when it reaches the WB stage, and is the destination register rd not $zero, and does the instruction entering the EX stage use the value that is going to be written by the instruction entering the MEM stage (which should be in register rt for the instruction entering the EX stage), then forward the value that is going to be written to rd to be the second input to the ALU. Detect type 2a MEM/WB.RegisterRd = ID/EX.RegisterRs If MEM/WB.RegWrite = 1 And MEM/WB.RegisterRd = $zero And MEM/WB.RegisterRd = ID/EX.RegisterRs Then forward word to be written to MEM/WB.RegisterRd to ALU input 1 In English, Will the instruction entering the WB stage write to a register, and is the destination register rd not $zero, and does the instruction entering the EX stage use the value that is going to be written by the instruction entering the WB stage (which should be in register rs for the instruction entering the EX stage), then forward the value that is going to be written to rd to be the first input to the ALU. Detect type 2b MEM/WB.RegisterRd = ID/EX.RegisterRt If MEM/WB.RegWrite = 1 And MEM/WB.RegisterRd = $zero And MEM/WB.RegisterRd = ID/Ex.RegisterRt Then forward word to be written to MEM/WB.RegisterRd to ALU input 2 (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 40

41 In English, Will the instruction entering the WB stage write to a register, and is the destination register rd not $zero, and does the instruction entering the EX stage use the value that is going to be written by the instruction entering the WB stage (which should be in register rt for the instruction entering the EX stage), then forward the value that is going to be written to rd to be the second input to the ALU. Now that these data hazards can be detected, the datapath must be modified so the data is properly forwarded. For example, Fig. 4.53, In CC3 for sub the ALU will generate the word to be written to $2 in CC5. This word will be written to the EX/MEM pipeline register at the beginning of CC4 and is needed in the EX stage of the and in CC4 (this is a type 1a EX hazard). Similarly, this word will be written to the MEM/WB pipeline register at the beginning of clock cycle 5 and is needed in the EX stage of the or in CC5 (this is a type 2b MEM hazard). The first modification to the datapath is to place multiplexors before the two ALU source operand inputs, Fig. 4.54, (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 41

The Forwarding Unit in the EX stage receives as inputs: ID/EX.RegisterRs (which is the rs field of the instruction moving from the ID to the EX stage); ID/EX.

42 The Forwarding Unit in the EX stage receives as inputs: ID/EX.RegisterRs (which is the rs field of the instruction moving from the ID to the EX stage); ID/EX.RegisterRt (which is the rt field of the instruction moving from the ID to the EX stage); EX/MEM.RegisterRd (which is the rd field of the instruction moving from the EX stage to the MEM stage); and MEM/WB.RegisterRd (which is the rd field of the instruction moving from the MEM to the WB stage). The forwarding unit outputs are two 2-bit control signals: ForwardA which selects the first source operand of the ALU and ForwardB which selects the second source operand. The inputs to mux A are (from top to bottom): 00 = the contents of $rs stored in ID/EX; 01 = the output from the MemToReg mux (which will be either the ALU result or the word read from data memory of the instruction that is in the WB stage); 10 = the ALU result of the instruction that is in the MEM stage. The inputs to mux B are (from top to bottom): 00 = the contents of $rt stored in ID/EX; 01 = the output from the MemToReg mux (which will be either the ALU result or the word read from data memory of the instruction that is in the WB stage); 10 = the ALU result of the instruction that is in the MEM stage. Rewriting the hazard detection equations to specify the states for ForwardA and ForwardB: Detect type 1a EX/MEM.RegisterRd = ID/EX.RegisterRs If EX/MEM.RegWrite = 1 And EX/MEM.RegisterRd = $zero And EX/MEM.RegisterRd = ID/EX.RegisterRs Then ForwardA 10 Detect type 1b EX/MEM.RegisterRd = ID/EX.RegisterRt If EX/MEM.RegWrite = 1 And EX/MEM.RegisterRd = $zero And EX/MEM.RegisterRd = ID/EX.RegisterRt Then ForwardB 10 Detect type 2a MEM/WB.RegisterRd = ID/EX.RegisterRs If MEM/WB.RegWrite = 1 And MEM/WB.RegisterRd = $zero And MEM/WB.RegisterRd = ID/EX.RegisterRs Then ForwardA 01 Detect type 2b MEM/WB.RegisterRd = ID/EX.RegisterRt If MEM/WB.RegWrite = 1 And MEM/WB.RegisterRd = $zero And MEM/WB.RegisterRd = ID/EX.RegisterRt Then ForwardB 01 (c) Kevin R. Burger :: Computer Science & Engineering :: Arizona State University Page 42

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction