Lecture 3: The Processor (Chapter 4 of textbook) Chapter PDF Free Download

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Introduction Chapter 4.1 Chapter 4.2

Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number of instruction formats opcode always the first 6 bits Smaller is faster limited instruction set limited number of registers in register file limited number of addressing modes Make the common case fast arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands Good design demands good compromises three instruction formats Chapter 4.3

The Processor: Datapath & Control Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j Generic implementation use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction Exec Fetch PC = PC+4 Decode All instructions (except j) use the after reading the registers How? memory-reference? arithmetic? control flow? Chapter 4.4

Logic Design Conventions Chapter 4.2 Chapter 4.5

Logic Design Basics Information encoded in binary Low voltage = 0, High voltage = 1 One wire per bit Multi-bit data encoded on multi-wire buses Combinational components Operate on data Output is a function of input State (sequential) components Store information Chapter 4.6

Combinational Components AND-gate Y = A & B A B Y Adder Y = A + B A B + Y Multiplexer Y = S? I1 : I0 I0 I1 M u x Y S Arithmetic/Logic Unit Y = F(A,B) A B Y F Chapter 4.7

Sequential Components Register: stores data in a circuit Uses a clock signal to determine when to update the stored value Edge-triggered: update when the clock signal changes the value - Rising edge (positive edge) : 0 1 - Falling edge (negative edge) : 1 0 D flip-flop is one option to implement register D Clk Q Clk D Q Chapter 4.8

Sequential Components Register with write control Only updates on clock edge when write control input (e.g., Write Enable) is asserted Clk D WE Clk Q WE D Q Chapter 4.9

Clocking Methodologies The clocking methodology defines when data in a state element are valid and stable relative to the clock State elements a memory element such as a register Edge-triggered all state changes occur on a clock edge Typical execution read contents of state elements send values through combinational logic write results to one or more state elements State element 1 Combinational logic State element 2 clock one clock cycle Assumes state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted and the clock edge occurs Chapter 4.10

Building a Datapath Chapter 4.3 Chapter 4.11

Fetching Instructions Fetching instructions involves reading the instruction from the Instruction Memory updating the PC value to be the address of the next (sequential) instruction clock Exec Fetch PC = PC+4 Decode PC 4 Address Add Instruction Memory Instruction PC is updated every clock cycle, so it does not need an explicit write control signal Chapter 4.12

Decoding Instructions Decoding instructions involves sending the fetched instruction s opcode and function field bits to the control unit Fetch PC = PC+4 Control Unit Exec Decode Instruction Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data reading two values from the Register File - Register File addresses are contained in the instruction Chapter 4.13

Executing R Format Operations R format operations (add, sub, slt, and, or) R-type: 31 25 20 15 10 5 0 op rs rt rd shamt funct perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd) RegWrite control Exec Fetch PC = PC+4 Decode Instruction Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data overflow zero Note that Register File is not written every cycle (e.g., sw), so we need an explicit write control signal for the Register File Chapter 4.14

Executing Store Operations Store operations involve compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction store value (read from the Register File during decode) written to the Data Memory control MemWrite Instruction Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data overflow zero Address Data Memory Write Data Data Sign 16 Extend 32 Chapter 4.15

Executing Load Operations Load operations involve compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction load value (read from the Data Memory) written to the Register File RegWrite control Instruction Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data overflow zero Address Data Memory Write Data Data Sign 16 Extend 32 Mem Chapter 4.16

Executing Branch Operations Branch operations involve compare the operands read from the Register File during decode for equality (zero output) compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the instr 4 Add Shift left 2 Add Branch target address control PC Instruction Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data zero (to branch control logic) Sign 16 Extend 32 Chapter 4.17

Executing Jump Operations Jump operation involves replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits Add 4 4 PC Address Instruction Memory Instruction 26 Shift left 2 28 Jump address Chapter 4.18

Creating a Single Datapath from the Parts Assemble the datapath segments and add control lines and multiplexors as needed Single cycle design fetch, decode and execute each instruction in one clock cycle no datapath resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders) multiplexors needed at the input of shared elements with control lines to do the selection write signals to control writing to the Register File and Data Memory Cycle time is determined by the length of the longest path Chapter 4.19

Fetch, Register, and Memory Access Portions PC 4 Address Add Instruction Memory Instruction RegWrite Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data Src control ovf zero MemWrite Address Data Memory Write Data Data MemtoReg Sign 16 Extend 32 Mem zero flag of - 1 : result is zero - 0 : result is not zero Chapter 4.20

Full Datapath (without Jump) Chapter 4.21

Adding Control Units Chapter 4.4, page 259 Chapter 4.22

Adding the Control Selecting the operations to perform (, Register File and Memory read/write) Controlling the flow of data (multiplexor inputs) Observations op field always in bits 31-26 R-type: 31 25 20 15 10 5 0 op rs rt rd shamt funct 31 25 20 15 0 I-Type: op rs rt address offset addr. of registers J-type: to be read are op always specified by the rs field (bits 25-21) and rt field (bits 20-16) - in lw and sw rs is the base register addr. of register to be written is in one of two places in rt (bits 20-16) for lw; in rd (bits 15-11) for R-type instructions offset for beq, lw, and sw always in bits 15-0 31 25 0 target address Chapter 4.23

The Main Control Unit Control signals derived from instruction R-type Load/ Store Branch 0 rs rt rd shamt funct 31:26 25:21 20:16 15:11 10:6 5:0 35 or 43 rs rt address 31:26 25:21 20:16 15:0 4 rs rt address 31:26 25:21 20:16 15:0 opcode always read always read write for R-type and load sign-extend Chapter 4.24

Single Cycle Datapath with Control Unit (w/o jump) 0 4 Add Op Instr[31-26] Control Unit Branch Src Shift left 2 Add 1 PCSrc Mem MemtoReg MemWrite PC Address Instruction Memory Instr[31-0] RegDst Instr[25-21] Instr[20-16] 0 1 Instr[15-11] RegWrite Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data 0 1 ovf zero Address Data Memory Write Data Data 1 0 Instr[15-0] Sign 16 Extend 32 control Instr[5-0] Chapter 4.25

Single Cycle Datapath with Control Unit 4 Add Instr[25-0] 26 Op Instr[31-26] Shift left 2 Control Unit 28 32 PC+4 [31-28] Jump Branch Src Shift left 2 Add 1 0 0 1 PCSrc Mem MemtoReg MemWrite PC Address Instruction Memory Instr[31-0] RegDst Instr[25-21] Instr[20-16] 0 1 Instr[15-11] RegWrite Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data 0 1 ovf zero Address Data Memory Write Data Data 1 0 Instr[15-0] Sign 16 Extend 32 control Instr[5-0] Chapter 4.26

Control used for Load/Store: Function = add Branch: Function = subtract R-type: Function depends on funct field control Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than Chapter 4.27

Control Assume 2-bit Op derived from opcode Combinational logic derives control opcode Op Operation funct function control lw 00 load word XXXXXX add 0010 sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 subtract 100010 subtract 0110 AND 100100 AND 0000 OR 100101 OR 0001 set-on-less-than 101010 set-on-less-than 0111 Chapter 4.28

Single-cycle Implementation Chapter 4.4, page 264 Chapter 4.29

R-type Instruction Data/Control Flow 0 4 Add Op Instr[31-26] Control Unit Branch Src Shift left 2 Add 1 PCSrc Mem MemtoReg MemWrite PC Address Instruction Memory Instr[31-0] RegDst Instr[25-21] Instr[20-16] 0 1 Instr[15-11] RegWrite Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data 0 1 ovf zero Address Data Memory Write Data Data 1 0 Instr[15-0] Sign 16 Extend 32 control Instr[5-0] Chapter 4.30

R-type Instruction Data/Control Flow (alt.) Chapter 4.31

Load Word Instruction Data/Control Flow 0 4 Add Op Instr[31-26] Control Unit Branch Src Shift left 2 Add 1 PCSrc Mem MemtoReg MemWrite PC Address Instruction Memory Instr[31-0] RegDst Instr[25-21] Instr[20-16] 0 1 Instr[15-11] RegWrite Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data 0 1 ovf zero Address Data Memory Write Data Data 1 0 Instr[15-0] Sign 16 Extend 32 control Instr[5-0] Chapter 4.32

Load Word Instruction Data/Control Flow (alt.) Chapter 4.34

Store Word Instruction Data/Control Flow 0 4 Add Op Instr[31-26] Control Unit Branch Src Shift left 2 Add 1 PCSrc Mem MemtoReg MemWrite PC Address Instruction Memory Instr[31-0] RegDst Instr[25-21] Instr[20-16] 0 1 Instr[15-11] RegWrite Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data 0 1 ovf zero Address Data Memory Write Data Data 1 0 Instr[15-0] Sign 16 Extend 32 control Instr[5-0] Chapter 4.35

Branch Instruction Data/Control Flow 0 4 Add Op Instr[31-26] Control Unit Branch Src Shift left 2 Add 1 PCSrc Mem MemtoReg MemWrite PC Address Instruction Memory Instr[31-0] RegDst Instr[25-21] Instr[20-16] 0 1 Instr[15-11] RegWrite Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data 0 1 ovf zero Address Data Memory Write Data Data 1 0 Instr[15-0] Sign 16 Extend 32 control Instr[5-0] Chapter 4.37

Branch Instruction Data/Control Flow (alt.) Chapter 4.39

Adding the Jump Operation 4 Add Instr[25-0] 26 Op Instr[31-26] Shift left 2 Control Unit 28 32 PC+4[31-28] Branch Jump Src Shift left 2 Add 1 0 0 1 PCSrc Mem MemtoReg MemWrite PC Address Instruction Memory Instr[31-0] RegDst Instr[25-21] Instr[20-16] 0 1 Instr[15-11] RegWrite Addr 1 Register Addr 2 Data 1 File Write Addr Data 2 Write Data 0 1 ovf zero Address Data Memory Write Data Data 1 0 Instr[15-0] Sign 16 Extend 32 control Instr[5-0] Chapter 4.40

Jump Operation (alt.) Chapter 4.41

Introduction to Pipelining Design Chapter 4.5 Chapter 4.42

Instruction Critical Paths What is the clock cycle time assuming negligible delays for muxes, control unit, sign extension, PC access, shift left 2, wires, setup and hold times except: Instruction Memory and Data Memory (200 ps) and adders (200 ps) Register File access (reads or writes) (100 ps) Instr. I Mem Reg Rd Op D Mem Reg Wr Total R- type load store beq jump 200 100 200 100 600 200 100 200 200 100 800 200 100 200 200 700 200 100 200 500 200 200 Chapter 4.43

Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction especially problematic for more complex instructions like floating-point multiplication Clk Cycle 1 Cycle 2 lw sw Waste Is slow but Is simple and easy to understand Chapter 4.44

How Can We Make It Faster? Start fetching and executing the next instruction before the current one has completed Pipelining modern processors are pipelined for performance Remember the performance equation: CPU time = IC CPI CC Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages A five stage pipeline is nearly five times faster because the CC is nearly five times faster - CPI=1 for single-cycle implementation - CPI 1 for pipelined implementation Chapter 4.45

Analogy: Assembly Line v.s. Mechanic Shop Mechanic Shop The mechanic needs to do everything It takes hours to fix just one car Car assembly line Many workers work together - Each worker just puts one or a few components into the car One assembly line can produce hundreds or thousands of cars per day Chapter 4.46

The Five Stages of Executing Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address; etc. Mem: /write the data from/to the Data Memory WB: Write the result data into the register file Chapter 4.47

Why Pipeline? For Performance! Time (clock cycles) I n s t r. Inst 0 Inst 1 Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 O r d e r Inst 2 Inst 3 Inst 4 Time to fill the pipeline Chapter 4.48

A Pipelined MIPS Processor Start the next instruction before the current one has completed improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB - clock cycle (pipeline stage time) is limited by the slowest stage for some stages don t need the whole clock cycle (e.g., WB) Chapter 4.49

Single Cycle versus Pipeline Single Cycle Implementation (CC = 800 ps): Clk Cycle 1 Cycle 2 lw sw Waste Pipeline Implementation (CC = 200 ps): lw IFetch Dec Exec Mem WB 400 ps sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why? How long does each take to complete 1,000,000 instrs? Chapter 4.50

Pipelining the MIPS ISA What makes it easy all instructions are the same length (32 bits) - can fetch in the 1 st stage and decode in the 2 nd stage few instruction formats (three) memory operations occur only in loads and stores - can use the execute stage to calculate memory addresses each instruction writes at most one result (i.e., changes the machine state) and does it in the last few pipeline stages (MEM or WB) operands must be aligned in memory so a single data transfer takes only one data memory access Only cover the following 8 instructions as an example lw, sw, add, sub, and, or, slt, beq Chapter 4.51

Pipelined Datapath Chapter 4.6, 286 Chapter 4.52

MIPS Pipelined Datapath Chapter 4.53

Pipeline Registers Need registers between stages Hold information produced in previous cycle 1 0 Chapter 4.54

Single-clock-cycle diagram Cycle-by-cycle flow of instructions through the pipelined datapath Single-clock-cycle pipeline diagram Show pipeline usage in a single cycle Highlight resources used in each cycle We will look at single-clock-cycle diagrams for load & store instructions Chapter 4.55

IF for Load & Store 1 0 Chapter 4.56

ID for Load & Store 1 0 Chapter 4.57

EX for Load & Store 1 0 Chapter 4.58

MEM for Load 1 0 Chapter 4.59

MEM for Store 1 0 Chapter 4.60

WB for Load 1 0 Wrong register number Chapter 4.61

Corrected Pipelined Datapath 1 0 Chapter 4.62

MIPS Pipeline Datapath State registers between each pipeline stage to isolate them IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack IF/ID ID/EX EX/MEM Add PC 4 Instruction Memory Address Addr 1 Register Addr 2Data 1 File Write Addr Data 2 Write Data Shift left 2 Add Address Write Data Data Memory Data MEM/WB Sign 16 Extend 32 System Clock Chapter 4.63

Graphically Representing MIPS Pipeline Can help with answering questions like: How many cycles does it take to execute this code? What is the doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed? Chapter 4.64

Multi-Cycle Pipeline Diagram Showing the resource usage Chapter 4.65

Multi-Cycle Pipeline Diagram Traditional form Chapter 4.66

Pipeline Control Chapter 4.6, page 300 Chapter 4.67

Pipelined Control Chapter 4.68

Pipelined Control Signals Control signals derived from instructions As in single-cycle implementation Chapter 4.69

Pipeline Control IF Stage: read Instr Memory (always asserted) and write PC (on System Clock) ID Stage: no control signals to set Reg Dst EX Stage MEM Stage WB Stage Op1 Op0 Src Brch Mem Mem Write Reg Write Mem toreg R 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 sw X 0 0 1 0 0 1 0 X beq X 0 1 0 1 0 0 0 X Chapter 4.70

MIPS Pipeline Control Path Modifications 1 0 Chapter 4.71

Pipeline Hazards Data Hazard Chapter 4.7, page 303 Chapter 4.72

Can Pipelining Get Us Into Trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource by two different instructions at the same time - Structural hazards are solved by duplicating the necessary components data hazards: attempt to use data before it is ready - An instruction s source operand(s) are produced by a prior instruction still in the pipeline control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated - branch and jump instructions, exceptions Can usually resolve hazards by waiting pipeline control must detect the hazard and take action to resolve hazards Chapter 4.73

A Single Memory Would Be a Structural Hazard Time (clock cycles) I n s t r. lw Inst 2 Mem Reg Mem Reg Mem Reg Mem Reg ing data from memory O r d e r Inst 3 Inst 4 Mem Reg Mem Reg Mem Reg Mem Reg Inst 5 ing instruction from memory Mem Reg Mem Reg Fix with separate instr and data memories (I$ and D$) Chapter 4.74

How About Register File Access? Time (clock cycles) I n s t r. add $1, Inst 2 Fix simple register file hazard by doing writes in the first half of the cycle and reads in the second half O r d e r Inst 3 add $2,$1, Chapter 4.75

Register Usage Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r add $1,$8,$9 sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 After Write data hazard Chapter 4.77

Loads Can Cause Data Hazards Dependencies backward in time cause hazards I n s t r. O r d e r lw $1,4($2) sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 Load-use data hazard Another After Write hazard Chapter 4.78

Formal Definitions of Data Hazards Consider two instructions i and j, with i occurring before j in program order Three data hazards RAW (read after write) - j tries to read a source before i writes it, so j incorrectly gets the old value WAW (write after write) - j tries to write an operand before it is written by i, leaving the value written by i rather than the value written by j in the destination WAR (write after read) - j tries to write a destination before it is read by i, so i incorrectly gets the new value In the basic 5-stage pipeline, WAW and WAR dependences do not cause any hazards Chapter 4.79

One Way to Fix a Data Hazard I n s t r. add $1, stall (insert nop) Can fix data hazard by waiting stall but impacts CPI O r d e r stall (insert nop) sub $4,$1,$5 and $6,$1,$7 Chapter 4.80

Another Way to Fix a Data Hazard I n s t r. add $1, sub $4,$1,$5 Fix data hazards by forwarding results as soon as they are available to where they are needed O r d e r and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 Chapter 4.81

Data Forwarding (aka Bypassing) Take the result from the earliest point where it exists in any of the pipeline registers and forward it to the functional units (e.g., the ) that need it in that cycle For functional unit: the inputs can come from any pipeline register rather than just from ID/EX by adding multiplexors to both inputs of the connecting the register write data in EX/MEM or MEM/WB to both mux inputs in the EX s stage adding the proper control hardware to control the new muxes With forwarding the processor can achieve a CPI close to 1 even in the presence of data dependencies Chapter 4.82

Forwarding Illustration I n s t r. add $1, sub $4,$1,$5 O r d e r and $6,$7,$1 EX forwarding MEM forwarding Chapter 4.83

Detecting the Need to Forward Pass register numbers along pipeline e.g., ID/EX.RegisterRs: register number for Rs stored in ID/EX pipeline register operand register numbers in EX stage are given by ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards only if 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd from previous instr., i.e., EX/MEM pipeline register Fwd from second previous instruction, i.e., MEM/WB pipeline register Chapter 4.84

Detecting the Need to Forward But only if forwarding instruction will write to a register! EX/MEM.RegWrite MEM/WB.RegWrite And only if Rd for that instruction is not $zero EX/MEM.RegisterRd 0 MEM/WB.RegisterRd 0 Chapter 4.85

Datapath with Forwarding Hardware PCSrc IF/ID Control ID/EX WB M EX EX/MEM WB M PC 4 Instruction Memory Address Add Addr 1 Register Addr 2Data 1 File Write Addr Data 2 Write Data 16 Sign 32 Extend Shift left 2 Add cntrl ForwardA Branch Data Memory Address Data Write Data MEM/WB WB ForwardB Forward Unit Chapter 4.86

Control Values for the Forwarding Multiplexors Mux control Source Explanation ForwardA=00 ID/EX The first operand comes from the register file. ForwardA=10 EX/MEM The first operand is forwarded from the prior result. ForwardA=01 MEM/WB The first operand is forwarded from the data memory or an earlier result. ForwardB=00 ID/EX The second operand comes from the register file. ForwardB=10 EX/MEM The second operand is forwarded from the prior result. ForwardB=01 MEM/WB The second operand is forwarded from the data memory or an earlier result. Chapter 4.87

Yet Another Complication! Another potential data hazard can occur when there is a conflict between the outputs of EX/MEM pipeline register and MEM/WB pipeline register which should be forwarded? I n s t r. O r d e r add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 Chapter 4.88

Statement for Forwarding Control Signals (in C) 1. ForwardA: if ( EX/MEM.RegWrite and (EX/MEM.RegisterRd!= 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10; Forwards the result from the previous instr. to either input of the Forwards the else if ( MEM/WB.RegWrite and result from the (MEM/WB.RegisterRd!= 0) and second previous (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01; instr. to either input of the else ForwardA = 00; No forwarding 2. ForwardB The logic is similar Chapter 4.90

Datapath with Forwarding Hardware PCSrc IF/ID Control ID/EX WB M EX EX/MEM WB M PC 4 Instruction Memory Address Add Addr 1 Register Addr 2Data 1 File Write Addr Data 2 Write Data 16 Sign 32 Extend 00 01 10 00 01 10 Shift left 2 Add cntrl Branch Data Memory Address Data Write Data MEM/WB WB EX/MEM.RegisterRd ID/EX.RegisterRt ID/EX.RegisterRs Forward Unit MEM/WB.RegisterRd Chapter 4.91

Data Hazards and Stalls Chapter 4.7, page 313 Chapter 4.92

Forwarding with Load-use Data Hazards (logical view) I n s t r. O r d e r lw $1,4($2) sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 Chapter 4.93

Forwarding with Load-use Data Hazards (logical view) I n s t r. O r d e r lw $1,4($2) sub stall $4,$1,$5 sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 Reg DM Reg IM xor $4,$1,$5 IM Reg DM Will still need one stall cycle even with forwarding Chapter 4.94

Load-use Hazard Detection Unit Need a Hazard detection Unit in the ID stage that inserts a stall between the load and its use 1. ID Hazard detection Unit: if (ID/EX.Mem and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline (more accurate, stall the following instructions) The first line tests to see if the instruction now in the EX stage is a lw; the next two lines check to see if the destination register of the lw matches either source register of the instruction in the ID stage (the load-use instruction) After this one cycle stall, the forwarding logic can handle the remaining data hazards Chapter 4.95

Hazard/Stall Hardware Along with the Hazard Unit, we have to implement the stall Prevent the instructions in the IF and ID stages from progressing down the pipeline done by preventing the PC register and the IF/ID pipeline register from changing Hazard detection Unit controls the writing of the PC (PC.write) and IF/ID (IF/ID.write) registers Insert a bubble between the lw instruction (in the EX stage) and the load-use instruction (in the ID stage) (i.e., insert a nop in the execution stream) Set the control bits in the EX, MEM, and WB control fields of the ID/EX pipeline register to 0 (nop). The Hazard Unit controls the mux that chooses between the real control values and the 0 s. Let the lw instruction and the following instructions in the pipeline proceed normally down the pipeline Chapter 4.96

Adding the Hazard/Stall Hardware PCSrc PC 4 Instruction Memory Address Add IF/ID Hazard Unit Control 0 0 1 Addr 1 Register Addr 2Data 1 File Write Addr Data 2 Write Data 16 Sign 32 Extend ID/EX WB M EX ID/EX.Mem Shift left 2 Add cntrl EX/MEM WB M Branch Data Memory Address Data Write Data MEM/WB WB ID/EX.RegisterRt Forward Unit Chapter 4.97

Stall/Bubble in the Pipeline Chapter 4.98

A Challenge: Memory-to-Memory Copies For loads immediately followed by stores (memory-tomemory copies), a stall can be avoided by adding forwarding hardware from the MEM/WB register to the data memory input. Would need to add a Forward Unit and a mux to the MEM stage I n s t r. O r d e r lw $1,4($2) sw $1,4($3) Chapter 4.99

Control Hazards Chapter 4.8 Chapter 4.100

MIPS Pipeline Control Path Modifications 1 0 Chapter 4.101

Control Hazards When the flow of instruction addresses is not sequential (i.e., PC PC + 4); incurred by change of flow instructions Unconditional branches (j, jal, jr) Conditional branches (beq, bne) Exceptions Possible approaches Stall (impacts CPI) Move decision point as early in the pipeline as possible, thereby reducing the number of stall cycles Predict and hope for the best! Delay decision (requires compiler support) Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards Chapter 4.102

Branch Instr s Cause Control Hazards Dependencies backward in time cause hazards I n s t r. O r d e r beq lw Inst 3 Inst 4 Chapter 4.103

One Way to Fix a Branch Control Hazard I n s t r. beq flush Fix branch hazard by waiting flush but affects CPI O r d e r flush flush beq target Inst 3 IM Reg DM Chapter 4.104

Another Way to Fix a Branch Control Hazard Move branch decision hardware as early in the pipeline as possible, i.e., during the decode cycle I n s t r. O r d e r beq flush beq target Inst 3 IM Reg DM Chapter 4.105

Reducing the Delay of Branches Add hardware to compute the branch target address and evaluate the branch decision to the ID stage Reduces the number of stall (flush) cycles to one (like with jumps) - But now need to add forwarding hardware in ID stage Comparing and updating the PC adds 2 muxes, a comparator, and an and gate to the ID stage - Computing branch target address can be done in parallel with RegFile read (done for all instructions only used when needed) - Comparing the registers can be done in the same clock cycle as RegFile read For deeper pipelines, branch decision points can be even later in the pipeline, incurring more stalls Chapter 4.106

Supporting ID Stage Branches PCSrc Branch Hazard Unit 0 1 ID/EX EX/MEM IF/ID Control 0 PC 4 Add Instruction Memory 0 Address IF.Flush Shift left 2 Addr 1 RegFile Addr 2 Data 1 Write Addr Data 2 Write Data 16 Sign Extend Add 32 Compare cntrl Data Memory Data Address Write Data MEM/WB Forward Unit Forward Unit Chapter 4.107

Example: Branch Taken Example 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7... 68: add $1, $2, $3 72: lw $4, 50($7) 76: and $3, $1, $2 Chapter 4.108

Example: Branch Taken Chapter 4.109

Example: Branch Taken Chapter 4.110

Data Hazards for Branches If a comparison register is a destination of 2 nd preceding instruction Can resolve using forwarding add $1, $2, $3 IF ID EX MEM WB add $4, $5, $6 IF ID EX MEM WB Inst 3 IF ID EX MEM WB beq $1, $4, target IF ID EX MEM WB Chapter 4.111

Data Hazards for Branches If a comparison register is a destination of preceding instruction or 2 nd preceding load instruction Need 1 stall cycle lw $1, addr IF ID EX MEM WB add $4, $5, $6 IF ID EX MEM WB beq stalled IF ID beq $1, $4, target ID EX MEM WB Chapter 4.112

Data Hazards for Branches If a comparison register is a destination of immediately preceding load instruction Need 2 stall cycles But no data forwarding lw $1, addr IF ID EX MEM WB beq stalled IF ID beq stalled ID beq $1, $0, target ID EX MEM WB Chapter 4.113

Summary All modern processors use pipelining for performance (a CPI close to 1 and fast clock frequency) Pipeline clock rate limited by slowest pipeline stage so designing a balanced pipeline is important Pipelining doesn t decrease the latency of single task, it increases the throughput of entire workload Must detect and resolve hazards Structural hazards resolved by designing the pipeline correctly Data hazards - Stall (impacts CPI) - Forward (requires hardware support) Control hazards put the branch decision hardware in as early a stage in the pipeline as possible - Stall (impacts CPI) - Static and dynamic prediction (requires hardware support) - Delay decision (requires compiler support) Chapter 4.114