COMPUTER ORGANIZATION AND DESIGN

Size: px

Start display at page:

Download "COMPUTER ORGANIZATION AND DESIGN"

Evelyn Jeffry Caldwell
6 years ago
Views:

1 ARM COMPUTER ORGANIZATION AND DESIGN Edition The Hardware/Software Interface Chapter 4 The Processor Modified and extended by R.J. Leduc

2 To understand this chapter, you will need to understand some basic digital logic concepts Earlier, we discussed AND, OR, NOT, and XOR (exclusive OR) logic gates You should go back and review this You should also review our earlier discussion on how RAM works You should review the Boolean logic axioms we discussed We will shortly discuss how multiplexors and registers work This section of slides includes information from Section Digital Logic Introduction Digital Logic Introduction Chapter 4 The Processor 2

3 Logic Design Basics Information in CPU encoded in binary Combinational component Low voltage = 0, High voltage = 1 One wire per bit Multi-bit data is encoded on multi-wire buses Operates on data Output is a function of current inputs i.e. no history Examples are circuits created from AND, OR, Not gates, but without any feedback loops State (sequential) elements Store information (i.e. flip-flops, registers) Chapter 4 The Processor 3

4 2- Input Multiplexor A 2-input multiplexor has two data sources, x1 and x2, and one output, f The third input, s, selects which input is transmitted to the output If s = 0, then f has the same value as x1 If s = 1, then f has the same value as x2 Chapter 4 The Processor 4

5 2- Input Multiplexor II Chapter 4 The Processor 5

6 4- Input Multiplexor For 4 inputs, need two select lines: s and s 0 1 If we wish to select between 32 sources as possible inputs, we would need five select inputs (i.e. 25 = 32) Chapter 4 The Processor 6

7 Multiplexor for 64-Bit Registers A 32-bit multiplexer can select between 32 1-bit sources To select between 32, 64-bit registers, we would need: An array of 32-bit multiplexors One 32-bit multiplexor for each bit of the register (64 muxes in total) i.e. the first 32-bit multiplexor would select between b0 of each of the LEGv8 registers Chapter 4 The Processor 7

8 Combinational Components I(m-1) I1 I0 n S0 n-bit, m x 1 Multiplexor S(log n m) O O= I0 if S=0..00 I1 if S=0..01 I(m-1) if S=1..11 I(log n I0-1) A B n A n log n x n Decoder n-bit Adder O(n-1) O1O0 carry sum n O0 =1 if I=0..00 O1 =1 if I=0..01 O(n-1) =1 if I=1..11 sum = A+B (first n bits) carry = (n+1) th bit of A+B With enable input e all O s are 0 if e=0 With carry-in input Ci sum = A + B + Ci B n n-bit Comparato r less equa greate l r less = 1 if A<B equal =1 if A=B greater=1 if A>B A n B n n bit, m function S0 ALU S(log n m) O O = A op B op determined by S. May have status outputs carry, zero, etc. Chapter 4 The Processor 8

9 Combinational Elements Adder Multiplexor Y = S? I1 : I0 I0 I1 M u x S A + Y = A+ B Y B Arithmetic/Logic Unit (ALU) Y = F(A, B) A ALU Y Y B F Chapter 4 The Processor 9

10 Tri-state Buffer When e = 1, acts like a buffer. i.e. output = input When e = 0, output is electrically disconnected from the input For a shared data bus, each register's outputs would go through a tri-state buffer As long as only one register enables output at a time, there is no conflict This is a more efficient method than multiplexors to have a large number of devices share a common wire Chapter 4 The Processor 10

11 D Flip-Flops The basic building block of a register is a device called a D (data) flip-flop (FF) A positive edge-triggered D flip-flop Stores a single bit of data On rising edge (when signal changes from low-to-high) of the clock: Stores the value at the D input Stored value then appears at the Q output, after a short delay Changes of D input otherwise ignored Can add an enable input to flip-flop When enable set to 1, it behaves as above When enable set to 0, it keeps its current value and ignores clock signal Chapter 4 The Processor 11

12 D Flip-Flops II Devices shown from top are: D latch (ignore) Positive edge-triggered D FF Negative edge-triggered (clock transitions from high-to-low) D FF To correctly store data, value at D input must be constant for period just before and after desired clock edge. Chapter 4 The Processor 12

13 Registers To create an 8-bit register: Group eight D flip-flops together Gives 8 D inputs, and 8 Q outputs Connect their enable signals together Connect their clock inputs together Chapter 4 The Processor 13

14 Registers II Example shows two 2-bit registers (R1 and R2), connected two a 2-bit shared data bus Register i (i in {1,2}) has an input enable signal (Riin) and an output enable signal (Riout) All registers use tri-state buffers to connect to the data bus Chapter 4 The Processor 14

Clocking Methodology Most data is stored in state elements such as registers Typical operation in a processor: One state element stores data, which appears at

15 Clocking Methodology Most data is stored in state elements such as registers Typical operation in a processor: One state element stores data, which appears at its output This data then propagates through a combinational logic circuit Data then appears at the input of a second state element Chapter 4 The Processor 15

16 Clocking Methodology II This occurs between clock edges: On first rising edge of clock, data is stored in State Element 1 Data propagates through combinational logic (i.e. an adder) Output of combinational logic stored in State Element2 on next rising edge of clock Clock period must be long enough for data to reach State element 2 and be stable Longest delay in processor determines the minimum clock period Chapter 4 The Processor 16

Clocking Methodology III Edge triggered elements allow for a state element to be read and written in the same clock cycle Clock period needs to be long enough for output of combinational logic to

17 Clocking Methodology III Edge triggered elements allow for a state element to be read and written in the same clock cycle Clock period needs to be long enough for output of combinational logic to reach next input The delay going through the combinational logic must be long enough that newly loaded value of state element can't propagate too quickly back to input of state element Input must not change for short period after clock edge Chapter 4 The Processor 17

18 From previous chapters, we saw that CPU performance was determined by: Instruction count Determined by ISA and compiler CPI and Cycle time 4.1 Introduction Chapter Introduction Determined by the implementation of the processor In examining the implementation of the CPU: We will see how the ISA determines many aspects of the CPU design How different implementations affect clock rate and CPI Chapter 4 The Processor 18

19 Chapter Introduction II We will examine two LEGv8 implementations A simplified version A more realistic pipelined version For both we will examine the datapath and controller design We will start with a highly abstract and simplified overview, and then refine the design as we add details Chapter 4 The Processor 19

20 Basic LEGv8 Implementation We will focus our implementation on a subset of LEGv8 instructions to simplify things Will demonstrate the key concepts of datapath and controller Implementation of remaining instructions similar The subset of instructions we will focus on is: Memory reference: Arithmetic/logical: Load register: LDUR X0, [X1,#8] Store register: STUR X0, [X1,#0] ADD, SUB, AND, ORR: i.e. ADD X0, X1, X2 Branching: Compare and Branch: CBZ X0, Label1 Unconditional Branch: B Label2 // will add this last to design Chapter 4 The Processor 20

21 Instruction Execution For all three classes of instructions (memory reference, arithmetic/logical, and branching), the first three steps are identical: 1. Set the program counter (PC) register to the memory location that contains the next instruction 2. Fetch the instruction from memory 3. Read zero, one or two registers: Use the fields of the instruction to select registers LDUR/STUR/CBZ only require one register Most other instructions require two Unconditional branch requires none Chapter 4 The Processor 21

22 Instruction Execution II Next action required depends on instruction type Except for unconditional branches, instructions next need to use the ALU Memory reference: need address calculation Arithmetic/logical: perform indicated operation Conditional branch: needs comparison to zero After ALU, next actions are: Memory reference: access memory to read data or write data Arithmetic/logical/load: store output of ALU or data from memory to a register Chapter 4 The Processor 22

23 Instruction Execution III Finally, we need to determine the address of the next instruction to execute If current instruction is a conditional branch and specified register was zero: Load new address (PC plus offset specified as part of instruction) into PC register Otherwise, load PC + 4 into PC register Chapter 4 The Processor 23

24 CPU Overview Provides an abstract and simplified view Omits two important details Chapter 4 The Processor 24

25 (1) Need Multiplexors Can t just join wires together Use multiplexors Chapter 4 The Processor 25

26 (2) Needs Control Signals Chapter 4 The Processor 26

27 Logic Design Conventions A signal is asserted if it is logically high We assert a signal when it should be set to logically high A signal is deasserted if it is logically low We deassert a signal when it should be set to logically low In textbook, we will only store data on the rising clock edge for our flip-flops and registers The data bus signals are assumed to be 64 bit unless specified otherwise Two overlapping signals are not connected unless there is a dot where they cross each other Chapter 4 The Processor 27

28 We will now examine the major components of a datapath needed to execute each class of a LEGv8 instruction A register file is a state element that consists of a set of registers that can be read or written by supplying the register number to be accessed 4.3 Building a Datapath Building a Datapath A datapath element is a unit used to operate on or hold data within the processor For LEGv8 we have: Instruction and data memories, register file, the ALU, and adders We will build a LEGv8 datapath incrementally Refining the overview design Chapter 4 The Processor 28

29 Instruction Fetch 64-bit register Increment by 4 for next instruction Chapter 4 The Processor 29

30 R-Format Instructions opcode Rm shamt Rn Rd 11 bits 5 bits 6 bits 5 bits 5 bits Need to read two register operands Perform arithmetic/logical operation Write result to a register Uses a register file and ALU Chapter 4 The Processor 30

31 R-Format Instructions II Register file requires: three 5-bit selection inputs to specify the two source registers, and the destination register One 64-bit input to load data to be written to the destination register Two 64-bit outputs Must assert RegWrite input to write to register on next clock edge Chapter 4 The Processor 31

32 R-Format Instructions III ALU has: two 64-bit inputs for operands and a 64 bit output A 4-bit input to select which function to perform A 1-bit output that is asserted when the result of the operation is zero Chapter 4 The Processor 32

33 Load/Store Instructions opcode address op2 Rn Rt 11 bits 9 bits 2 bits 5 bits 5 bits To perform load/store/ we need to: Read base address from Rn register Add 9-bit signed offset to base address to get data address If load: Read memory and update register If store: Write register value to memory Chapter 4 The Processor 33

34 Load/Store Instructions II To perform these instructions, we will need a register file and ALU: Register file provides base address and source/destination register ALU will add base address and address offset We will also need a sign extension unit: Unit takes the 32-bit instruction word as input For a load/store, it extracts a 9-bit offset For CBZ, it will extract a 19-bit offset Chapter 4 The Processor 34

35 Load/Store Instructions III We also need a data memory unit to read from and write to Data memory unit has: 64-bit address input to select memory location 64-bit input for write operations 64-bit output for read operations Input MemWrite is asserted to enable writing Input MemRead is asserted to specify a read operation Chapter 4 The Processor 35

36 Branch Instruction (CBZ) opcode Offset Rn 8 bits 19 bits 5 bits Requires register to test for Zero, and signed address offset Register is passed through ALU to output, and the ALU's zero output is set based upon this value We also need to calculate the branch target address which is the address to load into the PC register if the branch is taken (i.e. the register is equal zero) To calculate branch target address: Sign-extend displacement to 64 bits Shift displacement left 2 places to multiply by 4 (displacement is how many words to jump, but each word is 4 bytes) Add to current value of PC register (which is the address of the branch instruction) Chapter 4 The Processor 36

37 Branch Instruction II Shows segment of datapath that handles branches Adds a dedicated adder unit to calculate the branch target address Note: Figure 4.9 of the text incorrectly used PC +4 instead of PC in the top adder Chapter 4 The Processor 37

38 Composing the Elements Our first datapath version will execute one instruction in each clock cycle This means: Each datapath element can only do one function at a time Hence, we need separate instruction and data memories Also need two dedicated adders for calculating next instruction address Use multiplexers where alternate data sources are used for different instructions Add needed control signals Chapter 4 The Processor 38

39 R-Type/Load/Store Datapath They can share register file and ALU, but: Bottom ALU input needs to choose between offset and register Destination register input needs to choose between ALU output and data memory Chapter 4 The Processor 39

40 Full Datapath Still not using ALU zero output to determine PCSrc mux input Chapter 4 The Processor 40

41 We will now look at the design of the control unit To simplify the design of the main control unit, we will first design a simple controller for our ALU Our ALU offers six functions that we need To specify the desired function, we need to specify the value of the ALU's four control bits 4.4 A Simple Implementation Scheme ALU Control Chapter 4 The Processor 41

42 ALU Control II Table below shows the available functions and the corresponding 4-bit selection patterns Below shows which function the ALU is used for when executing the indicated instruction type Load/Store: need to add Base register to offset CBZ Branch: need to pass register to output R-type: operation depends on opcode ALU control Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 pass input b 1100 NOR Chapter 4 The Processor 42

43 ALU Control III We assume the main control unit will provide 2-bit ALUOp signal that is derived from the opcode This will determine ALU control for all but the R-type instructions We use X to indicate a don't care condition opcode ALUO p Operation Opcode field ALU function ALU control LDUR 00 load register XXXXXXXXXX X add 0010 STUR 00 store register XXXXXXXXXX X add 0010 CBZ 01 compare and branch on zero XXXXXXXXXX X pass input b 0111 R-type 10 add add 0010 subtract subtract 0110 AND AND 0000 ORR OR 0001 Chapter 4 The Processor 43

ALU Control Truth Table Next step is to create a truth table Can then use standard methods to derive a Boolean logic expression for each bit of ALU

44 ALU Control Truth Table Next step is to create a truth table Can then use standard methods to derive a Boolean logic expression for each bit of ALU Control signal Each Boolean expression implemented as a combinatorial circuit Used don't cares to minimize displayed rows Chapter 4 The Processor 44

45 The Main Control Unit We will create the main Control Unit We note that our control signals will be derived from our 32 bit instruction word Note: figure below as well as Fig 4.14 in text are incorrect. Opcode for conditional branch is 31:24, NOT 31:26. Chapter 4 The Processor 45

46 The Main Control Unit II We make the following important observations: Opcodes are found in bits 31:21 First register operand is bits 9:5 (Rn) for R-type and base register of load/store 2nd register operand is bits 20:16 (Rm) for R-type, but at 4:0 (Rt) for a store operation and register for testing on CBZ Means we require a multiplexor to make selection Destination register for R-type and load operations is bits 4:0 Chapter 4 The Processor 46

47 Datapath With Control Unit Chapter 4 The Processor 47

48 Logic for Control Signals Read Figure 4.16 in text for a good description of the purpose and operation of main control signals The value of the control signals determined by the instructions opcode alone Read description for Figure 4.18 in text Chapter 4 The Processor 48

49 Truth Table For Control Signals Use standard methods to derive Boolean logic expressions for each control signal Implement each expression as combinatorial circuit Used don't cares to minimize size Chapter 4 The Processor 49

50 Operation of R-type Instruction Consider operation of datapath for: ADD X1, X2,X3 Happens in one clock cycle, but steps ordered by flow of information 1. On rising clock edge, new instruction address is loaded into PC register 2. The instruction at this address is loaded 3. Registers X2 and X3 are read from register file, while control unit (then ALU control) sets the control signals 4. ALU does specified operation on data read from register file (ADD, SUB, AND, ORR) 5. Output of ALU directed to Write data input of register file (X1) 6. On next rising edge of clock, data saved to X1, while PC register will be loaded with next instruction address (PC + 4), and process repeats Chapter 4 The Processor 50

51 R-Type Instruction Chapter 4 The Processor 51

52 Operation of Load Instruction Consider operation of datapath for: LDUR X1, [X2,offset] 1. On rising clock edge, new instruction address is loaded into PC register 2. The instruction at this address is loaded 3. Registers X2 is read from register file, while control unit (then ALU control) sets the control signals 4. ALU computes the sum of X2 and the sign-extended address offset 5. Output of ALU used for address of Data memory (MemRead asserted) 6. On next rising edge of clock, output of data memory saved to X1, while PC register will be loaded with next instruction address (PC + 4), and process repeats Chapter 4 The Processor 52

53 Load Instruction For store operation, read description of Figure 4.20 in text Chapter 4 The Processor 53

54 Operation of CBZ Instruction Consider operation of datapath for: CBZ X1, offset 1. On rising clock edge, new instruction address is loaded into PC register 2. The instruction at this address is loaded 3. Registers X1 is read from register file (Read Reg 2) using I[4:0], while control unit (then ALU control) sets the control signals 4. ALU passes value of Read data 2 to output, setting signal zero = 1 if result = 0, sets zero = 0 otherwise 5. Value of PC is added to sign extended offset shifted left by 2 (i.e. branch target address) 6. On next rising edge of clock, PC register loaded with branch target address if zero asserted, otherwise loaded with next instruction address (PC + 4), and process repeats Chapter 4 The Processor 54

55 CBZ Instruction Chapter 4 The Processor 55

56 Implementing Unconditional Branch B 5 Offset 31:26 25:0 Like CBZ, but 26 bit offset, and always branches Branch target calculated by adding PC to sign-extended offset shifted left by 2 Need to add to control unit a new output called Uncondbranch that is only true when bits 31:26 of instruction equal 5 (i.e. B instruction) Need to add an OR gate for select input of top right multiplexor Need to extend sign-extend unit to be able to also select the 26 bit offset from a B instruction Chapter 4 The Processor 56

57 Datapath With B Added Chapter 4 The Processor 57

58 Performance Issues Single-cycle design will work correctly, but too inefficient for modern designs Every instruction must have same clock period Means longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not feasible to vary period for different instructions Violates design principle Making the common case fast We will improve performance by pipelining Chapter 4 The Processor 58

59 Consider the steps needed to process multiple loads of laundry with your roommate 1. Place load in washer 2. Move load to dryer 3. Place dry load on table and fold 4. Have roommate put clothes away 5. Go to step 1 until all loads finished 4.5 An Overview of Pipelining Pipelining Analogy Not a very time efficient approach as doesn't take into account that steps 1-4 can be done in parallel Chapter 4 The Processor 59

60 Pipelining Analogy II Assuming that we have the resources that each step can be done at same time, we can overlap execution For example, immediately start next load in washer after we move wet clothes to dryer Pipelining is a technique where multiple steps (called stages) are operated concurrently For Laundry, after the first load was finished, all four stages would be active, but on different loads Chapter 4 The Processor 60

61 Pipelining Analogy III For sequential version, each load takes 2 hours For pipelined version, first load takes 2 hours, but each additional load takes only an additional 30 minutes! Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup = 2n/(0.5n + 1.5) 4 = number of stages As n gets large, speedup tends to number of stages Chapter 4 The Processor 61

Pipelining Analogy IV The time spent per stage (cycle time) is time required for longest stage Laundry must be past forward by all stages at same time We go from a cycle time of 2 hours, to that

62 Pipelining Analogy IV The time spent per stage (cycle time) is time required for longest stage Laundry must be past forward by all stages at same time We go from a cycle time of 2 hours, to that of 30 minutes After initial latency to complete first load, completion time for each load is that of new cycle time We have not changed latency, but improved throughput Chapter 4 The Processor 62

63 LEGv8 Pipeline We can now apply this concept to instruction execution LEGv8 requires five steps to execute instructions IF: Fetch instruction from memory ID: Decode instruction and read register EX: Execute operation or calculate address MEM: Access operand in data memory (if needed) WB: Write result back to register (if needed) Each step becomes one stage of the instruction pipeline Clock period now only needs to be long enough for slowest stage to complete Chapter 4 The Processor 63

64 Pipeline Performance Assume elapsed times for execution stages are: 100ps for register read or write 200ps for other stages Elapsed time for each instruction class shown in table below Instr Instr fetch Register read ALU op Memory access Register write Total time LDUR 200ps 100 ps 200ps 200ps 100 ps 800ps STUR 200ps 100 ps 200ps 200ps R-format 200ps 100 ps 200ps CBZ 200ps 100 ps 200ps 700ps 100 ps 600ps 500ps Chapter 4 The Processor 64

65 Pipeline Performance II Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 The Processor 65

66 Pipeline Speedup If we execute 4 instructions, this takes 800 x 4 = 3200 for single-cycle version For pipelined: x 3 = 1400 Speedup is 3200/1400 = 2.29 Reason is that with only 4 instructions, the first one dominates. For a program executing billions of instructions, this should tend to a speedup of 5, the number of stages of the pipelines Assuming that the execution time of each stage is about the same Chapter 4 The Processor 66

67 Pipeline Speedup II If all stages are balanced i.e., all take the same time Time between instructionspipelined = Time between instructionsnonpipelined Number of stages If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease Chapter 4 The Processor 67

68 Pipelining and ISA Design LEGv8 ISA was designed for pipelining All instructions are 32-bits Few and regular instruction formats Easier to fetch in one stage and decode in next Compare with x86: 1- to 15-byte instructions Can decode and read registers in one step Load/store addressing Memory operands only appear here Frees up ALU to calculate addressing in 3rd stage of pipeline, and access memory in 4th stage Compare with x86 that can perform operations (i.e. ADD) on operands in memory would need extra stage! Chapter 4 The Processor 68

69 Hazards Hazards are situations when the next instruction cannot execute in the following clock cycle There are three types: A structural hazard is when a required resource is needed at the same time by more than one instruction A data hazard is when an instruction can not execute a stage because it is waiting on data from an earlier instruction i.e. waiting for a result to be written to destination register A control hazard occurs when the instruction that was fetched was not the one needed i.e. don't know which instruction is need after a conditional branch until the branch is evaluated Chapter 4 The Processor 69

70 Structural Hazards This is when different instructions have a conflict for simultaneous use of a resource If LEGv8 pipeline used a single memory we could get a structural hazard Both load and store require data access If tried to also do an instruction fetch at same time, pipeline would have to stall for that cycle Would cause a pipeline bubble Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches Chapter 4 The Processor 70

71 Data Hazards Occurs when an instruction depends on completion of data access by a previous instruction ADD SUB X19, X0, X1 X2, X19, X3 The result of the ADD will not be written to X19 in time to be read for the SUB operation, causing delay Chapter 4 The Processor 71

72 Forwarding (aka Bypassing) In forwarding, we use result when it is computed Don t wait for it to be stored in a register Requires extra connections in the datapath Chapter 4 The Processor 72

73 Load-Use Data Hazard Can t always avoid stalls by forwarding Consider a data load followed by a SUB instruction If value not ready when needed, we can t forward backward in time! Even with forwarding, data won't be retrieved until one cycle too late When we needed to delay the pipeline by a cycle or more, we call this a pipeline stall Chapter 4 The Processor 73

74 Code Scheduling to Avoid Stalls To avoid a stall, the compiler can often help It can reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; stall stall LDUR LDUR ADD STUR LDUR ADD STUR X1, X2, X3, X3, X4, X5, X5, 13 cycles [X0,#0] [X0,#8] X1, X2 [X0,#24] [X0,#16] X1, X4 [X0,#32] LDUR LDUR LDUR ADD STUR ADD STUR X1, X2, X4, X3, X3, X5, X5, [X0,#0] [X0,#8] [X0,#16] X1, X2 [X0,#24] X1, X4 [X0,#32] 11 cycles Chapter 4 The Processor 74

75 Control Hazards Control Hazards are caused by branch instructions Conditional branches determine whether we execute the next instruction or the one at the branch target We don't know which until after the condition is evaluated Can try to guess which instruction is next Wrong guess means flushing the pipeline and loading the correct instruction In LEGv8 pipeline Need to compare registers and compute target early in the pipeline Add extra hardware to do it in ID stage Still causes a one cycle delay Chapter 4 The Processor 75

76 Stall on Branch One solution to deal with control hazards: Wait until branch outcome determined before fetching next instruction For LEGv8, adds an extra cycle Chapter 4 The Processor 76

77 Branch Prediction Long pipelines can t readily determine branch outcome early Better option is to predict outcome of branch Stall penalty becomes unacceptable Only stall if prediction is wrong In LEGv8 pipeline We can assume (predict) a branch will never be taken Fetch instruction after branch, with no delay If branch is actually taken, flush pipeline, and load branch target instruction Chapter 4 The Processor 77

78 More-Realistic Branch Prediction 1. Static branch prediction Based on typical branch behavior Example: for loop and if-statement branches Always predict backward branches taken Always predict forward branches not taken 2. Dynamic branch prediction Hardware measures actual branch behavior e.g., record recent history of each branch Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history Chapter 4 The Processor 78

79 Pipeline Summary The BIG Picture Pipelining improves performance by increasing instruction throughput Subject to hazards Executes multiple instructions in parallel Each instruction has the same latency Parallelism achieved without requiring action by programmer Structural, data, control Instruction set design affects complexity of pipeline implementation Chapter 4 The Processor 79

80 4.6 Pipelined Datapath and Control LEGv8 Pipelined Datapath Instructions and data mostly move from left to right Exceptions: branch target and register writes Chapter 4 The Processor 80

81 Pipeline registers Once a pipeline is full, each stage is processing part of a different instruction This means each stage needs: To be independent of previous stages To keep a copy of the instruction information that either it needs, or that future stages will need We require registers between stages They will hold needed information produced in previous cycles Chapter 4 The Processor 81

pipeline registers assumed to store information each

82 Pipeline registers II Figure shows pipeline registers (highlighted) added between stages PC register and pipeline registers assumed to store information each cycle, so no write enable needed Chapter 4 The Processor 82

83 Pipeline Operation We will first examine the flow of data through the pipeline Will focus on what data needs to be saved in the pipeline registers at each stage We will then examine the flow of control signals through the pipeline We will first look at a single-clock-cycle pipeline diagram Shows pipeline usage for a single cycle at a time Highlights resources used We ll look at single-clock-cycle diagrams for load & store Chapter 4 The Processor 83

84 IF Stage for Load, Store, Diagram shows which resources (blue) are being used by current stage On rising edge of the clock, PC register loads new instruction address This progresses to address selection for Instruction memory Selected 32 bit instruction appears at data port of memory Instruction and 64 bit PC register content arrive at input of IF/ID Value of PC+4 computed and fed to multiplexor for input of PC register Chapter 4 The Processor 84

85 IF stage for Load, Store, II Chapter 4 The Processor 85

86 ID for Load, Store, On rising edge of the clock, the current instruction and PC register contents from previous stage is saved to the IF/ID register The new instruction stored in the IF/ID register is then decoded by the main controller to generate control signals The instruction stored in IF/ID is used to select read/write registers The instruction is also used by the sign-extension unit The stored PC value, the output of the two read registers, and the sign-extended offset arrive at input of ID/EX register We need to store any data in the next pipeline register that may be needed by a later stage Chapter 4 The Processor 86

87 ID for Load, Store, II Chapter 4 The Processor 87

88 EX for Load On rising edge of the clock, the stored PC value, the output of the two read registers, and the sign-extended offset are stored in the ID/EX register The new output (PC register and offset) of the ID/EX register is used to calculate a branch target address in case it is needed The stored offset and the stored Read data 1 are used by the ALU to calculate the desired data memory location The branch target address, the output of the ALU (zero and operation result), and the stored value from Read data 2 arrive at input of EX/MEM register Chapter 4 The Processor 88

89 EX for Load II Chapter 4 The Processor 89

90 MEM for Load On rising edge of the clock, branch target address, the output of the ALU (zero and operation result), and the stored value from Read data 2 are stored in the EX/MEM register The new output of the EX/MEM register is used to supply the address and write data inputs to the data memory unit The stored branch target is fed back to the input multiplexor for the PC register The stored output of the ALU (operation result), and the Read data output of the data memory arrive at input of MEM/WB register Chapter 4 The Processor 90

91 MEM for Load II Chapter 4 The Processor 91

92 WB for Load On rising edge of the clock, stored output of the ALU (operation result), and the Read data output of the data memory are stored in the MEM/WB register The new output of the EX/MEM register is connected to the two inputs of the multiplexor that is fed back to the Write data input of the register file The output of the multiplexor arrives at the Write data input of the register file On the next rising edge, this date will be stored in the selected write register of the register file Chapter 4 The Processor 92

93 WB for Load (with error) II Wrong register number We see now that we have made a design error When we feed back data to be written to write register, we are using the select bits from the wrong instruction! Chapter 4 The Processor 93

94 Corrected Datapath for Load During the Instruction decode stage of our load instruction, we need to save the write register select bits to our pipeline registers Then feed back the saved select bits with the data to be written Chapter 4 The Processor 94

95 MEM for Store For a store operation, the memory stage is the last active stage Still have to go through write-back stage as later instructions already progressing at maximum rate Chapter 4 The Processor 95

96 Multi-Cycle Pipeline Diagram Form showing resource usage for each instruction as it progresses over time Chapter 4 The Processor 96

97 Multi-Cycle Pipeline Diagram II More traditional form of this type of diagram Chapter 4 The Processor 97

98 Single-Cycle Pipeline Diagram State of a pipeline in a single given cycle Chapter 4 The Processor 98

99 Pipelined Control (Simplified) We now need to add a controller We start by adding control signals to the existing design Chapter 4 The Processor 99

100 Pipelined Control Design We first note that design ignores the data and control hazards we discussed in Section 4.5 We will borrow as much as we can from Singlecycle design We will keep the same Main and ALU controller, branch logic, control signals, and the same multiplexor design The Main controller will create its control signals during the ID stage, and then we will pass the needed control signals forward via the pipeline registers Chapter 4 The Processor 100

101 Pipelined Control Design II As PC and pipeline registers are written on each clock cycle, they don't need separate write signals As each control signal is associated with a component active only during a signal stage, we can associate each signal to a stage IF: nothing needed here ID: need to set Reg2Loc EX: need to set ALUOp and ALUSrc MEM: need to set Branch, MemRead, and MemWrite WB: need to set MemToReg and RegWrite Chapter 4 The Processor 101

kept local The rest are saved into pipeline registers and

102 Pipelined Control Propagation Control signals are derived from instruction during ID stage Signals needed for ID stage kept local The rest are saved into pipeline registers and passed forward to the stage they are needed Chapter 4 The Processor 102

103 Pipelined Control Next slide shows full design, including pipelined control signals and which stage they are used in As Instruction bits 31:21 are used by ALU Control unit in EX stage, we must add these to the ID/EX register Chapter 4 The Processor 103

104 Pipelined Control II Chapter 4 The Processor 104

105 Read for Own Interest Read Sections 4.7, 4.8, 4.9, 4.10 for your own interest Chapter 4 The Processor 105

106 Read On Your Own Read Sections 4.14, and 4.15 on your own Chapter 4 The Processor 106

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle