Chapter 4 The Processor
Introduction CPU performance factors CPI Clock Cycle Time Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified version (Single cycle implementation) A more realistic pipelined version Multi cycle version is removed in this version) Implement simple subset, but shows most aspects Memory reference: lw, sw Arithmetic/logical: add, sub, and, or, slt Control transfer: beq, j CPU Time Instruction Count 4. Introduction 3
Review: MIPS Instruction Set Architecture (ISA) Instruction Categories Arithmetic Load/Store Jump and Branch Floating Point coprocessor Memory Management Special 3 Instruction Formats: all 32 bits wide Registers R R3 PC HI LO OP OP OP rs rt rd sa funct rs rt immediate jump target R format I format J format 4
Review: MIPS Register Convention Name Register Number Usage Preserve on call? $zero constant (hardware) n.a. $at reserved for assembler n.a. $v - $v 2-3 returned values no $a - $a3 4-7 arguments yes $t - $t7 8-5 temporaries no $s - $s7 6-23 saved values yes $t8 - $t9 24-25 temporaries no $gp 28 global pointer yes $sp 29 stack pointer yes $fp 3 frame pointer yes $ra 3 return addr (hardware) yes
Instruction Execution PC (Program counter) is used to fetch instruction in the instruction memory) After instruction is obtained, register numbers in instructions is used to read registers in register files. PC PC +4 for sequentially execution 6
Different actions for different instruction classes Use ALU to calculate Arithmetic result Memory address for load/store Branch target address Access data memory for load/store PC target address add $t, $s, $s2 lw $s, 2($s2) bne $t, $s5, Exit lw $s, 2($s2) j Loop 7
Multiplexers Can t just join wires together Use multiplexers 8
CPU Overview We need three Muxes to avoid problems Details of each Mux and Control will be introduced later 9
Logic Design Basics Information encoded in binary Low voltage =, High voltage = One wire per bit Multi bit data encoded on multi wire buses 4.2 Logic Design Conventions Combinational element (See next slide) Operate on data Output is a function of input State (sequential) elements 32 bit bus Output is a function of input and current states Store information
Review: Combinational Elements AND gate Y = A & B Adder Y = A + B A B + Y A B Multiplexer I I M u x S Y Y = S? I : I Y Arithmetic/Logic Unit Y = F(A, B) A ALU Y B F
Review: Sequential Elements Register: stores data in a circuit Uses a clock signal to determine when to update the stored value Edge triggered: update when Clk changes ( > or > ) The following figure is positive edge triggered: update when Clk changes from to D Clk Q Clk D Q 2
Review: Sequential Elements (with write enable) Register with write control Only updates on clock edge when write control input is Used when stored value is required later Clk D Write Clk Q Write D Q Q is not changed when Write= 3
Building a Datapath Datapath : Elements that process data and addresses in the CPU Registers, ALUs, mux s, memories, 4.3 Building a Datapath We will show how to build MIPS datapath 5
Instruction Fetch 32 bit register Increment by 4 for next instruction 6
Read two register operands Perform arithmetic/logical operation Write results into destination registers add $t, $s, $s2 R Format Instructions Instruction $s $s2 $t Read register Read register 2 Registers Write register Write data 32 32 Read data Read data 2 3 ALU operation Zero ALU ALU result RegWrite 7
Load/Store Instructions (need 4 components) Read register operands =>register files Calculate address using 6 bit offset Use ALU, but sign extend offset Load/store: read memory and update register, and write register value to memory Need data memory lw $t, 4($s3) #load word from memory sw $t, 8($s3) #store word to memory 2
Load/store Datapath: Load/Store Instruction Instruction Read register Read register 2 Registers Write register Write data RegWrite Read data Read data 2 3 ALU operation Zero ALU ALU result Address Write data MemWrite Data memory Read data 6 Sign 32 extend MemRead Datapath Two additional elements (sign extension and data memory) used To implement load/stores 2
Load e.g. lw $t, 4($s3) Animating the Datapath load op rs rt offset/immediate WD 5 5 RN RN2 WN RD Register File RegWrite 5 RD2 E T N D 6 32 RN: register number RN2: register number 2 WN: register number that will be written WD: write data 6 Operation 3 ALU lw rt, offset(rs) R[rt] <- MEM[R[rs] + s_extend(offset)]; 4 Zero MemWrite ADDR Memory WD MemRead RD 22
store Animating the Datapath store sw $t, 8($s3) op rs rt offset/immediate WD 5 5 RN RN2 WN RD Register File RegWrite 5 RD2 6 E T N D 32 6 Operation 3 ALU sw rt, offset(rs) MEM[R[rs] + sign_extend(offset)] <- R[rt] Zero MemWrite ADDR Memory WD MemRead RD 23
Review: Specifying Branch Destinations MIPS conditional branch instructions: op rs rt offset 6 bits 5 bits 5 bits 6 bits PC relative addressing Target address = PC + offset 4 PC already incremented by 4 by this time from the low order 6 bits of the branch instruction sign extend PC 32 6 offset 32 4 32 32 Add 32 32 Add 32 2 beq $s $t 2 24. 28 2C Target Address (address of next instruction) =? Each word is 4 bytes? 2C branch dst address 24
Read register operands Compare operands Use ALU, subtract and check Zero output Calculate target address Sign extend offset Shift left 2 bits (word displacement) Add to PC + 4 (already calculated by instruction fetch) Datapath: Branch Instructions See animation in the next slide 25
beq Animating the Datapath (beq) e.g. beq $s $t 2 op rs rt offset/immediate 6 PC +4 from instruction datapath ADD WD 5 5 RN RN2 WN RD Register File Operation ALU Zero <<2 RegWrite RD2 6 E T N D 32 beq rs, rt, offset if (R[rs] == R[rt]) then PC <- PC+4 + s_extend(offset<<2) 26
Sign extension and shift left by 2 hardware Simple hardware is used for sign extension and shift left by 2 msb lsb Shift left by 2 lsb Signed extension hardware 27 Sign bit wire replicated
Composing the Elements Make Data path do an instruction in one clock cycle Each datapath element can only do one function at a time Hence, we need separate instruction and data memories Use multiplexers where alternate data sources are used for different instructions 28
R Type/Load/Store Datapath A Single Cycle Datapath Correct Control signal (RegWrite, ALUSrc, ALU operation, MemWrite, MemtoReg, MemRead) are needed to make sure correct operation is done 29
Full Datapath (Single Cycle Datapath) 3
Control for the single cycle CPU Designing a processor A single cycle implementation (the datapath) Control for the single cycle CPU Control of CPU operations ALU controller Main controller 4.4 A Simple Implementation Scheme 3
Next: Building Datapath With Control
Main Control and ALU Control Main Control: Based on opcode: generate RegDst, Branch, MemRead MemtoReg, ALUOp MemWrite, ALUSrc, RegWrite ALU Control: Based on 2 bit ALUop and the 6 bit func field of instruction, the ALU control unit generates the 3 bit ALU control field 3 26 2 6 6 R-type op rs rt rd shamt funct func Op code 6 6 Main Control ALUop 2 ALU Control Unit ALU control 3 ALU 33
Deciding ALU Control Assume 2 bit ALUOp derived from opcode Combinational logic derives ALU control opcode ALUOp Operation funct ALU function ALU control lw load word add? sw store word add? beq branch equal subtract? R-type add add? subtract subtract? AND AND? OR OR? set-on-less-than set-on-less-than?
ALU Control ALU Control has 4 four bits: Ainvert, Binvert, and Operation (2 bits) Function ALU control AND OR add subtract set-on-lessthan NOR If NOR is not supported, we just need three bits (MSB bit is always
ALU used for ALU Control Load/Store: function = add Branch: function = subtract R type: function depends on funct field Assume 2 bit ALUOp derived from opcode Combinational logic derives ALU control opcode ALUOp Operation funct ALU function ALU control lw load word add sw store word add beq branch equal subtract R-type add add subtract subtract AND AND OR OR set-on-less-than set-on-less-than
Control l signal Hopefully, generate Inst. Memory Addr Deciding Main Control Signals Instruction<3:> Op <3:26> <2:25> Rs Rt <6:2> Rd <:5> Control <:5> Imm6 <:5> Funct PCsrc RegDst ALUSrc RegWr MemRd ALUctr (ALUOp) MemWr MemtoReg Equal Datapath 37
Review: The Main Control Unit Control signals derived from instruction R-type Load/ Store Branch rs rt rd shamt funct 3:26 25:2 2:6 5: :6 5: 35 or 43 rs rt offset 3:26 25:2 2:6 5: 4 rs rt offset 3:26 25:2 2:6 5: opcode always read read, except for load write for R type and load sign extend and add
Control Signals for R Type Instruction PC 4 ADDR ADD Instruction Memory RD immediate/ offset I[5:] Control signals shown in blue Instruction I 32 6 rs I[25:2] WD rt I[2:6] 5 5 RN RN2 WN RD Register File RegWrite rd I[5:] 5 MU 5 RD2 E 6 32 T N D RegDst M U ALUSrc <<2??? Operation 3 ALU ADD Value depends on funct Zero M U PCSrc MemWrite ADDR Data Memory WD MemRead RD MemtoReg M U 39
Control Signals: lw Instruction PC 4 ADDR ADD Instruction Memory RD immediate/ offset I[5:] Instruction I 32 6 rs I[25:2] WD rt I[2:6] 5 5 RN RN2 WN RD Register File RegWrite rd I[5:] 5 MU 5 RD2 E 6 32 T N D RegDst M U ALUSrc <<2 Operation 3 ALU ADD Zero M U PCSrc MemWrite ADDR Data Memory WD MemRead RD MemtoReg M U 4
Control Signals: sw Instruction PC 4 ADDR ADD Instruction Memory RD immediate/ offset I[5:] Instruction I 32 6 rs I[25:2] WD rt I[2:6] 5 5 RN RN2 WN RD Register File RegWrite rd I[5:] 5 MU 5 RD2 E 6 32 T N D RegDst M U ALUSrc <<2 Operation 3 ALU ADD Zero M U PCSrc MemWrite ADDR Data Memory WD MemRead RD MemtoReg M U 43
Control Signals: beq Instruction PC 4 ADDR ADD Instruction Memory RD immediate/ offset I[5:] Instruction I 32 6 rs I[25:2] WD rt I[2:6] 5 5 RN RN2 WN RD Register File RegWrite rd I[5:] 5 MU 5 RD2 E 6 32 T N D RegDst M U ALUSrc <<2 Operation 3 ALU ADD Zero M U PCSrc PCSrc= if Zero= MemWrite ADDR Data Memory WD MemRead RD MemtoReg M U 44
Review: Target address of Jump Assume PC=4 6, what is the target address of the jump instruction? 6 bits 26 bits Address in the instruction= x2 Target Address= PC[3:28]+2 6 *4= x484 46
Review: Implementing Jumps Jump 2 address 3:26 25: Jump uses word address Update PC with concatenation of Top 4 bits of old PC 26 bit jump address Need an extra control signal decoded from opcode 26 PC+4 Address (25:) 3:28 + New Target Address 47
Datapath Executing j j PC 4 ADDR ADD Instruction Memory RD jmpaddr I[25:] 26 32 op I[3: Control Unit 6 WD <<2 ALUOp 2 5 5 RN RN2 WN RD Register File RegWrite 28 ALU Control CONCAT op 6 I[3:26] funct 6 I[5:] Instruction I 5 MU 5 RD2 6 E T N D 32 RegDst 32 PC+4[3-28] M U ALUSrc <<2 Operation 3 ALU ADD Zero PCSrc M U M U Jump Branch MemWrite ADDR Data Memory WD MemRead RD MemtoReg M U 48
Truth Table for Main Control Signals Current design of control is for lw, sw, beq, and, or, add, sub, slt I format: lw, sw, beq R format: and, or, add, sub, slt Given 4 OP codes (each 6 bits) as inputs, the outputs are as follows => a main control logic (the next slide) inputs Instruction RegDst ALUSrc Memto Reg outputs Reg Write Mem Read See appendix D for details Mem Write Branch ALUOp ALUOp R format lw sw beq 5
Inputs Outputs Implementation of Main Control Block (Use PLA) Use PLA Signal R- lw sw beq name format Op5 Op4 Op3 Op2 Op Op RegDst x x ALUSrc MemtoReg x x RegWrite MemRead MemWrite Branch ALUOp ALUOP Truth table for main control signals Inputs Op5 Op4 Op3 Op2 Op Op R-format Iw sw beq Outputs RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp ALUOpO Main control PLA (programmable logic array) 5 4 3 2 ALUSrc=? 5
Truth Table for ALU control signals inputs outputs Merge LW & SW ALUOp Funct field Operation ALUOp ALUOp F5 F4 F3 F2 F F C 3 C 2 C C 52 add subtract add subtract and or slt
C3= Implementation of ALU Control Block C2 ALUOp= is able to identify branch instruction C C ALUOp ALUOp ALUOp ALU control block C2 ALUOp or ALUOp and F F3 C2 2 3 53 F(5 ) F2 F F C C Operation
Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not feasible to vary period for different instructions Violates design principle Making the common case fast We will improve performance by pipelining
Backup Slides 55