Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some or all figures from Computer Organization and Design: The Hardware/Software Approach, Third Edition, by David Patterson and John Hennessy, are copyrighted material (COPYRIGHT 2004 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED). Lec 15 Systems Architecture 1
Introduction Objective: To understand how to implement the MIPS instruction set. Combine components (registers, memory, ALU) and add control Fetch-Execute cycle Topics Sequential logic (elements with state) and timing (edge triggered) Memory Registers Datapath components: Instruction memory, PC, Adder, Register File, ALU, Data Memory Implement a subset of MIPS in a single cycle computer Shortcomings of a single cycle computer Lec 15 Systems Architecture 2
The Processor: Datapath & Control Implementation of MIPS Simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j Generic Implementation: use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do Lec 15 Systems Architecture 3
Instruction Execution PC instruction memory, fetch instruction Register numbers register file, read registers Depending on instruction class Use ALU to calculate Arithmetic result Memory address for load/store Branch target address Access data memory for load/store PC target address or PC + 4 12/22/2011 Chapter 4 The Processor 4
Abstract View Two types of functional units: elements that operate on data values (combinational) elements that contain state (sequential) Lec 15 Systems Architecture 5
Multiplexers Can t just join wires together Use multiplexers 12/22/2011 Chapter 4 The Processor 6
Control 12/22/2011 Chapter 4 The Processor 7
Timing Clocks used in synchronous logic when should an element that contains state be updated? Edge-triggered timing falling edge cycle time rising edge Lec 15 Systems Architecture 8
Edge Triggered Timing State updated at clock edge Read contents of some state elements, Send values through some combinational logic Write results to one or more state elements State element 1 Combinational logic State element 2 Clock cycle Lec 15 Systems Architecture 9
Information encoded in binary Logic Design Basics Low voltage = 0, High voltage = 1 One wire per bit Multi-bit data encoded on multi-wire buses Combinational element Operate on data Output is a function of input State (sequential) elements Store information 4.2 Logic Design Conventions 12/22/2011 Chapter 4 The Processor 10
Combinational Elements AND-gate Y = A & B Adder Y = A + B A B + Y A B I0 I1 M u x S Y Multiplexer Y = S? I1 : I0 Y Arithmetic/Logic Unit Y = F(A, B) A ALU Y B F 22 December 2011 Chapter 4 The Processor 11
Sequential Elements Register: stores data in a circuit Uses a clock signal to determine when to update the stored value Edge-triggered: update when Clk changes from 0 to 1 D Clk Q Clk D Q 12/22/2011 Chapter 4 The Processor 12
Register with write control Sequential Elements Only updates on clock edge when write control input is 1 Used when stored value is required later Clk D Write Clk Q Write D Q 12/22/2011 Chapter 4 The Processor 13
Clocking Methodology Combinational logic transforms data during clock cycles Between clock edges Input from state elements, output to state element Longest delay determines clock period 12/22/2011 Chapter 4 The Processor 14
Components for Simple Implementation Functional Units needed for each instruction Instruction address Instruction PC Add Sum Instruction memory MemWrite Register numbers Data a. Instruction memory b. Program counter 5 Read 3 5 register 1 Read Read data 1 5 register 2 Registers Write register Write data Read data 2 Data c. Adder ALU control Zero ALU ALU result Address Write data Data memory Read data MemRead a. Data memory unit 16 Sign 32 extend b. Sign-extension unit RegWrite a. Registers b. ALU Lec 15 Systems Architecture 15
Instruction Fetch 32-bit register Increment by 4 for next instruction 12/22/2011 Chapter 4 The Processor 16
R-Format Instructions Read two register operands Perform arithmetic/logical operation Write register result 12/22/2011 Chapter 4 The Processor 17
Load/Store Instructions Read register operands Calculate address using 16-bit offset Use ALU, but sign-extend offset Load: Read memory and update register Store: Write register value to memory 12/22/2011 Chapter 4 The Processor 18
Read register operands Compare operands Branch Instructions Use ALU, subtract and check Zero output Calculate target address Sign-extend displacement Shift left 2 places (word displacement) Add to PC + 4 Already calculated by instruction fetch 12/22/2011 Chapter 4 The Processor 19
Branch Instructions Just re-routes wires Sign-bit wire replicated 12/22/2011 Chapter 4 The Processor 20
Composing the Elements First-cut data path does an instruction in one clock cycle Each datapath element can only do one function at a time Hence, we need separate instruction and data memories Use multiplexers where alternate data sources are used for different instructions 12/22/2011 Chapter 4 The Processor 21
R-Type/Load/Store Datapath 12/22/2011 Chapter 4 The Processor 22
Full Datapath 12/22/2011 Chapter 4 The Processor 23
Adding Control Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction R I J op rs rt rd shamt funct op rs rt 16 bit address op 26 bit address Lec 15 Systems Architecture 25
MIPS Instructions add $t0,$s1,$s2 000000 10001 10010 01000 00000 100000 op rs rt rd shamt funct lw $t0,256($t1) 100011 01001 01000 0000 0001 0000 0000 op rs rt offset Lec 15 Systems Architecture 26
MIPS Instructions Continued beq $s1,$s2,25 => 100 000100 10001 10010 0000 0000 0001 1001 op rs rt offset j 1024 => 4096 [+PC+4[31-28]] 000010 00 0000 0000 0000 0100 0000 0000 op address Lec 15 Systems Architecture 27
Determining ALU Control Bits ALUOp determined by instruction Control Lines 000 and 001 or 010 add 110 sub 111 slt Instruction ALUOp Instruction funct ALU ALU opcode operation action control LW 00 load word xxxxxx add 010 SW 00 store word xxxxxx add 010 BEQ 01 branch eq xxxxxx sub 110 R-type 10 add 100000 add 010 R-type 10 sub 100010 sub 110 R-type 10 and 100100 and 000 R-type 10 or 100101 or 001 R-type 10 slt 101010 slt 111 Lec 15 Systems Architecture 28
Must describe hardware to compute 3-bit ALU control input given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic function code for arithmetic ALU Control ALUOp computed from instruction type Describe it using a truth table (can turn into gates): Lec 15 Systems Architecture 29
Datapath with Control 0 4 Add Instruction [31 26] Control RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Shift left 2 Add ALU result M u x 1 PC Read address Instruction [31 0] Instruction memory Instruction [25 21] Instruction [20 16] Instruction [15 11] 0 M u x 1 Read register 1 Read Read data1 register 2 Registers Read Write data2 register Write data 0 M u x 1 Zero ALU ALU result Address Write data Data memory Read data 1 M u x 0 Instruction [15 0] 16 32 Sign extend ALU control Instruction [5 0] Lec 15 Systems Architecture 30
Control Line Settings 8 control lines (control read/write and multiplexors) Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp R-format 1 0 0 1 0 0 0 Func Code lw 0 1 1 1 1 0 0 add sw X 1 X 0 0 1 0 add beq X 0 X 0 0 0 1 sub Lec 15 Systems Architecture 31
R-Type Instruction 22 December 2011 Chapter 4 The Processor 32
Load Instruction 22 December 2011 Chapter 4 The Processor 33
Branch-on-Equal Instruction 22 December 2011 Chapter 4 The Processor 34
Implementing Jumps Jump 2 address 31:26 25:0 Jump uses word address Update PC with concatenation of Top 4 bits of old PC 26-bit jump address 00 Need an extra control signal decoded from opcode 22 December 2011 Chapter 4 The Processor 35
Datapath With Jumps Added 22 December 2011 Chapter 4 The Processor 36
Shortcomings of a Single Cycle Implementation Limits reuse of hardware components each functional unit can be used only once per cycle e.g. instruction and data memory required Inefficient clock cycle determined by longest possible path in the machine E.G. Assume time for: Memory units = 200 ps ALU and adders = 100 ps Register file (read or write) = 50 ps Instruction class Instruction memory Register read ALU operation Data memory Register write Total R-type 200 50 100 0 50 400 ps Load word 200 50 100 200 50 600 ps Store word 200 50 100 200 550 ps Branch 200 50 100 0 350 ps Jump 200 200 ps Lec 15 Systems Architecture 37
Single Cycle Model is inefficient! Assume 25% loads, 10% stores, 45% ALU instructions, 15% branches, and 5% jumps CPU execution time = Instruction count x CPI x Clock cycle time Performance ratio = CPU Performance (Multicycle impl.) ------------------------------------------------------ = CPU Performance (Single cycle impl.) CPU Exec. Time (Single cycle impl.) ------------------------------------------------------ = CPU Exec. Time (Multicycle impl.) 600 ------------------------------------------------------------------------------------- = 600 x 25% + 550 x 10% + 400 x 45% + 350 x 15% + 200 x 5% 600 ps 447.5 ps ------------- = 1.34 faster Lec 15 Systems Architecture 38