Chapter 5 1
State Elements Unclocked vs. Clocked Clocks used in synchronous logic Clocks are needed in sequential logic to decide when an element that contains state should be updated. State element 1 Combinational logic State element 2 Clock cycle 2
Latches and Flip-flops C Q D _ Q 3
Latches and Flip-flops D D C D latch Q D C D latch Q Q Q Q C 4
Latches and Flip-flops Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) 5
SRAM 6
SRAM vs. DRAM Which one has a better memory density? static RAM (SRAM): value stored in a cell is kept on a pair of inverting gates dynamic RAM (DRAM), value kept in a cell is stored as a charge in a capacitor. DRAMs use only a single transistor per bit of storage, By comparison, SRAMs require four to six transistors per bit Which one is faster? In DRAMs, the charge is stored on a capacitor, so it cannot be kept indefinitely and must periodically be refreshed. (called dynamic) Synchronous RAMs?? is the ability to transfer a burst of data from a series of sequential addresses within an array or row. 7
Datapath & control design We will design a simplified MIPS processor The instructions supported are Memory-reference instructions: lw, sw Arithmetic-logical instructions: add, sub, and, or, slt Control flow instructions: beq, j Generic implementation Use the program counter (PC) to supply instruction address Get the instruction from memory Read registers Use the instruction to decide exactly what to do All instructions use the ALU after reading the registers Why? memory-reference? arithmetic? control flow? 8
ALU Control ALU's operation based on instruction type and function code Example: add $t1, $s7, $s8 000000 10111 11000 01001 00000 100000 000 AND op rs rt rd shamt funct 001 OR 010 Add 110 Subtract 111 Set-on-less-than lw $t0, 32($s2) 35 18 8 32 op rs rt 16-bit number 9
Summary of Instruction Types R-Type: op=0 31:26 25:21 20:16 15:11 10:6 5:0 op rs rt rd shamt funct Load/Store: op=35 or 43 31:26 25:21 20:16 15:0 op rs rt address Branch: op=4 31:26 25:21 20:16 15:0 op rs rt address 10
Building blocks Instruction address Instruction PC Add Sum MemWrite Instruction memory a. Instruction memory b. Program counter c. Adder Address Write data Data memory Read data 16 Sign 32 extend Register numbers Data 5 Read 3 register 1 Read 5 data 1 Read 5 register 2 Registers Write register Write data Read data 2 RegWrite Data ALU control ALU ALU result a. Registers b. ALU Zero MemRead a. Data memory unit b. Sign-extension unit Why do we need each of these? 11
Fetching instructions 12
Reading registers 13
Load/Store memory access 14
Branch target 15
Combining datapath for memory and R-type instructions 16
Appending instruction fetch 17
Now Insert Branch 18
The simple datapath 19
Control For each instruction Select the registers to be read (always read two) Select the 2nd ALU input Select the operation to be performed by ALU Select if data memory is to be read or written Select what is written and where in the register file Select what goes in PC Information comes from the 32 bits of the instruction 20
Adding control to datapath 21
Adding control to datapath 0 4 Add Instruction [31 26] Control RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Shift left 2 Add ALU result M u x 1 PC Read address Instruction memory Instruction [31 0] Instruction [25 21] Instruction [20 16] Instruction [15 11] 0 M u x 1 Read register 1 Read register 2 Registers Write register Write data Read data 1 Read data 2 0 M u x 1 Zero ALU ALU result Address Write data Data memory Read data 1 M u x 0 Instruction [15 0] 16 Sign 32 extend ALU control Instruction [5 0] 22
ALU Control given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic 23
Control (Reading Assignment: Appendix C.2) Simple combinational logic (truth tables) Inputs Op5 Op4 ALUOp ALUOp0 ALUcontrol block Op3 Op2 Op1 Op0 ALUOp1 Outputs F (5 0) F3 F2 F1 F0 Operation2 Operation1 Operation0 Operation R-format Iw sw beq RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOpO 24
Instruction RegDst ALUSrc R-format lw sw beq Memto- Reg Reg Write Mem Read Mem Write Branch ALUOp1 ALUp0 25
Datapath in Operation for R-Type Instruction Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw sw beq 26
Datapath in Operation for Load Instruction Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq 27
Datapath in Operation for Branch Equal Instruction Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1 28
Datapath with control for Jump instruction J-type instructions use 6 bits for the opcode, and 26 bits for the immediate value (called the target). newpc <- PC[31-28] IR[25-0] 00 29
Timing: Single cycle implementation Calculate cycle time assuming negligible delays except Memory (2ns), ALU and adders (2ns), Register file access (1ns) 30
Why is Single Cycle not GOOD??? Memory - 2ns; ALU - 2ns; Adder - 2ns; Reg - 1ns Instruction class Instruction memory Register read ALU Data memory Register write Total (in ns) ALU type 2 1 2 0 1 6 Load word 2 1 2 2 1 8 Store word 2 1 2 2 7 Branch 2 1 2 5 Jump 2 2 what if we had floating point instructions to handle? 31
1 clock cycle fixed vs. variable for each instruction Memory - 2ns; ALU - 2ns; Adder - 2ns; Reg - 1ns Loads 24% Stores 12% R-type 44% Branch 18% Jumps 2% Instruction class Instruction memory Register read ALU Data memory Register write Total (in ns) ALU type 2 1 2 0 1 6 Load word 2 1 2 2 1 8 Store word 2 1 2 2 7 Branch 2 1 2 5 Jump 2 2 32
1 clock cycle fixed vs. variable for each instruction Memory - 2ns; ALU - 2ns; Adder - 2ns; Reg - 1ns Loads 24% Stores 12% R-type 44% Branch 18% Jumps 2% CPU = IC * CPI * CC CPU = 8*24% + 7*12% + 6*44% + 5*18% + 2*2% CPU = 6.3ns Instruction class Instruction memory Register read ALU Data memory Register write Total (in ns) ALU type 2 1 2 0 1 6 Load word 2 1 2 2 1 8 Store word 2 1 2 2 7 Branch 2 1 2 5 Jump 2 2 33
Single Cycle Problems Wasteful of area Each unit used once per clock cycle Clock cycle equal to worst case scenario Will reducing the delay of common case help? 34
Multicycle Approach Ability to allow different cycles for different instructions Ability to share functional units within the execution of a single instruction Each step in execution in 1 cycle Functional unit to be used more than once per instruction As long as it is used in different cycle This will lead to reduction in hardware 35
Where we are headed One Solution: use a smaller cycle time have different instructions take different numbers of cycles a multicycle datapath shares resources Break up the instructions into steps, each step takes a cycle balance the amount of work to be done restrict each cycle to use only one major functional unit At the end of a cycle store values for use in later cycles (easiest thing to do) introduce additional internal registers PC Address Instruction or data Memory Data Instruction register Memory data register Data Register # Registers Register # Register # A B ALU ALUOut What is new? One ALU Single memory Some registers 36
Multicycle Datapath R I op rs rt rd shamt funct op rs rt 16 bit address J op 26 bit address What is this wire for? 37
Multicycle Datapath with Control Signals Missing wires?? 38
Multicycle Datapath, All Together 39
Idea behind multicycle approach We define each instruction from the ISA perspective Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps) Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.) Finally try and pack as much work into each step (avoid unnecessary cycles) while also trying to share steps where possible (minimizes control, helps to simplify solution) Result: Our book s multicycle Implementation! 40
Breaking down an instruction ISA definition of arithmetic: Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]] Could break down to: IR <= Memory[PC] A <= Reg[IR[25:21]] B <= Reg[IR[20:16]] ALUOut <= A op B Reg[IR[20:16]] <= ALUOut We forgot an important part of the definition of arithmetic! PC <= PC + 4 41
Five Execution Steps Instruction Fetch Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion Memory Access or R-type instruction completion Write-back step INSTRUCTIONS TAKE FROM 3-5 CYCLES! 42
Step 1: Instruction Fetch Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC. Can be described succinctly using RTL "Register-Transfer Language" IR <= Memory[PC]; PC <= PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now? 43
Step 2: Instruction Decode and Register Fetch Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch RTL: A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(ir[15:0]) << 2); We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) 44
Step 3 Execution (instruction dependent) Memory Reference: ALUOut <= A + sign-extend(ir[15:0]); R-type: ALUOut <= A op B; Branch: if (A==B) PC <= ALUOut; Jump: PC <= {PC[31:28],(IR[25:0],2 b00)}; 45
Step 4 and 5 Step 4 R-type Completion or memory-access Memory access or R-type completion: MDR <= Memory[ALUout]; or Memory[ALUout] <= B; R-type instructions finish Reg[IR[15:11]] <= ALUOut; Step 5 Write-back step Reg[IR[20:16]] <= MDR; Why not do this in Step 4? 46
Summary: 47
Defining the Control Now that we have determined what the control signals are and when they have to be asserted Next step: implementing the control unit Single cycle datapath used truth tables Not feasible Alternatives Finite state machines Microprogramming 48
Implementing the Control Value of control signals is dependent upon: what instruction is being executed which step is being performed Use the information we ve accumulated to specify a finite state machine specify the finite state machine graphically, or use microprogramming Implementation can be derived from specification 49
Finite State Machine Control A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(ir[15:0]) << 2); IR <= Memory[PC]; PC <= PC + 4; 50
Big Picture 51
State Table 52
State Table 53
ROM Implementation ROM = "Read Only Memory" values of memory locations are fixed ahead of time A ROM can be used to implement a truth table if the address is m-bits, we can address 2 m entries in the ROM. our outputs are the bits of data that the address points to. m n 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 m is the "height", and n is the "width" 54
ROM Implementation (Appendix-C!!!) How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 2 10 = 1024 different addresses) How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs ROM is 2 10 x 20 = 20K bits (and a rather unusual size) Rather wasteful, since for lots of the entries, the outputs are the same i.e., opcode is often ignored 55
Big Picture 56
Putting All Together 57
PLA 58
59
ROM vs PLA Break up the table into two parts 4 state bits tell you the 16 outputs, 2 4 x 16 bits of ROM 10 bits tell you the 4 next state bits, 2 10 x 4 bits of ROM Total: 4.3K bits of ROM PLA is much smaller can share product terms only need entries that produce an active output can take into account don't cares Size is (#inputs #product-terms) + (#outputs #product-terms) For this example = (10x17)+(20x17) = 510 PLA cells PLA cells usually about the size of a ROM cell (slightly bigger) 60
Another Implementation Style Complex instructions: the "next state" is often current state + 1 Control unit PLA or ROM Input Outputs PCWrite PCWriteCond IorD MemRead MemWrite IRWrite BWrite MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl 1 State Adder Address select logic Op[5 0] Instruction register opcode field 61
Microprogramming Control unit Microcode memory Input Outputs PCWrite PCWriteCond IorD MemRead MemWrite IRWrite BWrite MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl Datapath 1 Microprogram counter Adder Address select logic Instruction register opcode field What are the microinstructions? 62
Pentium 4 Pipelining is important (last IA-32 without it was 80386 in 1985) Control Control I/O interface Instruction cache Enhanced floating point and multimedia Control Data cache Integer datapath Secondary cache and memory interface Chapter 7 Chapter 6 Advanced pipelining hyperthreading support Control Pipelining is used for the simple instructions favored by compilers Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions 63
Chapter 5 Summary If we understand the instructions We can build a simple processor! If instructions take different amounts of time, multi-cycle is better Datapath implemented using: Combinational logic for arithmetic State holding elements to remember bits Control implemented using: Combinational logic for single-cycle implementation Finite state machine for multi-cycle implementation 64