Designing The Main Control Unit: Remember the three instruction classes {R-type, Memory, Branch}: a) R-type : Op rs rt rd shamt funct 1.src 2.src dest. 31-26 25-21 20-16 15-11 10-6 5-0 a) Memory : Op rs rt offset base dest/src 31-26 25-21 20-16 15-0 a) Branch : Op rs rt offset 1.src 2.src 31-26 25-21 20-16 15-0 Observations: Bits 31-26 is always the opcode field. rs and rt fields always specify the two registers to be read in (R-type, beq, sw) the base register for sw and lw is always in the rs field. the 16-bit offset value is always in 15-0, for beq, lw, sw. for lw, destination reg. is in rt field, whereas for R-type it is in the rd field We need a MUX to select the appropriate destination reg. field. This new MUX is to be placed in front of the write register (wr) input of the Reg. file unit. For the 4 MUX es, we need 4 select control lines. Then, we need write control inputs for the Reg. file and Data Mem. We need a read control input for the Data Mem. We shall also show the ALU control block, and its connection to the ALU. The overall datapath for RT, lw,sw beq instrucution 13
Main control unit asserts output signals to control the operation of the sequential and combinational blocks at every clock-pulse. We have introduced all control inputs and outputs in the circuit previously. These signals are summarized in the following table. 14
In this truth-table the reset input is isolated from opcode and provides only asserting the PCClr output. Consequently, this can be decomposed to two parts, reset, and execution. The control unit with the truth-table above asserts a Branch output independent of ALU-Zero each time when a BEQ instruction is executed. In this case, a circuit in datapath tests ALU-Zero and generates PCSrc=1 only when both Branch and ALU-Zero are high. To eliminate the branch circuit from the datapath, the controller must have ALU-zero input and PCSrc output as specified in following table: In this case, the controller asserts the PCSrc output directly depending on the ALU-Zero input. The truth table for this case is seen below. 15
Datapath modification for a unconditional Jump Instruction format for jump: Op address 2 31-26 25-0 jump branches unconditionally, to a 32 bit address which is obtained as follows: 31-28 27-2 1-0 from current PC from jump inst.(26 bits) Modification required on datapath: A new MUX is needed to select the jump address, or the (branch target address or PC+4) 00 The highest four bits of PC (PC31-28) remains unchanged. 16 MemWrite
17
18
Problems with the single-cycle implementation: Since every instruction will be executed in the same clock cycle time, CPI = 1. The length of the single clock cycle is determined by the longest path in the design. This longest path is the one used by the load instruction: Sequentially it uses - the instruction memory - the register file - the ALU - the data memory, and - the register file again Note that some of the instructions considered could fit in shorter clock cycles. The shortest is Jump instruction. Assume the operation times (access times) of the major functional units for this implementation are given as: As we expected, the longest time-taking instruction is lw, and it requires 40ns for the completion of the processing. Minimum processing time for each instruction of the single-cycle-processor. That is, if all instructions must run on one clock cycle, processor needs 40ns to complete lw instruction, and thus the processor clock-cycle must be at least 40ns. 19
Ex: Let us find out which of the following implementations would be faster: a- The single clock cycle implementation (of fixed length) b- The variable length implementation with clock-cycle=10ns, but each instruction takes several clock-cycles. Instruction mixture of the test program Number of cycles 3 4 4 3 2 Answer Single and Multi-cycle cases for the implemented datapath. Instruction-execution-time = CPI clock-cycle-time. Compute the CPU-execution-time for each case CPU-execution-time = Inst. count CPI Clock cycle time a) For the single-clock cycle implementation: clock cycle time = 40 ns, CPI= 1 ; CPU-execution-time = I 1 40 ns = 40 I ns b) For the multi-clock cycle implementation the average CPI is: CPI = (CPIk Instr.-countk ) / Instruction-count CPI = (3 0.49)+(4 0.22)+(4 0.11)+(3 0.16)+(2 0.02) = 3.31 CPU-execution-time = I 3.31 10 ns = 33.1 I ns The performance-ratio of the two implementation is: The variable clock implementation has 1.21 times better performance rating over the fixed-single-clockcycle implementation. >> If floating point operations, or a more complicated instruction set is considered in the design, the single clock cycle design would use an extremely slow clock. So, instead, use implementation techniques that have a shorter clock cycle, requiring multiple clock cycles for each instruction 20
SUMMARY for Single Cycle Processor Advantages Single cycle per instruction makes logic and clock simple Disadvantages Inefficient utilization of memory and functional units since different instructions take different lengths of time ALU only computes values a small amount of the time Cycle time is the worst case path long cycle times Load instruction PC CLK-to-Q + instruction memory access time + register file access time + ALU delay + data memory access time + register file setup time All machines would have a CPI of 1 Fixing the Single Cycle Multicycle implementation Divide each instruction into a series of steps Each step will take one clock cycle Different instructions can now have different CPI Requires a few significant changes to organization Use registers to separate each stage Advantages Shorter cycle time Simple instructions executed in short period of time Variable cycles per instruction no longer restricts to worst case Functional units can be used more than once/instruction Less hardware required to implement processor Disadvantages Requires additional registers to store between stages More timing paths to design, analyze, and tune Multiple-Cycle Implementation The single-clock-cycle implementation requires distinct memory units for the instruction fetch and data access. A simplified representation of the data processing path starts with the instruction memory, and terminates with the data memory. We have four main sections in this datapath, those require almost same processing time. A simplified single clock datapath can be divided into four main sections 21
In dividing, we pay attention not to have datapath elements in between the sections, and having almost the same processing time for each of the divisions. Benefits of the multi-cycle processing Divide the execution into steps Each step takes one clock cycle to complete In this implementation, we can use a functional unit more than once per instruction, provided that it is used on different clock cycles reduced hardware Different instructions may take different no. of clock cycles flexibility Simplified multi-clock-cycle datapath ALU usage during each cycle of the multi-clock-cycle-processing At the first clock cycle, - ALU calculates PC+4. At the second cycle - ALU is used to calculate the branch-target-address PC+4 + imm 4 while the operands are not ready from the register-file outputs. During the first and second cycles the instruction is not yet decoded, and datapath performs the same operations for all instructions. At the third cycle, - For R-format instructions ALU performs the specified operation of the instruction. - The memory address of the base-addressing instructions lw and sw is calculated. - In the branch instruction, ALU compares the contents of the source registers. At the fourth cycle - For R-Format instructions, to store the ALU-result into the destination register. - For sw instruction, the memory write occurs. - For lw instruction, the memory read occurs At the fifth cycle - For lw instruction, to store the memory contents into the destination register. 22