1 Design of Decode, Control and Associated Datapath Units ECE/CS 3710 - Computer Design Lab Lab 3 - Due Date: Thu Oct 18 I. OVERVIEW In the previous lab, you have designed the ALU and hooked it up with the register files. The execute operation of your machine performs arithmetic and logical operations in the ALU; it fetches the data from register files and computes the result. So far, so good. In this lab, you will now conceptualize, design and implement the decode and control logic for your CPU and integrate it with the ALU and register files to complete the datapath. Before you begin designing the components for this lab, I would urge you to first re-cap as to how your ALU works, how many cycles does it take to perform the execute operation and then decide what control signals need to be generated to complete the data-flow through the CPU. Think about what extra registers/status flags would you need (if at all) to keep track of instruction execution. Develop a high-level view, preferably a block diagram, and conceptually verify if all your requirements are met. Then decide how to partition the decode, control, associated program counter hardware, and then proceed with the implementation. If you have already made plans regarding how to augment your base instruction set, make your design amenable to later modifications to accommodate the extra hardware/control at a later stage (this is tough, I know, but you can get a feel of how/what you would have to modify). In the next lab (lab 4), you will design the interface of your CPU with the memory. For designing and validating your CPU, you may assume for now that an instruction and/or data is available in associated registers: Memory Address Register (MAR), Memory Data Register (MDR), etc., as and when you need it. II. FETCHING THE INSTRUCTION The program counter is going to generate the addresses to fetch the instruction. Since the CPU will be interfaced with a memory controller, the current address would have to be latched in a Memory Address Register (MAR) for the memory controller. Of course, the PC itself may act as an MAR, but this can create the following problem. Usually, the PC gets incremented by one, the moment an instruction is fetched. Also, when branch instructions are executed, the PC gets updated only after the branch target address is computed. In case of pipelined execution, the PC computation can get rather involved if you implement some for of prediction. Without a specific MAR, how would the PC be interfaced with the memory controller so that the current address is latched until the corresponding instruction/data is being fetched? That s why, implementing an MAR may not be a bad idea. If an MAR is indeed implemented, how it would be interfaced with the PC and when/how would it get updated? That is the thinking part of this problem. Note that this MAR can be made a part of the Memory controller, but thats an issue for the next lab. To come to the point, maybe you should implement a memory address register to store the address of the current instruction/data being fetched.
2 SDRAM M A R P C PC = PC + 1 or PC = branch target address M D R I R mux/buf controls for regfiles opcodes for ALU Data ready write_enables for regfiles (or maybe for PC?) Data to ALU BUS/Regfile Fetch Decode Fig. 1. Block Diagram of the Memory-CPU Interface for fetch/decode stages A. Program Counter Design The program counter is a dedicated special register in the machine that holds the address of the next instruction to execute. It needs to be capable of being updated in every way the PC needs to be updated. For your machine, this means that the PC needs to be incremented by one word (the normal case), added to a (sign-extended) displacement (for branches) or loaded from a register (for jumps). Your datapath needs to be able to perform all of these operations. If your PC is already set up to feed into your ALU (which is unlikely), then you could perform the displacement calculation by simply setting some input MUXs (or bufs) so that the PC and the immediate go to the ALU and the ALU function is set to add. You might load from a register by simply setting the ALU such that the appropriate register source makes it through the ALU without modification. This value can then be latched into the PC. For the basic increment case, you could either put the PC through one side of the ALU, and select a constant 1 for the other somehow (put a constant value on one of the input muxes or something similar to that approach), or you could build your PC as a loadable counter. If you use the counter approach you can load the counter for the update and displacement functions, and count the counter for the increment-the-pc function. The choice is yours. The advantage of the counter is that you may not have to use the complete datapath for each PC increment. The advantage of the increment-through-the-alu approach is that every pc-update function goes through the same process, but with different mux settings. Remember that when you use the ALU to update the PC, you should not update the condition codes! Finally, remember that for a JAL instruction, the PC needs to have a path into the register file so that it can be stored in the link register. Your datapath must allow this operation. Another issue with the PC has to do with signed and unsigned arithmetic. Recall that the signed arithmetic is all done with two s complement numbers. This means that the range of numbers in a 16-bit word is -32,768 to 32,767. On the other hand, if you use those 16 bits to encode an unsigned number, you can represent 0 to 65,534 (64k). Since addresses are usually considered unsigned numbers, we need to consider what it means to have an unsigned PC that
3 is operated on by a two s complement ALU; especially in the face of signed offsets that might require subtraction! Note that our PC is addressing 16-bit words and not bytes! That way the bits in the PC are bits 0-15 of the word address and the PC can address 64k 16-bit words. B. Memory Data Register and Instruction Registers When the memory controller fetches and delivers a 16-bit word - the CPU has to know whether an instruction was fetched or data was fetched. The control logic knows what is being fetched. If an instruction is being fetched, you would like to store it in a register (usually called the Instruction Register, or IR) for subsequent decode. What if it is data? Would you like to store it in IR or have a dedicated Memory Data Register? Maybe, in your opinion, the Memory Data Register (MDR) should be inside the memory controller, and the CPU s job would be to: i) fetch the 16-bit instruction from the MDR into the IR; or ii) fetch the 16-bit data from the MDR and put it on the systembus/alu input MUX/buf; or iii) any other way you would want to implement it. The choice is yours, but you need to show consistency - synchronizing the machine operation is the main thinking issue here. III. SIGN EXTENSION Various instructions in our machine make use of sign-extended immediates. Recall from the instruction set handout that immediates in arithmetic operations are sign-extended from the 8-bits. Logical immediate operations are zero-extended instead of sign-extended. Check the 427 handout for details. IV. DECODING THE INSTRUCTION AND GENERATING THE CONTROL SIGNALS In this assignment, the complete decode logic needs to be designed and completed: Opcodes, MUX/buf select signals, Read and Write enables on Register Files, etc., etc., etc., will have to be generated by this piece of sequential control logic. Even though this is the most important feature as far as your machine operation is concerned, you have already developed part of your decode logic/signals: (i) The R src,r dest read buffers and their controls in the register files; (ii) The write enable control signal; (iii) PSR updates; (iv) anything else?? You need to think about what s involved in actually executing each instruction. Since you are (most likely) not building a pipelined machine, the execution model is pretty simple 1. Basically your control needs to sequence some actions in the machine that will result in the instruction being executed. Instruction Fetch: Already discussed in previous section. To repeat, before you can execute an instruction, you need to fetch it from memory. Think about what this involves. First you have to get the current PC value into the MAR. Then you read from the SDRAM (Through the sdram macro and your sdram interface state machine). The returned data should then be latched into the instruction register (on the rising edge of the clock after DONE goes high from the controller). The result of this sequence of actions is that you fetch the next instruction into the instruction register so you now know which instruction you are about to execute. Instruction Decode: In this phase of execution you use the information in the instruction register to set up all the state in the control path that you need to execute the instruction. This may or may not involve a separate clock phase depending on how your datapath is arranged. Things that get decoded include: mux settings, register file addressing, immediate fields (including sign extension or zero extension), and register enables. Of course, if the decoded value 1 If you want to pipeline your machine, you should discuss your plans with me and Neal. I would encourage those of you who have little software coding for their project to attempt some engineering optimizations. Pipeline is one, but be careful: it is easy to understand and difficult to implement.
4 describes an instruction that needs a second word of data (i.e. some extended version of the instruction set. The baseline doesn t have any two-word instructions...), you ll need to do a second fetch from the memory, but to a separate register so that you don t write over the current instruction. You ll need to have incremented the PC in this case too. Instruction Execution: Now that everything is set up by the instruction decode logic, you can execute the instruction. In this non-pipelined case this simply means allowing the correct data to go through the datapath and compute a result. Make sure you understand each and every instruction. Note that loads and stores may require some extra work here because you need to communicate to the SDRAM controller to execute the load or store - SDRAM interface is still pending. Also note that a PC operation must be performed somewhere. If the instruction is not a branch or jump, the PC needs to be incremented by one 16-bit word. If it is a taken branch, then the PC must be added to a signed offset from the instruction (the offset is a word-offset from the current PC), and if it s a jump, the PC must be loaded from a register. If it s a jump-and-link, then the incremented PC must be stored in the destination register. Make sure to get the details of JAL right! Writeback to the Register File: Once the result is computed for that instruction, you need, in most cases, to write back the answer to the register file. Most of you have already set up all the relevant information in the datapath (like destination addresses and mux/buf settings), so this is probably nothing more than enabling the register file to do a write on the next cycle. Remember, register files are enabled for write before the clock-trigger arrives. By the way, what about writes into Processor Status Register? See if your Control logic needs to directly access (read or write or read-modify-write) the PSR? This cycle then repeats itself for each new instruction. V. SUGGESTIONS ON HOW TO PROCEED Disclaimer: These are just my suggestions, mostly based on my experience. By no means you have to follow them strictly. You may want to design an MAR, MDR, IR and PC, and try to synchronize the fetch part. Since the SDRAM interface is not ready yet, in your testbench you can emulate it by writing the PC into MAR and fetching the instruction from the MDR into IR. This way, you can de-link the fetch stage from decode and execute stages. Later on, when we design the memory interface, you would have to do minimal changes to this fetch stage and perhaps hardly any for subsequent decoding and execution. If you mix everything in one-helluva-complex-module, re-synchronizing the CPU with the SDRAM would become almost impossible. Moreover, try to group the instructions for decoding that: i) either perform similar execute operations; or ii) generate similar control signals. I would group the instructions with similar addressing modes together. Furthermore, keep a track of how many execute cycles does an instruction require - e.g. your shifter, if implemented as a register, may require n cycles for shift-by-n operations. Counters can help, but ensure their correct operation by testing for corner cases, resetting and roll-overs, etc. Implementation of fetch part of the Load-Store can be postponed until the next lab, when we have the SDRAM interface ready. But you can surely test the decode and control part for load/store operations. Validate each module (fetch/decode stage) separately, then put it together. Finally, you may want to use a program to test your machine. You can do that similar to what we have been doing so far. Another important advice: First, try to implement the machine without jumps and branches. Make sure that a purely step-by-step sequential machine operation is currectly updating the PC, is setting up the right con-
5 trol signals at the right time, and so on. Once this is achieved, augment the hardware to include Jumps and Branches. Good luck!