Project 4 Dual-Issue Superscalar MIPS Processor Project Checkoff: Friday, June 1 nd, 2018 Report Due: Monday, June 4 th, 2018 Overview: Some machines go beyond pipelining and execute more than one instruction per cycle. How is this practical? In class, we learned about superscalar processors. The basic idea is that if we duplicate some of the functional parts of the processor and provide logic to issue several instructions concurrently, the resultant CPI will be effectively less than one. There are two general approaches to multiple issues: static multiple issue with the issue scheduling performed at a compile time and dynamic multiple issue with the issue scheduling performed in a hardware (also known as superscalar). In this lab, you will implement a superscalar dual-issue processor with in-order scheduling. Similar to the original Pentium processor, you will design 5-stage pipeline and similar to Cortex A8 processor (which, e.g. used in iphone 4) it will be based on RISC type instruction set architecture. The idea is that your 5-stage pipeline processor has simple memory hierarchy and branch predictors that you developed in the previous labs; however, in lab4 your processor will be capable of running two instructions per cycle! Fig. 1: Intel 80586 (P5) with and Cortex A8 processors Page 1 of 5
Dynamic (out-of-order) scheduling is a feature of all current x86 architectures. For such processor, the dependencies are handled by complicated control logic like Tomasulo algorithm considered in class. In this lab, you should implement in-order scheduling, just like you did in lab1, which is representative of the earlier ARM and MIPS processors. Handling hazards are more straightforward in case of static in-order scheduling; however, the number of all possible dependencies that your hazard logic should handle could be quite large (and you may appreciate on why it might be conceptually simpler to handle all dependencies in a centralized table like in scoreboard technique). A good approach is to make a complete list of dependencies and come up with a solution (e.g., by stalling the processor or forwarding data) for each case. Then you will apply your techniques to your current architecture. Hints: Before designing, you should consider a couple of points which will help you to design the lab. The timing diagram of a dual issue pipeline processor (in the ideal case) is shown in Figure 2. Note that here, "ideal" means that there is no stalling and the processor is running with its maximum throughput of two instructions per cycle. In your processor, stalling will limit the performance significantly. Fig. 2: The timing diagram of a dual issue MIPS Page 2 of 5
For the first lab, we consider variety of instructions such as add, addu, addi, addiu, sub, subu, and, or, xor, xnor, andi, ori, xori, slt, sltu, slti, sltiu, lw, sw, lui, j, bne, beq, mult, multu. For this lab, mult and multu instructions are considered for the extra credit questions. Hence, you may ignore these instructions in designing this lab. Figure 3 shows the simplified architecture of a dual issue processor. Some of the blocks and signals are not shown for clarity. There are a couple of questions which you may ask yourself before designing this processor. In particular, you have to know what is involved in the fetching of two instructions per cycle, decoding two instructions per cycle, executing two ALU operations per cycle, accessing the data cache twice per cycle and writing back two results per cycle. Fig. 3: 5-stage dual issue pipeline For two-way-wide fetching, the problem is easy when we don't have a cache in our processor, but handling branch instructions are a bit tricky. Basically, at each clock cycle, two instructions (64-bit) must be read from the instruction memory. The first step is to make it a dual port ROM. For branch prediction, you may access branch target buffer in parallel (you have to modify its structure), or you may come up with any other solution for that. If the first instruction or both the first and second ones are predicted not taken, then it is relatively straightforward. If the second instruction is taken; then, you have to provide next PC with an appropriate target address. But, what if the first one is predicted taken? In this case, you may want to discard the second instruction by inserting NOP (though executing it may also have some benefits). Using early branch, in the decode stage, you will decide whether to flush the next instructions or not. Taking care of mispredictions Page 3 of 5
should be also straightforward as you have already done it for a single issue processor. In wide decoding, it is easy in our case because the instruction length is fixed. The register file is now addressable through four ports. You may assume some signals are also bypassed or signed extended to EX stage. Obviously, you have to design another control unit to manage the second instruction. The problem which may arise here is managing hazards. Doubling number of issues quadruples required stall logic because you have two instructions in decode stage and two instructions in every other stage. You have to make a list of all possible hazards. A very important step would be to generalize the ideas that you applied for a single issue processor, i.e. forwarding and stalling, to take care of all possible hazards. For execution stage, memory, and write-back stages, we will simplify the problem by stalling the processor whenever there is a structural hazard. For simplicity, we can assume that data memory can only process one instruction in a cycle. In addition, you may assume some other rules to simplify your design. For instance, if the older instruction (first one) stalls, then the younger one has to stall and cannot bypass it. On the other hand, if the younger instruction stalls, the older instruction from the next group may or may not move up. You may assume rigid pipe (the next instruction doesn't move up) for simplicity. Executing two instructions per cycle is also double by considering two ALUs and enough number of bypass logic for that. In order to design your processor successfully before the deadline, I strongly recommend you to follow this step by step routine. Step1: Redesign your architecture to support double issuing. At this stage, you don't need to consider hazards, branches... Step2: Test your existing design by running a simple two consecutive ADD instructions. Step3: If the design passes the second step, you may increase the number of instructions to test the pipeline, but yet, consider only simple instructions and not hazards. Step4: Sketch a complete diagram of your design and identify all possible hazards. You could do it even earlier Step5: Now, you may take care of hazards by either stalling or forwarding. There will be lots of muxes and wires; therefore, you may want to implement it step by step and Page 4 of 5
as clear as possible. Your diagram helps you to follow every detail. Remember that the basic idea is the same as the single issue. Step6: Finally, you can take care of other details such as branch predictions. FAQs: 1. Grading: Your grade is mainly based on the correct operation. Functionality is the primary goal. 60% of your score is and the rest of is related to your report. 2. What happens during the checkoff? You have 10 minutes to present your project. Both of group members must be available during the presentation. You may bring your own laptop or use computers in ECI lab. Everybody has to explain the whole project and answer some questions. 3. What to turn in? Submit an organized Zip file containing mentioned files to zfahimi@ucsb.edu by the deadline. Your report is very important. Start with an introduction, an illustration of instructions, and design methodology. And then you may focus on each of the steps provided in the manual. When describing each step, provide the code, the test bench and, waveforms. Explain why your waveforms are correct and answer all questions. Organization and completeness of the report determine 40% of your score. Figures should be readable and you have to explain them in detail. Mention how many hours you have spent on this lab, your common mistakes in Verilog coding and lessons you learned. Finally, provide a conclusion and wrap up the project. Cite appropriately any references in the report. A folder containing the project files including all source files, test benches and waveforms. Please heavily comment your code. Poorly commented codes will not be graded. Page 5 of 5