Pipelined RISC-V Processors

Size: px

Start display at page:

Download "Pipelined RISC-V Processors"

Phebe Flynn
5 years ago
Views:

1 Due date: Tuesday November 20th 11:59:59pm EST Getting started: To create your initial Lab 7 repository, please visit the repository creation page at Once your repository is created, you can clone it into your VM by running: git clone git@github.mit.edu:6004-fall18/labs-lab7-{yourmitusername}.git lab7 Turning in the lab: To turn in this lab, commit and push the changes you made to your git repository. After pushing, check the course website to verify that your submission passes all the tests. If you finish the lab in time but forget to push, you will incur the standard late submission penalties. Check-off meeting: After turning in this lab, you are required to go to the lab for a check-off meeting within 10 days of the lab s due date (i.e., by Fri Nov 30th this is more days than usual to account for Thanksgiving holidays). See the course website for lab hours. Pipelined RISC-V Processors In this lab you will implement two pipelined RISC-V processors in Bluespec. For Bluespec-related questions, you may want to check out the Introductory Bluespec User Guide. To pass the lab you must complete all of the exercises and discussion questions and PASS all of the exercises. Coding guidelines: You should only change the following files: TwoStage.bsv, TwoStagePlus.bsv and ThreeStage.bsv. Modifications to other files will be overwritten during didit grading. Please provide answers to the discussion questions in discussion.txt. Debugging guidelines: If your processor does not work as expected, please read the Appendix, which describes both general debugging strategies and shows how to use an optional pipeline visualization aid. 1 Two-Stage Pipelined Processor 1.1 Fixing the Two-Stage Pipelined Processor TwoStage.bsv contains an implementation of a functional two-stage pipelined processor that correctly handles control hazards, but it is not properly pipelined. It passes all the fullasmtests for functional correctness, but fails the pipetests, which also check that the cycle counts match those of a pipelined processor. The reason the processor is not properly pipelined is because the dofetch and doexecute rules conflict, so they cannot run in the same cycle. Discussion Question 1 (10 points): Why do rules dofetch and doexecute conflict? To resolve the rule conflict, you can split the conflicting part of rule doexecute into rule doredirection, such that: Rule doexecute saves the misprediction condition and redirected PC in two registers. Rule doredirection is executed only if the saved misprediction condition is true, and it updates the pc and epoch registers. 1

2 The topic of rule splitting is explained in Slides of Lecture 18. Exercise 1 (20 points): Fix the two-stage pipelined processor by splitting the conflicting part of rule doexecute into rule doredirection. All the processor-related types are defined in ProcTypes.bsv. Build your two-stage pipelined processor by running make TwoStage. You can run a suite of tests on the processor by running./test.sh which dumps all the $display messages and PASSED or FAILED on the screen. You can silence the $display messages and only see PASSED or FAILED by running./test.sh -s or./test.sh --summary. You should pass both the fullasmtests (option 6) and the pipetests (option 11). If things don t work as expected, please refer to Appendix 3 for more Bluespec debugging guides. 1.2 Improving the Two-Stage Pipelined Processor Now that you have fixed the two-stage pipelined processor, you can further improve it and save a cycle by sending the instruction memory load request in rule doredirection in case of a misprediction. In other words, if there is a misprediction, then doredirection should initiate the instruction fetch and update the program counter, just like dofetch does. If there is no misprediction, dofetch should perform the instruction fetch like before. For the processor to work correctly, make sure that the guards of dofetch and doredirection are mutually exclusive. If they are not, when rules dofetch and doredirection conflict, the Bluespec compiler will automatically schedule dofetch before doredirection since dofetch appears before doredirection in the code. Thus, if doredirection and dofetch are co-related such that rule doredirection being ready always implies that dofetch is ready, doredirection will never fire. In such a case, Bluespec will print a warning: According to the generated schedule, rule domisprediction can never fire. To prevent rule doredirection from being blocked forever due to this problem, make sure that dofetch and doredirection have mutually exclusive guards. Exercise 2 (20 points): Copy your working code from TwoStage.bsv to TwoStagePlus.bsv. Improve the two-stage pipelined processor by sending a instruction load request in rule doredirection. Build your two-stage pipelined processor by running make TwoStagePlus. You can run a suite of tests on the processor by running./test.sh which dumps all the $display messages and PASSED or FAILED on the screen. You can silence the $display messages and only see PASSED or FAILED by running./test.sh -s or./test.sh --summary. You should pass both the fullasmtests (option 6) and the pipetests (option 11). If things don t work as expected, please refer to Appendix 3 for more Bluespec debugging guides. 1.3 Synthesizing the Two-Stage Pipeline Processor Synthesize your processor by running: synth TwoStagePlus.bsv mkproctwostageplus -l multisize N ote: synth has been updated for this lab. If you get an error when trying to run synth, close and reopen your terminal. This will automatically pull the latest version. Discussion Question 2 (10 points): What are the critical-path delay and area (excluding memories) of your Two-Stage Processor? Which stage determines the critical path? 2

3 Hint: You can determine which stage is in the critical path by looking at the names of the start- and endpoints in the critical path. These could be either the inputs or output of the instruction and data memories, or the inputs or outputs of a register. 2 Three-Stage Pipelined Processor 2.1 Fixing the Three-Stage Pipelined Processor To improve on the two-stage design, let s implement a three-stage pipelined processor with following stages: The Fetch stage initiates a instruction memory read request and sets the PC to the predicted next-pc value (PC+4). The Decode stage decodes the fetched instruction and reads its source operands from the register file. The Execute stage executes the instruction, reading or writing to the data memory and writing to the register file as needed. This design is like the one described in slides of Lecture 18. Unfortunately, since the Decode and Execute stages can execute concurrently, there can be a data hazard in this processor pipeline: the Decode stage can read a stale value from register file, which has not been yet updated by an earlier instruction that is still in the Execute stage. One can resolve this data hazard by tracking all outstanding register file writes into a hardware structure called a Scoreboard, and stall the Decode stage when the index of one of the source registers is found in the scoreboard. When an instruction writes to the register file, the item should be removed from scoreboard, and the Decode stage can then proceed. The Scoreboard has the following interface: interface Scoreboard#(numeric type size); method Action insert(maybe#(bit#(5)) dst); method Action remove(); method Bool search1(maybe#(bit#(5)) src1); method Bool search2(maybe#(bit#(5)) src2); endinterface size is the number of outstanding register write indices that the Scoreboard can hold. method insert inserts a destination register index into Scoreboard. An Invalid dst is treated as a NOP on the register file write. Each Valid or Invalid dst occupies a slot in the Scorebard and a search for an Invalid dst will return False. method remove removes the oldest outstanding register write index from Scoreboard. You would also need to remove invalid dst from Scoreboard to free up space for later instructions. methods search1 and search2 will match src register indices with a Valid register index stored in the Scoreboard, and returns True if a match is found. A search for register 0 is always False. ThreeStage.bsv contains a non-functional three-stage pipelined processor that does not handle hazards correctly. Specifically, the code in ThreeStage.bsv has three issues discussed in Slides of Lecture 18: 1. Rule dodecode does not have the necessary logic to stall the Decode stage on a data hazard. In rule dodecode, a new instruction inst from imem (Instruction Memory) should not be processed in case the previous instruction had stalled. Due to the request-response interface of imem: once imem.resp() is called, the value it returns is not available in imem anymore subsequent imem.resp() calls return the data for subsequent load requests. Therefore, if dodecode needs to stall (due to a data hazard), it needs to save the fetched instruction in fetchedinst to avoid losing it. Consequently after stall, dodecode should use the instruction previously saved into fetchedinst register instead of calling imem.resp(). 2. Rules doexecute and doloadwait do not have the necessary logic to remove the oldest item from the Scoreboard in the Execute stage when an instruction finishes execution. Specifically, these rules should 3

4 call sb.remove in the following three cases: For an instruction with Valid dst, sb.remove and rf.wr should be called atomically, which would be guaranteed if they were called together in the same rule. For an instruction with Invalid dst, sb.remove should also be called to make space for later instructions. Otherwise the pipeline would be stuck. For an instruction on the wrong path of execution (i.e., a mispredicted instruction), sb.remove should also be called to make space for later instructions. 3. Finally, rules dofetch and doexecute conflict just like in the two-stage pipelined processor (and this conflict can be fixed in the same way). Exercise 3 (30 points only passing fullasmtests is 20 points): Fix the three issues in ThreeStage.bsv. Build your three-stage pipelined processor by running make ThreeStage. You can run a suite of tests on the processor by running./test.sh which dumps all the $display messages and PASSED or FAILED on the screen. You can silence the $display messages and only see PASSED or FAILED by running./test.sh -s or./test.sh --summary. You should pass both the fullasmtests (option 6) and the pipetests (option 11). If things don t work as expected, please refer to Appendix 3 for more Bluespec debugging guides. Hint: You can first tackle issues 1 and 2, and that will pass fullasmtests. After this, you can apply the same strategy from Section 1.1 or Section 1.2 to solve issue 3 and pass pipetests. 2.2 Synthesizing the Three-Stage Pipeline Processor Synthesize your processor by running: synth ThreeStage.bsv mkprocthreestage -l multisize Discussion Question 3 (10 points): What are the critical-path delay and area (excluding memories) of your Three-stage pipelined processor? What stage determines the critical path? How does this design compare with your two-stage pipelined processor? 4

5 3 Appendix: Debugging Help If your processor does not work as expected, there are some simple strategies you can follow to debug it. This appendix first discusses a general strategy to debug Bluespec circuits, then discusses a tool that s specific to pipelined designs. 3.1 General Guidelines If things don t work as expected, start by adding $display statements to see what rules are being invoked and at which cycles. It helps to be systematic: we recommend that you first add $display("[%d] <rulename>", cycles); at the top of each rule. Many times this is sufficient to understand what s going wrong (e.g., if you forget to enqueue to a FIFO that s read by a rule, you ll see that the rule doesn t fire at all or stops firing). Then, refine by adding more $display statements or more output to each statement. 3.2 Pipeline Visualization with ScheduleMonitor This lab contains a ScheduleMonitor module that you can optionally use to obtain a visual representation of which pipeline rules are firing each cycle. This module is simulation-only and produces no actual hardware. For a fully pipelined processor with no data or control hazards or load instruction, this module may produce an output similar to the one below: fetch decode execute F FD_ The names at the top are the names of each of the columns. These correspond to pipeline stages. The rows below correspond to what is happening in each clock cycle. The first row F means in the first clock cycle only fetch fired. The fifth row W means in the fifth clock cycle all 4 stages of the pipeline fired concurrently. There are four other letters that may appear as output from the ScheduleMonitor integrated with the provided initial code: L - An instruction is in LoadWait state of execute stage x - An instruction was killed in-place in the specified stage. s - An instruction stalled in the decode stage due to a data hazard. R - The execute stage fired and redirected the fetch stage due to a mispredicted next pc. Using ScheduleMonitor Using ScheduleMonitor is optional and requires adding some code to your processor. The module constructor for ScheduleMonitor (mkschedulemonitor) takes in a File object (either stdout, stderr, or an opened text file) and a vector of pipeline stage names. The order of names in this vector determines the order of the columns in the output. The code changes outlined below instantiate a ScheduleMonitor for a 3-stage pipeline that prints to stdout. 5

6 ScheduleMonitor monitor <- mkschedulemonitor(stdout, vec("fetch", "decode", "execute")); rule dofetch; // do rest of fetch monitor.record("fetch", "F"); rule dodecode; // do rest of decode if (...) // not stalling monitor.record("decode", "D"); else // stalling monitor.record("decode", "s"); rule doexecute; // do rest of execute if (...) // not redirecting monitor.record("execute", "E"); else // killed monitor.record("execute", "x"); end rule doloadwait; // do rest of loadwait monitor.record("execute", "L"); rule doredirection; // do rest of redirection monitor.record("execute", "R"); The record method of ScheduleMonitor writes a character in the specified column of the pipeline schedule diagram. Typically the first letter of the pipeline stage is written in the column when the stage fires normally, but the above code uses some other letters to show special conditions. 6

Design Project Computation Structures Fall 2018

Design Project Computation Structures Fall 2018 Due date: Friday December 7th 11:59:59pm EST. This is a hard deadline: To comply with MIT rules, we cannot allow the use of late days. Getting started: To create your initial Design Project repository,