EECS 151/251 FPGA Project Report

EECS 151/251 FPGA Project Report GSI: Vighnesh Iyer Team Number: 6 Partners: Ashwinlal Sreelal and Zhixin Alice Ye Due Date: Dec 9, 2016

Part I: Project Description The aims of this project were to develop a 3-stage pipelined Central Processing Unit (CPU) to run on a Virtex 6 Xilinx FPGA. The CPU runs a reduced instruction set architecture that includes most of the instructions in the RV32I Base Instruction Set. The CPU had a number of requirements, including minimizing the Cycles Per Instruction (CPI) and maximizing the speed of the CPU clock (to beyond 50 MHz), while staying within the size limitations of the FPGA. In order to achieve this, data forwarding was implemented to resolve different types of control, structural, and data hazards. Optimizations were also conducted to reduce the core CPU critical path, including a global branch predictor. In addition to the core processor, several peripheral units were added to allow the CPU to communicate with the surrounding environment. A memory controller was implemented to communicate between Data Memory, Instruction Memory, UART, a cycle counter, switches and LEDs on the FPGA, and an AC97 interface. This allowed a number of relatively complex and interesting programs to be run on the CPU, such as a piano program using the computer keyboard as inputs, since programs could be loaded onto the data or instruction memory and run on the FPGA. Part II: High Level Organization 1

The above diagram provides a high-level overview of our design. ml505top is the top level block, and the outputs of this block interface directly with ports on the FPGA. The tone generator block, RiscV151 CPU, and AC97 periphery (including the FIFOs for the microphone and speaker outputs) are located here. Within the CPU, the main 3-stage datapath is instantiated, which communicates between the memory controller to write to various inputs and outputs. Other features within the CPU are the FIFO to the GPIO leds and switches, the UART to communicate with external computing devices, a cycle counter, and branch predictor. Three types of memory are also included: the Instruction memory, BIOS memory, and Data memory. Our datapath consists of three combinatorial stages, each separated by a set of registers. Each stage takes in registered inputs, including a set of control signals, and uses them to drive the outputs of the next stage. A diagram is shown on the next page to illustrate the entire datapath. The first stage, the Instruction Fetch stage, decodes the instruction and converts it to a set of control signals for future stages. It determines the instructions inputs into the Execute Stage, including which values are fed into the ALU. It also determines the next address to read an instruction for, stored in the next_pc value. Decoded signals then send in read addresses to the RegFile so that they are available in the Execute Stage. These output into nextpc, regout1, regout2, and control signals as registers. The second stage, the Execute stage, is used to take inputs from the first stage and compute the alu output to write to either memory or registers. Data forwarding logic is needed to resolve hazards, which is discussed more in Part III. To reduce the fanout of the design, parallel arithmetic units compute the alu output and the branch compare output. Then, the data, address, and write enable bits are forwarded to the next stage. The third stage, the Writeback stage, consists of decoding the memory output and choosing what value to write back to the reg file. It also kills instructions related to branch mispredictions or JALRs. 2

Datapath 3

Part III: Component Description A. Control Signal Decoding The IFController block provides the decode logic in the first stage of the CPU. It takes in a 32-bit instruction from instruction memory or BIOS memory and then outputs relevant signals to the rest of the CPU. The signals include the relevant bit-shifting and sign extending of the immediate, as well as letting other components of the CPU know what type of instruction it is (jal, branch, register-file related, and what the instruction opcode is). B. Branch/Jump Implementation The pc update logic for our CPU is the following: if (3rd stage inst killed): if(1st stage is jal OR we predict a branch) AND 1st stage not killed: Next_pc = curr_pc + imm else: Next_pc = pc + 4 else: If 3rd stage is jalr: Next_pc = 3rd stage jalr pc else if 3rd stage is mispredicted branch: Next pc = 3rd stage branch_not_taken pc else if (1st stage is jal OR we predict to take a branch) AND 1st stage not killed: Next_pc = curr_pc + imm else: Next_pc = pc + 4 4

This updates the pc (program counter) with either a correction for a valid JALR (jump and link register) instruction or mispredicted branch, a valid JAL (jump and link) instruction, or the default address. This implementation will not kill instructions on a JAL, but kills two instructions for a JALR or mispredicted branch. The kills are implemented by using a kill signal that is sent by the final stage and forwarding it through the pipeline. When a stage is killed, the ability to affect the address is removed (see pc update logic, above). Additionally, write ports are disabled. We chose to implement the branch logic in this manner for several reasons: 1. Since our implementation needs to calculate the ALU immediate out of instruction fetch and add the two values by the second stage in order to do branch prediction, most of the hardware for calculating and taking a JAL in the first stage already would need to exist. We believed the cost of one additional mux would be worth the gain in CPI (cycles per instruction) by not needing to kill instructions on JALs. 2. While branches and JALRs originally were able to be taken in the second stage, we found that this design created a large critical path (approximately 50% longer than the next largest critical path) when combined with data forwarding into the ALU. Therefore, moving this out of the second stage would improve performance. 3. Since JALRs are relatively rare because it is uncommon to take large jumps, the only penalty for moving these operations into the third stage would be an additional cycle on mispredicted branches. Therefore, if we could properly predict branches, there would be minimal loss in CPI. C. Branch Prediction (Extra Credit) We implemented two different versions of the branch predictor. 1. The first version we implemented was a global branch predictor. It used a 2-bit saturating counter that would increment whenever a branch was taken, and decrement whenever a branch was not taken. The state would not change on instructions that were not branch instructions. If the counter value was 2 10 or 2 b11, it would tell the IF stage to predict that the branch was taken. This global branch predictor has a significant improvement, reducing CPI from 1.2 to 1.1 with little impact on maximum clock frequency. 2. The branch history table (BHT) worked very similarly to the global branch predictor. Instead of a single register, there would be 32 different registers that formed a history table, and addresses would update and receive branch predictions based on the [6:2] bits of their PC. This worked only marginally better than the global branch predictor, as it took only several hundred fewer cycles out of tens of millions of instructions, or about 0.001% reduction in runtime. It did not have impact on clock frequency, but used 10% more LUTs and Slice registers. 5

D. Register File The register file was implemented by using a 2D array to store values. It has a synchronous write port and two synchronous read ports. r0 cannot be written to and will always output a value of 32 b0. E. Arithmetic Logic Unit The Arithmetic Logic Unit (ALU) is the main component of the Execute Stage. It was programmed to be easy to read and debug in Verilog, using a case statement to choose the internal operation from a list of alu operations. It performs operations on alu_in1 and alu_in2 such as addition, greater than or equal comparison, and bitwise shifting. Inputs can be sign-extended if the ALU operation is sign extended. Later on for optimization purposes, we worked to reduce fanout through the ALU logic stage, and consequently removed a few unneeded operations from this block. This is discussed in more detail in Part IV. F. Data Forwarding In order to resolve data hazards, data forwarding was implemented to save on CPI. The data forwarding hazard logic is as follows. 1. If there was a register write, the register was not r0, and the next instruction used the same register, the register needed to be forwarded in front of the ALU in the execute stage. 2. If there was a register write and the second instruction after used the same register, the data would be forwarded in the end of the Instruction Fetch stage instead. 3. Lastly, if there was a load from memory and then a store to memory, the data needs to be forwarded to the Execute stage prior to the Data Memory block with a mux. We did not need to check if either register equals 0 because the register is designed such that an r0 read/write will always output the correct value. 6

G. Memory Controller The memory controller is a block that interfaces between the CPU and all the various forms of memory and peripheral circuits, as listed below. Signal Prefix Address Description mem_* N/A The interface with the CPU, whether or not the CPU wants to write/read something from memory dmem_* Addr[31:28] = 4 b00x1 Data memory, a data storage block for the CPU imem_* Write: Addr[31:28] = 4 b001x Read: Addr[31:28] = 4 b0001 Instruction memory, a read/write memory used to determine the instructions read by the CPU core bios_* Addr[31:28] = 4 b0100 BIOS memory, a pre-loaded memory block that ensures the CPU always boots up to a known state. uart_* counter_*, cycle_* GPIO_* TG_*, ac97_fifo_*, ac97_mic_* Control: 32 h8000_0000 Receive:32 h8000_0004 Transmit:32 h8000_0008 Cycle counter: 32 h8000_0010 Instruction counter: 32 h8000_0014 Reset counter to 0: 32 h8000_0018 FIFO empty: 32 h8000_0020 FIFO read data: 32 h8000_0024 DIP Switches: 32 h8000_0028 GPIO and Compass LEDs: 32 h8000_0030 TG Enable: 32 h8000_0034 TG Switch Period: 32 h8000_0038 AC97 FIFO Full: 32 h8000_0040 AC97 Fifo Sample: 32 h8000_0044 AC97 Volume: 32 h8000_0048 MIC FIFO Empty: 32 h8000_0050 MIC FIFO Sample: 32 h8000_0054 UART interface, for communication to/from other computers to the CPU on the FPGA Counter interface for determining the number of cycles/instructions that were undertaken. GPIO/DIP interface, for reading from the output switches as well as writing to the LEDs on the FPGA board The ac97 interface, which is responsible for coordinating the tone generator, headphones, and microphone. 7

This block reads the input address and then mapping the relevant data and address to the corresponding output port. Since the timing between signals can be important, some signals were required to be delayed. This also prevents potential combinational loops by ensuring that all inputs into memory are put through a register. Some small blocks (mem_data_from_reg, mem_data_to_reg) were implemented in Stage 2 and Stage 3, respectively, to ensure that the data memory being used was appropriately shifted for reading/writing half-words or bytes. Therefore, the input to the memory controller was already adjusted to be correct. H. Cycle Counter The cycle counter is a separate block that stores the number of clock cycles that pass as well as the number of valid instructions that were executed. Every clock cycle, the cycle counter increments. The instruction counter will increment if the inst_ena signal is high. This can be used to calculate CPI of the CPU. I. FIFO Two types of FIFOs were implemented for this project: synchronous FIFO and asynchronous FIFO. The synchronous FIFO is implemented by having a read pointer and write pointer that are incremented each time a read or write is performed, respectively. When the two addresses match on a read instruction, then the fifo s empty signal goes high. For an address match on a write instruction, the fifo full signal goes high. This block was used to buffer input/outputs for the GPIO buttons. The asynchronous FIFO is used to buffer signals that are operated on separate clock domains, for example the AC97 microphone and headphone interface. Gray-coded signals and synchronizers were implemented for signals that cross clock domains to ensure that only one input signal changes at a time. An additional bit (which is also gray-coded) is used as wrap-around bit to determine if the fifo is full or empty. To determine the value of the wrap-around bit, a gray-code to binary converter was used to compare the read and write pointers. This gray-code to binary converter functions for up to 8 bits (or an address depth of 2 7 ). For a write to the FIFO, the FIFO is set to full by checking if a write was fired and then comparing the non-gray coded addresses. If the wrap-around bit is different, then the full bit is set. Similarly, for a read from the FIFO, the empty bit is set if a read instruction occurred and both the wrap-around bit and the addresses match. 8

J. AC97 Controller The AC97_controller block implements an interface between the CPU and the AC97 codec. It takes in serial data from the AC97 codec, as well as data written to an audio sample FIFO from the CPU, and interfaces these with an output FIFO to the speakers and microphone. Since both the speaker and extra-credit microphone were implemented, the AC97 writes out to several different registers in Slot 1 in a repeating loop, including setting the microphone volume, record select register, record gain register, as well as the speaker registers from Lab 6. When the CPU wants to send data to the codec, it sends the samples out by writing them to memory. The memory controller determines the memory mapping to send the signal out to the AC97 FIFO, then the data is sent to the FIFO so long as the FIFO is not full. The AC97 controller will read values from the FIFO so long as the FIFO is not empty, and then sends out values to the codec as per the requirements. In other words, the AC97 sets all the relevant registers, and then send out the speaker data via Slot 3 and Slot 4. For the project graduate student requirement, the microphone from the AC97 codec was also implemented. When the AC97 codec is sending data to the CPU (from the microphone), values first get shifted in as a serial data interface from the codec. The AC97 reads the bits in Slot 0 to ensure that the frame is valid and that Slot 3 input is also ready. Then, it shifts in the 20-bit value in Slot 3 and sends it out to the microphone FIFO. Part IV: Status and Results Currently, we have implemented all parts of the project, as well as the AC97 microphone and a branch predictor. Our design runs up to 75 MHz, with all components fully functional. 9

The design area usage was 2341 SLICE LUTs and 1030 SLICE registers. This amount grew over time from our initial design, since we added several units to reduce the critical path by parallelizing certain operations. To optimize the design performance, a number of improvements from our original design were implemented and experimental data gathered on clock frequency and CPI. For example, we modified some logic from our timing critical path to improve maximum clock frequency. We also implemented two flavors of branch prediction. Further description of the trade-offs between each design change are described below. Design CPI for mmult Minimum Clock Period (ns) Maximum Clock Frequency (MHz) Time To Complete (s) Base design 1.105 19.939 50 1.3835 ALU - changed forwarding order Moved branch logic to Stage 3 Reduced ALU fanout/added more units Global branch predictor Branch history table 1.105 17.69 50 1.3835 1.21 16.63 60 1.2621 1.21 13.296 75 1.009 1.10 13.297 75 0.9187 1.10 13.303 75 0.9186 1. Base Design Our base design was unable to run at a clock frequency beyond 50 MHz. This is because of a long critical path where data is forwarded from the memory output, muxed in with the data from the instruction fetch stage, and then sent into the EX Stage through the ALU. While the base design optimized CPI by maximizing data forwarding and trying to reduce the number of killed signals that were required, overall performance suffered from the long critical path. 2. ALU Forwarding Change We then attempted to reducing critical path without increasing CPI. We noticed that the late arrival of the forwarded signal was unnecessarily preventing the clock speed from increasing, forcing other signals to wait. To remedy this, we changed the order of the data forwarding logic so that it would use the forwarded value as late as possible, such that the 10

setup time could be relaxed slightly. While we were still unable to close timing with a 60 MHz CPU clock, this change cut more than 2 ns from the critical path. 3. Moving Branches to Stage 3 Another critical path that now became important was the path from the ALU through the branch logic. To further increase clock frequency, we were now forced to move the branch logic into the third stage. This implementation is discussed in Section III.B. The degradation of CPI from this change derived from the fact that after a branch or JALR instruction, the following two instructions were killed instead of just one instruction. This meant that every mispredicted branch would add an additional cycle. The benefit is that the data forwarded path no longer needed to go through the branch logic and into instruction memory, since there would be a register breaking up this path. With these changes, the clock frequency was able to be increased to 60 MHz, which more than offset the increase in CPI. We found that the critical path still went through the ALU, notably into the addr and write_enable bits of the memory. 4. Reducing ALU fanout/adding additional arithmetic units One reason that the ALU was part of the critical path is that the ALU inputs need to drive a large number of computational units, which each need to drive multiple outputs (at least in the old version of the design). This multiplicative fanout may have increased the timing delay of this critical path. Another reason that it is in the critical path is that the memory can forward into the ALU, whose outputs drive the address and the write enable bits. By adding a specialized arithmetic unit for branch calculation, we reduced the fanout by decreasing the amount of outputs driven by the ALU, as well as making the ALU output only feed into an aluout register. Now branch comparison logic is driven by a different unit, called branch_compare. Another change made in this step was adding units to calculate the memory address and the write_enable bits. Since the memory address will always be IF_imm + aluin1 (forwarded), we removed this calculation from the ALU and thus further reduced the fanout. An even more substantial improvement was made by removing aluout as the path for the write enable bits. Since only the 2 least-significant bits hold information, we do not need to use the full ALU output to do this calculation: rather, we can do a simple 2 bit add of addr[1:0]+offset[1:0] to calculate these bits. The combination of all these changes boosted our achievable clock speed to 75 MHz. 5. Global Branch Prediction After increasing the clock speed substantially, we sought to recover some of the lost CPI by adding branch prediction. Our initial thought was to use a 2-bit saturating counter for global branch prediction, hoping to use temporal locality in the sense that: If we had recently taken a lot of branches, we are likely to take the next branch and vice versa. With 11

that, we were able to reduce CPI from 1.21 to 1.1, which represent a reduction of more than 7 million cycles. 6. Branch History Table We then implemented a Branch History Table, using the pc[6:2] to index into the 32 different addresses. When the mmult test program was run, we found that there was a 8 cycle reduction in execution time out of 60 million+ instructions in comparison with the global branch predictor, so we found this change insignificant. However, the trade-off to the branch history table was an approximately 10% increase in area, making the global branch predictor an overall better option. Part V: Concluding Remarks The design of a CPU on an FPGA proved to be challenging and exciting - it was shown that there was a need to optimize the tradeoff between clock speed optimization and cycles per instruction of the CPU. In addition, we found that interfaces and codec are extremely useful for interconnected computing components, and were able to implement a variety of features for our CPU, including a branch predictor, communication via the UART as well as the AC97 codec. To successfully complete our project, it was important to work as a team to think through challenges and debug, as well as keep good project timelines, organization, and organization. Some aspects that could have been improved include keeping our documentation updated throughout the project instead of going back and re-reading code to understand how each component functioned. In addition, it may have been possible to further increase the speed of the processor by sacrificing some CPI by removing some of the hazard control logic, or else creating a 4 or 5 stage pipeline. We were finally able to achieve a fully functional 3-stage CPU at an operating speed of greater than 75 MHz on the Virtex 6 FPGA, with a CPI of around 1.1 using the mmult test program. 12