LAB 9 The Performance of MIPS

Similar documents
LAB 9 The Performance of MIPS

LAB 9 The Performance of MIPS

LAB 5 Implementing an ALU

Lecture 8: Addition, Multiplication & Division

LAB 6 Testing the ALU

Tailoring the 32-Bit ALU to MIPS

Math in MIPS. Subtracting a binary number from another binary number also bears an uncanny resemblance to the way it s done in decimal.

LAB 7 Writing Assembly Code

Two-Level CLA for 4-bit Adder. Two-Level CLA for 4-bit Adder. Two-Level CLA for 16-bit Adder. A Closer Look at CLA Delay

Outline. EEL-4713 Computer Architecture Multipliers and shifters. Deriving requirements of ALU. MIPS arithmetic instructions

NUMBER OPERATIONS. Mahdi Nazm Bojnordi. CS/ECE 3810: Computer Organization. Assistant Professor School of Computing University of Utah

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

Arithmetic for Computers

Homework 3. Assigned on 02/15 Due time: midnight on 02/21 (1 WEEK only!) B.2 B.11 B.14 (hint: use multiplexors) CSCI 402: Computer Architectures

Programming the processor

Overview. Introduction to the MIPS ISA. MIPS ISA Overview. Overview (2)

EEM 486: Computer Architecture. Lecture 2. MIPS Instruction Set Architecture

Review of Last lecture. Review ALU Design. Designing a Multiplier Shifter Design Review. Booth s algorithm. Today s Outline

LAB B Translating Code to Assembly

Computer Systems and -architecture

Lecture Topics. Announcements. Today: Integer Arithmetic (P&H ) Next: The MIPS ISA (P&H ) Consulting hours. Milestone #1 (due 1/26)

Review: MIPS Organization

Number Systems and Their Representations

M2 Instruction Set Architecture

Final Project: MIPS-like Microprocessor

Data paths for MIPS instructions

Assembly Programming

COMP MIPS instructions 2 Feb. 8, f = g + h i;

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

Signed Multiplication Multiply the positives Negate result if signs of operand are different

ECE 473 Computer Architecture and Organization Project: Design of a Five Stage Pipelined MIPS-like Processor Project Team TWO Objectives

ECE260: Fundamentals of Computer Engineering

CPS 104 Computer Organization and Programming

Number Systems and Computer Arithmetic

Review. Lecture #9 MIPS Logical & Shift Ops, and Instruction Representation I Logical Operators (1/3) Bitwise Operations

EECS150 Lab Lecture 2 Design Verification

Integer Multiplication and Division

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Computer Systems and -architecture

CSE/ESE 260M Introduction to Digital Logic and Computer Design. Lab 3 Supplement

ECE331: Hardware Organization and Design

Integer Multiplication and Division

IA Digital Electronics - Supervision I

ECE260: Fundamentals of Computer Engineering

Week 10: Assembly Programming

Design of Digital Circuits 2017 Srdjan Capkun Onur Mutlu (Guest starring: Frank K. Gürkaynak and Aanjhan Ranganathan)

EE260: Logic Design, Spring n Integer multiplication. n Booth s algorithm. n Integer division. n Restoring, non-restoring

Lab Assignment 1. Developing and Using Testbenches

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ECE Exam I February 19 th, :00 pm 4:25pm

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

The assignments will help you learn Verilog as a Hardware Description Language and how hardware circuits can be developed using Verilog and FPGA.

Recommended Design Techniques for ECE241 Project Franjo Plavec Department of Electrical and Computer Engineering University of Toronto

IMPLEMENTATION MICROPROCCESSOR MIPS IN VHDL LAZARIDIS DIMITRIS.

COMP 303 Computer Architecture Lecture 6

Major and Minor States

Midterm. Sticker winners: if you got >= 50 / 67

CS61C Machine Structures. Lecture 13 - MIPS Instruction Representation I. 9/26/2007 John Wawrzynek. www-inst.eecs.berkeley.

Introduction. Overview. Top-level module. EE108a Lab 3: Bike light

ECE 15B Computer Organization Spring 2010

Lecture Topics. Announcements. Today: Integer Arithmetic (P&H ) Next: continued. Consulting hours. Introduction to Sim. Milestone #1 (due 1/26)

Chapter 4. The Processor

Elec 326: Digital Logic Design

Number Systems Using and Converting Between Decimal, Binary, Octal and Hexadecimal Number Systems

TSEA44 - Design for FPGAs

Data Representation Type of Data Representation Integers Bits Unsigned 2 s Comp Excess 7 Excess 8

Computer Architecture Chapter 3. Fall 2005 Department of Computer Science Kent State University

ECE 2035 Programming HW/SW Systems Fall problems, 6 pages Exam One 22 September Your Name (please print clearly) Signed.

Cpr E 281 FINAL PROJECT ELECTRICAL AND COMPUTER ENGINEERING IOWA STATE UNIVERSITY. FINAL Project. Objectives. Project Selection

ECE/CS 3710 Computer Design Lab Lab 2 Mini-MIPS processor Controller modification, memory mapping, assembly code

Chapter 3 Arithmetic for Computers (Part 2)

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

TDT4255 Computer Design. Lecture 4. Magnus Jahre. TDT4255 Computer Design

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CS61C : Machine Structures

Digital Design Review

DIGITAL ARITHMETIC: OPERATIONS AND CIRCUITS

E85: Digital Design and Computer Engineering Lab 2: FPGA Tools and Combinatorial Logic Design

Lecture 3: Modeling in VHDL. EE 3610 Digital Systems

Physics 364, Fall 2014, reading due your answers to by 11pm on Sunday

In this lecture, we will focus on two very important digital building blocks: counters which can either count events or keep time information, and

Lecture Topics. Announcements. Today: The MIPS ISA (P&H ) Next: continued. Milestone #1 (due 1/26) Milestone #2 (due 2/2)

CSc 256 Midterm 2 Fall 2011

Kernel Registers 0 1. Global Data Pointer. Stack Pointer. Frame Pointer. Return Address.

DC57 COMPUTER ORGANIZATION JUNE 2013

TEACHING COMPUTER ARCHITECTURE THROUGH DESIGN PRACTICE. Guoping Wang 1. INTRODUCTION

Lab 6 : Introduction to Verilog

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

CS 61C: Great Ideas in Computer Architecture MIPS Instruction Formats

EEC 483 Computer Organization

Chapter 4. The Processor

Verilog for High Performance

Lab 1: Introduction to Verilog HDL and the Xilinx ISE

5/17/2012. Recap from Last Time. CSE 2021: Computer Organization. The RISC Philosophy. Levels of Programming. Stored Program Computers

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Recap from Last Time. CSE 2021: Computer Organization. Levels of Programming. The RISC Philosophy 5/19/2011

CSc 256 Midterm 2 Spring 2012

M A S S A C H U S E T T S I N S T I T U T E O F T E C H N O L O G Y DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

CS/EE 3710 Computer Architecture Lab Checkpoint #2 Datapath Infrastructure

Transcription:

LAB 9 The Performance of MIPS Goals Learn how the performance of the processor is determined. Improve the processor performance by adding new instructions. To Do Determine the speed of the processor in Lab 8. Introduction Determine the bottleneck in the calculation. Enhance the processor by implementing additional standard MIPS instructions. Determine the new performance of the processor. Follow the instructions. Paragraphs that have a gray background like the current paragraph denote descriptions that require you to do something. To complete the lab you must hand in a report. The last page of this document has been formatted so that it can be used for the report. Fill in the necessary parts and return the report to your assistant during the following exercise session. In the recent labs we have slowly built a small working microprocessor based on the 32-bit MIPS. The ALU was built during Lab 5 and allowed us to implement a subset of the original R-type instructions. In Lab 7, we used these instructions to write a small program that can add a range of consecutive integers. Finally, in Lab 8, we managed to put pieces of the MIPS together to get our first microprocessor. During this laboratory exercise we will be interested in improving the performance of the processor. Performance Before beginning, let us examine the current processor and find out how it can be improved. Refer to Lab 5 if you have forgotten how to read the reports in Xilinx ISE. We will be using the program from Lab 7 as a benchmark in this exercise. Calculate how long it will take us to calculate the result for different sizes of input parameters A and B. Assume that the system works at 20 MHz when answering the speed questions. Fill the table in Part 1. The problem of adding consecutive integers is a known one. To find the sum of all numbers from 0 to N one can use the simple formula: S = N *(N +1) 2 To calculate the sum of all integers from A to B, we can calculate this first for B and then for A. By subtracting these two we will get the desired sum. However, as our limited processor has neither a multiplication nor a division we were unable to make use of this trick and resorted to a brute force method that ended up summed up all numbers in the range. The problem with this method is that the execution time is proportional to the difference between the numbers A and B. Additional Instructions We can improve the performance of the program significantly if we could somehow use the Gauss formula. The first problem is the division: normally division is a complex operation. However, for our purposes we only need a simple division by two. From the class you should remember that dividing a binary number by 1

powers of two is very simple. We just need to shift the bits of the binary representation of the number to right. At the moment we do not have an instruction in our processor that can do this operation, but the original MIPS instruction set has an instruction named srl (shift right logical) for this purpose 1. We could use this instruction to do the final division by two. However we still need a way to multiply two numbers. There are two solutions. One of them is to implement the multu 2 instruction from the original MIPS instruction set. The other is to implement the sll (shift left logical) instruction that would be able to multiply by two. By shifting and conditionally adding the multiplicand, it is possible to implement a bit-serial multiplication with the sll, srl and add commands. You are free to try this approach, but we recommend using the first approach, implementing multu instruction. Multiplication Result Unfortunately, our problems do not stop here. While it will be very straightforward to implement the core function of multu instruction, the problem is that when you multiply two 32-bit numbers, the result is 64- bits. The MIPS architecture allows only one register write-back, and cannot support writing the multiplication result back into the standard registers. The solution is to use two additional 32-bit registers that are called hi and lo. These will contain the most significant 32-bit and least significant 32-bit of the multiplication respectively. Since these registers are not part of the standard register file, they can also not be accessed directly by other instructions. MIPS provides two additional instructions to move data from these registers: mfhi (move from hi) and mflo (move from lo) 3. If we decide to use the multiplier we will have to implement these functions as well. Determine the performance Before we start working on such a major change on our processor, we should first make sure that it is actually worth making the effort. For this exercise, we assume that the multiplication result can be expressed with less than 32bits 4. This means that the multiplication result will be small enough to be stored only in the Lo register. You have two options here choose the one that suits your challenge needs. Option 1 (challenging): Use the MARS simulator to write a faster version of the code that you have written in exercise 7. Make sure that the code is functional by testing it for smaller values 5. If you have problems with the MARS simulator or assembly, refer to exercise 7. After you have a functional code you can determine how many cycles are needed for your program to calculate the results given in Part 2 of the report. Option 2 (easy): Download and extract the Lab9_helpers.zip file from the course website. Look at the provided assembly code in helper_mul.asm to determine how many cycles will be needed for your 1 There is also an instruction called sra (shift right arithmetic) that seems similar, the difference is that in a logical shift zeroes are inserted from the left side, and in the arithmetic shift the sign (MSB) is preserved. Since the numbers in our example are all positive integers, sra and srl would have given the same result, but srl is easier to implement. 2 Similarly, there is a mult instruction where the operands can be signed. Since all numbers will be positive, mult and multu will deliver the same result, and multu is simpler to implement. 3 There are also mthi (move to hi) and mtlo (move to lo) instructions to store data to these registers, but we will not be needing this functionality 4 To use the full 64-bits we would need both Hi and Lo registers. This is not really much of a problem, but the shift operation would also be slightly more complex as we would have to add the bit that is shifted out of Hi register as the MSB of the Lo register. In order to make the exercise more manageable we will use this simplification and assume that the Lo register will contain the complete result. This would limit the maximum number that we can use to slightly more than 32 000. 5 Although it is tempting, do not use values larger than 100 for A and B. The simulation will just take a very long time, and will not give you additional information. 2

program to calculate the results given in Part 2 of the report. You could also try to run it with the MARS simulator. Please note that the simulator will accept any valid instructions, but our MIPS only supports a handful of instructions (the ones in Exercise 6 and srl, mflo and multu). Make sure that you only use the allowed instructions. Modify the processor This is the slightly harder part. Note that all three instructions are actually R-type instructions. If you remember, in the last exercise we had left the ALUControl signal to be 6-bits wide in the ControlUnit (although at that time we needed only 4 bits). Modify the ALU.v file from Lab 8 (or Lab 5), so that: 1. It accepts a 6-bit aluop signal. 2. It accepts a 5 bit ShAmt (shift amount) from the MIPS. 3. It performs the instruction that takes the B input and shifts the value by ShAmt bits to the right when aluop is 6 b000010 (srl). 4. It multiplies A and B and writes the result to an internal 32-bit register Lo when aluop is 6 b011001 (multu). Note that for the register you will need to add clock and reset signals to the interface as well. 5. It takes the present value of the register Lo and copies it to the output when aluop is 6 b010010 (mflo). Make sure that other instructions are not affected. Since we have changed the interface of the ALU (now the aluop connection is 6-bits instead of four, and we need additional clock and reset connections), we also have to modify the MIPS.v slightly to make sure that the modified ALU is connected correctly. Make sure that the new ALU is integrated correctly on our top level. This would require small changes on the instantiation within the MIPS.v. You need to extract the ShAmt signal (Shift Amount) from the instruction (Instr signal) to pass it to the ALU as well. Performance We have a new processor, which hopefully reduces the computation time for large numbers considerably. There is a cost associated with this improvement. There is a more complex operation in the ALU component of the processor (multiplication), which should theoretically also increase the length of the critical path. When you compare the two implementations, you realize that this is not really apparent. First of all, the synthesizer will make use of the built-in multipliers within the FPGA to implement the costly multiplier. These are substantially faster than building a multiplier using individual gates, so most probably you will not see the penalty in the timing. To keep the code simple, we use a very simple approach for the Instruction Memory. The program is defined as a constant look-up table. Since the program is embedded into the processor, the synthesizer is able to recognize which parts of the processor are not used, and is able to optimize those parts away. If you have a multiplier but do not have an instruction that uses it, there is no need to synthesize a multiplier. As a result the area numbers that you see also depend on the program you have loaded. This makes comparisons tricky. 3

Verification Here, we follow the example that we had in Lab 7, and make sure that the final result is written to a specific memory location 6 and wait until the processor is in a loop (PC equals to the end loop) and check the content of the memory location to see if the value is indeed correct. To do this you have again two options: Option 1 (challenging): Your assembly program needs to be copied into the textfile named insmem_h.txt. To do this, you can use the Memory Dump option within the MARS assembler. By selecting the memory Segment as.text, and Dump Format as Hexadecimal Text you will be able to generate the required file. All you have to do is use an editor and make sure that the file has exactly 64 lines. All lines after your real code will be filled with zeroes. If something is not clear, refer Lab 8 manual. Option 2 (easy): The Lab9_helpers.zip includes a helper_insmem_h.txt file that contains the binary program corresponding to the helper_mul.asm assembly program. Rename helper_insmem_h.txt file to insmem_h.txt and place it in the same folder as the project files. Remember that the InstructionMemory.v file reads the insmem_h.txt to initialize its contents. By changing this file, and recompiling your circuit you effectively reprogram your processor. Use the file MIPS_test.v file to test your processor as explained previously. It is a simplified version of the one used in lab 6 in which we no longer have to read in the expected responses and we will no longer check whether or not the results match our expectations. We simply need a clock generator, an initial reset signal and give the processor sufficient time to finish the calculation. Run the new testbench in ISim and monitor the value of the final memory location (where the result will be written) and the PC in the wave window. Note that this approach is a rather ad-hoc approach to verification, and will not help you much if the final result is not correct. You would need more effort in finding out why the circuit is not working as expected (since we do not have any intermediate expected results). Some Ideas Now that we have almost a full processor, here are some ideas you can try to implement (optional). - Use your files from the previous exercise to display the output of the mult on the 7-segment LED - Write a counter program in assembly, dump the instruction and data memory and display the counter on the 7-segment LED. - Write an assembly program to take inputs A and B from the switches and display the multiplied output on the 7-segment LED display. Final Words In this exercise we have added instructions to enhance the performance of our processor. In fact, rather than improving the original MIPS architecture we have just added some of the missing instructions to the processor. The addition comes at a cost, more instructions have to be decoded, and more hardware resources are required to perform the additional instructions. Processor designers often face this trade-off. Since we are talking about adding instructions, why don t we just add the instructions the way we like. We could also add one instruction (for example called gauss) that would take N, add one to it to generate (N+1), multiply these values, divide the result by two, and return the result in one single cycle. This is possible, but once we start adding non-standard instructions to MIPS, tools (for example compilers, debuggers like MARS) will no longer work with this modified architecture. These also have to be modified. This is additional overhead. In our case, the decision was actually pretty simple; we simply do not have the time to make modifications to the tools as well. 6 In Lab 7, you were asked to keep the result in $t2. 4

There is also the issue of how much we would gain by this gauss instruction. Adding a multiplication instruction reduced the calculation time tremendously (for larger N saving us millions of cycles). The gauss instruction would be able to reduce 4 instructions into a single one. The problem is that we would only need this instruction twice, so in total we could save only 6 cycles, overall not a very impressive gain for all our efforts. Clearly, this is not always the case. Instruction set extensions is an active field of research, and in some cases can offer significant gains with little overhead. This is the last exercise, so these will be the final words of this exercise series. First of all we hope that you have learned something from these exercises. In a short time (with some help) you have implemented a 32- bit processor, wrote programs for your own processor, and you have enhanced this processor to improve its performance. We hope that you enjoyed the exercises and have a better understanding about digital circuits and digital design in general. In the end, you can safely say that you have used professional digital design tools, and gained practical experience in designing actual circuits, which are not that far from what is needed in industrially relevant projects. 5

Digital Circuits Laboratory Report Lab 9: The Performance of MIPS Date Grade Group Number Names Assistant Part 1 For the following values of A and B, how many clock cycles are needed to execute your program from Lab 7? Assuming that we run the MIPS processor at 20 MHz, how much time (in seconds) would that take? Value of A Value of B Number of cycles Time in seconds 0 8 6 8 0 250 000 000 249 999 996 250 000 002 Part 2 With the modified MIPS architecture fill in the new values for the Table in Part 1. Value of A Value of B Number of cycles Time in seconds 0 8 6 8 0 250 000 000 249 999 996 250 000 002 Part 3 Compare the size of the two implementations (before and after the modifications). What differences do you see? Briefly comment on them. Hint: Look into the synthesis report. 6

Part 4 Attach the Verilog code for the modified ALU to the report. Part 5 Show an assistant your working MIPS code. Part 6 If you have any comments about the exercise please add them here: mistakes in the text, difficulty level of the exercise, anything that will help us improve it for the next time. 7