CALIFORNIA STATE UNIVERSITY, NORTHRIDGE A TOMASULO BASED MIPS SIMULATOR. For the degree of Master of Science. In Electrical Engineering.

Size: px

Start display at page:

Download "CALIFORNIA STATE UNIVERSITY, NORTHRIDGE A TOMASULO BASED MIPS SIMULATOR. For the degree of Master of Science. In Electrical Engineering."

Vincent Bishop
6 years ago
Views:

1 CALIFORNIA STATE UNIVERSITY, NORTHRIDGE A TOMASULO BASED MIPS SIMULATOR A graduate project submitted in partial fulfillment of the requirement For the degree of Master of Science In Electrical Engineering By Reza Azimi May 2013

2 Signature Page The graduate project of Reza Azimi is approved: Ali Amini, Ph.D. Date Shahnam Mirzaei, Ph.D. Date Ramin Roosta, Ph.D., Chair Date California State University, Northridge ii

3 Acknowledgement I would like to thank Dr. Shahnam Mirzaei for providing nice ideas to work upon and Dr. Ramin Roosta for his guidance. I sincerely want to thank my other committee member Dr. Ali Amini for his support as a member of project committee. I would like to show gratitude to all of my project committee members for being great mentors and their continuous guidance. Most importantly, I like to thank my family for their endless support, unconditional love and great care throughout my life. iii

4 Table of Contents Signature Page... ii Acknowledgement... iii List of Figures... v List of Tables... vi ABSTRACT... vii Chapter 1 : Introduction RISC Versus CISC: MIPS: Design Environment:... 4 Chapter 2 : Pipelining A Simple Implementation of RISC: The Classic Five Stage Pipeline: Pipeline Hazards: Assume Branch Not Taken: Data Forwarding: MIPS Implementation: Chapter 3 : Dynamic Scheduling Tomasulo s Algorithm Applications Chapter 4 : Design Implementation Pipelined MIPS: Tomasulo: Chapter 5 : Analysis and Conclusion Pipelined MIPS Simulation Example: Tomasuolo s Algorithm Simulation Example: Conclusion: Future Enhancement: References Appendix: Source Codes iv

5 List of Figures Figure 2-1 RISC Data Path... 9 Figure 2-2 Forwarding Paths for the Given Example Figure 2-3 The Implemented Pipelined MIPS Architecture Figure 3-1 Tomasulo Based Processor Structure Figure 4-1 Designed User Interface for Pipelined MIPS Figure 4-2 Pipelined MIPS Simulator Flow Chart Figure 4-3 Designed User Interface for Tomasulo s Algorithm Implementation Figure 4-4 Tomasulo s Algorithm Simulator Flow Chart Figure 5-1 Pipelined MIPS Simulation in the First Clock Cycle Figure 5-2 Pipelined MIPS Simulation in the Fifth Clock Cycle Figure 5-3 Pipelined MIPS Simulation in the Ninth and Tenth Clock Cycle Figure 5-4 Pipelined MIPS Simulation in the Eleventh Clock Cycle Figure 5-5 Pipelined MIPS Simulation in the Fifteenth and Sixteenth Clock Cycle Figure 5-6 Pipelined MIPS Simulation in the Nineteenth Clock Cycle Figure 5-7 Tomasulo s Algorithm Simulation in the First Clock Cycle Figure 5-8 Tomasulo s Algorithm Simulation in the Second and Third Clock Cycle Figure 5-9 Tomasulo s Algorithm Simulation in the Forth Clock Cycle Figure 5-10 Tomasulo s Algorithm Simulation in the Fifth and Sixth Clock Cycle Figure 5-11 Tomasulo s Algorithm Simulation in the Seventh and Eighth Clock Cycle 56 Figure 5-12 Tomasulo s Algorithm Simulation in the tenth Clock Cycle Figure 5-13 Tomasulo s Algorithm Simulation in the Sixteenth Clock Cycle Figure 5-14 Tomasulo s Algorithm Simulation in the 56 th and 57 th Clock Cycle v

6 List of Tables Table 1-1 Format I Instruction Fields... 4 Table 1-2 Format R Instruction Fields... 4 Table 4-1 Supported Instructions Information Table 5-1 Pipelined MIPS Simulation Example Table 5-2 Expected Result from the Given Example to the Pipelined MIPS Simulator.. 39 vi

7 ABSTRACT A Tomasulo Based MIPS Simulator By Reza Azimi Master of Science in Electrical Engineering From 1985 processor designers are using pipeline to cover the instructions execution on each other and increase their design speed. They call this overlap between instructions, instruction level parallelism because instructions are calculated in parallel. For developing ILP, there are two main separate methods: First method use hardware to detects and takes advantage dynamically the parallelism and the second method uses software to statically find parallelism at compile time. The processors which used the dynamic hardware based method like the Intel Pentium family are more successful in the market. Robert Tomasulo developed the Tomasulo algorithm for IBM 360/91 floating point. It is a hardware based algorithm which handles out of order execution problems. This project describes the design and implementation of a simulator for a Tomasulo based MIPS architecture processor and its fundamental ideology and needed techniques. The principles of dynamic scheduling and Tomasulo s algorithm are explained. The main features of the simulator are defined and important hazards, which we have to deal with them along with the implementation, and their solutions are mentioned. The implemented simulator is described with its core components and most important functions. The simulator developed using C++ and on a Qt framework. vii

8 Chapter 1 : Introduction 1.1 RISC Versus CISC: Reduced Instruction Set Computer (RISC) architecture s goal is to reduce the number and complexity of instruction in the processors. RISC developers use many techniques to exploit caching, superscalarity, pipelining and etc. in their processor architecture. These techniques are described in the following paragraphs. One instruction per cycle: The main idea and concept of RISC is only having one instruction per each clock cycle. This is the most important feature of early RISC designs that every instruction complete in a single clock cycle. Fixed instruction Length: Since one instruction has to issue per each clock cycle, it is normal that we have to have a fixed length for all instructions; usually we use one word with word sizes of 32 bits. This word specifies everything we want to know about the instructions: what the operation is; if there are operands where to find them; if there is a result where to write it; and where to find the next instruction in the instruction memory. Only load and store instructions access memory: Accessing memory has it owns time penalty and for solving this issue RISC designers put all operands in the registers before operating on them. Designers put two instructions for accessing memory: Load and store. This limitation for the processor minimizes the traffic in the processor-memory bus and makes sure all the needed operands are available in the register file. Simple addressing modes: For complicated addressing modes we have to perform more arithmetic operation to calculate the target address and thus they need longer clock cycles. To speed up, RISC designers limit the ISA into two addressing modes: indexed addressing and register indirect where the index can be found in a register or in immediate constant of instruction word. Using the immediate part of instruction word does not have any timing penalties because the instruction width is fixed and its value is available when it is needed. Fewer and simpler operations: It is obvious that less arithmetic operation has to be done for simpler operations in a clock cycle so they cause shorter clock cycles for each operation. Programmers can produce any complex instruction with bunch of simpler instructions. If we want to add an instruction to the ISA then we should answer to this question: Does the increase in complexity of architecture and increase in the clock cycle worth it? Only if we can answer this question with a confident yes then we should add that operation to the ISA. 1

9 Delayed Loads and Branches: In the RISC ISA load, store and branch instruction need more than one clock cycle for execution. Load and store need time to access the memory and branch needs time to access the instruction in the branch target address. The concept of delayed loads and branches is to allow other instruction to execute while processor is completing execution of these instructions. The Complex Instruction Set Computer (CISC) is not the result of a specific ideology of machine design. Instead it is the result of the designers hard work to combine more features such as addressing modes and instruction types in a instruction where memory costs a lot and memory access time were high and they want to compensate it with compact instruction codes to solve this problem. CISC designer s goal was to do more work with each instruction in instruction set architecture. Therefore they have to use a large set of different addressing modes and different number of operands in different instructions. That is why CISC instruction s word are different in lengths and execution time. From the first day of processor architecture design, architects goal is to increase the amount of instruction execution with overlapping the execution of more than one instruction in one clock cycle. The most common techniques of overlapping instruction are prefetching, pipelining and superscalar operation. In the first processor generations, an instruction is fetched, then executed and then the next instruction is fetched. In many current machines, these operations are overlapped. It is possible to speed up execution considerably by prefetching, that is, fetching the next instruction or instructions into an instruction queue before the current instruction is complete. Prefetching can be considered a primitive form of pipelining. Put simply, pipelining instructions means starting, or issuing the next instruction prior to the completion of the currently executing one. Pipelining will be discussed completely in the next chapter. Superscalar operations refer to a processor that can issue more than one instructions simultaneously. If superscalar operation is to be used to its fullest extent, then some instruction will be executed out of order. For example, if two instructions issue simultaneously, the one with the shortest execution time will finish ahead of other. We will see how to handle out of order execution in chapter three. Prefetching, pipelining and superscalar techniques can be founded in the CISC processor but the problem is because the instruction word is complex, its width is not fixed and also accessing operands depend on different complex addresses, it is hard to have these techniques efficiently in CISC machines and as we said before complex addressing make the clock cycles longer. In other hand RISC architecture were designed to use caching, prefetching, superscalarity and pipelining efficiently which was founded in the CISC days.[1] 2

10 1.2 MIPS: Microprocessor without Interlocked Pipeline Stages (MIPS) is a reduced instruction set computer (RISC) instruction set architecture (ISA) developed by MIPS Technologies company (formerly MIPS Computer Systems Inc.). The most important strength of MIPS is it has a simple and easy to pipeline instruction set architecture. The term instruction set architecture (ISA) is referred to the actual programmer visible instruction set. The ISA plays the part of the margin between the hardware and software. They are the tools in the hand of compiler to transfer the software program for hardware. Nearly all ISAs today are classified as general purpose register architectures, where operands are either registers or memory locations. In MIPS architecture we have 32 general purpose, 32 floating point registers and two register named High and Low for multiplication and division operation. MIPS is a load-store machine which only can access memory with load or store instructions. Addressing modes specify the address of a memory object and also specify registers and constant operands. We have three addressing modes in MIPS. They are Register, Immediate for constant and displacement, where for indicating the memory address we add a constant offset to the register value. All MIPS instructions words are 32 bits long. As we said before fixed length of instruction word simplifies instruction decoding and reduce the needed timing for it. The general categories of MIPS operation are ALU instructions, load and store instructions, branches and jumps.[2] ALU instructions: These instructions need either two registers or a register as operands and a sign extended immediate (called ALU immediate instructions, they have a 16-bit offset in MIPS), operate on them, and store the operation results into a third register indicated in the instruction word. In MIPS, there are both signed and unsigned forms of the arithmetic instructions; the unsigned form does not generate overflow exceptions. Load and store instructions: These instructions take a source register, called the base register, and an immediate field (16 bits in MIPS), called the offset, as operands from the instruction word. Add the content of base register to the immediate filed to find the effective address; the target address is used as a memory address. In the case of a load instruction, a second register operand acts as the destination for the data loaded from memory. In the case of a store instruction, the second register operand is the source of the data that is stored into memory. Branches and jumps instructions: Branches are conditional transfer of in instructions. In RISC architectures there are two ways of identifying the branch condition: with a couple of condition bits in instruction word or by a limited set of comparisons between a pair of registers or between a register and zero. MIPS uses the second way. The branch destination is obtained by adding a sign extended offset to the current program counter 3

11 (PC). Jumps are unconditional branches provided in many RISC architectures such as our MIPS [1]. There are three instruction formats in MIPS instruction word named format R, I and J. All formats use 6 most significant bits for operation of the instruction. Format J which is used for jump instructions; assign rest of its 26 bits to target address of jump. Below format R and I are explained briefly. Format I: This format is used by the data transfer instructions (load and store), branch instructions and immediate format instruction. Immediate format instructions are the instructions that use immediate addressing in their operation. Table 1.1 shows the format I instruction fields. Op rs rt Address/immediate 6 bits 5 bits 5 bits 16 bits Table 1-1 Format I Instruction Fields rs is the source register and rt is the destination register for these instructions. Format R: this format is used by the arithmetic instructions. Table 2 shows the format R instructions fields. Op rs rt rd shamt Funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Table 1-2 Format R Instruction Fields rs and rt are source registers, rd is the destination register, shamt is the shift amount in shift operation and funct selects the variant of the operation in the Op field. In Chapter 4, there is the complete description of the each used instruction, value of their field and how each instruction format decodes and works. 1.3Design Environment: Qt is used to develop this project. Qt is an open source full development framework with helpful libraries and tools designed to simplify the applications creation and user interfaces program for different platforms. Qt is a cross-platform application framework which is widely used by C++ programmer for developing graphical user interface (GUI) 4

12 application software and also used for developing non-gui programs such as commandline tools for servers.[4] Qt gives the ability of reuse code professionally to target multiple platforms with one written code base. The modular C++ class library and developer tools easily enable developers to create applications for one platform and easily build and run to deploy on another platforms. Also the open source feature is very important to some developers which made the Qt very popular among them. [5] The designed simulator was developed using C++ and by its nature object oriented programming style is used. The design description is available in chapter 4 and the simulation results are available in chapter 5. 5

13 Chapter 2 : Pipelining Pipelining implementation technique overlaps multiple instructions in execution; to execute an instruction, pipelining exploit the parallelism which exists among the actions. Pipelining used to be the key concept to make fast processors. Pipelining is like an assembly line of a factory. In a bottled water factory assembly line, there are stages, each do a separate stage of bottled water production. All steps are in parallel with each other and operate on a separate bottle. Exactly like the bottled water assembly line, each step of pipeline in a processor is operating and completing a part of instruction execution and all of them are working in parallel to complete the execution of an instruction. We call each of these steps a pipe segment or pipe stage. These stages are connecting back to back to each other which the output of one stage is the input of the next stage. Like the bottled water, an instruction enters, each stage completes some part of it and at the end it exits the pipeline completely done. In the bottled water s assembly line, as the production number of bottled water per hour is named throughput and it is revealed by how often a complete bottled water exits the assembly line as its production. Similarly how often an instruction exits the pipeline determines the throughput of an instruction pipeline. Because the pipe stages are connected to each other all the pipeline stages must be ready to work at the same time, just as we require in a factory assembly line. Processor cycle is stated as the time we need to move an instruction one step further pipeline stages. The length of a processor cycle is stated by the time required for the slowest pipe stage since all stages start at the same time. Similar to bottled water assembly line, the slowest step would determine the speed which stages can operate. In a pipelined processor this processor cycle is often one clock cycle. The goal in designing pipelined processor is to equalize the timing of each pipeline stage, like the designer of the assembly line. If designers do this job perfectly then in the ideal condition the execution time for an instruction in each stage is execution time for instruction on the same unpipelined machine divided by number of pipe stages. In this case the speedup gained from pipelining technique is equals to the number of pipe stages, like an assembly line with n stages can ideally produces bottled water n times faster. But in practice the stages cannot be perfectly divided; besides pipelining has some timing cost. Therefore the execution speed of instructions on the pipelined processor will not have its ideally value, but it will be close. Pipelining reduces the average execution time per instruction not the execution of individual instruction. The reduction can be observed as cutting the number of clock cycles per instruction (CPI), as the clock cycle time reducing, or as a mixture of both 6

14 depending on what you take as your design baseline. Pipelining is usually viewed as reducing the CPI, If processor base design works with multiple clock cycles per instruction. Pipelining reduces the clock cycle time if the processor base design takes one lone clock cycle per instruction. In a sequence of instructions, pipelining uses parallelism in the instructions. One of its big benefits is it does not affect the software programmer job and also it has a considerable advantage in execution timing and speed which is hidden from software programmer site [2]. 2.1 A Simple Implementation of RISC: For exploring how a pipelined RISC instruction set can be applied, it s implemented without pipelining has to be grasped. In the RISC machines every instruction can be operated in maximum 5 clock cycles. The 5 clock cycles are explained in bellow: Instruction Fetch (IF): Read the instruction that we want to execute in the program counter (PC) address of instruction memory. Since each instruction is 4 bytes update the PC value to its next sequential PC by adding four to it. Instruction decode and register fetch cycle (ID): From the register file, read the registers matching to the source register specifiers and decode the instruction. For a possible branch, it has to do the equality test on the source registers and if it is needed, sign extend the immediate address part of instruction. The possible branch target address can be computed by adding the incremented PC to the sign extended immediate address part of instruction. In our design the branch instructions can be completed at end of ID stage, just remember if the condition for the branch is true, we have to pass the branch target address to the PC register. RISC architecture uses the fixed-field decoding technique. In this technique the registers are stated at a fixed location which makes feasible to decode instructions and read registers in parallel. Also notice sometimes we may read an unused register, which does not benefit us for sure but does not harm the performance as well. It is wasting energy to read an unused register and in power sensitive designs it is avoided indeed. In case it is needed the sign extended immediate address is also calculated during this cycle because the immediate address is also located in a fixed identical field of an instruction. Execution and effective address cycle (EX): Depending on the instruction type, ALU operates on the operands which are prepared in the previous cycles one of these three functions. 7

15 Memory reference: In this type of instructions ALU has the responsibility to adds the offset to the base register to gain the effective address for memory usage in load and stores. Register-Register ALU instruction: In this type of instruction, ALU do the specified operation (ALU opcode specify that in our design ALUOP) on both the operands which are read and given to ALU from register file in the last stages. Register-Immediate ALU instruction: In this type of instructions, ALU do the specifiers operation by ALUOP on the operands read from register file and the sign extended version of immediate address field of instruction word. Because no instruction s needs to calculate a data address and perform an operation on the data in load-store architecture, the effective address calculation and execution cycles are combined into a one clock cycle. Memory access (MEM): In the load instruction we read the data memory using the calculated effective address in a previous stage. In the store instruction we write the second register read from the register file to the effective address of data memory. Other instructions do not do anything in this stage. Write back cycle (WB): The results come from the data memory (load instruction) or the ALU (ALU instruction) are written into the register file in this stage [2]. 2.2 The Classic Five Stage Pipeline: The RISC processor explained in the previous section can be pipelined with nearly no changes. It just needs to start a new instruction on each clock cycle. Each stage in the previous section becomes a pipe stage. Even though we need 5 clock cycles to finish each instruction, but we will start a new instruction and do portion of its implementation in five stages of pipeline in every clock cycles. Pipelining introduces new problems, for start we have to establish what is going on every clock cycle of execution. The first difficulty is we have to ensure the processor does not attempt to use the same data path resource for different operation on the same clock cycle. As an example, a single ALU cannot does an effective address and perform a subtract operation at the same clock cycle. Therefor it is the duty of designer to make sure instruction overlapping in the pipeline does not have such contradiction (structural hazards). Luckily because of RISC instruction set straightforwardness, resource calculation is easy. Most of functional units are used in different clock cycles however multiple instruction execution introduces few structural hazards that we have to handle.. 8

16 First, we use separate memory for instruction and data which it means we have two memories in our architecture, one for instructions and one for data. This memory and instruction memory separation gets rid of a hazard which would happen between instruction fetch stage and memory access stage. The memory system in the pipelined processor with the same clock frequency of unpipelined type must provide five times the bandwidth. This is the cost of having higher performance we must pay. Second, we use the register file in two stages: first in ID stage for reading operands and second in WB for writing the result. These uses are separate. That's why register file is read twice and is written once in every clock cycle. For reading and writing the same register in one clock cycle, we use the master-slave model flip flop and write the register in the rising edge of the clock cycle and the read in the falling edge. Third PC must be incremented and stored in every clock cycle to have a new instruction every clock. This action must be done during the IF stage for fetching the next instruction. In addition, for calculating the possible branch target address we must also have an adder arithmetic unit in ID stage. In addition to check pipeline instructions for not using the same hardware resources at the one time, we have to check the effect of different instructions in different stages of the pipeline on each other. We do this partitioning with the pipeline registers among different stages of pipeline. Therefore the result of each stage is stored into the pipeline stage at the end of clock cycle; this register is used as the input of the next stage on the next clock cycle. To sum it up, we increase the amount of instruction that their execution is completed in a period of time with pipelining technique. Pipelining improves the instruction throughput but it cannot decrease the needed time for completing one instruction. Also because we have a more complicated control unit of pipelined processor and we need time for calculating the control signals the needed time for completing one instruction increase a little bit. Although no single instruction runs faster in pipelining but the throughput is increased and thus program runs faster and the total execution time is lower. We show a simple model of pipelined RISC data path in Figure 2-1. The IM is used for instruction memory, DM for data memory and Reg for register file. IM Reg DM Reg ALU Figure 2-1 RISC Data Path 9

17 2.3 Pipeline Hazards: In pipeline, sometime we have instructions which prevent their next instructions to execute. Hazard is the name of these situations. For sure from its definition they cause longer execution time and thus lower performance in compare to the ideal pipelining speed. There are three types of hazards: 1. Structural hazards: These hazards are the result of resource clashes in overlapped execution; the hardware resources cannot take care of all probable mixture of instructions at the same time. 2. Data hazards: These hazards are the result of an instruction depends on the results of a previous instruction. 3. Control hazards: These hazards are the result of branches and other instructions that change the PC. Sometimes it is essential to stall the pipeline because of the hazards in pipeline. To evade a hazard in pipeline we often let some instruction to continue in the stages as the same time as some of them are stopped. In an instruction stall we stall all instruction issued after than the stalled instruction and instruction issued before must proceed. If not the hazard will never clear from the machine and we can t fetch any new instruction as a result. Structural Hazards: In a pipelined processor we need pipelining of functional units and also repetition of functional units resources to let all probable mixture of instruction on the pipeline to overlap instruction execution. If we have a situation where some group of instructions cannot be adjusted because of resource struggles then we have a structural hazard in our processor. The main reason for structural hazards is when we do not fully pipeline a functional component. In this case it is impossible for an array of instructions which are using that unpipelined component to progress with one per clock cycle. Another reason is when we as the architecture designer did not copy resources as it need to let all mixtures of instructions in the pipeline to execute. As an example we have a structural hazard if our processor has only one register file read port and processor needs to does two reads in a clock cycle. When we have a structural hazard in our serial of instructions, pipeline should stall the instructions until their needed functional component is free. Since stalls levitate through the pipeline with doing no useful work; a stall is usually called a pipeline bubble or bubble as well. Like any hazards, structural hazards increase the CPI to more than its ideal value of one. 10

18 Data Hazards: The main consequence of pipelining is, it changes the instruction execution timing because of its overlapping property. Overlapping causes data and control hazards. We have data hazards when the order of read-write accesses to operands is changed (the actual execution order from order seen in instruction sequence). Consider the execution of these instructions: OR $1,$2,$3 XOR $4,$1,$5 ADD $6,$1,$7 SUB $8,$1,$9 AND $10,$1,$11 Register one, the first OR s result, is used with all other instructions. As we said before, instructions write their result in WB stage including the OR instruction. However the XOR instruction reads the value of its operand during its ID stage so it needs the value of register one. Data hazard is the name of this problem. If we don t do anything about this problem the XOR instruction will read the wrong value of register one. We don t know what this value is and it is not definite. This is an arbitrary activity and it is not acceptable in processor design. This hazard is also influence the ADD instruction. OR instruction does write the value of register one until to the end of fifth clock cycle so the ADD instruction that which needs the value of the register one in the fourth clock cycle will catch the wrong result. The SUB instruction reads the value of register one in the sixth clock cycle after the OR s WB stage and therefore it is executed correctly. Also the SUB instruction is executed correctly and we do not have any hazard for it. This is because we do the write operation first (clock rising edge) and the read operation after that (clock falling edge) [2]. Control Hazards: Control hazards cause a bigger performing loss than the data hazards in RISC family architecture. If a branch instruction changes the PC to its target address, it is called the taken branch and if not it is not taken or untaken. In a taken branch instruction, the PC value is not changed usually until the end of ID stage when the address calculation and condition comparison is done. To deal with the branches and control hazards, the easiest technique is when we spot a branch in ID stage; restart the fetch operation of instruction after that branch. The first IF stage does not doing any helpful so it should be stalled for sure. If the branch is untaken 11

19 then stalling IF stage is redundant because the correct instruction was fetched. We will develop a technique to use this property in the next section. 2.4 Assume Branch Not Taken: The easiest way to handle branches and control hazards is to flush the pipeline. Flushing the pipeline means to hold to delete any instruction after the branch instruction until the branch target address is determined. This answer is appealing because it is straightforward for both hardware and software to implement. In this method the branch fine is fixed and software cannot do anything to reduce it. Assuming that the branch will not be taken is a frequent enhancement compare to stall the pipeline. In this method, processor keep execute the serial of instruction until it faces a taken branch instruction in ID stage, then the current fetched instruction in IF stage must be neglected and the execution continues at the branch target. Fifty percent of branch instructions are not taken or more in the loops, so it improves the cost of control hazards. This technique is called predicted untaken or predicted-not-taken in the classic five stage pipeline [3]. We continue fetch instruction the branch instruction like it is a normal instruction and pipeline looks it as it is nothing unusual and if the branch is taken in ID stage we need to turn the fetched instruction into a no op or bubble and restart the fetch the target address in the next clock cycle. 2.5 Data Forwarding: A basic hardware method called forwarding can unravel the explained problem in data hazard section. In the explained example of data hazard, the main point of data forwarding is the XOR instruction does not really need the result of OR instruction until after the OR actually produces it. If we move the result of XOR from the pipeline register that this instruction stores it to where the XOR instruction needs it to continue operating then we do not need the stall. Using this observation data forwarding explained below: We take the result from MEM and EX stages and give them to the ALU with a multiplexer. When we find out that the operand register of the EX stage is as the same as source register of MEM or WB, and then we get the operand from them instead of stage ID. The detection of hazards and producing appropriate control signals for ALU input multiplexer and other components are the job forwarding control unit. Data forwarding does not only belongs to the input of ALU and it is feasible to forward data to everywhere we need it from everywhere we produce it. We can forward results from the output of pipeline registers to the input of another unit. They do not only have a clock cycle distance with each other and this distance can be two clock cycles. 12

20 Take, for example the following sequence: ADD $1,$2,$3 LW $4, 10($1) SW $4, 20($1) In this example we have to forward the values of ALU and data memory output (their pipeline registers) to the ALU and memory inputs to solve the hazards. In the fourth clock cycle the value of register one, calculated in ALU (for ADD instruction) for target address calculation of load instruction so we pass it from ALU output to its input. In the next clock cycle we need the value of register which is still not written in the register file for target address calculation of store so we forward it from where it is (now it is in MEM stage) to the ALU inputs and also we need the value of register four from the memory to write it again in the memory in store instruction. Figure 2-2 shows all the forwarding paths for this example. Figure 2-2 Forwarding Paths for the Given Example Unfortunately, we cannot handle all potential data hazard with data forwarding technique.[2] Consider the following sequence of instructions: LW $1, 15($2) ADD $3,$1,$2 OR $4,$1,$5 13

21 This example is different from the previous one. Here first let s examine each instruction. The load instruction does not depend on any other instruction for execution and it will be finish in five clock cycles. It will have its value in the end of fourth clock cycle after it reads the data memory. ADD needs register one and two as its operand in the ID stage and in the third clock cycle and the latest time for ADD instruction to have its operands is the first of clock cycle four but load instruction is going to have the value of register one in the fourth clock cycle. There is nothing we can do about it and we must wait one clock cycle until load read the memory and the pass this value to the ALU for ADD. OR instruction needs the value of register one and five as its operands and in the first fourth clock cycle which we still do not have it but we can pass it at the first of clock cycle five to the ALU which is the latest time it needs it. 2.6 MIPS Implementation: Every MIPS instruction can be executed in maximum 5 clock cycle. The five clock cycles are as follows. 1. Instruction fetch stage (IF): In this clock cycle we read PC and fetch the instruction from the instruction memory into the instruction register (IR) and then add the PC with four to address the next instruction. We utilize IR to keep the instruction word that is required in the next clock cycles and also we use NPC to hold the value of the next program counter. 2. Instruction decode and register read stage (ID): In this clock cycle we decode the instruction and reach the instruction and read the registers from the register file (rs and rt are the source register specifiers). We write the output of the general purpose registers into two temporary registers (X3 and Y3 in our code) for use in the next clock cycles. If we have an immediate operation the lower 16 bits of the IR are also sign extended to 32 bits and stored into the register Imm for use in the later clock cycles. In MIPS architecture decoding and reading registers are in parallel, we have this opportunity because the instruction word fields are at a fixed location in the MIPS instruction format. As we said before the immediate field of an instruction is located in the same place in all MIPS instruction format, so we compute the sign extended immediate address thru this cycle as well if it is required for the next cycles. 14

22 3. Execution and effective address calculation stage(ex): In this clock cycle we get the operands from stage ID and depending on their instruction type do their needed calculation in the ALU unit. The MIPS instructions are one these three categories and in each case the ALU duty is explained: Memory reference: The operands are added together to form the effective address and the result is placed into the register ALUOutput for next clock cycles. Register-Register ALU instruction: In this type of instruction the ALU does the operation stated by the function code on the value in register X3 and on the value in register Y3. Again the result is stored in to the ALUOutput. Register-Immediate ALU instruction: The ALU does the operation stated by the opcode on the value in register X3 and on the value in register Imm and store its result in ALUOutput. 4. Memory access stage (MEM): We only need to access memory for load and store instruction and in this clock cycle we access memory for them. If our instruction is a load, memory send the data and it is placed in the LMD (load memory data) register for the next cycle and if it is a store instruction, then we write the data from the designated register into the memory. In both case we have the address from the previous clock cycles and it is in ALUOutput. 5. Write back stage (WB): In this clock cycle we write the result into the register file, for both cases when it is produced from the data memory system (in load instruction) or from the ALU; the register destination is also in instruction word fields rt or rd rely on the instruction opcode. Both data and control values are transferred by pipeline registers through the pipeline stages. If we need a value in future then we need to put it in the pipeline so it will automatically goes through every stages. At each clock cycle and in each stage of pipeline there is only one active instruction and any actions taken instead of that instruction appear between its stage two pipeline registers. Therefore the pipeline actions can be analyzed with what is happening on pipeline registers. Instruction issue is the process of allowing an instruction going from the instruction decode stage (ID) into the execution stage (EX) in the pipeline and an issued instruction is an instruction that finishes this operation. For the MIPS integer pipeline implementation, we can test all type of explained data hazards during the ID stage of 15

23 pipeline. If we cannot resolve the data hazards with the forwarding then the pipeline stall the instruction before issuing it. The data forwarding decisions are also made in ID stage and the correct control value are set. We decrease the complication of hardware by finding interlocks primary in the pipeline stages. The control unit puts stall in the pipeline to avoid instruction in the IF and ID stage to go further when it finds out there is a hazard it cannot solve with data forwarding. As mentioned earlier, we transfer all needed control information through the pipeline registers. Therefor when a hazard is spotted, the control signals of the ID/EX pipeline register are set to zeros which makes this stage operate as a no op. Also the contents of the IF/ID registers are fed back to itself for holding the stalled instructions. For executing the forwarding arithmetic, all forwarding logically occurs from the ALU or the data memory output to the ALU input, the data memory input or the branch condition unit. Therefore the forwarding is implemented by comparing the destination registers designated by IR in EX/MEM and MEM/WB stages with the source registers of the IR in ID/EX and EX/MEM registers. The branches instruction in MIPS (BNE and BEQ) examine a register for equivalence to another register and if we move the compare unit to the ID stage it is possible to take the branch decision by the end of this stage. For having the advantage of an early branch decision on whether the branch is taken or not, we have to calculate the PC in both case of decision. For calculating the branch target address in ID we need an extra adder functional unit since the ALU is in EX stage. If we do so then we only have one clock cycle stall for branches. However this reduces the branch delay to one clock cycle but if we have ALU instruction followed by a branch on the result of the instruction will incur a data hazard stall and if it is a load instruction followed by a branch then we must have two clock cycle stalls. Figure 2-3 shows the implemented pipelined MIPS in this project [2]. 16

24 Courtesy of Computer Organization and Design, 4th edition by David A. Patterson and John L. Hennessy Figure 2-3 The Implemented Pipelined MIPS Architecture 17

25 Chapter 3 : Dynamic Scheduling From 1985 processor designers are using pipeline to cover the instructions execution on each other and increase their design speed. They call this overlap between instructions, instruction level parallelism because instructions are calculated in parallel. For developing ILP, there are two main separate methods: First method use hardware to detects and takes advantage dynamically the parallelism and the second method uses software to statically find parallelism at compile time. The processors which used the dynamic hardware based method like the Intel Pentium family are more successful in the market. Unless there is data dependency between an inside pipeline instruction and a new fetched instruction which cannot be handled with data forwarding, processors with static scheduling technique fetches instructions and issues them. If there is a data hazard that cannot be resolved, then hardware will be stalled the first instruction that uses the result by hazard detection unit. Processor will wait to all hazards and problems clear and then fetches and issues new instructions. The hardware dynamic scheduling reschedules the instruction order execution to minimize the needed stall for preserving data stream. There are many improvements in dynamic scheduling: It elucidates the cases when at compile time hazards and problems are unidentified and this will makes the compiler complexity simpler. The most significant advantage is execution other instructions while some instruction execution is waiting to resolve their hazards the performance will improve. Another advantage is dynamic scheduling execute instruction compiled for one pipeline to run proficiently on another pipeline architecture. Processor does not change the data flow in dynamic scheduling but when we have dependencies it tries to evade them. In a different approach, static scheduling tries to minimize stalls by isolating related instruction so hazard can be resolved in the compile time. The program compiled with static scheduling method can also be used to run on a dynamically scheduled pipeline processor for sure. As we see before, the main simple pipelining procedure problem is it issue and executes instructions in order. If there is a hazard situation between two adjacent instructions dependency in the pipeline, as this will direct to stall. The pipeline does not process any further instruction if we have a stalled instruction in pipeline. If we have instruction j which depends on a long running instruction k that is executing in the pipeline, then we must stall all instruction after j until we got the instruction k result and j can execute. This problem can be solved using multiple functional units which also can be idle when we do not need them. 18

26 Before going any further let s define some terminologies in Data hazards. We have three types of data hazards, they are categorized based on the order of their read and write operation in instructions. Consider two instructions k and j, with k is first and then is j. Data hazards are categorized as below: RAW (read after write): This hazard occurs when instruction j tries to read a register before k writes it, so the old incorrect value is passed to j. This is the conventional type of hazards and it is because of operand dependency. To make sure j gets its correct operand from k we must execute them in their order. WAW (write after write): This hazard occurs when instruction j tries to write an operand after k writes it but actually the write operations are executed in incorrect direction and at the end the value of instruction k is in the register instead of instruction j. This hazard is because of destination register dependency. We have WAW hazard only in pipelines which write results in more than one pipeline stages or when a previous instruction is stalled and other instructions are permitted to continue. WAR (write after read): This hazard occurs when instruction j tries to write its destination register before k wants to read it, so k gets the wrong new value instead. We do not have WAR hazards in most static scheduling pipelines because all reads are first (in ID stage) and all writes are after it (in WB stage). In pipelines that instruction writes can happen in the first stages and operator reads can happen after these reads we have strong chance of WAR hazards. The most common reason of WAR hazards are when we change the execution order of instructions. Data and structural hazards are examined at decode stage (ID) in the pipelined MIPS. If there is not any hazard or we can solve them with forwarding then we issue the instruction to the next stage. In dynamic scheduling the issue process is divided into two portions: First check for any structural hazards and then wait for data hazards to vanish. In order instruction issue is still used in dynamic scheduling which means instructions are issued in program order. However as soon as instruction s data operands are available we execute them and not in order (i.e. out of order execution) that infers out of order completion. We have the possibility of WAR and WAW hazards in out of order execution which do not exist in our classic five stage pipeline. Consider the MIPS instruction sequence below: 19

27 MULT $0,$2,$4 SUB $6,$0,$8 ADD $8,$10,$14 DIV $6,$10,$8 The SUB instruction is anti-dependence on the ADD instruction, if we perform the ADD before the SUB (Which should wait for MULT for its operand), then we disrupt the antidependence which yield to a WAR hazard. Also we have to evade disrupting the output dependencies (WAW hazards), if we write the register 6 with DIV instruction, As we will see in the next sections, both these kind of hazards can be resolved by using a technique named register renaming. For having the out of order execution without hazards, we have to split the ID pipe stage of the classic five stage pipeline into two stages issue and read. In issue stage we do the instruction decoding and examine the possible structural hazards and in read stage as we can understand from its name we check the data hazards and if there is not any then the read operands for instruction. The instruction fetch stage will be continued with the issue stage and the read instruction form instruction memory into a queue of pending instruction to execute. We issue instruction from this instruction queue in our processor. The EX stage is after the read operand stage exactly like the classic five stage pipeline. Relying on the operation type and what it is, instructions need different amount of time for execution. In dynamic scheduling the execution beginning and its completion are separated. Between these two events the instruction is in execution. Multiple instructions can be in execution phase on the same time in dynamic scheduling and this is the key improvement compare to the classic five stage pipeline. For sure we need multiple functional units for executing multiple instructions at the same time. All instructions are issued in order in dynamic scheduling pipeline but they stalled or bypassed each other in the read operands stage and so enter execution phase out of order. Because of this and different speed of each operation we have out of order completion as well [2]. 3.1 Tomasulo s Algorithm In 1967 IBM developed a complicated method for that days to handle out of order execution for its 360/91 floating point unit, the algorithm is designed by Robert Tomasulo. The key point is it will trace the operands for instruction to see when they are available to diminish the WAR and WAW hazards. This technique is implemented in 20

28 many contemporary processors differently however the main sprite is the same they all do the tracking instructions dependencies to allow execution as soon as operands are available and also the renaming register technique to avoid WAW and WAR hazards.[2] In the paper written by Tomasulo [6], the objective is to brake the execution functions into two independent execution areas, namely fixed-point and floating-point. The reason behind this proposed method is to increase the speed multiplication and addition operations by assigning only one task to each unit. In an attempt to obtain maximum overlap with minimum or no effort by both the programmer and the compiler, Tomasulo has designed a scheme called Common Data Bus (CDB). This scheme contributes to Tomasulo s objective using hardware algorithm. To reduce the time spent on waiting for operands by execution circuitry, the reservation stations are allocated separately. Therefore, reservation stations are operated by the execution circuitry in the order in which they are filled. Here the use of CDB improves the performance of the operation by providing a shorter effective cycle in execution unit. We evade the RAW hazards with execute an instruction only when operands of that instruction are provided. As we said before WAW and WAR hazards are because of name dependences so they can be handle with register renaming. Register renaming technique rename instruction destination registers with a write or read from an earlier instruction and thus any instruction depending on an earlier value of an operand does not changed by the out of order write. For better explanation on how WAR and WAW hazards are removed by register renaming look at the example below. Possible WAR and WAW hazards can be found in this example: MULT $0,$2,$4 SUB $6,$0,$8 SW $6,0($1) ADD $8,$10,$14 DIV $6,$10,$8 In this example there is anti-dependence between the SUB and the ADD instruction and also SUB and the DIV have an output dependency headed to two potential hazard situation: A WAR hazard because the SUB needs register 8 (If ADD write it before SUB read it) and a WAW hazard between the SUB and DIV (if DIV writes before the SUB). Also between the MUL and SUB, ADD and DIV and SUB and SW, there are RAW data dependences. 21

29 Basically we just need two temporary registers named for example T1 and T2. Using T1, and T2, we can write this example again without any dependency as: MULT $0,$2,$4 SUB T1,$0,$8 SW T1,0($1) ADD T2,$10,$14 DIV $6,$10,T2 Notice that we have to use registers T2 in any next uses of register eight. The register renaming is can be done with the help of the compiler as well like here (in this case it is not dynamic anymore) but finding any further register eight usage that are in the rest of program needs a complicated compiler or hardware resources because we can have branches between the above code part and a latter register usage. One of the Tomasulo s algorithm advantage is it can handle renaming along branches as well [2]. We have reservation stations in Tomasulo s algorithm which do the register renaming and hold the operands of expecting for issue instructions. The idea is simple; we fetch and buffer operands as soon as they are ready in the reservation stations. We get rid of the need to get operands from register file with this simple concept. Waiting instructions make reservation stations remember to supply their operands. At the end if we have multiple overlap register writes, we only write the last one to actually update the register. When we issue the instructions we rename the register indicator for waiting operands to the names of the reservation station that register renaming is provided by. Because we can have more reservation stations than real registers, hazards of name dependencies are completely removed by this technique which cannot be handled with compiler technique. Using reservation stations instead of read all operands from a register file gives us two other significant features. First we spread the unit of finding and handling hazards. We control when an instruction can begin its execution with reservation station information held at each functional unit. Second reservation stations have the results and can give them to the functional units instead of registers file. A common result bus does this data traveling which allows all units to be loaded with one operand at the same time. Tomasulo named this bus the common data bus or CDB in IBM 360/91. We need more than one results bus for our multiple functional unit. In Figure 3-1 you can see the basic structure of a Tomasulo based processor. None of the execution control tables are shown but the floating point unit and load store unit are shown. We hold the issued instruction in the reservation stations waiting for execution in functional units. Reservation station have their operand either already they have been 22

30 computer or the names of other reservation stations which is going to provide that operand values. Courtesy of Computer Architecture: A Quantitative Approach, 2nd edition by David A. Patterson and John L. Hennessy Figure 3-1 Tomasulo Based Processor Structure We separate load buffers and store buffers from reservation stations only if it is necessary because they are almost like each other we hold the addresses and data of memory in the load and store buffers. The floating point registers, functional units and store buffers are connected to each other with common data buses. Using the common data bus we send the results from the functional units and memory to everywhere except to load buffer. For control we use reservation stations tag fields. We sent the instruction from instruction memory to the instruction queue window and from there we issue them in first in first out order. The operation, its operands, the needed information used for finding and solving hazards are all in the reservation stations. Load buffers hold the components of the effective address until we compute it, track load instruction which are waiting for memory and also hold the result of completed loads which are waiting for the CDB. Similar to load buffers, store buffers hold the components of the effective address until we compute it, have the destination address of memory that we want to write the data in it until the memory is available. We put the results of FP units and the load unit on CDB to 23

31 go to the floating point registers, reservation stations and store buffers. The floating point add and subtract operation are done in FP adder and floating point multiply and division are don in FP multiplier unit. Before we explain the details of the reservation station and the algorithm, let s look at the stages an instruction has to take. In Tomasulo s algorithm we have three stages which each can take different number of clock cycles to finish. 1. Issue: In this stage we get the next instruction from the top of the instruction queue. The instruction queue is in FIFO order (first in first out), this is because we want to keep the correct data flow and also for having the in order issue. If we have empty matching reservation station then we issue the instruction to the station with its operand values if they are available in the register file. If the operands are not in register file we have to keep track of the functional unit that is going to generate it. If the matching reservation register is not empty or free, due to the structural hazards we have to stall issuing instructions until we find an empty reservation stations. Also this stage does the register renaming as well to eliminate the WAW and WAR hazards. 2. Execute: In this stage we observe the common data bus to find the operands that we don t have them available up to now. They will be ready after they calculated in their functional unit and it will be putted into any reservation station waiting for it. When we have all the operands available, the instruction starts to execute in its functional unit. This is how RAW hazards are resolved, by waiting for all operands to be available before starting execution. More than one instruction can be ready in a clock cycle for a functional unit. If we have more than one instruction ready for a single functional unit, then we have to select one of them. For independent functional units we can start execution of different instructions in the same clock cycle. For executing load and store instructions we need two clock cycles. In the first one we calculate the target address if the base register is ready. We save this effective address in a buffer and in the second clock cycle we do the memory operation. The store instruction also needs to wait for the value to be stored before the memory operation can be started but in the load instruction as soon the memory address gets ready. For avoiding hazards with the memory unit we keep the program order of load and store instructions effective address calculation. 3. Write results: In this stage when the results got ready and the execution of an instruction is completed, we write its results on the CDB and CDB send it to any reservation station which is waiting for it. For store instructions when the address and the value to be stored are ready, we write them as soon as we find the memory unit not busy [2]. 24

32 For resolving hazards we define a data structure which find and remove hazards. This data structure is connected to the reservation stations, register file and to the load store buffer with a small difference in their information. We also use these tags for register renaming as well. When we issue an instruction and we are waiting for its source operands, we refer to these operands with indicating the reservation station number which is going to write the assigned register. Because we have more reservation station that the actual register numbers WAR and WAW hazards are handled with the register renaming in the reservation stations. If a value is zero or unused it indicates that the operand is available in the register file. We used the below terminology from [2] in our project: Op: we indicate the operation that we want to perform with this field. Qj, Qk: These two fields are used for corresponding source operands in our reservation stations. If there is a value inside these two fields we are waiting for that number register to be available. If it is zero or empty it means the source operands are available in V field or it is not necessary. Vj,Vk: These two fields have the value of source operands. Only one of the V field or Q field is available for each operand. Busy: this field indicate that the reservation station is available or not for preventing structural hazards. Qi: This field is for registers and is showing the reservation station that must pass the result to this register. If the value of a Qi is blank then no currently active instruction is operating on this register which means the value of this register is the value on the register file. Finally here is how Tomasulo algorithm works, when we issue an instruction then indicates its destination register s Qi field to the number of the reservation station or buffer used for this instruction. If we have its operands in the registers then we read their value from register file to the V field. Otherwise we indicate the reservation station or the buffer which had this operands in the Q field and wait until CDB brings it. We wait for both operands to be ready and Q field become empty. When an instruction has its both operands we will start its execution and when its execution has finished, CDB writes its results to all the buffers registers and reservation stations that their Qj and Qk is as same as completing reservation station destination register and mark the Q field that the values have been received. The issued instruction which get all their operands can start their 25

33 execution on the next cycle. All the writes are happen in the Write Result stage either we have to write the memory or the register file. 3.2 Applications As an application of Tomasulo s algorithm other than ours we can mention Roza s[7] work, taking the dependability aspect of systems into account would lead us to the use of SoCs that offers a cheaper price than ASICs followed by a higher reliability than Commercial of the Shelf (COTS) components. However, these systems demand scheduling algorithms for instructions, which are not accurate enough in case of fault occurrence. In this paper Tomasulo algorithm provides a dynamic instruction scheduling through application of Triple Modular Redundancy (TMR). An advantage of this Fault Tolerant Dynamic Instruction Scheduling (FTDIS) structure is that it does not apply any latency to the system due to fault tolerance. T. Arons and A. Pnueli [8] have verified the architecture of modern processors based on refinement, by proving the correctness of Tomasulo s algorithm for out-of-order execution. The reason for proposing this algorithm is to be able to increase the number of instructions executed per cycle using dynamic scheduling. This process has been done by introduction of predicted value field for direct comparison of Tomasulo system and their reference system model (SEQ). According to Lorenzo[9], evidence shows a double in the number of cores per semiconductor chips over a period of at least three years. However, through application of Tomasulo s algorithm to his work he could obtain a higher performance in comparison to this increase. His insist on not pursuing thread parallelism by using multicore architectures, was due to its large dependency on software problem-solving and lower hardware support. As a result he focused on task parallelism or function parallelism. In this method a network of hardware devices are associated with several processing cores. Each processing core uses declare, provide and require operations to interact with its associated hardware unit. This setup has provided the hardware with sequencing capability rather than conventional synchronization technique. 26

34 Chapter 4 : Design Implementation 4.1 Pipelined MIPS: For implementation we have developed two major classes which we explain in this chapter. Each class has a header file in which the methods and data fields are declared and a body in which these methods are defined. We have designed a graphical user interface for user convenience, described below. Widget Class: The main duty of this class is to control the user interface contents. It gets its data from class Processor which will be discussed shortly. Figure 4-1 shows the different elements of the designed user interface. Figure 4-1 Designed User Interface for Pipelined MIPS As you can see in this user interface, there are three tables for showing the register file values, data memory and instruction memory values. Three buttons are provided for running the simulation from start to end, running the simulation clock by clock and a reset button to reset the simulator. The Speed slider determines the speed of the simulation operation by run button. The plain text section is where the report will appear after each clock cycle of simulation. The line editors on the bottom left corner show the 27

35 status of each pipeline stage. On top of that you can find the value of PC and IR. Notice that PC always shows the value which will be fetched in the next clock cycle. Through the File menu on the top left corner you can open the instruction and data text file. User inputs the values to be loaded into each memory (data and instruction) in a text file and opens it in the program via this menu. There is a save tab which saves the reports in a text file for user. Basically class widget explains what each element of the user interface does. In the class Widget constructor we connect the buttons to their slots, initialize the slider to 5 on a 1 to 10 scale, make an object pointer from the processor class (named MyRisk), call the File menu constructor, generate and initialize the tables in the user interface and determine the window size and title. For generating the file menu in a user interface, we have a function named createmenus. This function uses an object named filemenu from class QMenu. Also we have defined three objects from class QAction named opendataact, openinstact and saveact for this class. The characteristic of these three actions is defined in createactions function. We have assigned opendata, openinst and save slots for opendataact, openinstact and saveact actions respectively. As an example, the action opendataact calls the opendata slot in creataction function. In the opendata slot the data file opens, reads one line and writes this line into the data memory using writedatamemory function of the Processor class. This process continues until the file reaches the end. In the Processor class you will see that each time the memory-write functions (both for data and instruction) are called, then the written values are passed to the Widget class via the MyWriteDataMemSignal and MyWriteInstMemSignal signals. When these two signals are emitted, they call updatedatatable and updateinsttable slots respectively which update the memory tables in the user interface. UpdateDataTable gets the value and address written to the data memory via the MyWriteDataMemSignal, converts it to binary and writes it in four allocated cells of the table. We assume that the memory width is 1 byte (8 bits) and our word size is 32 bits so every word needs 4 cells of the memory and its table in the user interface. UpdateInstTable does exactly the same operation with the help of MywriteInstMemSignal signal. UpdateRegTable slot also updates the register file table in user interface with the help of MyWriteRegSignal signal from Processor class. Whenever a value in both memory and register file changes in the Processor class, the value and its address are passed by a signal to the Widget class to allow the user to see the changes in the user interface. The most important slot in the Widget class is the stepbuttonpushed slot. As you can see from its name this slot is called when the user presses the step button in user interface. This slot calls the compute function of Processer class. We will explain compute function shortly under Processor class section. 28

36 The clock cycle counter (clockcycle) increments and if the stop flag is true (i.e. we reach the end of the program), a message ( You have reached to end of your program ) pops up. The stop flag is set to true in the finishsignalarrived slot which is called when Processor class emits MyStopSignal signal. The updatelineedit function updates the value of PC, IR and pipeline status stages in the user interface with the value it gets from the Processor class. If the status of the pipeline stage returned from Processor class is -3, then user interface line edit for that stage is blank. If it is -4 then No Op is displayed in its line edit. The value -2 means the stage is flushed, whereas -1 means the stage is stalled. Any other non-negative number shows the instruction number in which the stage is operating. The updatereporttextedit function updates the report text edit of the user interface. When user clicks on the reset button, the resetpushedclick slot is called. Inside this slot the resetmem and resetregs functions of class Processor are called. The clockcycle is set to zero and the user interface tables and line edits will be cleared. For having a simulator which goes from start to end, we provide the run button. The runpushedclicked slot is called whenever user presses the run button. This slot makes an object from the class Qtimer named timer. Whenever the timer signal timeout emits, the slot clock would be called. Slot clock is exactly like the steppushedclick slot. So until the Stop is false the timer will call the clock slot and the clock runs the simulation. When stop flag is set to false, slot clock kills the timer object, shows the end of program message and simulation stops. Processor Class: In this class we have thirteen important functions, which we describe below. ResetRegs and ResetMem: Here is where we initialize all the MIPS registers and memory to zero. These include 32 32bits general-purpose registers, the pipeline registers, 2 high and low registers for multiplication and division. The status of each pipeline stage is initialized to -3 and the stall flags, branch taken flag and flush flag are set to false. writedatamemory and writeinstmemory: These two function are written to data memory and instruction memory respectively. They both get the value and the address as inputs and pass them with the MyWriteDataMemSignal and MyWriteInstMemSignal signal to Widget class to update the user interface. ReadDataMemory and ReadInstMemory: All the processes operate integer numbers and convert them to binary and hex to display the results to the user in Widget Class. However our memory width is 8 bits so the values should be converted to integer before being processed. This operation is done in these two functions. 29

37 GetPC, GetIR and GetReg : These functions return the value of PC, IR and register. Since these variables are private, the Widget class does not have access to them. Therefore, we write these functions to pass their values to user interface. Compute: This function is the one which the Widget class calls each time the user presses the step button, or the clock function calls when the timer emits timeout signal. It basically runs the simulator for one clock cycle. In hardware we have parallel operations but in software we run the code line by line. In a pipeline processor we know that each stage of pipeline works concurrently. In order to simulate the behavior of pipeline processor in software we have to first run the last stage and come up to the first stage. In this way each stage works with old data of its previous stage. Otherwise, the data of the first stage comes out of the fifth stage in one clock. So in compute we first call the WB function, which is the write back stage of MIPS, then MEM, EX, ID respectively and at the end IF function. WB: This function implements the write back stage of pipelined MIPS. Except operation SW, BEQ, BNQ, J and JR we have to write the results coming from stage MEM into the register file and notify Widget with MyWriteRegSignal signal to update the user interface. In this stage, if the opcode is 63 then the stop operation reaches the last stage and the function emits MyWriteRegSignal signal to notify the Widget of the end of the simulation. MEM: This function implements the memory access stage of MIPS pipeline. If the opcode of this stage is 35 then we have to read data memory for the address which we get from ALU in the EX stage (load operation). However, if it is 43 then we have to write data memory with the value of the store data register (in the program MD4) in the address which we get from ALU (store operation). Like every other stage except WB, we pass the old value of the opcode, funct and destination register stage to its next stage (WB stage) for the next clock usage. EX: This function implements the execution stage of pipelined MIPS. First we pass the old value of the opcode, funct and destination register to its next stage MEM. The value of MD3 is also passed to MD4 (MDs are the registers in which the store data value is located). Then based on the opcode and the funct, we produce the appropriate ALU opcode and pass it with the operands to the ALU to get the result. X3 and Y3 are the operands which get their values from ID stage. ID: This function implements the instruction decode stage of pipelined MIPS. This is the most important function in Processor class since the data forwarding and stalls are generated here. If this stage is stalled in the last clock cycle then we have to proceed to the next stage (EX) with a no op. Otherwise give the EX stage the old value of this stage. In this stage we decode the instruction register which we get from IF to the opcode, 30

38 operands, destination register, the shift amount and funct. Also we have to sign extend the 16 bits immediate address part to 32 bits. Then we decode the Opcode and funct to read the operands register to X3, Y3 and MD3. Based on the operation the next program counter is evaluated and decision for stalls data forwarding is taken if needed. If we need to have the stall operation then the stall flags for stage IF and ID are set to true and a no op will be passed to EX stage. In case of branches operation, if a branch has to be taken then the BranchTaken flag will be set to true which informs the IF stage to flush the read instruction. Table 4-1 in the next page, our project instructions are summerized with what each instruction operands are (X3 and Y3) and where to save the result (destination register). MD3 is the register where we put the store instruction s data in and imadd means the immediate address field of instruction and it receives the rt in the store operation. All values are in decimals. (Refer to chapter 1 for reviewing what is rd, rt, rs, funct, opcode and etc). Empty cells indicates that the processor does not care about these values. ALU does the AND operation for ALUOP equal to zero, the OR operation for ALUOP equal to one, add for two, shift left for three, shift right for four, subtract for six, signed compare for seven and unsigned compare for ALUOP equal to eight. 31

39 Name: Instruction: opcode funct ALUOP X3 Y3 ALURd ADD add rs rt rd SUB subtract rs rt rd ADDI add immediate 8 2 rs imadd rt ADDU add unsigned rs rt rd SUBU subtract unsigned rs rt rd ADDIU add imm. Unsign. 9 2 rs imadd rt MULT multiply rs rt H/L MULTU multiply rs rt H/L unsigned DIV divide rs rt H/L DIVU divide unsigned rs rt H/L MFHI move from Hi H 0 rd MFLO move from Lo L 0 rd AND and rs rt rd OR or rs rt rd ANDI and immediate 12 0 rs imadd rt ORI or immediate 13 1 rs imadd rt SLL shift left logical rt shamt rd SRL shift right logical rt shamt rd LW load word 35 2 rs imadd rt SW save word 43 2 rs imadd 0 LUI load upper imm 15 3 imadd 16 rt BEQ branch on equal BNE branch on not eq SLT set on less than rs rt rd SLTI set less than imm rs imadd rt SLTU set less than uns rs rt rd SLTIU set l.t. imm. uns rs imadd rt J jump JR jump register JAL jump and link 3 2 ID PC 0 31 Table 4-1 Supported Instructions Information Except BEQ, BNE and JR which need their operand (from register file) in ID stage, other instructions can have their operands as late as in the beginning of EX stage. Also except load instruction, all other instructions that change the value of registers have their value ready in the end of EX stage. So in general we can forward values from the start of WB stage (which is the end of EX stage) to the start of EX stage, except load instruction. In 32

40 ID stage when an operand of an instruction is equal to the destination register of the EX instruction and the EX instruction is LW, then we stall the IF and ID stage and put a no op in EX. Then in the next clock cycle we forward data from end of MEM stage (First of WB stage) to the start of EX stage. If the EX operation is load and the ID operation is branch or JR and the ID operands are the same as destination register of EX then we need two stalls. The difference is that here we need the data for branch instruction in the beginning of ID stage, not in EX stage and data will be ready in the end of MEM stage. Therefore, we need not only two stalls but also to forward the data from the end of MEM to the ID. Also if the branch is taken, we have a flush in IF stage. This is the worst delay (2 stalls and one flush) which we encounter with this architecture. IF: This function implements the instruction fetch stage of pipelined MIPS. If we have IF stall flag from the ID stage it means in the next clock cycle we have to have a stall. In this case we set the second IF stall flag true and this one false. If the second IF stall flag is true it means we don t have to read a new instruction and this stage is stalled. Otherwise we have to read the instruction memory with address of PC and then update the PC with the next PC. If we have the BranchTaken flag from ID then we have to clear the instruction register (flush) and update the PC. Figure 4-2 shows the pipelined MIPS simulator flow chart. 33

41 Start Setup UI Get the Inputs Write Back Memory Access Execution Instruction Decode Instruction Fetch Update UI Last Inst.? No End Figure 4-2 Pipelined MIPS Simulator Flow Chart 34

4.2 Tomasulo: For implementing Tomasulo s algorithm, we developed another program with a separate user interface. Figure 4-3 shows the user interface for Tomasulo s algorithm implementation.

42 4.2 Tomasulo: For implementing Tomasulo s algorithm, we developed another program with a separate user interface. Figure 4-3 shows the user interface for Tomasulo s algorithm implementation. Figure 4-3 Designed User Interface for Tomasulo s Algorithm Implementation In this program, user gives the instruction queue using combo boxes in the top left corner of the user interface and then press the execute button. User can change the amount of execution time per functional unit at the bottom of these combo boxes. The program executes for one clock cycle with the step button and user can reset the whole program with the reset button. Tables show the instruction station, the load store unit, reservation stations and the register file station. We designate 6 operations from MIPS floating point ISA for this program: Load, store, add, subtract, multiplication and division. Our assumption is that the program starts its work from the instruction queue, i.e. the branches are resolved before in the previous stages. We have floating point registers F0 to F14 as operands and whenever the user chooses the load and store operation, the immediate address part can be written in the third column and the general integer register chosen from the fourth column. In this example of Tomasulo s algorithm we have 3 add, 2 multiplication, 3 load and 3 store functional units and we assume our instruction queue depth is 8. 35

43 Like the Pipelined MIPS we have a Widget class which controls this user interface. Below we will describe the behavior of the most important functions of this class. setupui: This function initializes all the user interface components and variables that we use in this program. It gives the value of combo boxes, writes the basic information of tables, initializes the program variable like clock to zero and so on. executepushclicked: The program starts when the user gives the input to the program with the combo boxes and press the execute button. Upon activation of the execute button this function is called in the Widget class. The class reads the combo boxes and translates them from string to integer for usage in the rest of the program. If everything set by the user is correct, the step button is enabled for the user to proceed. steppushclicked: If the step button is pressed in the user interface this function is called. In this function we increment the clock and show the clock cycle value in the register file station table and call write, execomp and issue functions respectively. Write: This function implements the write result stage of Tomasulo s algorithm. If an instruction has finished executing then it goes to write stage which writes its result in the register file and update the register file station. This function puts the current clock number to third column of instruction register table, makes the busy field of the instruction functional unit to no and clears that functional unit reservation station. Execomp: This function implements the execute stage of Tomasulo s algorithm. For issued instruction, this function monitors the CDB to get the operands of instruction. Operands that were not ready at issue time are indicated in the Q fields. After all operands are available, it starts the operation in functional units and after they are finished (the timer of the operation reaches zero) it updates the second column of instruction station with the current clock. Issue: this function implements the issue stage of Tomasulo s algorithm. It will start from the top instruction in the queue and find a not busy functional unit for it. It looks for the operands of the operation if available, fills the V field of reservation stations or otherwise fills the Q field. If it cannot find a functional unit for an instruction, it stops issuing until it finds, one since as we said in Tomasulo s algorithm instructions are issued in order. After issuing, it fills the first column of instruction station with the current clock. When it releases an instruction it indicates in register file station, the destination register of the instruction with the functional unit that is going to produce results for it. resetpushclicked: whenever reset button is pushed in the user interface this function is called. This function clears the tables in user interface and calls the setupui function to initialize the program. 36

44 Figure shows 4-4 the Tomasulo s algorithm simulator flow chart. Start Setup UI Get the Inputs Write Back Execute Issue Update UI Write last Inst.? No End Figure 4-4 Tomasulo s Algorithm Simulator Flow Chart 37

45 Chapter 5 : Analysis and Conclusion Here we analyze one example for each program to see its results. 5.1 Pipelined MIPS Simulation Example: Here is the example for pipelined MIPS. Table 5-1 shows the instruction that we want to run the on the processor. In the stalls/flush column you can see any stall and flush that is going to happen and in comment section you can find the effect of each instruction. IM is instruction memory and DM is data memory. location in IM instruction stalls/flush comment 0 LW $2,0(0) $2=128 4 LW $3,4(0) $3=256 8 ADD $4,$3,$2 stall $4= SLT $5,$4,$3 $5=0 16 LW $6,8(0) $6=0 20 BEQ $5,$6,8 2 stall one flush branch happen 24 J 1024 will not execute 28 J 512 will not execute 32 AND $7,$2,$3 $7=0 36 LW $8,12(0) $8=16 40 BEQ $7,$8,8 2 stall no branch 44 SW $8,16(0) save $8 to add 16 DM 48 JAL 56 one flush $31=52 52 J 1024 will not execute 56 END finish! Table 5-1 Pipelined MIPS Simulation Example We load the data memory (DM) with the values 128, 256, 0 and 16 respectively. Note that both instruction and data text file should be written in hex. Here is an example of how you have to write the instructions in the instruction text file which will be loaded to instruction memory: Format R: ADD $4,$3,$2 Opcode=0= binary. rs= 2= binary. rt=3=00011 binary. rd=4=00100 binary. shamt=0=00000 binary. funct= 32=10000 binary so ADD $4,$3,$2 would be binary or hex. Format I: LW $3,4($0) Opcode=35= binary. rs=0=00000 binary. rt=3=00011 binary. Immediate address =4. 38

46 So LW $3,4($0) would be binary or 8C hex. Format J: J 1024 Opcode=2= Target address= 1024= binary. So J 1024 would be binary or hex. Table 5-2 shows the timing that we expect our program to have. Clock cycle IF stage ID stage EX stage MEM stage WB stage 1 0 X X X X X X X X X X 5 12/Stall 8/stall NO OP NO OP NO OP /Stall 20/Stall NO OP /Flush 20/Stall NO OP NO OP /Flush 20 NO OP NO OP /Flush 20 NO OP /Flush /Flush 15 44/Stall 40/Stall NO OP /Stall 40/Stall NO OP NO OP NO OP NO OP NO OP /Flush /Flush /Flush /Flush Table 5-2 Expected Result from the Given Example to the Pipelined MIPS Simulator 39

47 These instructions first load the register 2 with data of address 0+0 of data memory. We put 128 (decimal) in address 0 of data memory. Then load register 3 with the content of data memory of address 4+0 in which we write 256. In the third instruction we add them together and store the result in register 4. Since the value of register 3 is not available until it reaches stage MEM and we need it at the beginning of EX for add instruction, we have to stall IF and ID here. The value of register 4 will be 384, stored in the fourth instruction, and since it is not less than register 3, value zero will be stored in register 5. In the next instruction we will load the value stored in address 8+0 of data memory which is 0. Since register 5 and 6 have the same value we will branch over two jumps instructions. Because we have a branch and a load instruction that have a register dependency, we should have two stalls here and since the branch is taken we flush the IF once. Register 7 will get the value of register 2 and 3 which is zero. Register 8 is loaded with the data memory value of 16 (it is stored in the address 12+0). The next branch will not be taken but we are going to have two stalls for it. The next instruction will save the value of register 8 (i.e. 16) in the data memory of address Then we jump to the instruction that is stored in instruction memory of address 56, which is the end, and the program is finished. Here are the results. Figure 5-1 shows the first clock cycle of simulation. As you can see the instruction zero is in IF stage. PC is 4 which means the next instruction will be fetched. The rest of the stages are empty. 40

48 Figure 5-1 Pipelined MIPS Simulation in the First Clock Cycle Figure 5-2 shows the fifth clock cycle of the simulation. As we expected because of add and load instructions dependencies, the IF and ID stages are stalled and we have a no op in EX stage. The first load instruction is in WB stage and the value of 128 is written in register 2. 41

49 Figure 5-2 Pipelined MIPS Simulation in the Fifth Clock Cycle Figure 5-3 shows the ninth and tenth clock cycle of the simulation. 256 is written in the third register in the sixth clock and 384 is written in the fourth register in the eighth clock. As we have expected we have two stalls for having a dependent branch after load instruction. The value of PC does not change because of the stall in the ninth clock cycle and in the tenth clock we are going to flush what we fetch in IF stage. In the ninth clock cycle the value of zero is written in register 5 and in clock cycle ten zero is written in register six, both occurs in the WB stage. 42

50 Figure 5-3 Pipelined MIPS Simulation in the Ninth and Tenth Clock Cycle 43

51 Figure 5-4 shows the eleventh clock cycle of the simulation. Instruction in location is fetched in the IF stage. ID stage gets the flush instruction (equal to No Op) from the last clock cycle of IF. Branch is in EX stage and we have two no op operations in MEM and WB. PC is 36 and it ensures us that the jump instructions are not executed. Figure 5-4 Pipelined MIPS Simulation in the Eleventh Clock Cycle Figure 5-5 shows the fifteenth and sixteenth clock cylce of the simulation. Here again we have a branch that is dependent on load after it situation. However, this time the branch is not taken so we do not have flush situation. In the fiftennth clock cycle we are going to write zero in the seventth register and in the clock cycle sixteen we are going to write 16 to the eighth regiter, both in WB stage. 44

52 Figure 5-5 Pipelined MIPS Simulation in the Fifteenth and Sixteenth Clock Cycle 45

53 Figure 5-6 shows the nineteenth clock cycle of the simulation. In the eighteenth clock cycle, we decode the jump instruction in the 48th location of instruction memory so we flush the IR for the clock cycle 19. Therefore, you can see the ID status in the nineteenth clock cyle is flush. As you can see the fetched instruction in IF is END and its opcode is in the IR. Whenever this instruction reaches the WB stage the simulation is going to be terminated and the finished message pops up. Figure 5-6 Pipelined MIPS Simulation in the Nineteenth Clock Cycle Below is the report text file where the value of each stage registers and variables in every clock cycles can be found. STAGE WB status:-3 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 0 STAGE MEM status:-3 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 0 STAGE EX status:-3 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 0EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:-3 X3= 0Y3= 0MD3= 0 Rs= 0 Rt= 0 Rd= 0 shamt= 0 Immidiat address= 0 Opcode= 0 ID funct= 0 IDRd= 0 STAGE IF status:0 IR= PC= 4 46

54 STAGE WB status:-3 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 0 STAGE MEM status:-3 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 0 STAGE EX status:-3 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 0EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:0 X3= 0Y3= 0MD3= 0 Rs= 0 Rt= 2 Rd= 2 shamt= 0 Immidiat address= 0 Opcode= 35 ID funct= 0 IDRd= 2 STAGE IF status:4 IR= PC= 8 STAGE WB status:-3 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 0 STAGE MEM status:-3 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 0 STAGE EX status:0 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 35EX funct= 0 EXRd= 2 Over flow= 0 STAGE ID status:4 X3= 0Y3= 4MD3= 0 Rs= 0 Rt= 3 Rd= 3 shamt= 0 Immidiat address= 4 Opcode= 35 ID funct= 0 IDRd= 3 STAGE IF status:8 IR= PC= 12 STAGE WB status:-3 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 0 STAGE MEM status:0 Z5= 128 Z4= 0 MD4= 0 Opcode= 35 MEMRd= 2 STAGE EX status:4 X3= 0Y3= 4Z4= 4MD3= 0Opcode= 35EX funct= 0 EXRd= 3 Over flow= 0 STAGE ID status:8 X3= 0Y3= 0MD3= 0 Rs= 2 Rt= 3 Rd= 0 shamt= 0 Immidiat address= 8224 Opcode= 0 ID funct= 0 IDRd= 0 STAGE IF status:12 IR= PC= 16 STAGE WB status:0 Z5= 128 L5= 0 H5= 0 Opcode= 35 WDRd= 2 STAGE MEM status:4 Z5= 256 Z4= 4 MD4= 0 Opcode= 35 MEMRd= 3 STAGE EX status:8 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 0EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:-1 X3= 128Y3= 256MD3= 0 Rs= 2 Rt= 3 Rd= 4 shamt= 0 Immidiat address= 8224 Opcode= 0 ID funct= 32 IDRd= 4 STAGE IF status:-1 IR= PC= 16 STAGE WB status:4 Z5= 256 L5= 0 H5= 0 Opcode= 35 WDRd= 3 STAGE MEM status:-4 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 0 STAGE EX status:-1 X3= 128Y3= 256Z4= 384MD3= 0Opcode= 0EX funct= 32 EXRd= 4 Over flow= 0 STAGE ID status:-1 X3= 384Y3= 256MD3= 0 Rs= 4 Rt= 3 Rd= 5 shamt= 0 Immidiat address= Opcode= 0 ID funct= 42 IDRd= 5 STAGE IF status:16 IR= PC= 20 47

55 STAGE WB status:-4 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 0 STAGE MEM status:8 Z5= 384 Z4= 384 MD4= 0 Opcode= 0 MEMRd= 4 STAGE EX status:12 X3= 384Y3= 256Z4= 0MD3= 0Opcode= 0EX funct= 42 EXRd= 5 Over flow= 0 STAGE ID status:16 X3= 0Y3= 8MD3= 0 Rs= 0 Rt= 6 Rd= 6 shamt= 0 Immidiat address= 8 Opcode= 35 ID funct= 42 IDRd= 6 STAGE IF status:20 IR= PC= 24 STAGE WB status:8 Z5= 384 L5= 0 H5= 0 Opcode= 0 WDRd= 4 STAGE MEM status:12 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 5 STAGE EX status:16 X3= 0Y3= 8Z4= 8MD3= 0Opcode= 35EX funct= 42 EXRd= 6 Over flow= 0 STAGE ID status:20 X3= 0Y3= 0MD3= 0 Rs= 5 Rt= 6 Rd= 0 shamt= 0 Immidiat address= 8 Opcode= 0 ID funct= 0 IDRd= 0 STAGE IF status:24 IR= PC= 28 STAGE WB status:12 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 5 STAGE MEM status:16 Z5= 0 Z4= 8 MD4= 0 Opcode= 35 MEMRd= 6 STAGE EX status:20 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 0EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:-1 X3= 0Y3= 0MD3= 0 Rs= 5 Rt= 6 Rd= 0 shamt= 0 Immidiat address= 8 Opcode= 0 ID funct= 0 IDRd= 0 STAGE IF status:-1 IR= PC= 28 STAGE WB status:16 Z5= 0 L5= 0 H5= 0 Opcode= 35 WDRd= 6 STAGE MEM status:-4 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 0 STAGE EX status:-1 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 0EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:-1 X3= 0Y3= 0MD3= 0 Rs= 5 Rt= 6 Rd= 0 shamt= 0 Immidiat address= 8 Opcode= 4 ID funct= 0 IDRd= 0 STAGE IF status:-2 IR= 0 PC= 32 STAGE WB status:-4 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 0 STAGE MEM status:-4 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 0 STAGE EX status:-1 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 4EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:-2 X3= 0Y3= 0MD3= 0 Rs= 0 Rt= 0 Rd= 0 shamt= 0 Immidiat address= 0 Opcode= 0 ID funct= 0 IDRd= 0 STAGE IF status:32 IR= PC= 36 48

56 STAGE WB status:-4 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 0 STAGE MEM status:20 Z5= 0 Z4= 0 MD4= 0 Opcode= 4 MEMRd= 0 STAGE EX status:-2 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 0EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:32 X3= 128Y3= 256MD3= 0 Rs= 2 Rt= 3 Rd= 7 shamt= 0 Immidiat address= Opcode= 0 ID funct= 36 IDRd= 7 STAGE IF status:36 IR= PC= 40 STAGE WB status:20 Z5= 0 L5= 0 H5= 0 Opcode= 4 WDRd= 0 STAGE MEM status:-2 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 0 STAGE EX status:32 X3= 128Y3= 256Z4= 0MD3= 0Opcode= 0EX funct= 36 EXRd= 7 Over flow= 0 STAGE ID status:36 X3= 0Y3= 12MD3= 0 Rs= 0 Rt= 8 Rd= 8 shamt= 0 Immidiat address= 12 Opcode= 35 ID funct= 36 IDRd= 8 STAGE IF status:40 IR= PC= 44 STAGE WB status:-2 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 0 STAGE MEM status:32 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 7 STAGE EX status:36 X3= 0Y3= 12Z4= 12MD3= 0Opcode= 35EX funct= 36 EXRd= 8 Over flow= 0 STAGE ID status:40 X3= 0Y3= 0MD3= 0 Rs= 7 Rt= 8 Rd= 0 shamt= 0 Immidiat address= 16 Opcode= 0 ID funct= 0 IDRd= 0 STAGE IF status:44 IR= PC= 48 STAGE WB status:32 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 7 STAGE MEM status:36 Z5= 16 Z4= 12 MD4= 0 Opcode= 35 MEMRd= 8 STAGE EX status:40 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 0EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:-1 X3= 0Y3= 0MD3= 0 Rs= 7 Rt= 8 Rd= 0 shamt= 0 Immidiat address= 16 Opcode= 0 ID funct= 0 IDRd= 0 STAGE IF status:-1 IR= PC= 48 STAGE WB status:36 Z5= 16 L5= 0 H5= 0 Opcode= 35 WDRd= 8 STAGE MEM status:-4 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 0 STAGE EX status:-1 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 0EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:-1 X3= 0Y3= 0MD3= 0 Rs= 7 Rt= 8 Rd= 0 shamt= 0 Immidiat address= 16 Opcode= 4 ID funct= 0 IDRd= 0 STAGE IF status:-1 IR= PC= 48 49

57 STAGE WB status:-4 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 0 STAGE MEM status:-4 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 0 STAGE EX status:-1 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 4EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:-1 X3= 0Y3= 16MD3= 16 Rs= 0 Rt= 8 Rd= 0 shamt= 0 Immidiat address= 16 Opcode= 43 ID funct= 0 IDRd= 0 STAGE IF status:48 IR= PC= 52 STAGE WB status:-4 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 0 STAGE MEM status:40 Z5= 0 Z4= 0 MD4= 0 Opcode= 4 MEMRd= 0 STAGE EX status:44 X3= 0Y3= 16Z4= 16MD3= 16Opcode= 43EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:48 X3= 52Y3= 0MD3= 0 Rs= 0 Rt= 0 Rd= 31 shamt= 0 Immidiat address= 56 Opcode= 3 ID funct= 0 IDRd= 31 STAGE IF status:-2 IR= 0 PC= 56 STAGE WB status:40 Z5= 0 L5= 0 H5= 0 Opcode= 4 WDRd= 0 STAGE MEM status:44 Z5= 0 Z4= 16 MD4= 16 Opcode= 43 MEMRd= 0 STAGE EX status:48 X3= 52Y3= 0Z4= 52MD3= 0Opcode= 3EX funct= 0 EXRd= 31 Over flow= 0 STAGE ID status:-2 X3= 0Y3= 0MD3= 0 Rs= 0 Rt= 0 Rd= 0 shamt= 0 Immidiat address= 0 Opcode= 0 ID funct= 0 IDRd= 0 STAGE IF status:56 IR= PC= 60 STAGE WB status:44 Z5= 0 L5= 0 H5= 0 Opcode= 43 WDRd= 0 STAGE MEM status:48 Z5= 52 Z4= 52 MD4= 0 Opcode= 3 MEMRd= 31 STAGE EX status:-2 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 0EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:56 X3= 0Y3= 0MD3= 0 Rs= 0 Rt= 0 Rd= 0 shamt= 0 Immidiat address= 0 Opcode= 63 ID funct= 0 IDRd= 0 STAGE IF status:60 IR= 0 PC= 64 STAGE WB status:48 Z5= 52 L5= 0 H5= 0 Opcode= 3 WDRd= 31 STAGE MEM status:-2 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 0 STAGE EX status:56 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 63EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:60 X3= 0Y3= 0MD3= 0 Rs= 0 Rt= 0 Rd= 0 shamt= 0 Immidiat address= 0 Opcode= 0 ID funct= 0 IDRd= 0 STAGE IF status:64 IR= 0 PC= 68 50

58 STAGE WB status:-2 Z5= 0 L5= 0 H5= 0 Opcode= 0 WDRd= 0 STAGE MEM status:56 Z5= 0 Z4= 0 MD4= 0 Opcode= 63 MEMRd= 0 STAGE EX status:60 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 0EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:64 X3= 0Y3= 0MD3= 0 Rs= 0 Rt= 0 Rd= 0 shamt= 0 Immidiat address= 0 Opcode= 0 ID funct= 0 IDRd= 0 STAGE IF status:68 IR= 0 PC= 72 STAGE WB status:56 Z5= 0 L5= 0 H5= 0 Opcode= 63 WDRd= 0 STAGE MEM status:60 Z5= 0 Z4= 0 MD4= 0 Opcode= 0 MEMRd= 0 STAGE EX status:64 X3= 0Y3= 0Z4= 0MD3= 0Opcode= 0EX funct= 0 EXRd= 0 Over flow= 0 STAGE ID status:68 X3= 0Y3= 0MD3= 0 Rs= 0 Rt= 0 Rd= 0 shamt= 0 Immidiat address= 0 Opcode= 0 ID funct= 0 IDRd= 0 STAGE IF status:72 IR= 0 PC= Tomasuolo s Algorithm Simulation Example: Our second example covers Tomasulo s aglorithm. Here is our instruction queue: LD F6,32(R2) LD F2,45(R3) MULT F0,F2,F4 SUB F8,F6,F2 DIV F10,F0,F6 ADD F6,F8, F2 We assume that add and subtract operations need two cycles, multipication needs 10 cycles and division needs 40 cycles to execute. As you can see there are multiple RAW hazards in this example: multipication needs F2 from load instruction and subtraction needs F6 from the first load instruction. DIV needs F0 from MULT and F6 from the first LD. ADD must get the F8 value from subtract and F2 from second LD. There is also a WAR hazard between the DIV and ADD instruction in this example. Figure 5-7 shows the first clock cycle of Tomasulo s algorithm simulation. In the first clock cycle we issue the the first LD instruction and indicate the first load functional unit in the register file station. 51

59 Figure 5-7 Tomasulo s Algorithm Simulation in the First Clock Cycle Figure 5-8 shows the second and third simulation clock cycles. In the second cycle we issue the second LD instruction and F2 is used by the LOAD2 unit. In the third clock cycle we issue the multipication instruction and since F2 is waiting for its result from the LOAD 2 functional unit, we indicate it in the Q field of reservation stations. In the same clock the execution of the first load instruction is completed. Load and store instructions need two clocks for executing, one for reading the operands and one for evaluating the address. 52

60 Figure 5-8 Tomasulo s Algorithm Simulation in the Second and Third Clock Cycle 53

Figure 5-9 shows the simulation in fourth clock cycle. In this clock cycle, the first LD instruction goes to the write result stage and the LOAD 1 unit is not busy anymore.

61 Figure 5-9 shows the simulation in fourth clock cycle. In this clock cycle, the first LD instruction goes to the write result stage and the LOAD 1 unit is not busy anymore. The second LD instruction completes execution. MULT instruction is waiting for the second load instruction and the SUB instruction is issued. SUB instruction is also waiting for the second load instruction for its F6 operand and gets its F2 operand from the first load instruction. Figure 5-9 Tomasulo s Algorithm Simulation in the Forth Clock Cycle Figure 5-10 shows the clock cycle 5 and 6 of the simulation. In the Fifth clock cycle the second instruction goes to write back result stage so both SUB and MULT instructions get their operand F2. The Q field is now empty and the value is in the V field. Also the DIV instruction is issued and it is waiting for the MULT to get the value of F0. In the sixth cycle both SUB and MULT are doing the first clock of their execution. SUB needs one more clock and MULT needs nine to complete their execution. ADD is issued and is waiting for the SUB to get value of F8. 54

In the seventh clock cycle the SUB finishes its execution and in eighth it writes its

62 Figure 5-10 Tomasulo s Algorithm Simulation in the Fifth and Sixth Clock Cycle Figure 5-11 shows the seventh and eighth simulation clock cycles. In the seventh clock cycle the SUB finishes its execution and in eighth it writes its result and passes the value of F8 to ADD instruction which is waiting for it. MULT is doing its execution and DIV is waiting for MULT. 55

63 Figure 5-11 Tomasulo s Algorithm Simulation in the Seventh and Eighth Clock Cycle 56

64 Figure 5-12 shows the simulation in clock cycle 10. ADD starts its execution in clock cycle 10 and completes it in the same clock cycle. MULT is still executing and DIV is waiting for it. Figure 5-12 Tomasulo s Algorithm Simulation in the tenth Clock Cycle Figure 5-13 shows the simulation in sixteenth clock cycle. In the clock cycle 15, the MULT instruction completes its execution and writes the result in the clock cycle 16. In the same clock cycle DIV, which was waiting for its operand from MULT, starts its execution. 57

65 Figure 5-13 Tomasulo s Algorithm Simulation in the Sixteenth Clock Cycle Figure 5-14 shows clock cycles 56 and 57 where the DIV instruction completes its execution and writes its results. 58

66 Figure 5-14 Tomasulo s Algorithm Simulation in the 56 th and 57 th Clock Cycle 59

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation