CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

Size: px

Start display at page:

Download "CAD for VLSI 2 Pro ject - Superscalar Processor Implementation"

Trevor Harrison
6 years ago
Views:

1 CAD for VLSI 2 Pro ject - Superscalar Processor Implementation 1 Superscalar Processor Ob jective: The main objective is to implement a superscalar pipelined processor using Verilog HDL. This project may be divided into three parts: The Arithmetic Logic Unit The Pipelined Processor Architecture Cache design 2 The ALU The objective in this phase is to implement the ALU for integer operations. For a fast ALU, the following are required Fully pipelined Carry Lookahead Adder (CLA) - 32 bit Fully pipelined Wallace Tree Multiplier (WTM) - 32 bit Fully pipelined Load - Store Unit (LSU) Refer to lab class notes and your earlier homework for this. Each processor in your design must have A 1 numbers of CLA, A 2 numbers of WTM and A 3 numbers of LSU, where A 1, A 2 and A 3 are parameters. The addition, multiplication and load-store operation may take several cycles to complete. But the pipelining above ensures that at every cycle a new set of operands can be pushed into the Arithmetic units for computation. Note that if there is a structural hazard due to non-availability of functional units, pipeline may stall and all the instructions that follow the stalled instruction should not be scheduled. In other words, the issue of instructions is in program order. It is interesting to note that if the issue is not in program order then, the Tomasulo technique described in the class will not correctly handle the data hazards. An id for every instruction should also be passed through the units. This is needed because, if a reservation station R is waiting for a result from an execution unit E, it should specify that instruction, from several instructions that may currently be pipelined and executed in E. 1

2 3 The Pipelined Processor 3.1 The Basic Pipeline The processor that you have to design is a RISC (Reduced Instruction Set Computer) also called the Load-Store architecture with the following instruction set. General purpose registers: Assume that there are thirty two, 32-bit registers, named R0,..., R31. R0 always stores the value 0 to facilitate many calculations involving zero (jump on zero for example). Instruction set: The instruction set of the processor includes 3 Arithmetic instructions ADD R1, R2, R3 ; //R1 = R2 + R3 SUB R1, R2, R3 ; //R1 = R2 - R3 MUL R1, R2, R3 ; //R1 = R2 * R3 All operations are two s complement operations. Exactly one of the source operands of the arithmetic instruction can be a signed immediate operand of 16 bits stored in two s complement format. ADD R1, R0, #5; makes R1 = 5 2 Data transfer instructions LD R1, [Reg]; //R1=content of the memory location; address is specified by Reg. SD [Reg], R1;//[Reg] = R1 2 Control transfer instructions JMP L1; //Unconditional jump to location L1 BEQZ (Reg), L1; //Jump to L1 if Reg content is zero L1 is given as an offset from current Program Counter (PC). This is called PC-relative addressing. Halt instruction HLT There are basically 5 stages of instruction execution as shown in Figure 1. Also, the instructions are assumed to be of fixed length of 4 bytes each. In a store instruction, the WB stage is non-existent. In an arithmetic instruction the MEM stage is non-existent. The processor is pipelined at the instruction level also. 1. Instruction fetch cycle (IF): IR Mem[PC]; NPC PC + 4; Operation: Send out the Program Counter (PC) and fetch the instruction from memory into the Instruction Register (IR); increment the PC by 4 to address the next sequential instruction. The IR is used to hold the instruction that will be needed on subsequent clock cycles; likewise the register NPC is used to hold the next sequential PC. The above describes fetching of one instruction at a time. You should fetch P 1 number of instructions at any time in the Superscalar architecture. Note that our desire is to execute more than one instruction in every cycle. 2

3 Instruction Fetch - IF Instruction Decoding-ID Execution or Addr evaluation - EX Memory access/branch completion - MEM Write back results - WB Figure 1: The five stages of instruction execution 2. Instruction Decode/Register fetch cycle (ID): A Regs [rs]; B Regs [rt]; Imm sign-extended immediate fields of IR; Operation: Decode the instruction and access the register file to read the registers (rs and rt are the register specifiers). The outputs of the general purpose registers are read into two temporary registers (A and B) for use in later clock cycles. The lower 16 bits of the IR are also sign extended and stored into the temporary register Imm, for use in the next cycle. Decoding is done in parallel with reading registers, which is possible by ensuring that these fields are at a fixed location in the instruction format. Assume that the immediate portion of an instruction is located in an identical place in every instruction, the sign extended immediate is also calculated during this cycle in case it is needed in the next cycle. The above describes, how to decode one instruction. You should parallely decode P 1 instructions. In addition, in the superscalar execution, before registers are fetched, the register status indicators have to be consulted. Also beware of Load and Store instructions, that reads registers for calculating memory addresses. These register reads can lead to RAW hazards. This stage is responsible for dynamically scheduling of P 1 instructions at any time into the respective A 1, A 2 and A 3 units. If units are not available, then stall the pipeline, as a structural hazard is caused. The memory aliasing problem is to be handled using an associative memory as the memory status indicator. Note that the size of this associative memory will be A 3 Number of pipeline stages in the Load-Store unit. The above will be the maximum number of memory addresses that could be accessed at a time. 3

4 3. Execution/Effective Address cycle (EX): The ALU operates on the operands prepared in the prior cycle, performing one of the following four functions depending on the instruction type. Memory reference: (LD and ST) ALUOutput R0 + Reg; Operation: The ALU adds R0 with the contents of Reg fetched in earlier cycle to form the effective address and places the result into the register ALUOutput. Consult the memory status indicator for resolving the memory aliasing problem. Register-Register ALU instruction:(add, SUB and MUL) ALUOutput A op B Operation: The ALU performs the operation specified by the function code on the value in register A and on the value in register B. The result is placed in the temporary register ALUOutput. Register-Immediate ALU Instruction:(ADD, SUB and MUL) ALUOutput A op Imm; Operation: The ALU performs the operation specified by the opcode on the value in register A and on the operand Imm. The result is placed in the temporary register ALUOutput. Branch: ALUOutput NPC + (Imm << 2); Cond (A == 0) Operation : The ALU adds the NPC to the sign-extended immediate value in Imm, which is shifted left by 2 bits to create a word offset, to compute the address of the branch target. Register A, which has been read in the prior cycle, is checked to determine whether the branch is taken. Since we are considering only one form of branch (BEQZ), the comparison is against 0. Note that BEQZ is actually a pseudo instruction that translates to a BEQ with R0 as an operand. For simplicity, this is the only form of branch we consider. To reduce penalty due to control hazards, the jumps can be treated specially. Both the unconditional and conditional jumps may be decoded in the IF cycle itself. Note that unconditional Jumps can be executed at IF cycle and conditional jumps in ID cycle. This is straight forward to implement. Note that out of the P 1 instructions fetched along with a JMP instruction, all the instructions that appear after the jump instruction should not be scheduled. In case of conditional jump the pipeline should be stalled for one cycle due to the control hazard. The load-store architecture enables the effective memory address calculation and execution cycle to be combined into a single clock cycle, since no instruction needs to simultaneously calculate a data address, calculate an instruction target address, and perform an operation on the data. 4. Memory access cycle (MEM): Memory reference : LMD Mem [ALUOutput] or Mem [ALUOutput] B; Operation: Access memory, if needed. If instruction is a load, data returns from memory and is placed in the LMD (load memory data) register; if it is a store, then the data from the B register is written into memory. In either case 4

5 the address used is the one computed during the prior cycle and stored in the register ALUOutput. Note: Each processor has two caches - the Instruction cache and the Data cache. The memory has two ports - a read port for accessing instruction and a read/write port for accessing data. Conflicts in addressing on these ports, namely same address loaded on the ports should be resolved. When two or more Load/Store units try to access the cache, there would be a structural hazard for accessing the data cache, resulting in stalling of the pipeline inside the Load/Store units. In your implementation, assume that a Cache-based structural hazard takes one extra cycle for simultaneous access by two LSUs. In the worst case you may waste A 3 1 cycles due to Cache-based structural hazards. In the case of a Cache miss, after the Cache miss is detected, assume it takes two clock cycles to access memory and read/write data. 5. Write-back cycle (WB): Register-Register ALU instruction: Regs[rd] ALUOutput; Load instruction: Regs[rt] LMD; Operation: Write the result into the register file, whether it comes from the memory system (which is in LMD) or from the ALU (which is in ALUOutput); the register destination field is also in one of two positions (rd or rt) depending on the effective opcode. The write back in superscalar is on the Common Data Bus (CDB), which is communicated back to the reservation stations. The CDB is shared by several execution units to write back results. The CDB should be designed to handle C 1 units to commit back the result at a time. The CDB has 32 C 1 data lines and does the following function. Note that C 1 A 1 + A 2 + A 3. The Bus arbiter has a simple circular-token protocol. It has a register which stores an integer K = A 1 + A 2 + A 3. In a current cycle the Bus arbiter permits the next C 1 units from the k th execution unit in a circular fashion that have a request for CDB to write into CDB. Note: The Write-back cycle resets the Register status indicator and the memory status indicator (if applicable). 3.2 Implementation of the Parallelism The ideal CPI (Cycles per Instruction) of a pipelined processor is 1. So we cannot achieve better than that without introducing redundancy. This redundancy is in the form of parallel execution units in the EX stage as shown in Figure 2. This arrangement helps overlapped and out-of-order execution of instructions on the EX stage in addition to the conventional pipelining. This arrangement has the potential to achieve a CPI<1. 5

6 IF ID E X 1 E X 2 E X 3 E X N Figure 2: Duplication of functional units for parallelism 3.3 Pipelining hazards Hazards are situations which prevent the next instruction in the instruction stream from getting executed in its designated clock cycle. Hazards may stall the pipeline. There are three types of hazards Structural - If some functional units are duplicated to accommodate overlap in execution and some combination of instructions cannot be run in parallel then structural hazard results. For e.g., we have only one write port and pipelining requires 2 writes to be done in that clock cycle. Data hazards to be explained shortly. Control hazards arise from pipelining of branches and other instructions that change the Program Counter (PC). For e.g., In a conditional Jump instruction, till the condition is evaluated the new PC can take either the incremented PC value or the address accessed in that instruction. To avoid this we either stall the pipeline for 2 cycles or use branch predictors. In this project assume no branch predictors are used. Instead, we choose to stall the processor. When a conflict is encountered, all instructions before the stalled instructions need to continue and all the instructions after the stalled instruction need to be stalled Data hazard classification 1. RAW - Read After Write Consider the instruction sequence given below. ADD R1, R2, R3 SUB R4, R1, R5 The result of ADD instruction that is written into R1 is required for the SUB instruction to proceed. 2. WAW - Write After Write LW R1, [addr] SUB R4, R1, R6 ADD R1, R2, R3 6

7 The result of the ADD cannot be written to R1 before LW is written into R1 as the former is needed by SUB. In addition, if the LW goes into a cache miss then ADD reaches the WB stage before the first instruction. So R1 has the older value at the end of the sequence. 3. WAR - Write After Read SD [addr], R4 ADD R4, R3, R2 Actually, the older value of R4 should get stored in [addr], by SD instruction before the new value of R4 is updated by the ADD instruction. Mem status cache Register status indicator Issue unit Reg File RS 1 RS 2 RS N EX 1 EX 2 EX N Common Data Bus(CDB) Reservation station Operation Q j Q k V j V k Address Busy Figure 3: Hardware for handling the pipelining hazards 3.4 Hardware for handling pipelining hazards The hardware used to overcome data hazards is shown in Figure 3. There are K = A 1 + A 2 + A 3 execution units running in parallel giving the data to a common bus (Common Data bus CDB). Each execution unit has an identification number which is an integer in the range [1..K-1]. The register file is an array of registers which give the inputs to the execution units. It has K triples of 5 bit input to specify the register, a read/write input signal and a 32 bit output port. The memory status cache is an associative memory with each entry as shown in figure 4 and is implemented to avoid the memory-aliasing problem. The register status indicator is implemented for handling the RAW and WAR hazards. Each execution unit is driven by an intermediate block called reservation station. The bits of reservation station are changed by the issue unit. The register status bits indicate the following (0, 0): if the register is not being currently written by any other instruction (i, j): if the execution unit i is currently evaluating the instruction with id j, where result is to be written to it. Whenever an execution unit finishes evaluation, it puts its result and its id on the CDB. The reservation stations of other execution units are waiting for the result from a particular execution unit by constantly snooping the CDB. The format of bits in reservation station is given below 7

8 Q j =0 indicates that V j holds the value of the operand 1. Q k =0 indicates that V k holds the value of the operand 2. Q j =(m, j), where m=0 indicates that 1 st operand needs to be taken from output of instruction with id 0 j 0 currently executed in the m th unit. Q k =(m, j), where m=0 indicates that 2 st operand needs to be taken from output of instruction with id 0 j 0 currently executed in the m th unit. Busy=0 execution unit is free. Busy=1 it is waiting for input. Effective address Unit number accessing it Instruction id Figure 4: An entry in the associative memory Using these units the various hazards are handled. There is need for explanation of the memory status register. It is used to handle the memory aliasing problem. The memory aliasing problem occurs under the situation given below: SD [R3+300], R4 LD R2, [R0+100] A read after write conflict will occur if R3+300 = R To handle this problem, the associative mem status register is used. Each entry in the associative memory is shown in Figure 4. Whenever a load is done, it finds out whether the associative memory has the address, and then it does a read from the CDB itself. When the corresponding unit as printed out by the entry in associative memory completes the specified instruction as specified by the entry. This is called the Tomasulo s scoreboard technique. The architecture shown above was basically meant for handling the data hazards. For the other two hazards, a separate kind of architecture is not necessary. Firstly, the structural hazard cannot be avoided. To handle the control hazard we can do one of the following. We stall the pipeline until completion of this instruction We can use branch predictors In this project, you will stall instructions till the branch condition is evaluated. 3.5 The CACHE The cache is used to bridge the gap between the speeds of the fast processor and the slow main memory. The cache memory is smaller than the main memory and faster than it. It sits between the processor and the main memory and holds data from a portion of main memory which is locally referred. The use of cache is motivated by the principle of locality of reference. There are basically 2 types of cache viz. the fully associative and the direct mapped cache. We use a cache which is a combination of both, the set associative cache. 8

9 The structure of a cache entry is given below. Tag Data V D Tag Data V D Tag Data V D Tag Data V D Figure 5: An entry in the set associative cache memory V: Validity of data. D: dirty bit; If 1, then it indicates that the data has been written by the processor and is inconsistent with the data in the memory. If 0, then it has not been modified by the processor. Caches use two policies for writing to memory: 1. Write through: If a value is to be written, then it is updated in the cache and also written to the main memory immediately. 2. Write back: The value is written only to the cache and written into memory only if a location with D=1 and V=1 is to be replaced. You will design a Cache unit with a write back policy for this project. The set associative cache has C 2 cache lines; each can hold up to C 3 cache entries. In other words we design a C 3-way set associative cache with C 2 entries. So, up to C 3 collisions can be handled without having to replace a cache entry. Tag is the MSB portion of the address which is not used in cache address generation and hence used to identify it uniquely. The LSB log 2C 2 bits of the main memory address is used for decoding into a particular cache line. Hence assume C 2 to be a power of 2. The system bus has separate data lines and address lines. 1. Read hit - the cache line holds the value being searched for. 2. Read miss - The cache line does not hold the data, hence need to be accessed from the memory. 3. Write hit - The cache entry to be written into is already in the cache, so update can be done in the cache only. 4. Write miss - Then the data already in the cache entry has to be written to main memory and the new data has to be written to this cache entry. Parameter list: A 1, A 2, A 3, P 1, C 1, C 2, C 3, N A 1 - Number of CLAs in the processor. A 2 - Number of WTMs in the processor. A 3 - Number of LSUs in the processor. P 1 - Number of instructions fetched at a time. C 1 - Number of execution units whose results are to be committed simultaneously. C2 - Number of cache lines in the set associative cache. C 3 - Number of cache entries held by a cache line in the set associative cache (or) in other words, number of ways in the set-associative cache. Once the RTL is developed, the next document would give you the verification plan, which can enable you to do the Functional Verification of your RTL. 9

10 4 Implementation Your Verilog code must follow synthesis guidelines that are discussed in the class. You will be required to take the design through the various steps of design flow later. Primary requirement for those stages is that the code is synthesizable. Further instructions will be given as you proceed. Remember that this is a group project and partitioning of your design is an absolute requirement. Use your time judiciously. Unlike project specifications for other groups, grading scheme for the report is not provided. I will talk to the groups and decide on the grading policy. 10

Instruction Pipelining

Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages