CS 251, Winter 2018, Assignment 5.0.4 3% of course mark Due Wednesday, March 21st, 4:30PM Lates accepted until 10:00am March 22nd with a 15% penalty 1. (10 points) The code sequence below executes on a pipelined datapath where branching is determined in the ID stage. You must consider branch data hazards that might exist between the branch instruction and an instruction immediately before the branch. A one clock cycle delay is needed if the instruction immediately before the branch is an R-format instruction or an addi/subi instruction and a data dependency exits. A two clock cycle delay is needed if the instruction immediately before the branch is a load word instruction and a load-use hazard exists. You may assume if a branch data hazard exists, the datapath will add in the necessary stalls (NOPs). (a) (5 pts) Assume the datapath implements data forwarding and load-use stalling but does not implement Branch Flushing. Indicate any instructions that have any hazard between itself and a prior instruction using (*) beside that instruction. Rearrange the code to remove the load-use hazards and branch data hazards if they exist. Fill the branch delay slot if possible. If code rearrangement cannot be used, you may use NOPs. * Original Rearranged Code 100 addi $1, $0, 50 104 add $5, $0, $0 108 lw $2, 100($4) 112 lw $3, 200($4) 116 sw $3, 300($4) 120 addi $4,$4, 4 124 add $5, $5, $2 128 addi $1, $1, -1 132 bne $1, $0, -7 136 add $8, $5, $0 1
(b) (5 points) This question is asking for calculations for the original sequence of instructions above running on a pipelined datapath where branch is determined in the ID stage. You should assume that Branch Flushing exists for instructions that are not needed following the branch and that the datapath implements a one clock cycle stall if a Branch Data hazard exists. You should also assume that data forwarding and load-use stalling exist in the datapath. (i) What is the total number of instructions that are flushed? (ii) State the total number of clock cycles required to run the original sequence of code including pipeline start-up time. 2
2. (6 points) Given the following execution times for individual components on the Pipelined datapath find the minimum time that can be assigned to the clock cycle length (i.e., in class we always used a 200ps clock cycle for the pipelined datapath). You may assume Branch is determined in the MEM Stage for this question. Memory accesses take 120ps Register File access is 70ps (read or write) ALU computations 150ps, Adders: 150ps Sign Extension 5ps, Shift Left by two: 5ps MUXes: 10ps, Writing to Intermediate Pipeline Registers (IF/ID etc.) Negligible. Reading data from any Pipeline registers is Negligible Control Unit decode of instruction opcode bits: 10ps ALU Control: 5ps Assume all other components are negligible and many operations occur in parallel. Complete the following table giving the minimum time needed for the stage to execute correctly. Be careful with the ID stage. Min Time IF ID EX MEM WB State the shortest clock cycle time we could allow on the Pipelined Datapath : 3
3. (6 points) Given a simple high level loop: for (register int k=1; k<11; k++) A[k] = A[k-1] + k; The following MIPS code implements the above high level code fragment. It is run on the pipelined datapath that performs branch in the MEM stage, has data forwarding and loaduse stalls, and implements branch flushing for unwanted instructions following the branch. 096 addi $1,$0,1 # k 100 addi $2,$0,0 # index into A 104 addi $3,$0,11 # end value of k 108 lw $4,200($2) 112 add $4,$4,$1 116 sw $4,204($2) 120 addi $2,$2,4 124 addi $1,$1,1 128 bne $1,$3,-6 132 slt $2, $4, $0 136 add $1, $1, $1 140 add $2, $1, $2 (a) (2 points) How many total clock cycles (including flushed instructions and stalls) does the above code need to execute ending after executing line 140? Be sure to include the time to start-up the pipeline. (b) (4 points) Rewrite this code using code rearrangment to solve any possible hazards. If hazards cannot be solved completely you may use NOPs. Note: Instructions that are not part of the loop should not be moved into the loop for any reason. Line number 096 100 104 108 112 116 120 124 128 132 136 140 144 148 152 Rearranged Code 4
4. (7 points) The datapath on the next page shows the hardware needed to execute branch in the ID stage. The zero bit ANDed with the Branch control bit is missing from this diagram; however you may assume it exists and all the necessary hardware to take a branch in ID exists in the datapath. As noted in question 01 of the assignment, when branching is determined in the ID stage, data hazards may now exist between the branch instruction and an instruction that immediately precedes it. In class we discussed data hazards between instructions in the EX stage and instructions in the MEM or WB stages. A copy of a condition to detect a data hazard between two instructions has been copied from the course notes and is given below: (if (MEM/WB.RegWrite) and (MEM/WB.RegisterRd!= 0) and (EX/MEM.RegisterRd!= ID/EX.RegisterRs) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) This condition detects a data hazard between an instruction in the WB stage and an instruction in the EX stage. a) (4 points) State the conditions necessary to detect a data hazard between a branch instruction in the ID stage and an R-format instruction in the EX stage. You need to only state the necessary conditions to detect a data hazard for the $rt register in the ID stage. There are no forwarding control bits that need to be set. b) (3 points) If a branch data hazard exist between a branch instruction in ID and an R-format instruction in EX, state how many stalls would be required between the two instructions. You may assume the necessary forwarding hardware was added to allow forwarding to the ID stage from the EX/MEM or MEM/WB pipeline registers. State which instruction would need to be stalled, which instruction would need to move forward and how would you implement the stall. 5
Pipelined datapath with Forwarding, Branch in ID stage. This is the WRONG datapath to use for question 2! 6
5. (15 points) Here is a series of address references given as 4-bit word addresses in both decimal and binary; we also list the relative time at which these references occur: Addr 0 1 2 3 4 0 8 0 4 0 8 5 4 2 0 Binary 0000 0001 0010 0011 0100 0000 1000 0000 0100 0000 1000 0101 0100 0010 0000 Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Below are four different 8-word caches (similar to Figure 5.14 of the text). For each cache type, assuming the cache is initially empty, show the final contents of the cache, and in the table at the bottom, show how many cache hits and misses there are for each type of cache. Write your solution in the tables below, assuming the above word address are 4-bit binary numbers. You should write the binary form of the tag in the tables below, except for the fully associative cache, where you may write the decimal form of the tag. In the data column, write M[3] for data at memory address 3, M[8] for data at memory address 8. Assume a LRU replacement scheme. When inserting an element into the cache, if there are multiple empty slots for that index, you should put the new element in the left-most empty slot. Direct mapped Block Tag Data 0 1 2 3 4 5 6 7 Four-way set associative Two-way set associative Set Tag Data Tag Data 0 1 2 3 Set Tag Data Tag Data Tag Data Tag Data 0 1 Fully associative Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Write the number of cache hits and misses for each scheme in the table below: Direct Mapped Two-way s.a. Four-way s.a. Fully Associate Hits Misses 7
6. (6 points) Suppose we have a 16-word, 2-way set associative cache that is partially filled as indicated below (a missing tag indicates that the cache entry is invalid; a tag indicates that the cache entry is valid). Only the Tags are shown in the cache (i.e., we have omitted the data stored in the cache). index tag0 tag1 000 00 10 001 00 01 010 11 00 011 00 00 100 00 01 101 11 10 110 01 111 00 (a) (1 point) What is wrong with the cache entries in this cache? (b) (4 points) Assume the cache starts partially full as shown above. accesses, fill in the table with cache hits or misses. Given the following word Binary Addr Hit/miss 00 000 01 000 00 100 11 101 10 110 10 100 Miss (c) (1 points) We labeled the last cache access of the previous question as a Miss. After fetching this word from memory, we will need to replace one word in the cache. Assuming we have executed the sequence of memory accesses listed in the previous part of this question, which of tag0 and tag1 would you replace? Justify your answer. 8
7. (2 points) AMAT (Average memory access time) is the average (expected) time it takes for a memory access considering both hits and misses. It can be calculated using the formula: AMAT = hitt ime + missrate missp enalty. The miss rate is the percentage of memory accesses that are not in cache. The miss penalty is the additional time it takes for memory access to the next higher level in the hierarchy. Suppose you only have a level 1 cache, and that level 1 cache hits in 1 clock cycle with a miss rate of 4%. The cost to access main memory (the miss penalty) is 120 clock cycles. AMAT= 8. (3 points) CPI is a measure of clock cycles per instruction that is used to compare Instruction Set Architectures based on a particular instruction mix. Assume an instruction mix of 14% Load words, 10% Store words, 60% R-format, 10% Branch, 6% Jumps Given a Pipelined datapath where branch is determined in the MEM stage and the datapath implements data forwarding, load-use stalling and branch flushing when necessary. Assume half of all branch instructions cause flushing of unwanted instructions following the branch. A quarter of all load-words are followed by a use and generate a load-use hazard. The jump instruction is determined in the ID stage and all jump instructions will require flushing 1 instruction behind it. State the average CPI and be sure to show your work. CPI = 9
9. BONUS (5 points) Below is a diagram showing the Forwarding Unit in the pipelined datapath. The inputs and outputs of the forwarding unit are indicated in the diagram. You need to only consider a data hazard between the instruction in the EX stage of the pipeline and an instruction in the MEM stage. In the space below, implement the circuit that will partially implement the ForwardB signal to the multiplexor before the ALU. You only need to generate the signal to detect a hazard between the $rt source register of the instruction in the EX stage and an instruction in the MEM stage of the pipeline. You must show and correctly label all of the necessary inputs and outputs that you use and indicate with a slash the width of each input/output. You may use any of the gates that we discussed in class. (You may not use decoders, multiplexors or comparators). This question will be marked all or nothing, meaning it must be exactly correct for 5 bonus marks; otherwise you will receive zero. 10
The remaining questions will NOT be used to compute your assignment mark; they are included here as additional questions you may want to try to aid your understanding of the course material. Exercises from the textbook: 5.2.1, 5.2.2, 5.3, 5.7.1, 5.7.2, 5.7.3. 11