CS 251, Winter 2019, Assignment % of course mark

Size: px

Start display at page:

Download "CS 251, Winter 2019, Assignment % of course mark"

Jemima Hodges
5 years ago
Views:

1 CS 251, Winter 2019, Assignment % of course mark Due Wednesday, March 27th, 5:30PM Lates accepted until 1:00pm March 28th with a 15% penalty 1. (10 points) The code sequence below executes on a pipelined datapath where branching is determined in the ID stage. You must consider branch data hazards that might exist between the branch instruction and an instruction immediately before the branch. A one clock cycle delay is needed if the instruction immediately before the branch is an R-format instruction or an addi/subi instruction and a data dependency exits. A two clock cycle delay is needed if the instruction immediately before the branch is a load word instruction and a load-use hazard exists. You may assume if a branch data hazard exists, the datapath will add in the necessary stalls (NOPs). (a) (6 points) Assume the datapath implements data forwarding and load-use stalling but does not implement Branch Flushing. Indicate any instructions that have any hazard between itself and a prior instruction using (*) beside that instruction (control hazard, data hazard). Rearrange the code to remove the load-use hazards and branch data hazards if they exist. Fill the branch delay slot if possible. If code rearrangement cannot be used, you may use NOPs. * Original Rearranged Code 100 addi $1, $0, add $5, $0, $0 108 lw $3, 100($4) 112 sw $3, 300($4) 116 lw $2, 200($4) 120 sw $2, 400($4) 124 addi $4,$4, add $5, $5, $2 132 addi $1, $1, bne $1, $0, add $8, $5, $0 1

2 (b) (4 points) This question is asking for calculations for the original sequence of instructions above running on a pipelined datapath where branch is determined in the ID stage. You should assume that Branch Flushing exists for instructions that are not needed following the branch and that the datapath implements a one clock cycle stall for a Branch Data hazard. You should also assume that data forwarding and load-use stalling exist in the datapath. (i) What is the total number of instructions that are flushed? (ii) State the total number of clock cycles required to run the original sequence of code including pipeline start-up time. 2

3 2. (6 points) Given the following execution times for individual components on the Pipelined datapath find the minimum time that can be assigned to the clock cycle length (i.e., in class we always used a 200ps clock cycle for the pipelined datapath). You may assume Branch is determined in the MEM Stage for this question. Memory accesses take 120ps Register File access is 75ps (read or write) ALU computations 140ps, Adders: 140ps Sign Extension 5ps, Shift Left by two: 5ps MUXes: 10ps, Writing to Intermediate Pipeline Registers (IF/ID etc.) Negligible. Reading data from any Pipeline registers is Negligible Control Unit decode of instruction opcode bits: 10ps ALU Control: 5ps Assume all other components are negligible and many operations occur in parallel. Complete the following table giving the minimum time needed for the stage to execute correctly. Be careful with the ID stage. Min Time IF ID EX MEM WB State the shortest clock cycle time we could allow on the Pipelined Datapath : 3

4 3. (6 points) Given a simple high level loop: for (register int k=1; k<10; k++) A[k] = A[k-1] + 2*k; The following MIPS code implements the above high level code fragment. It is run on the pipelined datapath that performs branch in the MEM stage and has data forwarding and load-use stalls. Note: The datapath does not implement branch flushing for unwanted instructions following the branch. 096 addi $1,$0,1 # k 100 addi $2,$0,0 # index into A located at addi $3,$0,10 # end value of k 108 lw $4,200($2) # read A[k-1] 112 add $4,$4,$1 # add k to A[k-1] 116 add $4,$4,$1 # add k to A[k-1] again 120 sw $4,204($2) # store A[k-1]+2k in A[k] 124 addi $2,$2,4 # next index into A 128 addi $1,$1,1 # next k 132 bne $1,$3,-7 # branch if not done 136 slt $4,$2,$0 # code immediately following the loop 140 add $2,$5,$6 144 add $4,$3,$3 148 addi $1,$3,2 (a) (2 points) Some of the instructions following the loop will execute erroneously. Regardless, how many total clock cycles does the above code need to execute through line 148? Be sure to include the time to start-up the pipeline and all loop iterations. (cont) 4

5 (b) (4 points) Rewrite this code using code rearrangement to solve any possible hazards including data hazards, load-use hazards or control hazards. If a hazard cannot be solved completely you must use NOPs to indicate that the hazard cannot be solved using code rearrangement. Note: Instructions that are not part of the loop should not be moved into the loop, and instructions inside the loop should remain in the loop even if you could achieve a performance gain by making such a change. Line number Rearranged Code 5

6 4. (7 points) The datapath on the next page shows the hardware needed to execute branch in the ID stage. The zero bit ANDed with the Branch control bit is missing from this diagram; however you may assume it exists and all the necessary hardware to take a branch in the ID stage exists in the datapath. As noted in question 1 of the assignment, when branching is determined in the ID stage, data hazards may now exist between the branch instruction and an instruction that immediately precedes it. In class we discussed data hazards between instructions in the EX stage and instructions in the MEM or WB stages. A copy of a condition to detect a data hazard between two instructions has been copied from the course notes and is given below: (if (MEM/WB.RegWrite) and (MEM/WB.RegisterRd!= 0) and (EX/MEM.RegisterRd!= ID/EX.RegisterRs) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) This condition detects a data hazard between an instruction in the WB stage and an instruction in the EX stage. a) (4 points) State the conditions necessary to detect a data hazard between a branch instruction in the ID stage and an R-format instruction in the EX stage. You need to only state the necessary conditions to detect a data hazard for the $rt register in the ID stage. There are no forwarding control bits that need to be set. b) (3 points) If a branch data hazard exist between a branch instruction in the ID stage and an R-format instruction in the EX stage, state how many stalls would be required between the two instructions. You may assume the necessary forwarding hardware was added to allow forwarding to the ID stage from the EX/MEM or MEM/WB pipeline registers. State which instruction would need to be stalled, which instruction would need to move forward and how would you implement the stall. 6

7 Pipelined datapath with Forwarding, Branch in ID stage. This is the WRONG datapath to use for question 2! 7

8 5. (15 points) Here is a series of address references given as 4-bit word addresses in both decimal and binary; we also list the relative time at which these references occur: Addr Binary Time Below are four different 8-word caches (similar to Figure 5.14 of the text). For each cache type, assuming the cache is initially empty, show the final contents of the cache, and in the table at the bottom, show how many cache hits and misses there are for each type of cache. Write your solution in the tables below, assuming the above word address are 4-bit binary numbers. You should write the binary form of the tag in the tables below, except for the fully associative cache, where you may write the decimal form of the tag. In the data column, write M[3] for data at memory address 3, M[8] for data at memory address 8. Assume a LRU replacement scheme. When inserting an element into the cache, if there are multiple empty slots for that index, you should put the new element in the left-most empty slot. Direct mapped Block Tag Data Four-way set associative Two-way set associative Set Tag Data Tag Data Set Tag Data Tag Data Tag Data Tag Data 0 1 Fully associative Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Write the number of cache hits and misses for each scheme in the table below: Direct Mapped Two-way s.a. Four-way s.a. Fully Associate Hits Misses 8

9 6. (6 points) Suppose we have a 16-word, 2-way set associative cache that is partially filled as indicated below (a missing tag indicates that the cache entry is invalid; a tag indicates that the cache entry is valid). Only the Tags are shown in the cache (i.e., we have omitted the data stored in the cache). index tag0 tag (a) (1 point) What is wrong with the cache entries in this cache? (b) (4 points) Assume the cache starts partially full as shown above. accesses, fill in the table with cache hits or misses. Given the following word Binary Addr Hit/miss Miss (c) (1 points) We labeled the last cache access of the previous question as a Miss. After fetching this word from memory, we will need to replace one word in the cache. Assuming we have executed the sequence of memory accesses listed in the previous part of this question, which of tag0 and tag1 would you replace? Justify your answer. 9

10 7. (3 points) CPI is a measure of clock cycles per instruction that is used to compare Instruction Set Architectures based on a particular instruction mix. Assume an instruction mix of 15% Load words, 10% Store words, 60% R-format, 10% Branch, 5% Jumps Given a Pipelined datapath where branch is determined in the MEM stage and the datapath implements data forwarding, load-use stalling and branch flushing when necessary. Assume half of all branch instructions cause flushing of unwanted instructions following the branch. A quarter of all load-words are followed by a use and generate a load-use hazard. The jump instruction is determined in the ID stage and all jump instructions will require flushing 1 instruction behind it. State the average CPI and be sure to show your work. You do not need to show the final answer, only the formulas you used. CPI = 10

11 8. BONUS (5 points) Below is a diagram showing the pipelined datapath where Branch is determined in the ID stage. A new forwarding unit (Branch ID Forwarding Unit) has been added to the ID stage that generates the signals ForwardC and ForwardD to forward to the Branch instruction in the ID stage if a data hazard exists between a branch in the ID stage and an instruction in the MEM stage. In the drawing, some of the connections in the ID stage have been broken for clarity. You can assume the datapath works exactly as it did previously when Branch is determined in the ID stage. Further, the ForwardC and ForwardD signals are only shown as inputs to the multiplexors they control and are generated by the Branch ID Fowarding Unit. On the next page, provide the circuit that will implement the ForwardD control bit to the multiplexor before the comparator in the ID stage (you do not have to worry about forwarding from other stages in the datapath). You only need to generate the signal as true or false in order to detect a hazard between the $rt source register of a Branch instruction in the ID stage and an instruction in the MEM stage of the pipeline. You must show and correctly label all of the necessary inputs and outputs that you use and indicate with a slash the width of each input/output. You may use any of the gates that we discussed in class. You may use inputs and information from anywhere in the datapath in order to correctly detect the branch data hazard. (You may not use decoders, multiplexors or comparators). Special marking note: Your answer must be exactly correct for the full 5 bonus marks; for a mostly correct answer, you will receive only 2 bonus points; answers with any significant errors will receive no bonus points. 11

12 Q8 solution: 12

13 The remaining questions will NOT be used to compute your assignment mark; they are included here as additional questions you may want to try to aid your understanding of the course material. Exercises from the textbook: 5.2.1, 5.2.2, 5.3, 5.7.1, 5.7.2,

CS 251, Winter 2018, Assignment % of course mark

CS 251, Winter 2018, Assignment 5.0.4 3% of course mark Due Wednesday, March 21st, 4:30PM Lates accepted until 10:00am March 22nd with a 15% penalty 1. (10 points) The code sequence below executes on a