CS/CoE 1541 Mid Term Exam (Fall 2018).

CS/CoE 1541 Mid Term Exam (Fall 2018). Name: Question 1: (6+3+3+4+4=20 points) For this question, refer to the following pipeline architecture. a) Consider the execution of the following code (5 instructions) on the architecture: L: lw r3, 8(r10) add r4,r5, r6 beq r9, r10, LL sw r7, 100(r8) addi r1, r11, 8 Assuming that at the end of some cycle, C, the lw instruction is in the MEM/WB buffer, specify the values of the following control signals at the end of cycle C (use if you do not care about the value being 0 or 1). ID/E.RegWrite: 0 E/MEM.RegWrite: 1 MEM/WB.RegWrite: 1 ID/E.MemtoReg: x E/MEM.MemtoReg: 1 MEM/WB.MemtoReg: 0 ID/E.MemWrite: 0 E/MEM.MemWrite: 0 ID/E.MemRead: x E/MEM.MemRead: x ID/E.Branch: 1 ID/E.RegDst: x ID/E.ALUSrc: 0 1

b) What will be stored in the PC and the IF/ID buffer at the end of cycle C? - The PC will store: L+16 - The IF/ID buffer will store: A will store L+16 and B will store the 32 bits instruction sw r7, 100(r8) c) What is the reason for having the three shaded components in the E stage: - The shaded shift left 2 unit: to multiply the branch offset by 4 - The shaded ALU unit: To compute the target address - The shaded AND gate: To force the target address into PC only when a branch is in E d) Now assume that the branch condition of the beq instruction, which is in the ID/E buffer at the end of Cycle C evaluates to true, what actions (in terms of changing the values stored in the inter-stage buffers) will be taken by the architecture during cycle C+1 as a result of the branch being taken? 1) PC = the target address 2) IF/ID = 0 (so that to force opcode = 0, the opcode for a no-op) 3) ID/E.MemWrite = ID/E.RegWrite = ID/ED.Branch = 0 e) If 25% of the instructions executing on this architecture are branch instructions, 30% are lw/sw instructions and 45% are ALU instructions, what would be the CPI for the architecture assuming that 40% of the branches are taken (consider only control hazards and ignore the effect of other hazards)? CPI = 1 + 0.25 * 0.4 * 2 = 1.2 2

Question 2 (5+4+4=13 points): Assume that the 5-stage pipelined architecture has only one memory unit which is used for fetching instructions (in the IF stage) and fetching/storing data (in the MEM stage). Hence, if at a given cycle, the IF stage wants to read an instruction and the MEM stage wants to read or write data, then one of the two stages has to stall and use the memory in a later cycle. Assume that when the IF and MEM stages compete for the memory in a given cycle, priority is given to the MEM stage. a) Complete the following timing diagram to trace the execution of the first 7 instructions of the following code segment: I1: lw $1, 100($0) I2: and $3, $2, $4 I3: add $9, $9, $10 I4: sw $7, 50($6) I5: sub $3, $3, $1 I6: sub $2, $2, $2 I7: bneq $9, $6, I1 I8:.. I9: IF ID E MEM WB Cycle 1 I1: lw Cycle 2 I2: and I1: lw Cycle 3 I3: add I2: and I1: lw Cycle 4 I4: sw I3: add I2: and I1: lw sw -- add and lw I5: sub sw -- add and I6: sub I5: sub sw -- add bneq I6: sub I5: sub sw -- bneq -- I6: sub I5: sub sw bneq -- I6: sub I5: sub bneq -- I6: sub bneq -- bneq b) Assuming that the loop (I1-I7) will execute 10000 iterations and that the branch condition/target is resolved in the E stage, compute the CPI during the execution of the loop if the branch is always predicted not-taken (ignore the time to fill up the pipeline). During each iteration (7 instructions) there will be two bubbles due to structural hazards and two bubbles due to control hazards. Hence each iteration will take 11 cycles to complete. CPI = 11/7 c) What would the CPI be if a 1-bit branch predictor is used? A one bit branch predictor will be correct 9998 out of 10000 times can ignore structural hazard Hence, each iteration will take 9 cycles to complete CPI = 9/7 3

Question 3 (6+6=12 points): The following figure shows two forwarding paths from E/MEM and MEM/WB buffers to the E stage. A forwarding unit sets the mux control signals, A and B, to 1 or 2 whenever it detects data dependencies that may cause hazards. a) Using the notation given in the figure for the information stored in the ID/E, E/MEM and MEM/WB buffers, state the condition(s) that will cause the forwarding unit to set up control signal A to the value 1. (MEM/WB.Rd = ID/E.Rs) AND (MEM/WB.RegWrite=1) AND (MEM/WB.Rd!= 0) b) As clear from the figure, the branch condition is determined in the ID stage. A data hazard that we did not discuss in class occurs when the instruction in the E stage writes into a register, $r, and at the same cycle, a branch instruction uses register $r to determine the branch condition. For example add $r1, $r2, $r3 beq $r1, $r5, L causes a data hazard because the condition of the branch is evaluated using the old value of $r1 rather than the value computed by the add instruction. By drawing on the figure, show an additional forwarding path that can be used to avoid the above data hazard. 4

Question 4 (5+5+5=15 points): Consider the following loop which accumulates in $r3 the sum of the N elements of a vector. The value of N is initially stored in $r5 and the address of the first element of the vector is initially stored in $r2. I1: lw $r1, 0($r2) I2: add $r3, $r3, $r1 I3: addi $r2, $r2, -4 I4: addi $r5, $r5, -1 I5: bneq $r5, $0, I1 In this question, assume a superscalar architecture with two pipelines, one for ALU/branch instructions and the other for lw/sw instructions with hardware support for forwarding. a) Show, using the table below, the schedule for one iteration of the loop on the superscalar. You should rearrange the instructions to minimize the execution time (of course, without violating correctness). b) Assuming that N is a multiple of 3, show the code for the loop after unrolling it twice. lw $r1, 0($r2) lw $r6, -4($r2) lw $r7, -8($r2) add $r3, $r3, $r1 add $r3, $r3, $r6 add $r3, $r3, $r7 addi $r2, $r2, -12 addi $r5, $r5, -3 bneq $r5, $0, I1 c) Show, using the table below, the schedule for one iteration of the unrolled loop on the superscalar. You should rearrange the instructions to minimize the execution time Answer of part a Schedule for the original loop (one iteration) Answer of part c Schedule for the unrolled loop ALU/branch pipeline Load/store pipeline ALU/branch pipeline Load/store pipeline addi $r5, $r5, -1 lw $r1, 0($r2) addi $r5, $r5, -3 lw $r1, 0($r2) addi $r2, $r2, -4 addi $r2, $r2, -12 lw $r6, -4($r2) add $r3, $r3, $r1 add $r3, $r3, $r1 lw $r7, 4($r2) bneq $r5, $0, I1 add $r3, $r3, $r6 add $r3, $r3, $r7 bneq $r5, $0, I1 5

Question 5 (10 points): Indicate which of the following sentences is True and which is False. For the first 4 sentences, refer to the execution of the following sequence of instructions on a 5-stage CPU pipeline: I1: add $r1, $r3, $r2 I2: lw $r1, 100($r2) I3: lw $r3, 104($r1) true false 1 Assuming no forwarding or stalling hardware, data hazard will occur because I2 will not use the correct register data. 2 Assuming no forwarding or stalling hardware, data hazard will occur because I3 will not use the correct register data. 3 Assuming forwarding but not stalling hardware, data hazard will occur because I2 will not use the correct register data. 4 Assuming forwarding but not stalling hardware, data hazard will occur because I3 will not use the correct register data. 5 When an exception occurs in a 5-stage pipelined CPU, the address of the exception handler is stored in the EPC (Exception PC) register 6 In DRAM open row policy, opening a row occurs before every access to the row. 7 In DRAM closed row policy, closing a row occurs after every access to the row. 8 In the write allocate cache policy, a block is allocated in the cache on either a read or a write miss. 9 Increasing the cache s associativity reduces compulsory misses 10 If memory words are accessed sequentially (ex: 1,2,3,4,5,6, ) and the cache block size is 4 words, then the cache miss rate is 25% 6

Question 6 (4+5+6=15 points): In this question, use the notation [tag, M(address),...] to describe the content of each entry. For example [4,M(46)] indicates that the entry contains tag=4 and the data from memory location (word address) 46. Similarly, [4,M(46),M(47)] indicates that the entry contains a block of two words from word addresses 46 and 47. a) Show the content of each of the caches shown below after the two memory references (word addresses) 31, 36 (a) An 8-words, direct mapped cache with block size = 1 word (b) An 8-words, direct mapped cache with block size = 2 words Index Content of cache Block index Content of cache 0 0 1 1 2 2 [4,M(36), M(37)] 3 3 [3,M(30),M(31)] 4 [4,M(36)] 5 6 7 [3,M(31)] b) Show the content of the cache shown below after the three memory references (word addresses) 31, 36, 20 (c) A 32-words, 2-ways set associative cache with Block size = 2 words Block Index 0 Content of cache (Way 1) Content of cache (Way 2) 1 2 [2, M(36),M(37)] [1,M(20),M(21)] 3 4 5 6 7 [1, M(30),M(31)] 7

Question 7 (5+5+5=15 points): Assume that the CPI of a particular CPU is 2.0 with an infinitely large cache and consider the addition of a 4-way, 1MByte L1 cache to the CPU. You have two possible options for the L1 cache: Option C1: uses a block size of 32 bytes and results in an effective miss rate of 4% Option C2: uses a block size of 64 bytes and results in an effective miss rate of 3%. Because the block size in C2 is larger than the block size in C1, the miss penalty for C2 is 100 cycles while the miss penalty of C1 is 80 cycles. (a) By computing and comparing the CPI for each of the cache options, determine which option results in a faster system. CPI with C1 (32 bytes blocks) = 2.0 + 0.04 * 80 = 5.2 CPI with C2 (64 bytes blocks) = 2.0 + 0.03 * 100 = 5 Hence, option C2 is the faster than option C1. (b) To compare the total number of bits (for tag and data, ignoring other bits) required to implement the L1 cache, complete the following sentences: For option C1, - the tag for each block in the cache consists of N-5-13 = N-18 bits (14 if N= 32) - the total number of blocks in the cache is _2 15 For option C2, - the tag for each block in the cache consists of N-6-12 = N-18 bits (14 if N= 32) - the total number of blocks in the cache is 2 14 Hence, option C1 uses more bits (total) for tags and data than option C2 (c) Assume that an L2 cache is added to option C2 to reduce the miss penalty for blocks that are not in L1 but are in L2 to 10 cycles. Assuming that the miss penalty of L2 is 100 cycles and that its hit rate is 30% (30% of the misses from L1 are found in L2), what is the CPI of the system with two levels of caches? CPI = 2.0 + 0.03 * (10 + 0.7 * 100) = 4.4 8