ECE 3056 Exam II - Solutions November 8 th, 2017
1. (15 pts) To the base pipeline we add data forwarding to EX, data hazard detection and stall generation, and branches implemented in MEM and predicted not taken with flushing. For a load-to-use hazard a check will generate a stall = 1 signal when this hazard is detected. a. (5 pts) Write the Boolean expression to generate this stall signal. Use the signal notation from the figures, defining any other variables if you feel you need them. If (ID/EX.MemRead = 1) & ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) Stall =1 An equivalent solution is possible by checking between EX and MEM and inserting the stall in the MEM stage b. (15 pts) How should the data path be modified to use this stall signal to ensure correct implementation of stalls for the load-to-use dependency? The following specific actions are necessary when performing the check between ID and EX. Stall the IF/ID register and the PC. This can be done by performing the logical AND of the stall signal and the clock at the PC and IF/ID register The control signals generated in ID should be zero to create a stall cycle in EX at the next clock. If the check is being performed between EX and MEM equivalent actions have to performed including stalling the EX stage.
2. (20 pts) To the base pipeline we add data forwarding to EX with data hazard detection (load-to-use) with stall generation. Now imagine that the implementation of the branch has been moved to ID and predicted not taken with flushing but there is no forwarding to ID. Consider the following code sequence at the stages indicated in cycle i. sub $8, $12, $11 -- WB add $4, $8, $9 add $4, $4, $7 beq $4, $6, loop -- MEM -- EX -- ID subi $2, $2, 4 -- IF This code will not execute correctly on this pipeline a) (5 pts) Show what the correct state of the pipeline should be at cycle i+1, i.e. what should be in each stage. IF ID EX MEM WB subi $2, $2, 4 beq $4, $6, loop <stall> add $4, $4, $7 add $4, $8, $9 b) (15 pts) Describe a general hardware solution to ensure correct execution when there are data hazards on branches. Clearly define i) the checks to be made, and ii) how they are implemented, i.e., what are the main functional blocks. It is probably easier to show additions to one of the figures. Check for hazards and insert the correct number of stalls. Hazards can occur when beq is in ID and rformat or lw instructions are in EX or MEM. Stalls ensure the producer instruction gets to WB since reads and writes can happen in the same cycle. Check dependencies between EX and ID and insert stall in EX if necessary: for both rformat and lw dependencies. Checks compare source and destination registers and check for instruction types, i.e., beq in ID and rformat or lw in EX. If ((ID/EX.ALUOp = 10) & (IF/ID.Opcode = beq)) & ((ID/EX.RegisterRd = IF/ID.RegisterRs) or (ID/EX.RegisterRd = IF/ID.RegisterRt)) Stall = 1; If (ID/EX.MemRead = 1) & ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) Stall =1; (The question notes that this check is already present).
Similar checks are performed between MEM and ID and may insert a stall in EX. Both EX and MEM checks could be concurrently positive - generate the logical OR of the outcome of the checks in each stage. There are also alternative combinations of variables for these checks but they look very similar. These checks will insert 1 or 2 stalls. Alternatively, one could add forwarding to ID from MEM and add checks between EX and ID. However, you still need a check between MEM and ID for lw.
3. (15 pts) To the base pipeline we add data forwarding to EX, data hazard detection with stall generation, and branches are implemented in MEM and predicted not taken with flushing. Consider the following code sequence in the pipeline at cycle 10. In the attached forwarding figure, ForwardA is the top mux and ForwardB is the bottom mux in the collateral. The inputs are numbered 00, 01, 10 from top to bottom. add $1, $2, $3 add $1, $1, $4 add $1, $1, $5 add $1, $1, $6 add $1, $1, $7 -- WB -- MEM -- EX -- ID -- IF a) (5 pts) What should be the value of ForwardA and ForwardB during cycles 10 and 11? ForwardA@ cycle 10 10 ForwardB@ cycle 10 00 ForwardA@ cycle 11 _10 ForwardB@ cycle 11 00 The MEM stage will always have the latest value for $1. Hence rs will be forwarded from MEM. The mux value from the figure is 10. The rt value is unique, has not been updated, and therefore will be whatever was read form the register file hence the mux value (from the figure) will be 00. b) (10 pts) Now consider that we pipeline some stages in the interest of increasing the clock rate. For each approach separately, what are the number of clock cycles for the branch and load-to-use penalties? IF ID EX MEM WB Branch Penalty Load-to-Use Penalty i. MEM is pipelined to 3 stages 3 3_ ii. IF is pipelined to two stages 4 1_ iii. EX is pipelined to two stages 4 2 Pipelining MEM to 3 stages will not add to the penalty since the since a branch instruction will not need the result from MEM. The and gate will operate on the equality test result from EX in
the first cycle of MEM. However, the load-to-use penalty will need the result from MEM and hence will increase by 2 cycles. When IF is pipelined to 2 stages the branch penalty will increase by 1 cycle but the load-to-use does not increase since this test is between ID and EX. If EX is pipelined to 2 stages the branch penalty will increase by 1 cycle since the branch test has to complete and all a preceding stages flushed when taken. The load-to use will increase by 1 cycle. The hazard test can be performed in the first cycle of EX. However, there has to be a 2 cycle gap between the lw and the dependent instruction. When lw is in WB, the dependent instruction should be in the first cycle of EX.
4. (20 pts) To the base pipeline we add data forwarding to EX, data hazard detection with stall generation, and branches are implemented in MEM and predicted not taken with flushing. Consider the execution of the code below. First instruction is at 0x00400000. 1 add $7, $0, $0 2 addi $5, $0, 72 3 addi $3, $0, x1028 4 loop: lw $4, 0($3) 5 lw $9, -8($3) 6 lw $6, -4($3) 7 add $7, $6, $7 8 add $7, $9, $7 9 add $7, $4, $7 10 addi $3, $3, -12 11 addi $5, $5, -12 12 bne $5, $0, loop a. (10 pts) Identify all data and control hazards, i.e., I à J where I and J are instruction numbers Data hazard from 6à7 Control hazard on the bne The question is asking for hazards not dependencies. b. (10 pts) Fill in the values below for cycle 9 (start counting at 0!). Use the notation in the base pipeline in the collateral. Pipeline Signal Instruction in the stage Value IF/ID.PC4 add $7, $9, $7 0x00400020 ID/EX.ALUSrc add $7, $6, $7 0 ID/EX.RegDst add $7, $6, $7 1 EX/MEM.MEM.Address <stall> 0x1024 MEM/WB.MemToReg lw $6, -4($3) 1
5. (20 pts) Consider a 32 Kbyte direct mapped cache with 64 byte lines operating with 32-bit addresses. i. (5 pts) How many lines are there in the cache (2 15 / 6 ) = 2 9 = 512 lines ii. (5 pts) Show the breakdown of the address bits into fields used to address the cache. 17 99 6 iii. (10 pts) Consider a read to address 0x004000F8 that misses in the cache and is processed to bring the corresponding line into the cache. For each of the following addresses, if it is the next reference to the cache, identify whether it will be a reference to a word in the same line (Y/N). i. 0x00401000 N ii. 0x00400020 N iii. 0x004000C4 Y
6. (10 pts) To the base pipeline we add data forwarding to EX, branches are implemented in MEM, and support for jumps are implemented in ID. There is no hardware support for hazard detection or associated flushing/stall generation. Compiler support maintains correctness via the insertion of nops for any possible data or control hazards. Executing programs result in the following statistics. 40% of branches are taken and 20% of all load operations produce a hazard. The compiler can reorganize the code to fill 60% of branch delay slots, 30% of the jump delay slots, and 50% of load delay slots. What is the improvement in execution time achieved by such instruction scheduling? Assume a base CPI of 1.0. You may leave answers in expression form. Instruction Frequency Loads 22% Stores 13% ALU 42% Operations Branches 18% Jumps 5% EX 1 = #I * CPI 1 * clock_cycle CPI 1 = CPI base + 0.22 * 0.2 * 1 + 0.18 * 3 + 0.05 * 1 = 1.0 + 0.634 = 1.634 Note that the branch taken probability does not matter. There is no hazard detection so every branch will incur a 3-cycle penalty. The load-to-use and jump penalties are 1 cycle. CPI 2 = CPI base + 0.22 * 0.2 * 0.5 * 1 + 0.18 * 0.4 *3 + 0.05 * 0.7 * 1 = 1.0 + 0.273 = 1.273 Without flushing every branch will incur a penalty, but this penalty has been reduced by 60% with instruction scheduling. Similarly the load-to-use and jump-penalties have been reduced. EX 2 = #I * CPI 2 * clock_cycle Improvement in execution time = EX 1 - EX 2
Forwarding Paths