CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions Tutorial Questions 2. [AY2014/5 Semester 2 Exam] Refer to the following MIPS program: # register $s0 contains a 32-bit value # register $s1 contains a non-zero 8-bit value # at the right most (least significant) byte add $t0, $s0, $zero #inst A add $s2, $zero, $zero #inst B lp: bne $s2, $zero, done #inst C beq $t0, $zero, done #inst D andi $t1, $t0, 0xFF #inst E bne $s1, $t1, nt #inst F addi $s2, $s2, 1 #inst G nt: srl $t0, $t0, 8 #inst H j lp #inst J done: We assume that the register $s0 contains 0xAFAFFAFA and $s1 contains 0xFF. Given a 5-stage MIPS pipeline processor, for each of the parts below, give the total number of cycles needed for the first iteration of the execution from instructions A to H (i.e. excluding the j lp ). Remember to include the cycles needed for instruction H to finish the WB stage. Note that the questions are independent from each other. a. With only data forwarding mechanisms and no control hazard mechanism. b. With data forwarding and assume not taken branch prediction. Note that there is no early branching. c. By swapping two instructions (from Instructions A to H), we can improve the performance of early branching (with all additional forwarding paths). Give the two instructions that can be swapped. You only need to indicate the instruction letters in your answer. AY2017/8 Semester 2-1 of 5 - CS2100 Tutorial #10 Selected Answers
Answers: a) 20 cycles 16 17 18 19 20 beq F D E M W andi F D E M W addi The addi instruction is not executed. srl F D E M W b) 14 cycles 1 2 3 4 5 6 7 8 9 10 11 12 13 14 beq F D E M W andi F D E M W addi F D E * * srl F D E M W c) Swap instructions A and B to reduce the delay between instructions B and C. add $s2, $zero, $zero #inst B add $t0, $s0, $zero #inst A lp: bne $s2, $zero, done #inst C beq $t0, $zero, done #inst D andi $t1, $t0, 0xFF #inst E bne $s1, $t1, nt #inst F addi $s2, $s2, 1 #inst G nt: srl $t0, $t0, 8 #inst H j lp #inst J done: AY2017/8 Semester 2-2 of 5 - CS2100 Tutorial #10 Selected Answers
3. Consider the following code fragment: loop: lw $t1, 0($t2) #i1 addi $t1, $t1, 1 #i2 sw $t1, 0($t2) #i3 addi $t2, $t2, 4 #i4 sub $t4, $t3, $t2 #i5 bne $t4, $zero, loop #i6 For simplicity, the setup code is not given. You can assume that $t2 refers to a valid array element and $t3 is initialized to $t2 + 400 at the start of the code. Let us study a pipeline processor with the different mechanisms discussed in lecture. For each of the following parts, give: Timing chart for {i1 to i6} + {i1 from next iteration } Total cycles needed for the code Data Hazards a. Suppose the processor has no mechanisms to handle RAW data hazards as well as control hazards. The processor will stall on these hazards until the hazards "disappear". Note that the register files still support "write-then-read" in a single cycle. b. Suppose the processor has full forwarding paths for RAW data hazards. However, control dependency is still handled by stalling the processor. Control Hazards The following parts assume full forwarding paths for RAW data hazards. c. Branch is handled by early resolution in ID stage. d. Branch is handled by branch prediction (predict taken). Branch is still resolved in MEM stage. When we use predict-taken scheme, every branch instruction is assumed to be taken. The target instruction will be fetched in the next cycle with no stall. e. Branch is handled by delayed branching. Branch is still resolved in MEM stage. Show the modified code sequence in additional to the two tasks mentioned. AY2017/8 Semester 2-3 of 5 - CS2100 Tutorial #10 Selected Answers
(a) Answers: 16 17 18 i2 F D D D E M W i3 F F F D D D E M W i4 F F F D E M W i5 F D D D E M W i6 F F F D D D E M W i1 F One iteration takes 18 cycles. However, as the next iteration starts at cycle 18, i.e. an overlap of 1 cycle, we can calculate the total cycles as: (18 1) 100 iterations + 1 cycle for the WB of the last bne inst. = 1701 cycles (b) The i2 i1 RAW hazards causes 1 cycle delay due to the leading instruction (i1 is a lw). One iteration takes 11 cycles. However, as the next iteration starts at cycle 11, i.e. an overlap of 1 cycle, we can calculate the total cycles as: (11 1) 100 iterations + 1 cycle for the WB of the last bne inst. = 1001 cycles (c) i6 F D* D E M W Since i5 produces the value used for the branch instruction, 1 cycle stall is needed so that the Ex/Mem ID stage forwarding (only for branch) can be used. The additional stall cycle is denoted as D* in the timing chart above. (12 3) 100 iterations + 3 cycles for the EX-MEM- WB of the last bne inst. = 903 cycles AY2017/8 Semester 2-4 of 5 - CS2100 Tutorial #10 Selected Answers
(d) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (11 4) 100 iterations + 4 cycles for the WB of the last bne inst. = 704 cycles (e) Firstly, there are 3 delayed slots in this case, as branch is resolved in Mem stage. Distance to Fetch stage is 3 cycles 3 delayed slot. If we can only reorder the original code without modification, then there are no suitable instructions for the delayed slots. This results in 3 nop instructions after the bne instruction: This gives us the exact same performance as if there is no mechanism for control dependency. AY2017/8 Semester 2-5 of 5 - CS2100 Tutorial #10 Selected Answers