Assignment solutions. The jal instruction does a jump identical to the j instruction (i.e., replacing the low order 28 bits of the with the ress in the instruction) and also writes the value of the + 4 in register 3 ($ra). The hardware required to do this is similar to that for the single cycle implementation; the value must be passed along the pipeline until it can be written into the register file, and the value 3 must be made available as an ress input to the register file. (See the red components in the attached diagram). The main question is when can the return ress( + 4) be written into the register file? Consider that there may already be up to 4 instructions waiting to write values in the register file, (say, R-type instructions) then for those to complete successfully, the value could not be written into the register file until all those have completed. Therefore, this value would be written in pipeline stage 5. 2. From the earlier question, the jump part of jal is completed in the first stage, and the link part (writing the return ress in register $ra) in the fifth stage. Therefore, any operation requiring the return ress must wait until the correct value is written there. This means that, for example, a return from subprogram (jr cannot happen immediately. The IPS function call required saving some information on the stack and restoring it, so this would not normally happen, anyway. Register 3 ($ra) was normally pushed on the stack however, and this also should not happen until the correct value was saved in $ra. Therefore, other instructions should be scheduled before this is done. 3. The jr instruction does not alter any of the values in the register files, only the, so it can complete after the register value is available, which happens in the second pipeline stage. The itional hardware and control logic required is again similar to that for the single cycle implementation; namely, a path fromthe register file to the, and a control signal for the required. Since jr is an r-type instruction, at least part of the function field must also be decoded in this stage. (See the blue components in the attached diagram). 4. If the jr completes in the second cycle, there would also be a delay slot following this instruction, as was the case for the branch instructions. This could again be filled with a useful instruction that was aays executed, or a instruction otherwise. Forwarding here would be similar to forwarding for the branch instructions if an R- type instruction, say, immediately preceded the jr instruction, then forwarding would not work and a stall would be required. If it were two instructions before the jr
instruction, forwarding could help. The logic would be similar to that for the AL stage. It would, of course, have to be in the ID stage. Comment: In the original IPS architecture, all the jump instructions has a delay slot following the instruction. This meant that for the jal instruction, it was actually + 8 that should be saved as the return ress. There was also a jump and link register jalr instruction similar to the jal instruction that jumped to the ress stored in a register, and also saved the return ress in register 3. 5. Each instruction following the initial depends on the value in register 3, written by that instruction. Therefore, there are three hazards from this instruction, all of which can be resolved by forwarding. There is also a hazard between the instruction, which writes register 6, and the instruction following, which reads register 6, This hazard cannot be resolved by forwarding. 6. loop: i $3, $3, 4 beq $3, $4, loop $2, 96($3) \\ subtract the 4 ed by i 7. The original code was,,,,,,... where the instruction only on the preceding instruction. In the non-pipelined machine, and the generated code would be... This code would be executed times. Neglecting the final 4 cycles (to complete the last instruction in the loop), the total time would be 5 cycles, or 5 cycles for 2 instructions. The CPI would therefore be 5/2 = 2.5. For the pipelined machine with forwarding but no hazard detection, instructions would not be required, and the generated code would be 2
... The CPI would therefore be. Note that if the instruction depended on the preceding instruction, one instruction would still be required, and the effective CPI would be.5. 8. The following shows the predictions for the four predictors and the given branch patterns, for the 25 branch instances: Behavior Predictions aays taken aays not -bit 2-bit weak taken T-T-T T-T-T F-F-F T-T-T T-T-T 2 N-N-N-N F-F-F-F T-T-T-T F-T-T-T F-T-T-T 3 T-N-T-N-T-N T-F-T-F-T-F F-T-F-T-F-T T-F-F-F-F-F T-F-T-F-T-F 4 T-T-T-N-T T-T-T-F-T F-F-F-T-F T-T-T-F-F T-T-T-F-T 5 T-T-N-T-T-N-T T-T-F-T-T-F-T F-F-T-F-F-T-F T-T-F-F-T-F-F T-T-F-T-T-F-T 5 T, F T, 5F 3T, 2F 8T, 7F Accuracy 5/25 =.6 /25 =.4 3/25 =.52 8/25 =.72 9. The original loop, which is to be unrolled, was: loop: $2, ($) sub $4, $2, $3 sw $4, ($) i $, $, 4 bne $, $3, loop Almost every instruction depends on the instruction preceding it. (The purpose of this loop is to subtract the value in register $3 from the array in memory pointed to by register $. Note that a better programmer would have reused register $2 rather than introducing register $4.) Thetargetmachine isthestandardips, whichhadnoforwardingandasinglebranch delay slot. Although not part of the answer, consider the following rescheduling of the original loop, for both the original IPS and a IPS with forwarding: 3
standard IPS IPS with forwarding loop: $2, ($) loop: $2, ($) i $, $, 4 i $, $, 4 sub $4, $2, $3 bne $, $3, loop sub $4, $2, $3 sw $4, -4($) bne $, $3, loop sw $4, -4($) /* register was incremented */ Note that the IPS with forwarding does not need loop unrolling for this example. The previous schedule can be used as a pattern for the unrolled loop (it may not be optimal): loop: $2, ($) $5, 4($) i $, $, 8 sub $4, $2, $3 sub $6, $5, $3 sw $4, -8($) bne $, $3, loop sw $6, -4($) The original rescheduled loop for the standard IPS required 9 instruction fetches for a single loop iteration. or 8 for two loop iterations. The unrolled loop required instruction fetches for a single iteration. The unrolled loop should then complete in /8 = or.6 of the time required for the original loop. The loop could be implemented without instructions if it was unrolled 4 times. 4
control AL Inst[5 ] Register Register 2 Register 2 emory Data extend Sign Shift left 2 Add Instruction [3 ] emory Instruction ress E E IF WB ID Add 4 AL Zero Registers Address Inst[5 ] Inst[25 2] 32 6 Inst[5 ] E/E IF/ID ID/E E/WB JAL Inst[2 6] 3 JAL JR