Computer Architecture Homework 3 2012-2013 Please state clearly any assumptions you make in solving the following problems. 1 Processors Write a short report on at least five processors from at least three different companies. The points of comparison in your report must include, but are not limited to, the frequency of operation, the number of different pipelines used, and the number of pipeline stages in each pipeline. (7 points) 2 Branch prediction 1. Suppose that a machine with a 5-stage pipeline uses branch prediction (i.e., no branch delay slots). 15% of the instructions for a given test program are branches, of which 80% are correctly predicted. The other 20% of the branches suffer a 4-cycle mis-prediction penalty. (In other words, when the branch predictor predicts incorrectly, there are four instructions in the pipeline that must be discarded.) Assuming there are no other stalls, develop a formula for the number of cycles it will take to complete n lines of this program. Answer: n(0.85 1 + 0.15(0.8 1 + 0.2 (1 + 4))) = n(1 + 0.03 4) = 1.12n 2. Now suppose you are given the option of replacing this processor s branch prediction scheme with a 1-cycle branch delay system (i.e., there is one branch delay slot after every branch). What percentage of the branch delay slots must be filled in order for the CPU with the branch-delay system to have better performance than the CPU described above? Answer: With p as the required percentage, the new number of cycles is n(0.85 1+0.15(p 1 + (1 p) 2)) = n(1.15 0.15p). We want 1.15 0.15p < 1.12 hence p > 0.2. (4 points) (3 points) 3 Re-ordering Assume that the following code runs on a processor with a 5-stages pipeline: fetch, decode, execute, memory, write-back. If you have stalls in the code, can you re-order it to avoid the stalls? If yes, what is the new order? If no, explain why it cannot be re-ordered. (3 points) I1: add R1, R3, R4 I2: ld R6, 12(R1) I3: sub R6, R6, R5 I4: ld R7, 16(R1) I5: mul R8, R7, R7 Answer: Yes, I1: add R1, R3, R4 I2: ld R6, 12(R1) I4: ld R7, 16(R1) I3: sub R6, R6, R5 I5: mul R8, R7, R7 1
4 Advanced pipelines The following code is from the algorithms employed in the machine you are designing. Assume that your processor uses a pipeline with full bypassing (forwarding), the initial value of register R23 is much bigger than the initial value of R20, and all memory references hit in the caches taking a single cycle. You may not re-order the instructions and if an instruction stalls it stalls all the following instructions. LOOP: lw R10, X(R20) ; load the first value into R10 lw R11, Y(R20) ; load the second value into R11 subu R10, R10, R11 ; subtract sw Z(R20), R10 ; store R10 into memory addiu R20, R20, 4 ; step the index subu R5, R23, R20 ; check the limit bnez R5, LOOP ; branch if R5 is not equal to zero or R20, R5, 0 ; start of block after the loop lw R12, X(R20) ; part of new block, load first value lw R13, Y(R20) ; load the second value into R13 mul R12, R12, R13 ; multiply... 1. Fill the following table by a pipeline diagram of 2 iterations of the loop s execution on a standard 5-stage pipeline (similar to what we studied). Assume that the branch is completely resolved in the decode stage. Indicate clearly all the required stall cycles and the reasons for those stalls. Write the average number of cycles required to complete a single iteration of the loop. (8 points) 2
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 lw R10, X(R20) F D X M W Reasons for stalls: Answer: The diagram is Average cycles for a single iteration= 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 lw R10 IF D EX M W B lw R11 IF D EX M W B subu R10 IF D stall EX M W B sw IF stall D EX M W B addui R20 IF D EX M W B subu R5 IF D EX M W B bnez IF D D EX M W B or IF stall flush lw R10 IF D EX M W B lw R11 IF D EX M W B subu R10 IF D stall EX M W B sw IF stall D EX M W B addui R20 IF D EX M W B subu R5 IF D EX M W B bnez IF D D EX M W B or IF stall flush The subu R10 instruction is stalled waiting for the loaded value of R11 and the following sw is stalled as well for the same reason. The bnez cannot read the value of R5 during its original decode stage since it is calculated within the same cycle by the previous instruction. A stall cycle must be used for the or instruction and the branch evaluates the condition in the following cycle. The average number of cycles is 10: from the start of one iteration to the start of the following iteration. (three points for the diagram, two points for each explanation, one point for the number of cycles.) 2. As an attempt to increase the performance, you consider doubling the frequency of operation and using the following stages IF1: Begin Instruction Fetch IF2: Complete Instruction Fetch ID: Instruction Decode 4
RF: Register Fetch (including the fetching of registers for branch resolution) EX1: ALU operation execution begins. Branch target calculation finishes. Memory address calculation. Branch condition resolution calculation begins. EX2: Branch condition resolution finishes. Finish ALU ops. (But branch and memory address calculations finish in a single cycle during EX1). M1: First part of memory access, address sent to memory. M2: Second part of memory access, Data sent to memory for stores OR returned from memory for loads. WB: Write back results to register file Assume that forwarding allows the RF stage of an instruction to complete in the same cycle producing the required value by a previous instruction. Fill the following table with a pipeline diagram of two iterations of the loop on the new pipeline (a single iteration is from the fetch of the first load till that same instruction is fetched again). Indicate clearly any required stall cycles and the reasons for those stalls. Assume that any instructions following a branch are fetched in order and may move up to the register fetch stage but are not issued for execution until the condition of the branch is resolved. If the branch is taken the following instructions are flushed and the target is fetched. Write the average number of cycles required for a single iteration. (12 points) 5
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 lw R10, X(R20) F1 F2 D RF X1 X2 M1 M2 W Pipeline Load Delay = Pipeline Branch Delay = Average cycles for a single iteration = Reasons for stalls: 6
Answer: The diagram is 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 lw R10 IF1 IF2 D RF EX1 EX2 M1 M2 W B lw R11 IF1 IF2 D RF EX1 EX2 M1 M2 W B subu R10 IF1 IF2 D stall1 stall2 stall3 RF EX1 EX2 M1 M2 W B sw IF1 IF2 stall1 stall2 stall3 D RF EX1 EX2 M1 M2 W B addui R20 IF1 stall1 stall2 stall3 IF2 D RF EX1 EX2 M1 M2 W B subu R5 IF1 IF2 D stall1 RF EX1 EX2 M1 M2 W B bnez IF1 IF2 stall1 D stall2 RF EX1 EX2 M1 M2 W B or IF1 stall1 IF2 stall2 D RF stall3 flush lw R12 IF1 stall1 IF2 D stall2 flush lw R13 IF1 IF2 stall1 flush mul R12 IF1 stall1 flush lw R10 IF1 IF2 D RF EX1 EX2 M1 M2 W B lw R11 IF1 IF2 D RF EX1 EX2 M1 M2 W B subu R10 IF1 IF2 D stall1 stall2 stall3 RF EX1 EX2 M1 M2 W B sw IF1 IF2 stall1 stall2 stall3 D RF EX1 EX2 M1 M2 W B addui R20 IF1 stall1 stall2 stall3 IF2 D RF EX1 EX2 M1 M2 W B subu R5 IF1 IF2 D stall1 RF EX1 EX2 M1 M2 W B bnez IF1 IF2 stall1 D stall2 RF EX1 EX2 M1 M2 W B or IF1 stall1 IF2 stall2 D RF stall3 flush lw R12 IF1 stall1 IF2 D stall2 flush lw R13 IF1 IF2 stall1 flush mul R12 IF1 stall1 flush lw R10 IF1 IF2 D... The subu R10 instruction is stalled waiting for the loaded value of R11 and the following sw is stalled as well for the same reason. The RF stage of subu R5 must wait for R20 from the previous instruction. The RF stage of bnez must wait for the previous instruction. All the instructions following bnez wait for the condition resolution. Delay for load is 3 cycles. Delay for branch is 1 stall for its register fetch and 5 cycles till its condition resolution at the EX2 stage. The average number of cycles is 17: from the start of one iteration to the start of the following iteration. (Two points for the diagram, two points for each explanation, two points for each number of cycles.) 7
3. Based on this piece of code only, is it beneficial to double the frequency and use the new pipeline? Explain why or why not? Answer: The new pipeline takes 17 cycles to finish one iteration. Those 17 cycles are equivalent to 8.5 cycles in the old pipeline. This is better than what was achieved on the old pipeline and thus it is beneficial to double the frequency. (One point for the answer, two points for the analysis.) (3 points) 8