Course on Advanced Computer Architectures

Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1 Q2 SIX QUESTIONS TOTAL ( 5 points) ( 5 points) ( 6 points) ( 5 points) ( 5 points) ( 6 points) (32 points)

EXERCISE 1 MULTI-CYCLE PIPELINE (5 points) In this problem we will examine the execution of the following code segment on the following single-issue out-of-order processor: All functional units are pipelined ALU operations take 1 clock cycle Memory operations take 2 clock cycles (includes time in ALU) FP ALU operations take 2 clock cycles The Write Back unit has a single write port There is no register renaming Instructions are fetched, decoded and issued in order An instruction will only enter the ISSUE stage if it does not cause a WAR or WAW hazard Only one instruction can be issued at a time, and in the case multiple instructions are ready, the oldest one will go first Fill in the following table pointing out, for each instruction, the pipeline stages activated in each clock cycle and type of hazards: Instruction C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Type of Hazards lw.d $f1,basea($r1) IF ID IS ALU M WB addi.d $f1,$f1,k1 IF ID IS FA1 FA2 WB RAW $f1 WAW $f1 add.d $f3,$f2,$f1 IF ID IS FA1 FA2 WB RAW $f1 add.d $f1,$f2,$f3 IF ID IS FA1 FA2 WB RAW $f3 WAW $f1 WAR $f1 sw.d $f1,basea($r1) IF ID IS ALU M WB RAW $f1 Insert in the empty pipeline scheme the pipeline stages and stalls needed to solve the previous data, control and structural hazards. INS C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 I1 I2 I3 I4 I5 Solution: In C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 I1 IF ID IS ALU M WB I2 IF ID S S S IS FA1 FA2 WB I3 IF S S S ID IS S S FA1 FA2 WB I4 IF ID S S S S S IS FA1 FA2 WB I5 IF S S S S S ID IS S S ALU M WB Page 1 - SOLUTION

Fill in the following table pointing out the NUMBER OF STALLS to be inserted within each instruction to solve the hazards: Num. of Stalls INSTRUCTION Type of Hazards lw.d $f1,basea($r1) 3 addi.d $f1,$f1,k1 (RAW $f1) WAW $f1 2 add.d $f3,$f2,$f1 RAW $f1 3 add.d $f1,$f2,$f3 (RAW $f3) WAW $f1 WAR $f1 2 sw.d $f1,basea($r1) RAW $f1 Page 2 - SOLUTION

EXERCISE 2 TOMASULO (5 points) Assume that the following code has been executed on a CPU with dynamic scheduling based on TOMASULO algorithm. 1. List all the possible conflicts in the code (use the column HAZARDS TYPE): Instruction Issue Execution Start Write Result HAZARDs TYPE I1: LW.D $F6, 32(R2) 1 2 8 I2: ADDD $F2, $F6, $F4 2 9 11 I3: MULTD $F0, $F4, $F2 3 12 32 I4: SUBD $F12, $F2, $F6 12 13 15 I5: ADDD $F0, $F12, $F2 16 17 19 RAW $F6 RAW $F2 STRUCT RSADDSUB (RAW $F2) (RAW$F6) STRUCT RSADDSUB (RAW $F12) (RAW $F2) 2. Is there a hardware configuration that can respect the shown execution timing? YES Tomasulo with ROB to solve the WAW $F0 I5:I3 3. If it is possible, describe the architecture in terms of Functional Units and Reservation Stations: List the number, type and latency of Functional Units and the number of Reservation Stations for each FU 1 Load/Store Unit with Latency 6 clock cycles + 1 Reservation Station 1 ADD/SUB Unit with Latency 2 clock cycles + 1 Reservation Station 1 MUL Unit with Latency 20 clock cycles + 1 Reservation Station I4 and I5 delayed ISSUEs imply that there is only one reservation station for ADDD/SUBD generating the STRUCTURAL HAZARDs Page 3 - SOLUTION

EXERCISE 3: VLIW (6 points) In this problem, you will port code to a simple 3-issue VLIW machine, and schedule it to improve performance. Details about the 3-issue VLIW machine with 3 fully pipelined functional units: Integer ALU with 1 cycle latency to next Integer/FP and 2 cycle latency to next Branch Memory Unit with 3 cycle latency Floating Point Unit with 3 cycle latency (each FPU can complete one add or one multiply per clock cycle) Branch completed with 1 cycle delay slot (branch solved in ID stage) No interlocks C Code: Assembly Code: for(int i=0; i<n; i++) loop:ld f1, 0(r6) A[i] = A[i] + A[i]; ld f2, 0(r6) fadd f2, f2, f1 st f2, 0(r6) addi r6, r6, 4 bne r6, r7, loop ESERCISE 3.A Considering one iteration of the loop, schedule the assembly code for the 3-issue VLIW machine in the following table by using the list-based scheduling, but neither using software pipelining nor loop unrolling nor modifying loop indexes. Please do not need to write in NOPs (can leave blank). Integer ALU Memory Unit FPU C0 ld f1, 0(r6) C1 ld f2, 0(r6) C2 C3 C4 fadd f2, f2, f1 C5 C6 C7 addi r6, r6, 4 st f2, 0(r6) C8 C9 bne r6, r7, loop C10 (br delay slot) C11 C12 C13 How long is the Critical Path? 11 cycles What performance did you achieve in FP ops per cycle? 1/11 What performance did you achieve in cycles per loop iteration? 11/1 What code efficiency did you achieve? 6/33 What CPI did you achieve? CPI = #VLIW_cycles/ IC = 11 / 6 = 1.83 Page 4 - SOLUTION

EXERCISE 3.B Considering the unrolling of two iterations of the loop as follows: Assembly Code: loop:ld f1, 0(r6) ld f2, 0(r6) fadd f2, f2, f1 st f2, 0(r6) ld f3, 4(r6) ld f4, 4(r6) fadd f4, f4, f3 st f4, 4(r6) addi r6, r6, 8 bne r6, r7, loop Considering one iteration of the unrolled loop, schedule the assembly code by using the list-based scheduling, but neither using software pipelining nor loop unrolling nor modifying loop indexes. Please do not need to write in NOPs (can leave blank). Integer ALU Memory Unit FPU C0 ld f1, 0(r6) C1 ld f2, 0(r6) C2 ld f3, 4(r6) C3 ld f4, 4(r6) C4 fadd f2, f2, f1 C5 C6 fadd f4, f4, f3 C7 st f2, 0(r6) C8 C9 addi r6, r6, 8 st f4, 4(r6) C10 C11 bne r6, r7, loop C12 (br delay slot) C13 C14 C15 How long is the Critical Path? 13 cycles What performance did you achieve in FP ops per cycle? 2/13 What performance did you achieve in cycles per loop iteration? 13/2 What code efficiency did you achieve? 10/39 What CPI did you achieve? CPI = #VLIW_cycles/ IC = 13 / 10 = 1.3 How much is the Speedup with respect to the previous EXE 3.A? SpeedUp = CPI A / CPI B = 1.83 / 1.3 = 1.41 => 41% speedup Page 5 - SOLUTION

QUESTION 1: (5 points) 1. Superscalar processors today use an approach that exploits both Instruction Level Parallelism and Thread Level Parallelism. Briefly explain the approach SMT 2. Which HW mechanisms of dynamic scheduled processors intrinsically support this approach? 3. Which modifications to a generic architecture of dynamically scheduled superscalar processors are required? Page 6 - SOLUTION

QUESTION 2: (5 points) 1. Explain the concepts of the Reorder Buffer and the motivations to introduce it in dynamically superscalar architectures. 2. In Speculative Tomasulo how is the Reorder Buffer used and during which stage the entry for the ROB is assigned? Page 7 - SOLUTION

SIX QUESTIONS: (6 points) Answer with True or False to the following questions: Q1) In MIMD architectures with a physically centralized memory organization, the memory address space is shared Answer: T F False Feedback: The addressing space does not depend from the physical organization Q2) Increasing the issue rate of modern processors (for instance up to 12 instructions per clock cycle) provides a huge improvement in the processor clock frequency Answer: T F False Feedback: The answer is FALSE. In fact, increasing the issue rate of a processor means increase the memory accesses, branch resolutions, register renaming and accesses per clock cycle. But, due to the complexity to implement such an architecture, we would sacrifice the maximum clock rate. So, this choice has also some important disadvantages: notice that the processor with the highest issue rate is also the one with the slowest clock cycle. Mark the True answers to the following questions (some questions might have multiple True answers): Q3) Considering the MESI write-invalidate write-back protocol, a cached block can be in one of four states: Modified, Exclusive, Shared, Invalid. For which of these states is the memory up to date? Answer 1: Only Exclusive. Answer 2: Only Shared. Answer 3: Shared and Exclusive. Answer 4: Shared and Modified. T T T (TRUE) T Feedback: The correct answer is: Shared and Exclusive. When the block is in the shared state it means that no write has been performed on the block. Both memory and caches hold the latest value. When the state is Exclusive it means that only one cache holds a copy of the block and that it hasn t performed any write on it (i.e. both the memory and the single cache holding the copy have the latest value of the block). Q4) In case of delayed branch technique, which of the following assertions are true? Answer 1: Instruction in the branch delay slot is always executed T (TRUE) Answer 2: Instruction in the branch delay slot is always a NOP T Answer 3: Instruction in the branch delay slot is always an independent instruction from before the branch T Answer 4: Helps the compiler to statically predict the outcome of the branch T Feedback: The instruction in the branch delay slot is always executed, whether or not the branch is taken. The job of the compiler is to make the instruction placed in the branch delay slot valid and useful, the remaining slots are filled with NOPs Page 8 - SOLUTION

Q5) Which complications are introduced by out-of-order execution and out-of-order completion? Answer 1: WAR and WAW hazards T (TRUE) Answer 2: RAW hazards T Answer 3: Exception handling T (TRUE) Answer 4: Control hazards T Answer 5: Instruction reordering T Feedback: Out-of-order execution introduces the possibility of WAR and WAW hazards, which do not exist in the five-stage integer pipeline. Out-of-order completion creates major complications in handling exceptions. Dynamic scheduling with out-of-order completion must preserve exception behavior in the sense that exactly those exceptions that would arise if the program were executed in strict program order actually do arise. Q6) Which of these modifications would you introduce on a cache to decrease the conflict misses? Answer 1: Increase the cache size T (TRUE) Answer 2: Increase the cache associativity T (TRUE) Answer 3: Increase the size of the blocks T Answer 4: Add a second level of cache T Feedback: The best and most effective way of reducing the conflict misses is moving to a higher associativity cache organization. It is also true that an increase on the cache size leads to a reduction of conflict misses as well as a reduction of capacity misses. It is wrong, instead, that increasing the size of the blocks results in a reduction of conflict misses, and actually it is the other way around, namely decreasing the size of the blocks produces a decrease of conflict misses. See also pages 12-13 of the set of slides Memory performance improvement techniques. Page 9 - SOLUTION