University of Toronto Faculty of Applied Science and Engineering

Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science and Engineering Final Examination December 19, 2011 ECE552F Computer Architecture Examiner Natalie Enright Jerger 1. There are 6 questions and 14 pages. Do all questions. The total number of marks is 100. The duration of the test is 2.5 hours. 2. ALL WORK IS TO BE DONE ON THESE SHEETS! Use the back of the pages if you need more space. Be sure to indicate clearly if your work continues elsewhere. 3. Please put your final solution in the box if one is provided. 4. Clear and concise answers will be considered more favourably than ones that ramble. Do not fill space just because it exists! 5. You may use two 8.5x11 aid sheets. 6. You may use faculty approved non-programmable calculators. 7. Always give some explanations or reasoning for how you arrived at your solutions to help the marker understand your thinking. 1 [23] 2 [10] 3 [23] 4 [12] 5 [26] 6 [6] Total [100] Page 1 of 14

1. Start with some short answer questions: [4 marks] (a) Why is using test-and-test-and-set better than using test-and-set for synchronization? Test-and-set (T&S) performs a write every time it tries to acquire the lock; each processor trying to access the lock will install the cache block containing the lock in the modified state in its cache and invalidate all copies in other caches. Test-and-test-and-set (T&T&S) will first perform a read and spin locally on a shared copy of the line until the lock becomes available. T&T&S will issue fewer bus transactions. [4 marks] (b) Why will having more reservations stations than functional units in Tomasulo s algorithm result in better performance than a 1-to-1 ratio of reservation stations to functional units? Instructions can be dispatched to functional units out of program order but reservation stations are allocated in the issue stage (in-order). More reservation stations will increase the likelihood that an instruction that is ready (both operands are available) will be found. Page 2 of 14

[4 marks] (c) What is fine-grained multithreading? What benefit does it achieve? Fine-grained multithreading (FGMT) switches between multiple threads on a cycle-by-cycle basis in a round robin fashion. FGMT is able to hide the latency of short stalls such as load-touse penalties and branch mispredictions. [4 marks] (d) Give one advantage of implementing load-linked (ll)/store-conditional (sc) instructions instead of an exchange (exch) instruction. LL/SC are more RISC-like and therefore are easier to pipeline. EXCH implements a load and a store in one instruction making it more difficult to pipeline. Page 3 of 14

[2 marks] (e) What is the difference between coherence and consistency for multiprocessors? Coherence creates a globally consistent (uniform) view of accesses to a single memory address. Consistency creates a globally consistent (uniform) view of accesses to all memory addresses [5 marks] (f) Can a multiprocessor built with dynamically scheduled (out-of-order) processors be sequentially consistent? Explain your reasoning. Yes. Stores occur in program order at retirement so they are not a problem. However, loads occur out of program order in the execute stage. In order to achieve sequential consistency, loads must be treated as speculative. Coherence events to speculative loads are treated as a mis-speculation and require the load (and subsequent instructions) to be re-executed similar to a branch mis-prediction. Page 4 of 14

2. Dynamic Scheduling This snapshot of the ROB, Map Table and Free list for a MIPSR10K-like dynamically scheduled scalar processor was taken as the st instruction is about to retire. In the table, h denotes the head of the ROB and t denotes the tail of the ROB. ROB ROB number Insn T T old S D X C 1 mult R0, R2 R5 PR 8 PR 5 c1 c2 c3-c8 c9 2 add R5, R2 R5 PR11 PR8 c2 c9 c10 c11 3 add R1, R4 R3 PR 7 PR 3 c3 c4 c5 c6 h 4 st R3 [R2] c4 c6 c7 c8 5 div R0, R2 R3 PR9 PR7 c5 c6 c7-6 sub R4, R2 R4 PR10 PR6 c6 c7 c8 c10 t 7 ld [R3] R3 PR4 PR9 c7 Map Table R0 PR 0 R1 PR1 R2 PR2 R3 PR4 R4 PR10 R5 PR11 Free List PR 8, PR 5, PR 3 [4 marks] (a) In which cycle will the store retire? Explain your reasoning. Cycle = 14 Retire occurs in order and since it is a scalar processor, only one instruction can retire each cycle. Instruction 1 (mult) can retire at cycle 10, Insn 2 can retire at cycle 12, Insn 3 at cycle 13 and the store (insn 4) at cycle 14. Page 5 of 14

[6 marks] (b) Now assume the store experiences a page fault. Fill in the tables to show what the state of the Map Table and Free List should be right before the processor proceeds to handle the page fault. Map Table R0 R1 R2 R3 R4 R5 PR0 PR1 PR2 PR7 PR6 PR11 Free List PR8, PR5, PR3, PR10, PR9, PR4 Page 6 of 14

3. Multiprocessor Issues [12 marks] (a) Draw the state transition diagram for a snooping-based MOSI protocol. The states M, S, I correspond to the the Modified, Shared and Invalid state discussed in class. An O (Owned) has been added. The Owned state indicates that even though other shared copies of the block may exist, this cache (instead of main memory) is responsible for supplying the data when it observes a relevant bus transaction. Label the arcs in the transition diagram with the convention used in class: event generatedevent. Example: R BR denotes that the processor has experienced a read miss (R) and must generate a bus read (BR) to obtain the data. Please include any comments or assumptions to help the marker interpret your diagram. Please use this notation (You may add additional abbreviations as necessary. Be sure to clarify your notation for the marker): Bus Read: BR Bus Write: BW Read: R Write: W Send Data: SD Writeback: WB BR, BW M W => BW W=>BI BW=>SD, WB=>SD I BW, BI R => BR S R, BR R,W W=>BI BR=>SD BI, BW=>SD O BR => SD, R Page 7 of 14

[3 marks] (b) One proposed solution for the problem of false sharing is to add a valid bit per word (in a multi-word cache block). This would allow the coherence protocol to invalidate a word without removing the entire block, letting a processor keep a portion of a block in its cache while another processor writes a different portion of the block. i. First, explain what is meant by false sharing. False sharing occurs when two processors access the same cache line but do not access the same word within that cache line [8 marks] ii. Give two extra complications that are introduced into the basic (MSI) snooping cache coherence protocol if this capability is included? 1. Coherence state must be tracked on a per-word granularity instead of per-block (extra bits). 2. The memory system must be able to handle narrow writebacks. 3. On a cache access, need to match both tag and offset. Page 8 of 14

Caches 4. Consider a fully associative 128-byte instruction cache with 4-byte blocks (every block can hold one instruction). [3 marks] (a) Consider an LRU replacement policy. What is the asymptotic instruction miss rate for a 16 instruction loop with a very large number of iterations? Miss rate (16 instruction) = 0% [3 marks] (b) What is the asymptotic instruction miss rate for a 48 instruction loop with a very large number of iterations? Miss rate (48 instruction) = 100% Page 9 of 14

[6 marks] (c) If the cache replacement policy is changed to most recently used (MRU) where the most recently accessed line is selected for replacement, which loop (16 instruction or 48 instruction) would benefit from this policy? Explain your reasoning. The first loop (part a) already fits within the cache and would not be impacted by the new replacement policy. The second loop (part b) would reduce its miss rate to 17/48 (35%). The first 31 instructions would be placed in the empty cache. For the remaining 17 instructions, they will replace the most recently used block. On the next iteration, we will hit on the first 0-30 instructions, instruction 31 will replace 30, 32 will replace 31 and so on. We will hit on instruction 47. Subsequent iterations will proceed similarly. Page 10 of 14

5. Pipelining [2 marks] (a) Consider a single-cycle CPU implementation. When the stages are split by functionality, the stages do not require exactly the same amount of time. The original machine had a clock cycle of 7ns. After the stages were split, the measured times were F (Fetch): 1 ns; D (Decode): 1.5 ns; E (Execute): 1 ns; M (Memory): 2 ns; W (Writeback): 1.5 ns. The total pipeline register delay is 0.1 ns. i. What is the clock cycle time of the 5-stage pipelined machine? The clock cycle is determined by the longest stage + the pipeline register delay = 2.0ns + 0.1in. [3 marks] Cycle time = 2.1ns ii. If the pipelined machine had an infinite number of stages (the amount of work per stage can be divided into infinitely small chunks), what would its speedup be over the single-cycle machine (Ignore any stall cycles)? If the latency per stage goes to zero, the pipeline register delay is all that remains. Speedup = 7ns (original machine) / 0.1ns (pipeline register delay). Speedup = 70 Page 11 of 14

[3 marks] (b) Consider the 5-stage single issue (scalar) in-order pipeline (F,D,X,M,W) from class with full bypassing support. i. List the bypassing paths required for full bypassing support. Use the notation FromStage- ToStage for each path. Be sure to indicate if multiple paths are needed between the same two stages (If multiple inputs in the same stage are forwarded to from the same stage, this counts as multiple paths). 2 MX, 2 WX, 1 WM = 5 total paths [12 marks] ii. How many bypass paths are needed for a 5-stage N-wide in-order superscalar processor to have full bypass support? Place your answer (in terms of N) in the given box. You must justify/explain your answer. (Hint: If you are having trouble generalizing to N, start with a 2-wide processor). # of bypass paths = 5N 2 Page 12 of 14

[6 marks] (c) Consider a deeply-pipelined processor for which we implement a branch-target buffer (BTB) for the conditional branches only. Assume that the misprediction penalty is always four cycles and the buffer miss penalty is always three cycles. Assume a 90% hit rate, 90% accuracy and 15% conditional branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed two-cycle conditional branch penalty (A processor with a fixed two-cycle conditional branch penalty does no branch prediction and stalls when it encounters a conditional branch). Assume the base CPI without conditional branch stalls is one. 0.90 0.90 0cycles(hitandaccurate)+0.10 3cycles(buf f ermiss)+0.90 0.10 4(hit+ mis prediction) = 0.66 1 + 0.66 0.15 = 1.099 No BTB = 1 + 2 0.15 = 1.3 Speedup = 1.3 / 1.099 Speedup = 1.183 Page 13 of 14

[6 marks] 6. A common transformation required in graphics processors is square root. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. Proposal A is to enhance the FPSQR hardware and speed up this operation by a factor of 10. Proposal B is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for half of the execution time for the application. Assuming the design effort/time is similar for Proposal A and Proposal B, which proposal would you work on? Justify your answer. 1 Proposal A = 1 0.2+ 0.2 10 = 1.22 1 Proposal B = 1 0.5+ 0.5 1.6 = 1.23 Proposal B will give you slightly better performance. Page 14 of 14