University of Toronto Faculty of Applied Science and Engineering

Size: px

Start display at page:

Download "University of Toronto Faculty of Applied Science and Engineering"

Emerald Hudson
6 years ago
Views:

1 Print: First Name:Solution Last Name: Student Number: University of Toronto Faculty of Applied Science and Engineering Final Examination December 16, 2013 ECE552F Computer Architecture Examiner Natalie Enright Jerger 1. There are 6 questions and 12 pages. Do all questions. The total number of marks is 101. The duration of the test is 2.5 hours. 2. ALL WORK IS TO BE DONE ON THESE SHEETS! Use the back of the pages if you need more space. Be sure to indicate clearly if your work continues elsewhere. 3. Please put your final solution in the box if one is provided. 4. Clear and concise answers will be considered more favourably than ones that ramble. Do not fill space just because it exists! 5. You may use two 8.5x11 aid sheets. 6. You may use faculty approved non-programmable calculators. 7. Always give some explanations or reasoning for how you arrived at your solutions to help the marker understand your thinking. 1 [18] 2 [25] 3 [18] 4 [16] 5 [9] 6 [15] Total [101] Page 1 of 12

2 1. From Lab [2 marks] (a) Multiple virtual networks are often used for cache coherence. Briefly explan what purpose these virtual networks serve. (Lab 6) Multiple virtual networks prevent protocol-level deadlock by ensuring that different coherence message types do not block each other in the network or queues. Prevents cyclic dependences from forming between in-flight coherence messages. [5 marks] (b) Why are transient states needed in coherence protocols? Give one example of transient state you implemented in Lab 6 and explain why this state was needed? 1st part: Transient states are needed because transitions between protocol states are not atomic. 2nd part: Many valid answers [3 marks] (c) Give an example of a data structure where a stride-prefetcher would work perfectly but a next-line prefetcher would fail (would not produce useful prefetches). (Lab 5) Many possible answers. Consider: struct a { int x; int y[31]; // assumes 16 ints per line } array[n]; for (int i = 0; i < N; i++) { array[i].x = i; } Page 2 of 12

3 [3 marks] (d) If two instructions compete for a resource in Tomasulo in the same cycle, which instruction would you choose to access the resource first? Why? (Lab 4) The older instruction. This will prevent starvation and will likely be better for performance as the older instruction may have dependent instructions waiting on it [5 marks] (e) Write a short microbenchmark that you would use to validate that you are correctly tracking load-to-use dependences in a 6-stage in-order pipeline. The stages of this pipeline are: Fetch, Decode, Execute1, Execute2, Memory, Writeback. Operands are needed at the start of Execute1 to compute the correct value. Correct syntax is not important for your microbenchmark but it must be clear what your code is doing (use comments as needed) (Lab 1). Many correct answers. Consider: LOOP: ADDI R1, 1 -> R1 LW [R3] -> R2 ADD R2, R3 -> R4 // 2-cycle stall, twice LW [R3] -> R2 ADD R2, R3 -> R4 LW [R3] -> R2 SUB R5, R5 -> R5 ADD R2, R3 -> R4 // 1 cycle stall, once LW [R3] -> R2 SUB R5, R5 -> R5 SUB R6, R6 -> R6 ADD R2, R3 -> R4 // no stall BNE R1, R7, LOOP Page 3 of 12

4 2. Multiprocessors (a) Consider the following code executed on processors P1 and P2: Initially: A = B = 0 P1 A = 1 B = 1 P2 Print B Print A [5 marks] Considering a sequentially consistent memory model, list all valid combinations that can be printed by P2. If certain combinations are not possible, provide a brief explanation as to why not. Possible Combinations: (B, A) (0, 0), (0, 1), (1, 1) Not possible: (1, 0) - If B prints 1, than A will also have to print 1 since the update to B by P1 occurs after the update to A. [7 marks] (b) Using load locked (LL) and store conditional (SC), write the assembly code to implement an atomic compare and swap: CAS Rx Ry X, where the value of Rx is first compared to the value of X and if they are equal, the values in Ry and X are swapped. X is located in memory and the address of X is in r3 CAS: LL R1, [R3] ADD Ry, R0 -> R2 BNE R1, Rx, exit ADD R1, R0 -> R2 ADD Ry, R0 -> R1 exit: SC R1, [R3] BEQZ R1, CAS ADD R2, R0 -> Ry return Page 4 of 12

5 (c) On the next page, you are given cache, memory and coherence state. This represents the initial state for each subpart of this question. Do NOT use your answer from part i in part ii or part iii. Each subpart is independent. This multiprocessor uses a directory coherence protocol; it has 4 processors. Each processor has a direct mapped cache with 2 sets; each set holds two words. The 4th bit in the address indicates the set. To simplify the format, the cache address tag contains the full address and each word shows only two hex characters with the least significant word on the right. The directory coherence states are M (modified), S (shared) and U (uncached) while the cache has states M, S and I. Each part of this question signifies a sequence of one or more CPU operations of the form: P#: op address [ value] where P# designates the CPU (P0, P1, P2, P3), op is the CPU operation (e.g. read or write), address denotes the memory address, and value indicates the new word to be assigned on a write operation. What is the resulting state (coherence state, address tags and data) of the caches and memory (including directory state) after the given sequence of actions? Show only the blocks that change, for example, P0.Set0: (S, 0x110, AB 33) indicates that CPU P0 s set 0 has the final state of S, address of 0x110 and data contents of 33 (address 0x110) and AB (address 0x114). Use a similar format to show changes in the directory. Also, what value is sent to the processor by a read operation? Write comments to help the marker understand your thinking. Page 5 of 12

6 Directory and Memory contents Address State Sharers Data 0x100 U EF 01 0x108 S 1, x110 S 0, 1, 3 AB 33 0x118 M x120 S x128 M x130 U x138 U Cache Contents P0 State Addr Data P1 State Addr Data Set 0 S 0x110 AB 33 Set 0 S 0x110 AB 33 Set 1 M 0x Set 1 S 0x P2 State Addr Data P3 State Addr Data Set 0 S 0x Set 0 S 0x110 AB 33 Set 1 S 0x Set 1 M 0x [2 marks] i. P3: write 0x Dir: (S, 0x110, (0, 1), AB 33) Dir: (M, 0x130, 3, 78 11) P3.Set0 (M, 0x130, 78 84) [6 marks] ii. P0: read 0x128 P1: read 0x128 Dir: (U, 0x118, -, 34 04) P0.Set0 (S, 0x128, 03 02) Dir: (S, 0x108, 2, 10 20) Dir: (S, 0x128, (0, 1, 3), 03 02) P1.Set0: (S, 0x128, 03 02) returns 02 to both P0 and P1 [5 marks] iii. P1: write 0x P1: read 0x110 P2: read 0x110 Dir: (S, 0x110, (1, 2) AB 11) Dir: (U, 0x120, -, 56 22) P0.Set0 (I, 0x110, AB 33) P3.Set0 (I, 0x110, AB 33) P1.Set0 (S, 0x110 AB 11) P2.Set0 (S, 0x110 AB 11) reads return 11 to both P1 and P2 Page 6 of 12

7 3. Dynamic Scheduling [18 marks] (a) Assume that you have a single-issue processor that uses MIPS R10K dynamic scheduling with re-order buffer as discussed in lecture. There are 3 reservation stations (Int 1, Int 2, Int 3) for integer operations and 3 integer execution units. Integer units are capable of doing addition, subtraction and multiplication. There are 2 load reservations stations, 1 store reservation station and 1 CDB. The ROB is initially empty and has 32 entries. Addition and subtraction take 2 cycles and multiplication takes 6 cycles. They write the CDB in the cycle after execution is complete. A reservation station is available to a new instruction on the cycle after the instruction in the reservation station writes the CDB. If multiple instructions are ready to write the CDB in the same cycle, priority is given to the instruction dispatched earliest. Instructions waiting for operands can complete issue in the same cycle that the data appears on the CDB. Memory instructions (load and store) take 4 cycles to compute the address and access memory. The address calculation does not use the integer units. Memory instructions write the CDB in the cycle after they finish accessing memory. Assume you have 10 physical registers. Initially R1-R6 are mapped to P1-P6 and P7 through P10 are free. A physical register can be reused by another instruction the cycle after it is freed. All reservation stations, execution units and ROB entries are free/available at the start of this code sequence. Consider the following code: LD [R2+0] R1 MULT R1, R2 R4 LD [R5] R6 ADD R6, R2 R6 ST R6 [R2+8] ADD R3, R4 R1 Complete the following table for this code sequence. For each column, record the cycle at which the instruction completes this stage. Also fill in the old (T old ) and new (T) register mapping for each instruction. Write comments to help the marker; clearly indicate what you are doing. Instruction D S X C R T T old Comment LD [R2+0] R P7 P1 MULT R1, R2 R P8 P4 LD [R5] R P9 P6 ADD R6, R2 R P10 P9 ST R6 [R2+8] ADD R3, R4 R P1 P7 Page 7 of 12

8 4. Caches [4 marks] (a) You re given a benchmark and a cache simulator, but you cannot modify either. The simulator outputs the total number of cache misses in the benchmark. As inputs to the simulator, you can configure the cache size (4B to infinite), the line size (4B to 128B) and the associativity (1-way to fully). You are asked to estimate how many capacity misses there would be if you were to use a 4-way 16kB cache with 64B lines. Describe how you would use the simulator to do this. A: fully associative cache - 16Kb, 64B lines gives you cold + capacity misses B: infinite size, fully associative, 64B lines gives you cold misses Capacity = A - B Page 8 of 12

9 [12 marks] (b) Consider the following trace of accesses to a set. Each letter represents a cache block: a b c a d e f f b e f e a b g c f e Assume a fully associative cache that can hold 3 cache blocks. The cache is cold at the start of the trace. Fill in the table below to show the contents of the cache after the access indicated in that row for each of the three replacement policies: LRU, MRU (most recently used), Optimum. The first row has been filled in for you. What is the miss rate with each replacement policies: LRU, MRU and Optimum? Also indicate if the access is a miss (enter Y/N). Cache block LRU MRU Optimum accessed cache contents Miss? cache contents Miss? cache contents Miss? a a, -, - Y a, -, - Y a, -, - Y b a,b Y a,b Y a,b Y c a,b,c Y a,b,c Y a,b,c Y a a,b,c N a,b,c N a,b,c N d a,c,d Y b,c,d Y a,b,d Y e a,d,e Y b,c,e Y a,b,e Y f d,e,f Y b,c,f Y b,e,f Y f d,e,f N b,c,f N b,e,f N b e,f,b Y b,c,f N b,e,f N e e,f,b N c,f,e Y b,e,f N f e,f,b N c,f,e N b,e,f N e e,f,b N c,f,e N b,e,f N a e,f,a Y c,f,a Y b,f,a Y b e,a,b Y c,f,b Y b,f,a N g a,b,g Y c,f,g Y b,f,g Y c b,g,c Y c,f,g N f,g,c Y f g,c,f Y c,f,g N f,g,c N e c,f,e Y c,g,e Y f,c,e Y Miss Rate 13/18=72.2% 11/18= 61.1% 10/18=55.6% Page 9 of 12

10 5. Pipelining [2 marks] (a) Consider a single-cycle CPU implementation. When the stages are split by functionality, the stages do not require exactly the same amount of time. The original machine had a clock cycle of 8ns. After the stages were split, the measured times were F (Fetch): 2.0 ns; D (Decode): 1.5 ns; E (Execute): 1.4 ns; M (Memory): 2.1 ns; W (Writeback): 1.0 ns. The total pipeline register delay is 0.2 ns. i. What is the clock cycle time of the 5-stage pipelined machine? [2 marks] Cycle time = 2.3 ns ii. If you could split one of the 5 stages into two stages, which stage would you select and why? The longest stage (Memory) because this would reduce the cycle time [2 marks] iii. What negative impact on performance might arise from splitting 1 stage into two stages? Pipeline is deeper so flushing the pipeline becomes more expensive. The cost of RAW hazards increases [3 marks] iv. If the pipelined machine had an infinite number of stages (the amount of work per stage can be divided into infinitely small chunks), what would its speedup be over the single-cycle machine (Ignore any stall cycles)? Speedup = 40 Page 10 of 12

11 6. Control flow (a) Consider the following code where a and b can each have a value of 0 or 1. For this code, a branch is consider taken T if the code in the if clause would execute and not taken N otherwise.. int my_func(int a, int b) { int c = 0; int d = 1; if (a == 0) { // 1st if c = 1; } if (b == 0) { // 2nd if d = 0; } } if (c == d) { // 3rd if return 1; } else { return 0; } [7 marks] i. Explain the branch prediction mechanism you would use to accurately predict the 3rd if statement. Your explanation could include a discussion of local history, global history and PC indexing bits. Note: There are many other branch instructions in this program besides those given in the code above. The 3rd if statement is taken if and only if the previous two if-statements have different outcomes. We need global history to correlate the outcome of B1 and B2 with the outcome of B3. For example, you could use 2 bits of global history and then use PC bit to index into private predictor tables to minimize aliasing with other branches. Values Branch Outcomes a b B1 B2 B3 0 0 T T N 0 1 T N T 1 0 N T T 1 1 N N N Page 11 of 12

12 [4 marks] ii. How many times would you need to call my func (with different values of a and b) in order to fully train your predictor? Clearly state any assumptions you make to arrive at your answer. There are four possible combinations for a,b. Assuming there is no aliasing, it depends on the initial state of the 2-bit saturation counter. If you assume it starts from a weakly taken position, it would need to see the input values once for the onces that result in a taken outcome (a=0,b=1) and (a=1,b=0) and see the (a,b) combinations that result in a not-taken outcome twice (weakly taken weakly not taken strongly not taken). This is 6 in total to guarantee correct predictions. Multiple correct answers [4 marks] (b) Calculate the CPI for a 6-stage pipelined processor where the branch prediction is verified in stage 4 and the branch target is calculated in stage 2 (there is no BTB). 25% of instructions are branches. 40% of branches are taken. Your branch direction predictor has an accuracy of 75%. 40% of correctly predicted branches are taken. There are no data hazards. Taken: 40%, 1 cycle to get target (correct), 3 cycles penalty (incorrect) Not taken: 60%, 0 cycle to get target (correct), 3 cycles penalty (incorrect) = CPI = Page 12 of 12

University of Toronto Faculty of Applied Science and Engineering

Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science