University of Toronto Faculty of Applied Science and Engineering

Size: px

Start display at page:

Download "University of Toronto Faculty of Applied Science and Engineering"

Doreen Curtis
5 years ago
Views:

1 Print: First Name: Solutions Last Name: Student Number: University of Toronto Faculty of Applied Science and Engineering Midterm Examination 2 November 11, 2011 ECE552F Computer Architecture Examiner Natalie Enright Jerger 1. There are 4 questions and 13 pages. Do all questions. The total number of marks is 50. The duration of the test is 50 minutes. 2. ALL WORK IS TO BE DONE ON THESE SHEETS! Use the back of the pages if you need more space. Be sure to indicate clearly if your work continues elsewhere. 3. Please put your final solution in the box if one is provided. 4. Clear and concise answers will be considered more favourably than ones that ramble. Do not fill space just because it exists! 5. You may use a single 8.5x11 aid sheet. 6. You may use faculty approved non-programmable calculators. 7. Always give some explanations or reasoning for how you arrived at your solutions to help the marker understand your thinking. Page 1 of 13

2 This page is for grading purposes only. The marks breakdown is given for each question. 1 [14] 2 [11] 3 [9] 4 [16] Total [50] Page 2 of 13

3 1. Assume you have a single-issue processor that uses the Scoreboard algorithm for dynamic scheduling. There are 2 integer functional units capable of doing addition and subtraction, 1 multiply unit and 1 load/store unit (the address calculation does not use the integer units). Addition and subtraction take 2 cycles and multiplication takes 4 cycles. Load/Store operations take 6 cycles. A scoreboard entry is available to a new instruction the cycle after writeback. Updated register values can be read in the same cycle that they are written. Consider the following code: LD ADD MULT ADDI SUB [R2+0] R1 R1, R2 R3 R2, R4 R2 R4, 8 R4 R2, 12 R1 Page 3 of 13

4 [10 marks] (a) Complete the following table for this code sequence. For the Issue (S), Dispatch (D), Execute (X) and Writeback (W) columns, enter the cycle at which the instruction completes this stage. Execution latencies: Add/subtract: 2 cycles, Multiplication: 4 cycles, Load/Store: 6 cycles Write comments to help the marker. A confused and unhappy marker gives fewer marks, so clearly indicate what you are doing. Instruction Status Instruction S D X W Comment LD [R2+0] R1 c1 c2 c8 c9 ADD R1,R2 R3 c2 c9 c11 c12 Cannot dispatch because of RAW hazard with LD MULT R2,R4 R2 c3 c4 c8 c10 Waits in W because of WAR hazard with ADD ADDI R4,8 R4 c4 c5 c7 c8 SUBI R2,12 R2 c11 c12 c14 c15 Stall in issue because no functional unit/ and WAW hazard with MULT [4 marks] (b) On the following pages are the tables for the reservation stations and register tags. Show the contents of these tables as the above instructions are executed. The tables do not go to the completion of the instruction sequence. Add comments to explain what is happening. Be sure to fill in all relevant entries. Do not forget to fill in entries that do not change from one cycle to the next. Page 4 of 13

5 Cycle 1 Functional Unit Status Table Busy op F i F j F k Q j Q k R j R k Int 1 Int 2 Mult Load/Store Y Load R1 R Y - Register Status R1 R2 R3 R4 Load/Store Comments Cycle 2 Functional Unit Status Table Busy op F i F j F k Q j Q k R j R k Int 1 Y ADD R3 R1 R2 Load/Store - N Y Int 2 Mult Load/Store Y Load R1 R N - Register Status R1 R2 R3 R4 Load/Store Int 1 Comments Page 5 of 13

6 Cycle 3 Functional Unit Status Table Busy op F i F j F k Q j Q k R j R k Int 1 Y ADD R3 R1 R2 Load/Store - N Y Int 2 Mult Y Mult R2 R4 R2 - - Y Y Load/Store Y Load R1 R N - Register Status R1 R2 R3 R4 Load/Store Mult Int 1 Comments Cycle 4 Functional Unit Status Table Busy op F i F j F k Q j Q k R j R k Int 1 Y ADD R3 R1 R2 Load/Store - N Y Int 2 Y ADDI R4 R Y - Mult Y Mult R2 R4 R2 - - N N Load/Store Y Load R1 R N - Register Status R1 R2 R3 R4 Load/Store Mult Int 1 Int 2 Comments Page 6 of 13

7 2. Branch Prediction [6 marks] (a) The following pattern history table gives the current state of 2-bit saturating predictors for a design with a two bit history register. The valid states of the 2-bit saturating predictor are N, n, t and T. The branch history register holds the value of TN at the start of this paper simulation. The branch outcomes are given for you. You must fill in the state in the pattern history table and the prediction made in each column. You only need to fill in the entries that change. Each column should represent that state of the branch predictor prior to the outcome of the branch in that column. Updated state (based on the outcome given to you) should be placed in the next column to the right. In the final column you only need to fill in the updated state not the prediction. Pattern History Table Pattern: NN T T t T Pattern: NT t T t N n Pattern: TN n T T t N Pattern: TT N N Prediction: N T N T T T T T T T T N Outcome: T T N T N N T N N N T T Page 7 of 13

8 [5 marks] (b) Consider the following code: for (i=0; i<1000; i++) { for (j=0; j<4; j++) { c[i,j] = a[i] * b[j]; } } Consider a branch prediction scheme with a separate local branch history register for each PC that is used to index into a pattern history table; each entry contains a single bit predictor. How many bits of history are needed to correctly predict the inner loop branch (ignore mispredictions that occur while the predictor is still learning the relevant patterns)? Ignore the outer loop branch. Explain your reasoning. The inner loop will have the following pattern: T T T N. To fully capture this pattern 3 bits of history must be maintained. Maintaining fewer bits will not capture the pattern. For example, if 2 bits of history are maintained, a history of TT could be folowed by either a taken or a not taken outcome which will result in mispredictions. Number of history bits = 3 Page 8 of 13

9 3. Caches [5 marks] (a) Consider a cache with the following design: 1. 2-way set associative 2. Each cache block holds 2 bytes 3. The cache contains 4 sets. 4. This design has a 6-bit address space. 5. This cache uses an LRU replacement policy. The initial LRU state is: Way 0 is LRU for Set 0, Way 0 is LRU for Set 1, Way 1 is LRU for Set 2, Way 0 is LRU for Set 3. For the following sequence of memory accesses update the table to reflect the addresses that are held in the cache (E.g., entry 1 of the table should reflect the cache contents after access 1 has completed). Each entry gives the starting address of blocks in that cache entry (not the data). Indicate in the last column if the access was a hit (H) or miss (M). The row Init holds the addresses initially in the cache. You only need to fill in the entries that change. The effect of the accesses is cumulative (E.g., the initial state for access 2 is the cache contents after access 1). Accesses: 1. Load Store Load Load Load Set 0 Set 1 Set 2 Set 3 Hit Miss Way 0 Way 1 Way 0 Way 1 Way 0 Way 1 Way 0 Way 1 Init M H M H M Page 9 of 13

10 [4 marks] (b) Calculate the average memory access time for data accesses for a processor with the following memory hierarchy: an L1 data cache that has a 1 cycle access latency and a 90% hit rate, an L2 cache with a 10 cycle access latency and a 85% hit rate, an L3 cache with a 20 cycle access latency and a 75% hit rate and main memory that has a 70 cycle access time and a 100% hit rate. Average memory access time = T L1 + %Miss L1 (T L2 + %Miss L2 (T L3 + %Miss L3 T Mem )) = ( ( )) Average memory access time = Page 10 of 13

11 4. Answer the following short answer questions: [3 marks] (a) Instruction scheduling can be done either by the compiler or by hardware. Given one advantage that hardware scheduling has over compiler scheduling. Hardware has better dynamic knowledge including cache misses, branch mispredictions and memory addresses. It is easier to speculate and recover in hardware then in software Hardware has more registers available to it More portable software [2 marks] (b) Define accuracy and coverage (two commonly used prefetching metrics). Accuracy = # of useful prefetches/ # of total prefetches Coverage = # of useful prefetches / # of misses with no prefetching Page 11 of 13

12 [3 marks] (c) How can you determine which cache misses in a program are compulsory misses using simulation. Compulsory or cold misses are misses to addresses that have never been seen before. Simulate an infinite sized cache. All misses to an infinite cache are compulsory misses [4 marks] (d) In what stage of the in-order 5-stage pipeline are exceptions/interrupts handled? Why? Exceptions are handled in the W stage. Exceptions must be handled in program order and not temporal order. Therefore, exceptions do not get handled when they occur but rather when they are ready to writeback. This ensures that older instructions have completed. Page 12 of 13

13 [4 marks] (e) Consider Tomasulo s algorithm i. What gets stored in the V j and V k fields in the reservation stations in Tomasulo s algorithm? The source operand values get stored in V j and V k. ii. What problem with the Scoreboard do V j and V k overcome? Storing operand values in V j and V k eliminated WAR and WAW hazards. Page 13 of 13

University of Toronto Faculty of Applied Science and Engineering

Print: First Name:......... SOLUTION............... Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science