University of Toronto Faculty of Applied Science and Engineering

Print: First Name:......... SOLUTION............... Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science and Engineering Midterm Examination November 3, 2014 ECE552F Computer Architecture Examiner Natalie Enright Jerger 1. There are 4 questions and 8 pages. Do all questions. The total number of marks is 48. The duration of the test is 50 minutes. 2. ALL WORK IS TO BE DONE ON THESE SHEETS! Use the back of the pages if you need more space. Be sure to indicate clearly if your work continues elsewhere. 3. Please put your final solution in the box if one is provided. 4. Clear and concise answers will be considered more favourably than ones that ramble. Do not fill space just because it exists! 5. You may use a single 8.5x11 aid sheet. 6. You may use faculty approved non-programmable calculators. 7. Always give some explanations or reasoning for how you arrived at your solutions to help the marker understand your thinking. 8. State your assumptions. Show your work. Use your time wisely as not all questions will require the same amount of time. If you think that assumptions must be made to answer a question, state them clearly. If there are multiple possibilities, comment that there are, explain why and then provide at least one possible answer and state the corresponding assumptions. 9. Only exams written in pen can be considered for remarking. Page 1 of 8

This page is for grading purposes only. The marks breakdown is given for each question. 1 [11] 2 [19] 3 [8] 4 [10] Total [48] Page 2 of 8

1. Pipelining [3 marks] (a) Branches represent 30% of dynamic instructions. Branches are statically predicted not-taken. 40% of branches are taken. Loads make up 30% of dynamic instructions. 65% of loads are followed immediately by a dependent ALU instruction in the dynamic instruction sequence. Consider a 4-stage pipeline where the Execute and Memory Access stages have been combined into 1 stage (called XM). The branch outcome is known at the end of the XM stage. Full forwarding exists in this pipeline. Calculate the CPI for this pipeline implementation. CP I = 1 + 0.3 0.4 2 CPI = 1.24 Page 3 of 8

(b) This part of the question assumes the typical 5-stage in-order pipeline used in class (F, D, X, M, W). The table below gives the fraction of instructions that have a particular type of RAW data dependence. The type of RAW data dependence is identified by the stage that produces the result (X or M) and the instruction that consumes the result (1 st instruction that follows the one that produces the result, 2 nd instruction that follows, or both). Assume that the register write is done in the first half of the clock cycle and the register read is done in the second half of the cycle. X to 1 st M to 1 st X to 2 nd M to 2 nd X to 1 st Other RAW Only Only Only Only and M to 2 nd Dependences 5% 20% 5% 10% 10% 0% [8 marks] Let us assume that we cannot afford to have three-input muxes that are needed for full forwarding. We have to decide if it is better to implement MX forwarding or WX forwarding. Which of the two options results in fewer data stalls cycles? You must show your work to justify your answer. An answer without any justification will not receive full marks. WX forwarding is better. WX forwarding leads to 0.5 stall cycles while MX forwarding leads to 0.65 stall cycles. To receive full marks, answer should enumerate stall conditions and/or do calculations on stall cycles MX forwarding WX Forwarding Case 1: F D X M W F D X M W F D X M W F D d* X M W Case 2: F D X M W F D X M W F d* d* D X M W F D d* X M W Case 3: F D X M W F D X M W F D X M W F D X M W F d* D X M W F D X M W Case 4: F D X M W F D X M W F D X M W F D X M W F d* D X M W F D X M W Case 5: F D X M W F D X M W F D X M W F D d* X M W F d* D X M W F p* D X M W 0 * 0.05 + 2 * 0.2 + 1 * 0.05 + 1 * 0.2 + 0 * 0.05 + 1 * 0.05 + 1 * 0.1 + 0 * 0.1 + 1 * 0.1 =.35 1 * 0.1 = 0.65 Page 4 of 8

2. Dynamic Scheduling [16 marks] (a) Consider a MIPSR10K processor with the following execution latencies and one reservation station/functional unit for each type of instruction: 5 cycles for an add 12 cycles for a multiply 20 cycles for a divide Suppose the segment of the program with the four instructions is i1: DIV R3, R5 -> R2 i2: ADD R1, R4 -> R3 i3: MULT R2, R6 -> R4 i4: ADD R2, R3 -> R4 Considering the MIPSR10K implementation, if i1 is issued at cycle 0, in what cycle will each instruction complete and retire? The ROB size is infinite, as are the number of physical registers. Assumptions: Multiple instructions can issue in the same cycle (i3, i4). Instructions can issue in same cycle as dependent instruction broadcasts tag on CDB. Complete Retire i1 21 22 i2 7 23 i3 34 35 i4 27 36 [3 marks] (b) Explain how WAW hazards are avoided in Tomasulo s algorithm. WAW hazards are avoided through register map table. If the ID in the map table does not match the ID of the instruction writing the CDB, then that instruction does not update the register file. Answer must be more detailed than register renaming Page 5 of 8

[8 marks] 3. Consider two possible improvements to a processor design. The first improvement can speed up floating point arithmetic instructions by a factor of 8. The second improvement can speed up load and store instructions by a factor of 3. Let F fp and F ls be the fraction of execution time spent on floating point and load/store instructions respectively. The executions of these two sets of instructions are non-overlapping in time. What should the relation be between the fractions F fp and F ls such that a machine built with the first improvement outperforms a machine built with the second improvement. 1 1 F fp + F fp 8 f fp f ls > 16 21 > 1 1 F ls + F ls 3 Page 6 of 8

4. Branch Prediction Consider the following code sequence. Assume that each instruction is encoded in one 32-bit word. Address Instruction 0x0038 L3:... 0x003C... 0x0040 SUBI R1, 2 -> R3 0x0044 BNEZ R3, L1 0x0048 ADD R0, R0 -> R1 0x004C L1: SUBI R2, 2 -> R3 0x0050 BNEZ R3, L2 0x0054 ADD R0, R0 -> R2 0x0058 L2: SUB R1, R2 -> R3 0x005c BEQZ R3, L3 [4 marks] (a) Show the contents of a 4-entry branch target buffer (BTB) after one execution of the code starting at PC = 0x0040. Assume the BTB is initially empty. Discard the lowest-order PC bits that never change and use the next set of bits to index into the BTB. Assume the initial register values are such that every branch is taken. Each entry can hold 16 bits of information. Entry # (index) 0 Contents 0x0058 1 0x004C 2 3 0x0038 Page 7 of 8

[6 marks] (b) Consider the case where we use a global branch direction predictor with a 3-bit global history register. Execution of the 13th iteration of the code on the previous page is about to start. Provide an example of i) the value of feasible 3-bit global branch history, and ii) the value of an infeasible global branch history. To receive marks, you must justify your answers. Simply writing some combination of N and T values without any explanation is not sufficient Other answers are possible provided correct/sufficient justification is given i. Feasible 3-bit global branch history Assume R1 initially has the value 2 and R2 initially has the value 2. The first two branches will be not taken and the third branch will be taken leading to a feasible global history of NNT ii. Infeasible global branch history From the previous answer, we can see that if both of the first two branches are not taken, the third branch can never be not taken so an infeasible history is: NNN. Page 8 of 8