University of Toronto Faculty of Applied Science and Engineering

Similar documents
University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering

CS 654 Computer Architecture Summary. Peter Kemper

CS433 Homework 2 (Chapter 3)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

UNIT I (Two Marks Questions & Answers)

ECE/CS 757: Homework 1

CS433 Homework 2 (Chapter 3)

Hardware-Based Speculation

Keywords and Review Questions

Computer System Architecture Final Examination Spring 2002

Alexandria University

Hardware-based Speculation

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Updated Exercises by Diana Franklin

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Tutorial 11. Final Exam Review

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter. Out of order Execution

1 Tomasulo s Algorithm

Write only as much as necessary. Be brief!

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Four Steps of Speculative Tomasulo cycle 0

CS 1013 Advance Computer Architecture UNIT I

/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #:

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

ECE 341 Final Exam Solution

CS 2410 Mid term (fall 2018)

Computer Architecture Spring 2016

Design of Digital Circuits ( L) ETH Zürich, Spring 2017

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

EECS 470 Midterm Exam Winter 2008 answers

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

HW1 Solutions. Type Old Mix New Mix Cost CPI

CPE 631 Advanced Computer Systems Architecture: Homework #3,#4

RECAP. B649 Parallel Architectures and Programming

Multi-cycle Instructions in the Pipeline (Floating Point)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Week 11: Assignment Solutions

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS252 Graduate Computer Architecture Midterm 1 Solutions

Complex Pipelines and Branch Prediction

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

EECC551 Exam Review 4 questions out of 6 questions

Good luck and have fun!

Instruction Level Parallelism (ILP)

" # " $ % & ' ( ) * + $ " % '* + * ' "

EECS 470 Midterm Exam Winter 2009

/ : Computer Architecture and Design Spring Final Exam May 1, Name: ID #:

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

Computer Science 146. Computer Architecture

Chapter 5. Multiprocessors and Thread-Level Parallelism

Instruction-Level Parallelism and Its Exploitation

EECS 470 Midterm Exam Winter 2015

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Computer Systems Architecture

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Hardware-Based Speculation

Computer Architecture EE 4720 Final Examination

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

ECE 485/585 Microprocessor System Design

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Advanced issues in pipelining

Instruction Level Parallelism

CS 2506 Computer Organization II Test 2. Do not start the test until instructed to do so! printed

Architectures for Instruction-Level Parallelism

Getting CPI under 1: Outline

EECS 570 Final Exam - SOLUTIONS Winter 2015

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

CS / ECE 6810 Midterm Exam - Oct 21st 2008

EC 513 Computer Architecture

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Processor: Superscalars Dynamic Scheduling

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Lecture 11 Cache. Peng Liu.

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Final Exam Fall 2007

Transcription:

Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science and Engineering Final Examination December 19, 2011 ECE552F Computer Architecture Examiner Natalie Enright Jerger 1. There are 6 questions and 14 pages. Do all questions. The total number of marks is 100. The duration of the test is 2.5 hours. 2. ALL WORK IS TO BE DONE ON THESE SHEETS! Use the back of the pages if you need more space. Be sure to indicate clearly if your work continues elsewhere. 3. Please put your final solution in the box if one is provided. 4. Clear and concise answers will be considered more favourably than ones that ramble. Do not fill space just because it exists! 5. You may use two 8.5x11 aid sheets. 6. You may use faculty approved non-programmable calculators. 7. Always give some explanations or reasoning for how you arrived at your solutions to help the marker understand your thinking. 1 [23] 2 [10] 3 [23] 4 [12] 5 [26] 6 [6] Total [100] Page 1 of 14

1. Start with some short answer questions: [4 marks] (a) Why is using test-and-test-and-set better than using test-and-set for synchronization? Test-and-set (T&S) performs a write every time it tries to acquire the lock; each processor trying to access the lock will install the cache block containing the lock in the modified state in its cache and invalidate all copies in other caches. Test-and-test-and-set (T&T&S) will first perform a read and spin locally on a shared copy of the line until the lock becomes available. T&T&S will issue fewer bus transactions. [4 marks] (b) Why will having more reservations stations than functional units in Tomasulo s algorithm result in better performance than a 1-to-1 ratio of reservation stations to functional units? Instructions can be dispatched to functional units out of program order but reservation stations are allocated in the issue stage (in-order). More reservation stations will increase the likelihood that an instruction that is ready (both operands are available) will be found. Page 2 of 14

[4 marks] (c) What is fine-grained multithreading? What benefit does it achieve? Fine-grained multithreading (FGMT) switches between multiple threads on a cycle-by-cycle basis in a round robin fashion. FGMT is able to hide the latency of short stalls such as load-touse penalties and branch mispredictions. [4 marks] (d) Give one advantage of implementing load-linked (ll)/store-conditional (sc) instructions instead of an exchange (exch) instruction. LL/SC are more RISC-like and therefore are easier to pipeline. EXCH implements a load and a store in one instruction making it more difficult to pipeline. Page 3 of 14

[2 marks] (e) What is the difference between coherence and consistency for multiprocessors? Coherence creates a globally consistent (uniform) view of accesses to a single memory address. Consistency creates a globally consistent (uniform) view of accesses to all memory addresses [5 marks] (f) Can a multiprocessor built with dynamically scheduled (out-of-order) processors be sequentially consistent? Explain your reasoning. Yes. Stores occur in program order at retirement so they are not a problem. However, loads occur out of program order in the execute stage. In order to achieve sequential consistency, loads must be treated as speculative. Coherence events to speculative loads are treated as a mis-speculation and require the load (and subsequent instructions) to be re-executed similar to a branch mis-prediction. Page 4 of 14

2. Dynamic Scheduling This snapshot of the ROB, Map Table and Free list for a MIPSR10K-like dynamically scheduled scalar processor was taken as the st instruction is about to retire. In the table, h denotes the head of the ROB and t denotes the tail of the ROB. ROB ROB number Insn T T old S D X C 1 mult R0, R2 R5 PR 8 PR 5 c1 c2 c3-c8 c9 2 add R5, R2 R5 PR11 PR8 c2 c9 c10 c11 3 add R1, R4 R3 PR 7 PR 3 c3 c4 c5 c6 h 4 st R3 [R2] c4 c6 c7 c8 5 div R0, R2 R3 PR9 PR7 c5 c6 c7-6 sub R4, R2 R4 PR10 PR6 c6 c7 c8 c10 t 7 ld [R3] R3 PR4 PR9 c7 Map Table R0 PR 0 R1 PR1 R2 PR2 R3 PR4 R4 PR10 R5 PR11 Free List PR 8, PR 5, PR 3 [4 marks] (a) In which cycle will the store retire? Explain your reasoning. Cycle = 14 Retire occurs in order and since it is a scalar processor, only one instruction can retire each cycle. Instruction 1 (mult) can retire at cycle 10, Insn 2 can retire at cycle 12, Insn 3 at cycle 13 and the store (insn 4) at cycle 14. Page 5 of 14

[6 marks] (b) Now assume the store experiences a page fault. Fill in the tables to show what the state of the Map Table and Free List should be right before the processor proceeds to handle the page fault. Map Table R0 R1 R2 R3 R4 R5 PR0 PR1 PR2 PR7 PR6 PR11 Free List PR8, PR5, PR3, PR10, PR9, PR4 Page 6 of 14

3. Multiprocessor Issues [12 marks] (a) Draw the state transition diagram for a snooping-based MOSI protocol. The states M, S, I correspond to the the Modified, Shared and Invalid state discussed in class. An O (Owned) has been added. The Owned state indicates that even though other shared copies of the block may exist, this cache (instead of main memory) is responsible for supplying the data when it observes a relevant bus transaction. Label the arcs in the transition diagram with the convention used in class: event generatedevent. Example: R BR denotes that the processor has experienced a read miss (R) and must generate a bus read (BR) to obtain the data. Please include any comments or assumptions to help the marker interpret your diagram. Please use this notation (You may add additional abbreviations as necessary. Be sure to clarify your notation for the marker): Bus Read: BR Bus Write: BW Read: R Write: W Send Data: SD Writeback: WB BR, BW M W => BW W=>BI BW=>SD, WB=>SD I BW, BI R => BR S R, BR R,W W=>BI BR=>SD BI, BW=>SD O BR => SD, R Page 7 of 14

[3 marks] (b) One proposed solution for the problem of false sharing is to add a valid bit per word (in a multi-word cache block). This would allow the coherence protocol to invalidate a word without removing the entire block, letting a processor keep a portion of a block in its cache while another processor writes a different portion of the block. i. First, explain what is meant by false sharing. False sharing occurs when two processors access the same cache line but do not access the same word within that cache line [8 marks] ii. Give two extra complications that are introduced into the basic (MSI) snooping cache coherence protocol if this capability is included? 1. Coherence state must be tracked on a per-word granularity instead of per-block (extra bits). 2. The memory system must be able to handle narrow writebacks. 3. On a cache access, need to match both tag and offset. Page 8 of 14

Caches 4. Consider a fully associative 128-byte instruction cache with 4-byte blocks (every block can hold one instruction). [3 marks] (a) Consider an LRU replacement policy. What is the asymptotic instruction miss rate for a 16 instruction loop with a very large number of iterations? Miss rate (16 instruction) = 0% [3 marks] (b) What is the asymptotic instruction miss rate for a 48 instruction loop with a very large number of iterations? Miss rate (48 instruction) = 100% Page 9 of 14

[6 marks] (c) If the cache replacement policy is changed to most recently used (MRU) where the most recently accessed line is selected for replacement, which loop (16 instruction or 48 instruction) would benefit from this policy? Explain your reasoning. The first loop (part a) already fits within the cache and would not be impacted by the new replacement policy. The second loop (part b) would reduce its miss rate to 17/48 (35%). The first 31 instructions would be placed in the empty cache. For the remaining 17 instructions, they will replace the most recently used block. On the next iteration, we will hit on the first 0-30 instructions, instruction 31 will replace 30, 32 will replace 31 and so on. We will hit on instruction 47. Subsequent iterations will proceed similarly. Page 10 of 14

5. Pipelining [2 marks] (a) Consider a single-cycle CPU implementation. When the stages are split by functionality, the stages do not require exactly the same amount of time. The original machine had a clock cycle of 7ns. After the stages were split, the measured times were F (Fetch): 1 ns; D (Decode): 1.5 ns; E (Execute): 1 ns; M (Memory): 2 ns; W (Writeback): 1.5 ns. The total pipeline register delay is 0.1 ns. i. What is the clock cycle time of the 5-stage pipelined machine? The clock cycle is determined by the longest stage + the pipeline register delay = 2.0ns + 0.1in. [3 marks] Cycle time = 2.1ns ii. If the pipelined machine had an infinite number of stages (the amount of work per stage can be divided into infinitely small chunks), what would its speedup be over the single-cycle machine (Ignore any stall cycles)? If the latency per stage goes to zero, the pipeline register delay is all that remains. Speedup = 7ns (original machine) / 0.1ns (pipeline register delay). Speedup = 70 Page 11 of 14

[3 marks] (b) Consider the 5-stage single issue (scalar) in-order pipeline (F,D,X,M,W) from class with full bypassing support. i. List the bypassing paths required for full bypassing support. Use the notation FromStage- ToStage for each path. Be sure to indicate if multiple paths are needed between the same two stages (If multiple inputs in the same stage are forwarded to from the same stage, this counts as multiple paths). 2 MX, 2 WX, 1 WM = 5 total paths [12 marks] ii. How many bypass paths are needed for a 5-stage N-wide in-order superscalar processor to have full bypass support? Place your answer (in terms of N) in the given box. You must justify/explain your answer. (Hint: If you are having trouble generalizing to N, start with a 2-wide processor). # of bypass paths = 5N 2 Page 12 of 14

[6 marks] (c) Consider a deeply-pipelined processor for which we implement a branch-target buffer (BTB) for the conditional branches only. Assume that the misprediction penalty is always four cycles and the buffer miss penalty is always three cycles. Assume a 90% hit rate, 90% accuracy and 15% conditional branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed two-cycle conditional branch penalty (A processor with a fixed two-cycle conditional branch penalty does no branch prediction and stalls when it encounters a conditional branch). Assume the base CPI without conditional branch stalls is one. 0.90 0.90 0cycles(hitandaccurate)+0.10 3cycles(buf f ermiss)+0.90 0.10 4(hit+ mis prediction) = 0.66 1 + 0.66 0.15 = 1.099 No BTB = 1 + 2 0.15 = 1.3 Speedup = 1.3 / 1.099 Speedup = 1.183 Page 13 of 14

[6 marks] 6. A common transformation required in graphics processors is square root. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. Proposal A is to enhance the FPSQR hardware and speed up this operation by a factor of 10. Proposal B is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for half of the execution time for the application. Assuming the design effort/time is similar for Proposal A and Proposal B, which proposal would you work on? Justify your answer. 1 Proposal A = 1 0.2+ 0.2 10 = 1.22 1 Proposal B = 1 0.5+ 0.5 1.6 = 1.23 Proposal B will give you slightly better performance. Page 14 of 14