University of Toronto Faculty of Applied Science and Engineering

Size: px
Start display at page:

Download "University of Toronto Faculty of Applied Science and Engineering"

Transcription

1 Print: First Name: Solutions Last Name: Student Number: University of Toronto Faculty of Applied Science and Engineering Final Examination December 19, 2011 ECE552F Computer Architecture Examiner Natalie Enright Jerger 1. There are 6 questions and 14 pages. Do all questions. The total number of marks is 100. The duration of the test is 2.5 hours. 2. ALL WORK IS TO BE DONE ON THESE SHEETS! Use the back of the pages if you need more space. Be sure to indicate clearly if your work continues elsewhere. 3. Please put your final solution in the box if one is provided. 4. Clear and concise answers will be considered more favourably than ones that ramble. Do not fill space just because it exists! 5. You may use two 8.5x11 aid sheets. 6. You may use faculty approved non-programmable calculators. 7. Always give some explanations or reasoning for how you arrived at your solutions to help the marker understand your thinking. 1 [23] 2 [10] 3 [23] 4 [12] 5 [26] 6 [6] Total [100] Page 1 of 14

2 1. Start with some short answer questions: [4 marks] (a) Why is using test-and-test-and-set better than using test-and-set for synchronization? Test-and-set (T&S) performs a write every time it tries to acquire the lock; each processor trying to access the lock will install the cache block containing the lock in the modified state in its cache and invalidate all copies in other caches. Test-and-test-and-set (T&T&S) will first perform a read and spin locally on a shared copy of the line until the lock becomes available. T&T&S will issue fewer bus transactions. [4 marks] (b) Why will having more reservations stations than functional units in Tomasulo s algorithm result in better performance than a 1-to-1 ratio of reservation stations to functional units? Instructions can be dispatched to functional units out of program order but reservation stations are allocated in the issue stage (in-order). More reservation stations will increase the likelihood that an instruction that is ready (both operands are available) will be found. Page 2 of 14

3 [4 marks] (c) What is fine-grained multithreading? What benefit does it achieve? Fine-grained multithreading (FGMT) switches between multiple threads on a cycle-by-cycle basis in a round robin fashion. FGMT is able to hide the latency of short stalls such as load-touse penalties and branch mispredictions. [4 marks] (d) Give one advantage of implementing load-linked (ll)/store-conditional (sc) instructions instead of an exchange (exch) instruction. LL/SC are more RISC-like and therefore are easier to pipeline. EXCH implements a load and a store in one instruction making it more difficult to pipeline. Page 3 of 14

4 [2 marks] (e) What is the difference between coherence and consistency for multiprocessors? Coherence creates a globally consistent (uniform) view of accesses to a single memory address. Consistency creates a globally consistent (uniform) view of accesses to all memory addresses [5 marks] (f) Can a multiprocessor built with dynamically scheduled (out-of-order) processors be sequentially consistent? Explain your reasoning. Yes. Stores occur in program order at retirement so they are not a problem. However, loads occur out of program order in the execute stage. In order to achieve sequential consistency, loads must be treated as speculative. Coherence events to speculative loads are treated as a mis-speculation and require the load (and subsequent instructions) to be re-executed similar to a branch mis-prediction. Page 4 of 14

5 2. Dynamic Scheduling This snapshot of the ROB, Map Table and Free list for a MIPSR10K-like dynamically scheduled scalar processor was taken as the st instruction is about to retire. In the table, h denotes the head of the ROB and t denotes the tail of the ROB. ROB ROB number Insn T T old S D X C 1 mult R0, R2 R5 PR 8 PR 5 c1 c2 c3-c8 c9 2 add R5, R2 R5 PR11 PR8 c2 c9 c10 c11 3 add R1, R4 R3 PR 7 PR 3 c3 c4 c5 c6 h 4 st R3 [R2] c4 c6 c7 c8 5 div R0, R2 R3 PR9 PR7 c5 c6 c7-6 sub R4, R2 R4 PR10 PR6 c6 c7 c8 c10 t 7 ld [R3] R3 PR4 PR9 c7 Map Table R0 PR 0 R1 PR1 R2 PR2 R3 PR4 R4 PR10 R5 PR11 Free List PR 8, PR 5, PR 3 [4 marks] (a) In which cycle will the store retire? Explain your reasoning. Cycle = 14 Retire occurs in order and since it is a scalar processor, only one instruction can retire each cycle. Instruction 1 (mult) can retire at cycle 10, Insn 2 can retire at cycle 12, Insn 3 at cycle 13 and the store (insn 4) at cycle 14. Page 5 of 14

6 [6 marks] (b) Now assume the store experiences a page fault. Fill in the tables to show what the state of the Map Table and Free List should be right before the processor proceeds to handle the page fault. Map Table R0 R1 R2 R3 R4 R5 PR0 PR1 PR2 PR7 PR6 PR11 Free List PR8, PR5, PR3, PR10, PR9, PR4 Page 6 of 14

7 3. Multiprocessor Issues [12 marks] (a) Draw the state transition diagram for a snooping-based MOSI protocol. The states M, S, I correspond to the the Modified, Shared and Invalid state discussed in class. An O (Owned) has been added. The Owned state indicates that even though other shared copies of the block may exist, this cache (instead of main memory) is responsible for supplying the data when it observes a relevant bus transaction. Label the arcs in the transition diagram with the convention used in class: event generatedevent. Example: R BR denotes that the processor has experienced a read miss (R) and must generate a bus read (BR) to obtain the data. Please include any comments or assumptions to help the marker interpret your diagram. Please use this notation (You may add additional abbreviations as necessary. Be sure to clarify your notation for the marker): Bus Read: BR Bus Write: BW Read: R Write: W Send Data: SD Writeback: WB BR, BW M W => BW W=>BI BW=>SD, WB=>SD I BW, BI R => BR S R, BR R,W W=>BI BR=>SD BI, BW=>SD O BR => SD, R Page 7 of 14

8 [3 marks] (b) One proposed solution for the problem of false sharing is to add a valid bit per word (in a multi-word cache block). This would allow the coherence protocol to invalidate a word without removing the entire block, letting a processor keep a portion of a block in its cache while another processor writes a different portion of the block. i. First, explain what is meant by false sharing. False sharing occurs when two processors access the same cache line but do not access the same word within that cache line [8 marks] ii. Give two extra complications that are introduced into the basic (MSI) snooping cache coherence protocol if this capability is included? 1. Coherence state must be tracked on a per-word granularity instead of per-block (extra bits). 2. The memory system must be able to handle narrow writebacks. 3. On a cache access, need to match both tag and offset. Page 8 of 14

9 Caches 4. Consider a fully associative 128-byte instruction cache with 4-byte blocks (every block can hold one instruction). [3 marks] (a) Consider an LRU replacement policy. What is the asymptotic instruction miss rate for a 16 instruction loop with a very large number of iterations? Miss rate (16 instruction) = 0% [3 marks] (b) What is the asymptotic instruction miss rate for a 48 instruction loop with a very large number of iterations? Miss rate (48 instruction) = 100% Page 9 of 14

10 [6 marks] (c) If the cache replacement policy is changed to most recently used (MRU) where the most recently accessed line is selected for replacement, which loop (16 instruction or 48 instruction) would benefit from this policy? Explain your reasoning. The first loop (part a) already fits within the cache and would not be impacted by the new replacement policy. The second loop (part b) would reduce its miss rate to 17/48 (35%). The first 31 instructions would be placed in the empty cache. For the remaining 17 instructions, they will replace the most recently used block. On the next iteration, we will hit on the first 0-30 instructions, instruction 31 will replace 30, 32 will replace 31 and so on. We will hit on instruction 47. Subsequent iterations will proceed similarly. Page 10 of 14

11 5. Pipelining [2 marks] (a) Consider a single-cycle CPU implementation. When the stages are split by functionality, the stages do not require exactly the same amount of time. The original machine had a clock cycle of 7ns. After the stages were split, the measured times were F (Fetch): 1 ns; D (Decode): 1.5 ns; E (Execute): 1 ns; M (Memory): 2 ns; W (Writeback): 1.5 ns. The total pipeline register delay is 0.1 ns. i. What is the clock cycle time of the 5-stage pipelined machine? The clock cycle is determined by the longest stage + the pipeline register delay = 2.0ns + 0.1in. [3 marks] Cycle time = 2.1ns ii. If the pipelined machine had an infinite number of stages (the amount of work per stage can be divided into infinitely small chunks), what would its speedup be over the single-cycle machine (Ignore any stall cycles)? If the latency per stage goes to zero, the pipeline register delay is all that remains. Speedup = 7ns (original machine) / 0.1ns (pipeline register delay). Speedup = 70 Page 11 of 14

12 [3 marks] (b) Consider the 5-stage single issue (scalar) in-order pipeline (F,D,X,M,W) from class with full bypassing support. i. List the bypassing paths required for full bypassing support. Use the notation FromStage- ToStage for each path. Be sure to indicate if multiple paths are needed between the same two stages (If multiple inputs in the same stage are forwarded to from the same stage, this counts as multiple paths). 2 MX, 2 WX, 1 WM = 5 total paths [12 marks] ii. How many bypass paths are needed for a 5-stage N-wide in-order superscalar processor to have full bypass support? Place your answer (in terms of N) in the given box. You must justify/explain your answer. (Hint: If you are having trouble generalizing to N, start with a 2-wide processor). # of bypass paths = 5N 2 Page 12 of 14

13 [6 marks] (c) Consider a deeply-pipelined processor for which we implement a branch-target buffer (BTB) for the conditional branches only. Assume that the misprediction penalty is always four cycles and the buffer miss penalty is always three cycles. Assume a 90% hit rate, 90% accuracy and 15% conditional branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed two-cycle conditional branch penalty (A processor with a fixed two-cycle conditional branch penalty does no branch prediction and stalls when it encounters a conditional branch). Assume the base CPI without conditional branch stalls is one cycles(hitandaccurate) cycles(buf f ermiss) (hit+ mis prediction) = = No BTB = = 1.3 Speedup = 1.3 / Speedup = Page 13 of 14

14 [6 marks] 6. A common transformation required in graphics processors is square root. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. Proposal A is to enhance the FPSQR hardware and speed up this operation by a factor of 10. Proposal B is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for half of the execution time for the application. Assuming the design effort/time is similar for Proposal A and Proposal B, which proposal would you work on? Justify your answer. 1 Proposal A = = Proposal B = = 1.23 Proposal B will give you slightly better performance. Page 14 of 14

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:......... SOLUTION............... Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:Solution...................... Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Computer System Architecture Final Examination Spring 2002

Computer System Architecture Final Examination Spring 2002 Computer System Architecture 6.823 Final Examination Spring 2002 Name: This is an open book, open notes exam. 180 Minutes 22 Pages Notes: Not all questions are of equal difficulty, so look over the entire

More information

Alexandria University

Alexandria University Alexandria University Faculty of Engineering Computer and Communications Department CC322: CC423: Advanced Computer Architecture Sheet 3: Instruction- Level Parallelism and Its Exploitation 1. What would

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Updated Exercises by Diana Franklin

Updated Exercises by Diana Franklin C-82 Appendix C Pipelining: Basic and Intermediate Concepts Updated Exercises by Diana Franklin C.1 [15/15/15/15/25/10/15] Use the following code fragment: Loop: LD R1,0(R2) ;load R1 from address

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Chapter. Out of order Execution

Chapter. Out of order Execution Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until

More information

1 Tomasulo s Algorithm

1 Tomasulo s Algorithm Design of Digital Circuits (252-0028-00L), Spring 2018 Optional HW 4: Out-of-Order Execution, Dataflow, Branch Prediction, VLIW, and Fine-Grained Multithreading uctor: Prof. Onur Mutlu TAs: Juan Gomez

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Final Exam Prof. Martin Wednesday, May 2nd, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached (with

More information

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units CS333: Computer Architecture Spring 006 Homework 3 Total Points: 49 Points (undergrad), 57 Points (graduate) Due Date: Feb. 8, 006 by 1:30 pm (See course information handout for more details on late submissions)

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

CS 1013 Advance Computer Architecture UNIT I

CS 1013 Advance Computer Architecture UNIT I CS 1013 Advance Computer Architecture UNIT I 1. What are embedded computers? List their characteristics. Embedded computers are computers that are lodged into other devices where the presence of the computer

More information

/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #:

/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #: 16.482 / 16.561: Computer Architecture and Design Fall 2014 Final Exam December 4, 2014 Name: ID #: For this exam, you may use a calculator and two 8.5 x 11 double-sided page of notes. All other electronic

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

ECE 341 Final Exam Solution

ECE 341 Final Exam Solution ECE 341 Final Exam Solution Time allowed: 110 minutes Total Points: 100 Points Scored: Name: Problem No. 1 (10 points) For each of the following statements, indicate whether the statement is TRUE or FALSE.

More information

CS 2410 Mid term (fall 2018)

CS 2410 Mid term (fall 2018) CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Final Review Shuai Wang Department of Computer Science and Technology Nanjing University Computer Architecture Computer architecture, like other architecture, is the art

More information

Design of Digital Circuits ( L) ETH Zürich, Spring 2017

Design of Digital Circuits ( L) ETH Zürich, Spring 2017 Name: Student ID: Final Examination Design of Digital Circuits (252-0028-00L) ETH Zürich, Spring 2017 Professors Onur Mutlu and Srdjan Capkun Problem 1 (70 Points): Problem 2 (50 Points): Problem 3 (40

More information

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

EECS 470 Midterm Exam Winter 2008 answers

EECS 470 Midterm Exam Winter 2008 answers EECS 470 Midterm Exam Winter 2008 answers Name: KEY unique name: KEY Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: #Page Points 2 /10

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 16, 2014 Time: 1 hour + 15 minutes Name: Alias: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

CPE 631 Advanced Computer Systems Architecture: Homework #3,#4

CPE 631 Advanced Computer Systems Architecture: Homework #3,#4 Issued: 04/09/2007 Due: 04/18/2007 CPE 631 Advanced Computer Systems Architecture: Homework #3,#4 Name: Q1 Q2 Q3 Q4 Q5 Total 20 20 20 60 40 160 Question #1: (20 points) 1.1 (20 points) Consider the following

More information

RECAP. B649 Parallel Architectures and Programming

RECAP. B649 Parallel Architectures and Programming RECAP B649 Parallel Architectures and Programming RECAP 2 Recap ILP Exploiting ILP Dynamic scheduling Thread-level Parallelism Memory Hierarchy Other topics through student presentations Virtual Machines

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

Week 11: Assignment Solutions

Week 11: Assignment Solutions Week 11: Assignment Solutions 1. Consider an instruction pipeline with four stages with the stage delays 5 nsec, 6 nsec, 11 nsec, and 8 nsec respectively. The delay of an inter-stage register stage of

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Good luck and have fun!

Good luck and have fun! Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) 1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into

More information

" # " $ % & ' ( ) * + $ " % '* + * ' "

 #  $ % & ' ( ) * + $  % '* + * ' ! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File

More information

EECS 470 Midterm Exam Winter 2009

EECS 470 Midterm Exam Winter 2009 EECS 70 Midterm Exam Winter 2009 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 18 2 / 12 3 / 29 / 21

More information

/ : Computer Architecture and Design Spring Final Exam May 1, Name: ID #:

/ : Computer Architecture and Design Spring Final Exam May 1, Name: ID #: 16.482 / 16.561: Computer Architecture and Design Spring 2014 Final Exam May 1, 2014 Name: ID #: For this exam, you may use a calculator and two 8.5 x 11 double-sided page of notes. All other electronic

More information

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics Computing system: performance, speedup, performance/cost Origins and benefits of scalar instruction pipelines and caches

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

EECS 470 Midterm Exam Winter 2015

EECS 470 Midterm Exam Winter 2015 EECS 470 Midterm Exam Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: Page # Points 2 /20 3 /15 4 /9 5

More information

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name: SOLUTION Notes: CS 152 Computer Architecture and Engineering CS 252 Graduate Computer Architecture Midterm #1 February 26th, 2018 Professor Krste Asanovic Name: I am taking CS152 / CS252 This is a closed

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Computer Architecture EE 4720 Final Examination

Computer Architecture EE 4720 Final Examination Name Computer Architecture EE 4720 Final Examination Primary: 6 December 1999, Alternate: 7 December 1999, 10:00 12:00 CST 15:00 17:00 CST Alias Problem 1 Problem 2 Problem 3 Problem 4 Exam Total (25 pts)

More information

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005 Points missed: Student's Name: Total score: /100 points East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005 Section

More information

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours CS433 Final Exam Prof Josep Torrellas December 12, 2006 Time: 2 hours Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 6 Questions. Please budget your time. 3. Calculators

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based

More information

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4) 1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches:

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

CS 2506 Computer Organization II Test 2. Do not start the test until instructed to do so! printed

CS 2506 Computer Organization II Test 2. Do not start the test until instructed to do so! printed Instructions: Print your name in the space provided below. This examination is closed book and closed notes, aside from the permitted fact sheet, with a restriction: 1) one 8.5x11 sheet, both sides, handwritten

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS / ECE 6810 Midterm Exam - Oct 21st 2008 Name and ID: CS / ECE 6810 Midterm Exam - Oct 21st 2008 Notes: This is an open notes and open book exam. If necessary, make reasonable assumptions and clearly state them. The only clarifications you may

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline? 1. Imagine we have a non-pipelined processor running at 1MHz and want to run a program with 1000 instructions. a) How much time would it take to execute the program? 1 instruction per cycle. 1MHz clock

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 5: Precise Exceptions Prof. Onur Mutlu Carnegie Mellon University Last Time Performance Metrics Amdahl s Law Single-cycle, multi-cycle machines Pipelining Stalls

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

Lecture 11 Cache. Peng Liu.

Lecture 11 Cache. Peng Liu. Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) 1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory

More information

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.

More information

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page. CS 433 Homework 4 Assigned on 10/17/2017 Due in class on 11/7/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Final Exam Fall 2007

Final Exam Fall 2007 ICS 233 - Computer Architecture & Assembly Language Final Exam Fall 2007 Wednesday, January 23, 2007 7:30 am 10:00 am Computer Engineering Department College of Computer Sciences & Engineering King Fahd

More information