Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Similar documents
CSE 240A Midterm Exam

Exploitation of instruction level parallelism


CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Advanced optimizations of cache performance ( 2.2)

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Computer Architecture EE 4720 Final Examination

Techniques for Efficient Processing in Runahead Execution Engines

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

CS 614 COMPUTER ARCHITECTURE II FALL 2005

Chapter 2: Memory Hierarchy Design Part 2

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

UNIT I (Two Marks Questions & Answers)

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Chapter 2: Memory Hierarchy Design Part 2

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CS 654 Computer Architecture Summary. Peter Kemper

Lecture-13 (ROB and Multi-threading) CS422-Spring

Control Hazards. Prediction

Instruction Level Parallelism (Branch Prediction)

Pipelining and Vector Processing

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Execution-based Prediction Using Speculative Slices

Lecture notes for CS Chapter 2, part 1 10/23/18

CS 2410 Mid term (fall 2018)

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

CS222: Cache Performance Improvement

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Hardware-Based Speculation

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Page 1. Multilevel Memories (Improving performance using a little cash )

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Computer System Architecture Final Examination Spring 2002

LIMITS OF ILP. B649 Parallel Architectures and Programming

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

ECE 341. Lecture # 15

Superscalar Processors

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

15-740/ Computer Architecture

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

CS433 Homework 2 (Chapter 3)

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS152 Computer Architecture and Engineering

CS146 Computer Architecture. Fall Midterm Exam

F28HS Hardware-Software Interface: Systems Programming

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Course on Advanced Computer Architectures

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

CS 426 Parallel Computing. Parallel Computing Platforms

CS 1013 Advance Computer Architecture UNIT I

Chapter 7 The Potential of Special-Purpose Hardware

Advanced Computer Architecture

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CS252 Graduate Computer Architecture Multiprocessors and Multithreading Solutions November 14, 2007

CS433 Homework 2 (Chapter 3)

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

ECE331: Hardware Organization and Design

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Final Exam Fall 2008

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

CS252 Graduate Computer Architecture Midterm 1 Solutions

The Alpha Microprocessor Architecture. Compaq Computer Corporation 334 South St., Shrewsbury, MA

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Handout 2 ILP: Part B

Alexandria University

Understanding The Effects of Wrong-path Memory References on Processor Performance

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

Lecture-18 (Cache Optimizations) CS422-Spring

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Cache Optimisation. sometime he thought that there must be a better way

ECE 411, Exam 1. Good luck!

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Multithreading: Exploiting Thread-Level Parallelism within a Processor

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Hardware-Based Speculation

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

November 7, 2014 Prediction

Final Exam Fall 2007

Chapter 10: Virtual Memory. Lesson 05: Translation Lookaside Buffers

CS425 Computer Systems Architecture

Simultaneous Multithreading on Pentium 4

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

The Gap Between the Virtual Machine and the Real Machine. Charles Forgy Production Systems Tech

Transcription:

Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am. 1 Problem Set One Hennessy & Patterson (4th Ed) 5.2 Hennessy & Patterson (4th Ed) 5.4 Hennessy & Patterson (4th Ed) 5.5 Hennessy & Patterson (4th Ed) 5.6 Hennessy & Patterson (4th Ed) 5.8 Hennessy & Patterson (4th Ed) 5.18 A cache may use a write buffer to reduce write latency and a victim cache to hold recently evicted (nondirty) blocks. Would there be any advantages to combining the two into a single piece of hardware? Would there be any disadvantages? As caches increse in size, blocks often increase in size as well. 1. If a large instruction cache has larger data blocks, is there still a need for prefetching? Explain the interaction between prefetching and increased block size in instruction caches. 2. Is there a need for data prefetch instructions when data blocks get larger? Explain. 1

2 Problem Set Two 1 (Simultaneous Multithreading) (30 points) One critical issue in computer architecture is the limitation of instruction-level parallelism, which motivates the use of the multithreading technique to allow multiple threads to share the functional units of a single processor in an overlapping fashion. Recall that to permit this sharing, the processor needs to add more hardware to preserve the independent state of each thread. (Part A) There is a raging debate going on within a company that is trying to adopt the SMT technique for a new processor they are designing. Being newcomers to the block, their business strategy is to undercut the competition particularly on price (of course at no impact on performance!). The engineers are involved in heated discussions trying the figure out the feasibility & desirability of adopting shared structures and those in the camp of shared structures are arguing further whether the shared structure is the only way to go, whether the shared structure is neutral performance-wise or whether the shared structure will impose a stiff performance degradation on the system. Of course, this being a low-cost provider, no additional functionality or hardware is envisioned in the case of the shared structure. Here are the four possibilities: A The shared structure will not work, period. B The shared structure not only will work but is the only way to go; anything else will result in incorrectly functioning processors. C The shared structure will work and it does not make a difference performance-wise whether the structure is shared or not. D The shared structure will work but the performance will degrade stiffly. For the following 5 cases, decide which of the 4 options above holds and provide a brief reasoning. Instruction fetch queue: Register renaming table: L1 cache: Load/store queue: Reorder buffer: (Part B) One of the limiting factors on an SMT is the bandwidth of getting instructions into the fetch queue. While giving preference to the preferred thread(s) is beneficial in terms of the latency of the preferred thread, it may impact the ability of the processor to utilize fully the processing units. Why? If your preferred thread(s) were straightline DSP code with little branch behavior & highly analyzable memory accesses, which instruction fetch queue loading policy will you suggest? For a preferred application with heavy conditional utilization & pointer chasing, which instruction fetch queue loading policy would you suggest? 2

2 (Way prediction) (40 points) Using the way prediction technique, only one way in an associative cache needs to accessed if the way of the next cache access is predicted correctly. However, a mis-prediction results in the check of the other ways for matches in the next clock cycle. (Part A) One limitation of the way prediction technique is that the way predictor itself needs to be accessed before indicating a cache access. Assume accessing one way of the cache takes T one ns, while accessing all the ways simultaneously takes T all ns. Furthermore, assume the way predictor itself needs T pred ns to access. What should the prediction accuracy of the way predictor be, in order for the way predictor to be able to improve the average hit time of the cache? (Part B) Another potential benefit of way prediction is the possible reduction in energy consumption, since only one way needs to be accessed if the next cache access is predicted correctly. For an n-way associative cache, assume the energy consumed in accessing each way of the cache is E, while the energy consumed in accessing the way predictor is P. What should the prediction accuracy of the way predictor be, in order for the way predictor to be able to reduce the overall energy consumed in cache accesses? 3

The following loop which has 4 load/store accesses in each iteration is going to be executed exclusively on a particular processor with the memory configuration of an L1 cache that is a two-way associative cache with 1 word per block and a 6 bit index. Furthermore, assume the starting addresses (in bytes) of arrays A[ ] and B[ ] to be 0 and 4100, respectively. All the data stored in A[ ] and B[ ] are 4 bytes long. for i = 0 to 1000 by step 1 A[i] = i * B[i+1]; B[i] = i + B[i+3]; end for (Part C) For each set of the cache, an engineer has proposed to use a 2-bit saturating counter, similar to what is used in a local branch predictor, to generate way prediction. The counter is incremented/decremented according to the particular way that is accessed within that set. Please comment on the prediction accuracy of this counter for the loop presented above. (Part D) Since assigning a 2-bit counter to each set of the cache introduces a lot of hardware overhead, another engineer has proposed to keep record of the global history regarding the ways of the most recent accesses to the cache for way prediction. In order to achieve maximal prediction accuracy, what is the minimal number of bits of global history that should be recorded? Compared with the predictor used in (Part C), does this predictor achieve a reduction in hardware overhead? Please provide a justification for your answer. 4

3 (Compiler-based prefetching) (30 points) In this question, you are going to evaluate the prefetching technique for the following Control Flow Graph (CFG), which consists of 7 Basic Blocks. All the memory references within the code fragment are shown in the CFG. The CFG also shows the T/NT probabilities, generated by static profiling, of both branches. i++ B1 Load B[i] 80% 20% B2 B3 80% 20% B4 Load A[i] B5 Load B[i+1] B6 Store B[i] B7 Store C[i] Assume the program is going to be executed on a particular processor with the following memory configuration: L1 cache is a direct-mapped cache with 1 word/block, 6 bit index, and 1 cycle hit time. On a cache miss due to a load instruction, the memory would need to be accessed, which takes an additional 20 cycles. On the other hand, a cache miss of a store instruction will not block the processor. Because of the significant cache miss penalty, you are considering adding an explicit prefetch instruction to reduce the miss rate. If the word to be prefetched is already in the cache, the prefetch instruction will transform itself into a noop. For the following parts of this question, assume the starting addresses (in byte) of arrays A[ ], B[ ] and C[ ] to be 260, 0, and 400, respectively. All the data stored in A[ ], B[ ] and C[ ] are 32-bit integers. The loop body itself is going to be executed 30 times. 5

(Part A) Since the basic block B4 is on the most frequently executed trace, you want to add an explicit prefetch instruction for the load of A[i] in B4. Since the prefetch instruction needs to be inserted at least 20 cycles ahead, the only possibility is to insert it in B1 (prefetch instructions across loop iterations are not considered). However, it is also possible that sometimes the insertion of prefetch instructions may cause performance degradation. For this particular program, please analyze whether the inserted prefetch instruction will cause performance degradation or not. How much speedup (or degradation) can be achieved if the prefetch instruction is going to be inserted? Your answer should take into account the T/NT probabilities of branches. (Part B) Since inserting a prefetch instruction for A[i] may cause performance degradation, another possible optimization is suggested to reduce the execution time of this particular loop. Instead of inserting a prefetch instruction in B1 for the load of A[i] in B4, a prefetch instruction is inserted in B1 for the load of B[i+1] in B5. Compared with the idea in (Part A), is this idea better or not? Compared with the original code, how much speedup (or degradation) can be achieved if this prefetch instruction is going to be inserted? Your answer should take into account the T/NT probabilities of branches. 6