Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Similar documents
Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CS 152, Spring 2011 Section 7

ECE331: Hardware Organization and Design

Keywords and Review Questions

Page 1. Multilevel Memories (Improving performance using a little cash )

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

6.004 Tutorial Problems L14 Cache Implementation

CS152 Computer Architecture and Engineering

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2011 Professor Krste Asanović

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Page 1. Memory Hierarchies (Part 2)

Memory Hierarchy & Caches Worksheet

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Lecture 11 Cache. Peng Liu.

Computer Architecture CS372 Exam 3

CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

EE 4683/5683: COMPUTER ARCHITECTURE

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Pipelined processors and Hazards

ECE 411 Exam 1 Practice Problems

Parallel Processing SIMD, Vector and GPU s cont.

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

ECE 30 Introduction to Computer Engineering

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Computer System Architecture Final Examination Spring 2002

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

CS3350B Computer Architecture

ISA Instruction Operation

Alexandria University

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

6.004 Tutorial Problems L14 Cache Implementation

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Systems Programming and Computer Architecture ( ) Timothy Roscoe

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

Inside out of your computer memories (III) Hung-Wei Tseng

ECE 411, Exam 1. Good luck!

ECE331: Hardware Organization and Design

University of Toronto Faculty of Applied Science and Engineering

Memory Hierarchies 2009 DAT105

Computer Architecture EE 4720 Final Examination

CS3350B Computer Architecture

UNIT I (Two Marks Questions & Answers)

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Computer Systems Architecture

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Computer Architecture Spring 2016

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

1. Creates the illusion of an address space much larger than the physical memory

CS 654 Computer Architecture Summary. Peter Kemper

Hardware-Based Speculation

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

ETH, Design of Digital Circuits, SS17 Practice Exercises II - Solutions

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

6 th Lecture :: The Cache - Part Three

Processor Performance and Parallelism Y. K. Malaiya

Computer Systems Architecture

CS 341l Fall 2008 Test #4 NAME: Key

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Lecture 7 - Memory Hierarchy-II

Final Exam Fall 2008

CSC 631: High-Performance Computer Architecture

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Write only as much as necessary. Be brief!

Review: Computer Organization

Write only as much as necessary. Be brief!

12 Cache-Organization 1

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

CS61C : Machine Structures

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

LECTURE 12. Virtual Memory

14:332:331. Week 13 Basics of Cache

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

EECS 322 Computer Architecture Superpipline and the Cache

Course Administration

Computer System Architecture Final Examination

Transcription:

Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector, Multimedia extended ISA, GPU, loop level parallelism, Chapter4 slides you may also refer to chapter3-ilp.ppt starting with slide #114 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence Study Guide Study part 1 only at conceptual level. For parts 1 and 2, main focus will be questions like 2-14 listed below. How does a specific change in the architecture affect specific performance metrics? Exercises 1, 15, 16 and 17 are provided as a reference. Exam will not include questions such as finding the tag, index bits, cache size, finding the number of hits/. Exam will not include a problem that will require CPU execution time formula. Problems types will be purely analysis and discussion. For part 3, Cache coherence: There will be questions on both snoopy and directory based cache protocols. For example, given a coherence protocol fill the state transition tables, evaluate the advantage /disadvantage of a protocol, add a new state to overcome a specific limitation, evaluate a scenario,.

Exercises: 1. A vector A and a vector B are added together. The result is this written back to vector A. Pseudo-disassembly of the inner loop is shown on the right. #define N 4096 int A[N], B[N]; int i; for(i = 0; i < N; i++) A[i] = A[i] + B[i]; # ra holds the addr to A[i] # rb holds the addr to B[i] LD r2,0(rb) LD r1,0(ra) ADD r1,r1,r2 ST r1,0(ra) a) Assume A and B are cache aligned to a 4KB boundary and are contiguous in memory. ints are 32 bits (4 bytes). Also assume that the cache has the following properties: The address size is 32 bits, the index size is 8 bits, and the block offset size is 4 bits. What is the miss-rate for the two-way set-associative (using LRU) cache running the above code? (percentage of memory accesses that completely miss in the cache and require fetching the data from main memory?). b) What is the Average Memory Access Time from running the above code for the two-way set-associative cache? Assume the miss penalty is 100 ns. Also assume that the processor s clock speed is limited by the cache access time, which is 1540ps. 2. What is simultaneous multithreading and why is it useful? 3. What technological forces have caused Intel, AMD, Sun, and others to start putting multiple processors on a chip? 4. Why are Vector processors are more power efficient that superscalar processors when executing applications with a lot of data-level parallelism? Explain. 5. For a computer with 64-bit virtual addresses, how large is the page table if only a single level page table is used? Assume that each page is 4KB, that each page table entry is 8 bytes, and that the processor is byte-addressable. 6. Consider a Simultaneous Multithreading (SMT) machine with limited hardware resources. Circle the following hardware constraints that can limit the total number of threads that the machine can support. For the item(s) that you circle, briefly describe the minimum requirement to support N threads. 1. Number of Functional Units 2. Number of Physical Registers 3. Data Cache Size 4. Data Cache Associatively

7. Ben Bitdiddle is implementing a directory-based cache coherence invalidate protocol for a 64-processor system. He first builds a smaller prototype with only 4 processors to test out the directory-based cache coherence protocol described in the practice problems. (A copy of the protocol is provided at the end of this test.) To implement the list of sharers, S, kept by the home site, he maintains a bit vector per cache block to keep track of all the sharers. The bit vector has one bit corresponding to each processor in the system. The bit is set to one if the processor is caching a shared copy of the block, and zero if the processor does not have a copy of the block. For example, if Processors 1 and 3 are caching a shared copy of some data, the corresponding bit vector would be 1010 to represent processors 3, 2, 1, 0 respectively. The bit vector worked well for the 4-processor prototype, but when building the actual 64- processor system, Ben discovered that he did not have enough hardware resources. Assume each cache block is 32 bytes. What is the overhead of maintaining the sharing bit vector for a 4- processor system, as a ratio of bit vector (overhead) bits to data storage bits? What is the overhead for a 64-processor system? Overhead for a 4-processor system: Overhead for a 64-processor system: 8. Mark whether the following modifications to cache parameters will cause each of the categories to increase, decrease, or whether the modification will have no effect. You can assume the baseline cache is set associative. Explain your reasoning. Assume that in each case the other cache parameters (number of sets, number of ways, number of bytes/line) and the rest of the machine design remain the same. number of sets number of ways number of bytes per line compulsory conflict capacity

9. Explain the effect of the number of TLB entries on CPI and TLB capacity. TLB contribution to the CPI TLB capacity Increase number of TLB entries 10. Describe how you expect switching to each of the following architectures will affect instructions/program and cycles/instruction (CPI) relative to a baseline 5-stage, in-order processor. Mark whether the following modifications will cause instruction/program and CPI to increase, decrease, or whether the change will have no effect. Explain your reasoning. a) How do instructions/program and CPI change when moving from a 5-stage-pipeline inorder processor to a traditional VLIW processor. b) How do instructions/program and CPI change when moving from a 5-stage-pipeline inorder processor to a multithreaded processor? Assume that the new processor is still an in-order, 5-stage-pipeline processor, but that it has been modified to switch between two threads every clock cycle (fine-grain multithreading). If a thread is not ready to be issued (e.g., a cache miss), a bubble is inserted in the pipeline. 11. Design Choice: You are the manager of the architecture group at the Acme Corporation. One of your team members proposes a large direct-mapped cache + victim cache as a faster and cheaper alternative to higher-associativity. Would you agree with this statement? Justify your answer. 12. Vector processors vs. Superscalar vs VLIW: a) How can parallelism (such as in a vector processor) be used to reduce to total energy consumed by a computation? Why doesn t a superscalar processor get this advantage? b) When does a vector processor perform better than VLIW processor? Think about the operations occurring in an application 13. Instruction Set: ISA extended processors are especially popular for targeting multimedia applications. These extended instructions are meant for the programmers to utilize. Even though compiler support is minimal relative to the vector processors, why are ISA extensions are still popular? 14. Smith and Goodman has shown that for a small instruction cache, a cache using direct mapping could consistently outperform one using fully associative with LRU replacement. Explain why this would be possible. (Hint: you can t explain this with the 4 C s model because it ignores the replacement policy.) 15. What is the formula for average access time for a three level cache in terms of HL, ML and PL? (6pts) HL i : Hit rate, ML i : Miss rate and PL i : miss penalty for i th level cache, where i is 1,2 and 3.

16. Assume that we have a 32-bit processor (with 32-bit words) and that this processor is byte-addressed (i.e. addresses specify bytes). Suppose that it has a 512-byte cache that is two-way set-associative, has 4-word cache lines, and uses LRU replacement. Split the 32-bit address into tag, index, and cache-line offset pieces. Below is a series of memory read references set to the cache. Assume that the cache is initially empty. Classify each memory references as a hit or a miss. Identify each miss as either compulsory, conflict, or capacity. Tag= Index= Cache-line offset= Address Hit/Miss Miss type 0000-0000-0000-0000-0000-0011-0000-0000 Miss Compulsory 0000-0000-0000-0000-0000-0001-1011-1100 0000-0000-0000-0000-0000-0010-0000-0110 0000-0000-0000-0000-0000-0001-0000-1001 0000-0000-0000-0000-0000-0011-0000-1000 0000-0000-0000-0000-0000-0001-1010-0001 0000-0000-0000-0000-0000-0001-1011-0001 0000-0000-0000-0000-0000-0010-1010-1110 0000-0000-0000-0000-0000-0011-1011-0010 0000-0000-0000-0000-0000-0001-0000-1100 0000-0000-0000-0000-0000-0010-0000-0101 0000-0000-0000-0000-0000-0011-0000-0001 0000-0000-0000-0000-0000-0011-1010-1110 0000-0000-0000-0000-0000-0001-1010-1000 0000-0000-0000-0000-0000-0011-1010-0001 0000-0000-0000-0000-0000-0001-1011-1010 17. One difference between a write through cache and a write back cache can be in the time it takes to write. Let s assume that 50% of the blocks are dirty for a write back cache. Assume a cache read hit takes 1 clock cycle, the cache miss penalty is 50 clock cycles, and a block write from cache to main memory takes 50 clock cycles. Finally assume the instruction cache miss rate is 0.5% and the data cache miss rate is 1%. What is the CPI based on the cache behavior with a two cycle write for the gzip benchmark? Note that during the first cycle, we detect whether a hit will occur, and during the second (assuming a hit) we actually write the data.