Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector, Multimedia extended ISA, GPU, loop level parallelism, Chapter4 slides you may also refer to chapter3-ilp.ppt starting with slide #114 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence Study Guide Study part 1 only at conceptual level. For parts 1 and 2, main focus will be questions like 2-14 listed below. How does a specific change in the architecture affect specific performance metrics? Exercises 1, 15, 16 and 17 are provided as a reference. Exam will not include questions such as finding the tag, index bits, cache size, finding the number of hits/. Exam will not include a problem that will require CPU execution time formula. Problems types will be purely analysis and discussion. For part 3, Cache coherence: There will be questions on both snoopy and directory based cache protocols. For example, given a coherence protocol fill the state transition tables, evaluate the advantage /disadvantage of a protocol, add a new state to overcome a specific limitation, evaluate a scenario,.

Exercises: 1. A vector A and a vector B are added together. The result is this written back to vector A. Pseudo-disassembly of the inner loop is shown on the right. #define N 4096 int A[N], B[N]; int i; for(i = 0; i < N; i++) A[i] = A[i] + B[i]; # ra holds the addr to A[i] # rb holds the addr to B[i] LD r2,0(rb) LD r1,0(ra) ADD r1,r1,r2 ST r1,0(ra) a) Assume A and B are cache aligned to a 4KB boundary and are contiguous in memory. ints are 32 bits (4 bytes). Also assume that the cache has the following properties: The address size is 32 bits, the index size is 8 bits, and the block offset size is 4 bits. What is the miss-rate for the two-way set-associative (using LRU) cache running the above code? (percentage of memory accesses that completely miss in the cache and require fetching the data from main memory?). b) What is the Average Memory Access Time from running the above code for the two-way set-associative cache? Assume the miss penalty is 100 ns. Also assume that the processor s clock speed is limited by the cache access time, which is 1540ps. 2. What is simultaneous multithreading and why is it useful? 3. What technological forces have caused Intel, AMD, Sun, and others to start putting multiple processors on a chip? 4. Why are Vector processors are more power efficient that superscalar processors when executing applications with a lot of data-level parallelism? Explain. 5. For a computer with 64-bit virtual addresses, how large is the page table if only a single level page table is used? Assume that each page is 4KB, that each page table entry is 8 bytes, and that the processor is byte-addressable. 6. Consider a Simultaneous Multithreading (SMT) machine with limited hardware resources. Circle the following hardware constraints that can limit the total number of threads that the machine can support. For the item(s) that you circle, briefly describe the minimum requirement to support N threads. 1. Number of Functional Units 2. Number of Physical Registers 3. Data Cache Size 4. Data Cache Associatively

7. Ben Bitdiddle is implementing a directory-based cache coherence invalidate protocol for a 64-processor system. He first builds a smaller prototype with only 4 processors to test out the directory-based cache coherence protocol described in the practice problems. (A copy of the protocol is provided at the end of this test.) To implement the list of sharers, S, kept by the home site, he maintains a bit vector per cache block to keep track of all the sharers. The bit vector has one bit corresponding to each processor in the system. The bit is set to one if the processor is caching a shared copy of the block, and zero if the processor does not have a copy of the block. For example, if Processors 1 and 3 are caching a shared copy of some data, the corresponding bit vector would be 1010 to represent processors 3, 2, 1, 0 respectively. The bit vector worked well for the 4-processor prototype, but when building the actual 64- processor system, Ben discovered that he did not have enough hardware resources. Assume each cache block is 32 bytes. What is the overhead of maintaining the sharing bit vector for a 4- processor system, as a ratio of bit vector (overhead) bits to data storage bits? What is the overhead for a 64-processor system? Overhead for a 4-processor system: Overhead for a 64-processor system: 8. Mark whether the following modifications to cache parameters will cause each of the categories to increase, decrease, or whether the modification will have no effect. You can assume the baseline cache is set associative. Explain your reasoning. Assume that in each case the other cache parameters (number of sets, number of ways, number of bytes/line) and the rest of the machine design remain the same. number of sets number of ways number of bytes per line compulsory conflict capacity

9. Explain the effect of the number of TLB entries on CPI and TLB capacity. TLB contribution to the CPI TLB capacity Increase number of TLB entries 10. Describe how you expect switching to each of the following architectures will affect instructions/program and cycles/instruction (CPI) relative to a baseline 5-stage, in-order processor. Mark whether the following modifications will cause instruction/program and CPI to increase, decrease, or whether the change will have no effect. Explain your reasoning. a) How do instructions/program and CPI change when moving from a 5-stage-pipeline inorder processor to a traditional VLIW processor. b) How do instructions/program and CPI change when moving from a 5-stage-pipeline inorder processor to a multithreaded processor? Assume that the new processor is still an in-order, 5-stage-pipeline processor, but that it has been modified to switch between two threads every clock cycle (fine-grain multithreading). If a thread is not ready to be issued (e.g., a cache miss), a bubble is inserted in the pipeline. 11. Design Choice: You are the manager of the architecture group at the Acme Corporation. One of your team members proposes a large direct-mapped cache + victim cache as a faster and cheaper alternative to higher-associativity. Would you agree with this statement? Justify your answer. 12. Vector processors vs. Superscalar vs VLIW: a) How can parallelism (such as in a vector processor) be used to reduce to total energy consumed by a computation? Why doesn t a superscalar processor get this advantage? b) When does a vector processor perform better than VLIW processor? Think about the operations occurring in an application 13. Instruction Set: ISA extended processors are especially popular for targeting multimedia applications. These extended instructions are meant for the programmers to utilize. Even though compiler support is minimal relative to the vector processors, why are ISA extensions are still popular? 14. Smith and Goodman has shown that for a small instruction cache, a cache using direct mapping could consistently outperform one using fully associative with LRU replacement. Explain why this would be possible. (Hint: you can t explain this with the 4 C s model because it ignores the replacement policy.) 15. What is the formula for average access time for a three level cache in terms of HL, ML and PL? (6pts) HL i : Hit rate, ML i : Miss rate and PL i : miss penalty for i th level cache, where i is 1,2 and 3.

16. Assume that we have a 32-bit processor (with 32-bit words) and that this processor is byte-addressed (i.e. addresses specify bytes). Suppose that it has a 512-byte cache that is two-way set-associative, has 4-word cache lines, and uses LRU replacement. Split the 32-bit address into tag, index, and cache-line offset pieces. Below is a series of memory read references set to the cache. Assume that the cache is initially empty. Classify each memory references as a hit or a miss. Identify each miss as either compulsory, conflict, or capacity. Tag= Index= Cache-line offset= Address Hit/Miss Miss type 0000-0000-0000-0000-0000-0011-0000-0000 Miss Compulsory 0000-0000-0000-0000-0000-0001-1011-1100 0000-0000-0000-0000-0000-0010-0000-0110 0000-0000-0000-0000-0000-0001-0000-1001 0000-0000-0000-0000-0000-0011-0000-1000 0000-0000-0000-0000-0000-0001-1010-0001 0000-0000-0000-0000-0000-0001-1011-0001 0000-0000-0000-0000-0000-0010-1010-1110 0000-0000-0000-0000-0000-0011-1011-0010 0000-0000-0000-0000-0000-0001-0000-1100 0000-0000-0000-0000-0000-0010-0000-0101 0000-0000-0000-0000-0000-0011-0000-0001 0000-0000-0000-0000-0000-0011-1010-1110 0000-0000-0000-0000-0000-0001-1010-1000 0000-0000-0000-0000-0000-0011-1010-0001 0000-0000-0000-0000-0000-0001-1011-1010 17. One difference between a write through cache and a write back cache can be in the time it takes to write. Let s assume that 50% of the blocks are dirty for a write back cache. Assume a cache read hit takes 1 clock cycle, the cache miss penalty is 50 clock cycles, and a block write from cache to main memory takes 50 clock cycles. Finally assume the instruction cache miss rate is 0.5% and the data cache miss rate is 1%. What is the CPI based on the cache behavior with a two cycle write for the gzip benchmark? Note that during the first cycle, we detect whether a hit will occur, and during the second (assuming a hit) we actually write the data.