ECE 587/687 Final Exam Solution

Size: px

Start display at page:

Download "ECE 587/687 Final Exam Solution"

Vernon Evans
5 years ago
Views:

1 ECE 587/687 Final Exam Solution Time allowed: 80 minutes Total Points: 60 Points Scored: Name: Problem No. 1 (15 points) Consider a computer system with a two level cache hierarchy, consisting of split L1 instruction and data caches and a unified L2 cache. The cache parameters and statistics for a program are as follows: (i) L1 data cache hit ratio is 95% (ii) L1 instruction cache hit ratio is 98% (iii) L2 cache miss rate is 10 misses per 1000 instructions, (iv) L2 cache hit latency is 10 cycles (v) Average L2 miss latency = 100 cycles (vi) Fetching instructions for the program requires one L1 cache access per instruction (vii) 40% of the instructions in the program are load/store instructions. Each data read (or write) caused by a load (or store) instructions requires one access to the L1 data cache (viii) The program s CPI assuming a perfect L2 cache with no misses, is 1.0 (ix) The L2 cache is a blocking cache that processes only one miss at a time, and blocks all other accesses until the current miss returns from memory Based on the above statistics, answer the following questions: (a) (3 points) Calculate the actual CPI of the program. Actual CPI = CPI(Perfect L2) + (L2 miss rate)(average L2 miss latency) = (10/1000)(100) = 2.0 (b) (6 points) Calculate the hit ratio of the L2 cache. L2 miss rate = L2 miss ratio * L2 accesses per instruction L2 miss ratio = L2 miss rate / L2 accesses per instruction L2 miss rate = 10 misses per 1000 instructions = 0.01 misses per instruction There are 2 sources of L2 accesses: (i) L1 instruction cache misses, (ii) L1 data cache misses L1 i-cache miss rate = L1 i-cache access rate * L1 i-cache miss ratio = 1 access/instruction * (1-98%) = 0.02 misses/instruction L1 d-cache miss rate = L1 d-cache access rate * L1 d-cache miss ratio = 0.4 accesses/instruction * (1-95%) = 0.02 misses/instruction. Therefore: L2 access rate = = 0.04 accesses/instruction. Hence, L2 miss ratio = 0.01 / 0.04 = 25%, and L2 hit ratio = 1 25% = 75%

2 (c) (6 points) Assume that we added compression to the L2 cache. This change led to reducing the L2 miss rate to 4 misses per 1000 instructions. However, it introduced 5 cycles of additional latency (due to decompression) on every L2 cache access. Does this design change help or hurt performance, and by how much? New CPI = Old CPI CPI saved due to fewer misses+cpi increased due to higher access latency Old CPI = 2.0 from part(a) Old L2 miss rate = 10 MPKI, New L2 miss rate = 4 MPKI, Miss latency = 100 cycles Therefore, CPI saved due to fewer misses = (10 4) / 1000 * 100 = 0.6 L2 access rate = 0.04 accesses per instruction from part(b) Added L2 latency due to de-compression = 5 cycles Therefore, CPI increased due to higher access latency = 0.04 * 5 = 0.2 Therefore new CPI = = 1.6 Since new CPI < old CPI, this design change helps performance. Speedup = 2.0 /1.6 = 25% Problem No. 2 (8 points) A computer architect in the processor research team for a microprocessor company has proposed adding a prefetcher to the L2 cache. He implements the prefetcher in a simulator and reports the simulator results to his manager. The following questions relate to the conversation between the researcher and his manager: (a) (3 points) The researcher says: My results show that the prefetcher has an accuracy of 50%. Upon hearing that, the manager responds: That is great. This means that by adding this prefetcher, we can eliminate half of the L2 misses. Is the manager correct? Why or why not? No. The manager is wrong. Accuracy = # of prefetch hits / # of prefetches. The accuracy metric captures how many prefetches were successful. It does not measure how many of the original misses were avoided due to prefetching. The manager is confusing coverage with accuracy. (b) (5 points) The researcher then reports: I found that despite getting rid of many misses, the prefetcher was not able to obtain any improvements in processor performance. Provide two possible explanations for this result. (i) (ii) Useless prefetch requests could have slowed down demand misses, resulting in higher average access latency. Harmful prefetches could have introduced additional cache misses,

3 Problem No. 3 (13 points) A program is run on a Multiscalar architecture with 4 processing units (cores). The Multiscalar sequencer divides up the program into four tasks of length 1200, 1800, 500, and 2000 instructions, respectively. Each task can commit 2 instructions per cycle on average. All inter-task branches are predicted correctly, and all inter-task dependences are resolved in time and don t cause any additional latency. (a) (4 points) Calculate the execution time of this program in terms of number of cycles. Execution time = Time taken by longest task = 2000 instructions / 2 instructions per cycle = 1000 cycles (b) (4 points) Calculate the overall IPC of this program. Total instructions executed = = 5500 Total cycles = 1000 Overall IPC = 5500 / 1000 = 5.5 (c) (5 points) Assume that one of the cores is unavailable and therefore the program is run on only 3 cores. Will that change the execution time of the program? If yes, what is the new execution time in terms of number of cycles? In this case, the first 3 tasks will initially be executed on the 3 available cores. After the first task is done, its core will become available and will subsequently be used by the 4 th task. Therefore: Execution time = ( ) instructions / 2 instructions per cycle = 1600 cycles Problem No. 4 (24 points) (Note: For the following problems, please provide concise answers. Unnecessarily long answers will waste more time than is worth) (a) (3 points) Pentium-M designers established a criteria that a performance optimization would be considered only if it incurs a power overhead of at most 3% for every 1% performance increase. Why did they use the 3% power overhead limit? Why not 2%, 4% or any other number? Scaling the processor supply voltage and frequency already provides 1% performance increase for 3% additional power. An optimization targeted towards performance must be able to beat that in order to be meaningful.

4 (b) (3 points) State ONE advantage and ONE disadvantage of SMT:dual issue as compared to SMT:single issue. Advantage: higher performance (throughput) Disadvantage: higher design complexity (c) (2 points) State TWO potential benefits of runahead execution. 1. Loads and stores independent of L2-miss instructions generate useful prefetch requests 2. Prefetch tables are trained using future memory access information, enabling even more potentially useful prefetch activity (d) (3 points) Consider a load instruction in the runahead mode, which accesses the runahead cache and the store buffer in parallel. Assume that there is a hit in the runahead cache. Also, assume that the load address matches one of the preceding stores in the store buffer. Should the load consume the data read from the runahead cache? Why or why not? No. The store buffer contains one of the more recent store which is writing to the same address that the load needs to read from. (e) (2 points) State ONE advantage and ONE disadvantage of PCM compared to DRAM. Advantage: Non-volatility Disadvantage: Limited endurance (f) (3 points) In the Alloy cache paper, the miss latency of the DRAM cache was assumed to be 2x of the hit latency. If the miss latency were to become 8x of hit latency, do you expect Alloy cache s performance advantage over LH-cache to increase, decrease, or stay the same? Why? Alloy cache s performance advantage over LH-cache would decrease. Since Alloy cache has a higher miss rate than LH-cache, longer miss latency would affect it disproportionately compared to LH-cache

5 (g) (2 points) Which of the following techniques is the most effective in reducing compulsory misses, (i) Stream buffer, (ii) Victim cache, (iii) Compression? Stream buffer (h) (3 points) Can the destination physical register used by an instruction be freed after the instruction has been committed? Why or why not? No. Even if the producer instruction has committed, there could be other consumer instructions that may be stalled (due to some other dependence) and haven t had the opportunity to read their source register. If the physical register is freed, then another instruction may start using it and overwrite it with a new value before one of the consumer instructions had read the original contents of the physical register. (i) (3 points) Consider a design upgrade in which the simple 1-bit branch predictor used in a processor is being replaced by a more sophisticated PAp predictor with 10 bits of branch history. How does that change impact the effectiveness of each of the following processor features, (i) Trace cache, (ii) Runahead execution, (iii) Wakeup logic? For each feature, simply state whether the upgrade would make it MORE effective, LESS effective, or the SAME? 1. Trace cache would become MORE effective 2. Runahead execution would become MORE effective 3. Wakeup logic will remain the SAME

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each