Topics. Digital Systems Architecture EECE EECE Need More Cache?

Size: px

Start display at page:

Download "Topics. Digital Systems Architecture EECE EECE Need More Cache?"

Bryan Merritt
6 years ago
Views:

1 Digital Systems Architecture EECE 33-0 EECE 9-0 Need More Cache? Dr. William H. Robinson March, 00 Topics Cache: a safe place for hiding or storing things. Webster s New World Dictionary of the American Language Administrative stuff Homework assignment #3 posted Reading assignment summary due Friday Caches Role in the memory hierarchy Improving cache performance Who Cares About the Memory Hierarchy? Processor Only Thus Far in Course: CPU cost/performance, ISA, Pipelined Execution 000 CPU-DRAM Gap Moore s Law Performance Less Law? DRAM : no cache in µproc; 995 -level cache on chip (99 first Intel µproc with a cache on chip) CPU µproc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) DRAM 7%/yr. What is a Cache? Small, fast storage used to improve average access time to slow memory. Exploits spatial and temporal locality In computer architecture, almost everything is a cache! Registers a cache on variables First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) Proc/Regs L-Cache L-Cache 3 Adapted from David Patterson s CS 5 lecture notes. Copyright 00 UCB. Bigger Memory Disk, Tape, etc. Faster

2 The Principle of Locality The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 5 years, H/W relied on locality for speed Cache Performance (Figure 5.9 page ) Miss-oriented Approach to Memory Access: MemAccess CPUtime = IC CPI Execution + MissRate MissPenalty CycleTime Inst Separating out Memory component entirely AMAT = Average Memory Access Time MemAccess CPUtime = IC CPI AluOps + AMAT CycleTime Inst AMAT = HitTime + MissRate MissPenalty = HitTime Inst + MissRateInst MissPenalty HitTime + MissRate MissPenalty ( Inst ) + ( ) Data Data Data 5 6 Example: Harvard Architecture Unified vs Separate I&D (Harvard) Proc Unified Cache- Unified Cache- I-Cache- Statistics (given in H&P): KB I&D: Inst miss rate = 0.6%, Data miss rate = 6.7% 3KB unified: Aggregate miss rate =.99% Which is better (ignore L cache)? Assume 33% data ops 75% accesses from instructions (.0/.33) hit time =, miss time = 50 Note that data hit has stall for unified cache (only one port) Proc Unified Cache- D-Cache- AMAT Harvard = 75%x(+0.6%x50)+5%x(+6.7%x50) =.05 AMAT Unified = 75%x(+.99%x50)+5%x(++.99%x50) =. Four Questions for Memory Hierarchy Designers Q: Where can a block be placed in the upper level? (Block placement) Fully Associative, Set Associative, Direct Mapped Q: How is a block found if it is in the upper level? (Block identification) Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) Random, LRU Q: What happens on a write? (Write strategy) Write Back or Write Through (with Write Buffer) 7

3 Types of Caches Direct-mapped Cache line only has possible location Simple to implement Set-associative Cache line has a set of possible locations e.g. -way set-associative Fully-associative Cache line can be placed anywhere Costly in H/W to check entire cache for hit Block Identification Block Address Cache Tag Cache Index Block Offset Generally tags checked in parallel for speed Block offset not needed for comparison Checking the index is redundant Also need valid bit 9 0 Replacement Policy Random Spreads allocation uniformly Implemented with pseudorandom number generator Least Recently Used (LRU) Blocks not accessed for longest time are replaced Complex to implement First In First Out (FIFO) Determines the oldest block Approximates LRU Write Policy: Write-Through vs Write-Back Write-through: all writes update cache and underlying memory/cache Can always discard cached data - most up-to-date data is in memory Cache control bit: only a valid bit Write-back: all writes simply update cache Can t just discard cached data - may have to write it back to memory Cache control bits: both valid and dirty bits

4 Write Policy: Write-Through vs Write-Back Other Advantages: Write-through: Memory (or other processors) always have latest data Simpler management of cache Write-back: Much lower bandwidth, since data often overwritten multiple times Better tolerance to long-latency memory? Write Policy : Write Allocate vs No-Write Allocate What happens on write-miss Write allocate: allocate new cache line in cache Usually means that you have to do a read miss to fill in rest of the cache-line! No-write allocate (or write-around ): Simply send write data through to underlying memory/cache; don t allocate new cache line! 3 CPUtime = IC CPI Execution + Improving Cache Performance Memory accesses Instruction Miss rate Miss penalty Clock cycle time. Reduce the miss rate,. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Reducing Misses Classifying Misses: 3 Cs The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) More recent, th C : Coherence - Misses caused by cache coherence. 5

5 3Cs Absolute Miss Rate (SPEC9) way -way -way -way : Cache Rule miss rate -way associative cache size X = miss rate -way associative cache size X/ -way -way -way -way vanishingly small % 0% 60% 0% 0% 0% 3Cs Relative Miss Rate -way -way -way -way Flaws: for fixed block size Good: insight => invention 3 6 How To Reduce Misses? 3 Cs:,, In all cases, assume total cache size not changed: What happens if: ) Change Block Size: Which of 3Cs is obviously affected? ) Change Associativity: Which of 3Cs is obviously affected? 3) Change Compiler: Which of 3Cs is obviously affected? 9 0

6 . Reduce Misses Via Larger Block Size. Reduce Misses Via Larger Caches Miss Rate 5% 0% 5% 0% K K K 6K way -way -way -way 5% 56K 0.0 0% Block Size (bytes) Reduce Misses Via Higher Associativity : Cache Rule: Miss Rate DM cache size N about the same as Miss Rate -way cache size N/ Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [9] suggested hit time for -way vs. -way external cache +0%, internal + % Example: Avg. Memory Access Time vs. Miss Rate Example: assume CCT =.0 for -way,. for -way,. for -way vs. CCT direct mapped Cache Size Associativity (KB) -way -way -way -way (Red means A.M.A.T. not improved by more associativity) 3

7 . Reducing Misses via Pseudo-Associativity How to combine fast hit time of Direct Mapped and have the lower conflict misses of -way SA cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Hit Time Pseudo Hit Time Time Miss Penalty Drawback: CPU pipeline is hard if hit takes or cycles Better for caches not tied directly to processor (L) Used in MIPS R000 L cache, similar in UltraSPARC 5. Reducing Misses by Compiler Optimizations McFarling [99] reduced caches misses by 75% on KB direct mapped cache, byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows 5 6 Summary Memory hierarchy goal Infinite storage with access time of processor clock Things to Do Reading Assignment #5 due Friday Alpha 6 microprocessor Caches utilize locality Spatial and temporal Quantitative performance measurements Used to compare different implementations Continue reading H & P Chapter 5 Homework Assignment #3 posted Techniques to reduce miss rate 7

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards Computer Organization CS 231-01 Improving Performance Dr. William H. Robinson November 8, 2004 Topics Money's only important when you don't have any. Sting Cache Scoreboarding http://eecs.vanderbilt.edu/courses/cs231/