NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

Size: px

Start display at page:

Download "NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches"

Randell Nash
6 years ago
Views:

1 NOW Handout Page EECS 5 Graduate Computer Architecture Lec - Caches David Culler Electrical Engineering and Computer Sciences University of California, Berkeley http// http//www-inst.eecs.berkeley.edu/~cs5 Review Who Cares About the Memory Hierarchy? Processor Only Thus Far in Course CPU cost/performance, ISA, Pipelined Execution CPU-DRAM Gap Moore s Law Performance Less Law? µproc 6%/yr. DRAM 7%/yr. /8/ CS5-S5 L Caches CPU Processor-Memory Performance Gap (grows 5% / year) 98 no cache in µproc; 995 -level cache on chip (989 first Intel µproc with a cache on chip) DRAM Review What is a cache? Small, fast storage used to improve average access time to slow memory. Exploits spacial and temporal locality In computer architecture, almost everything is a cache! Registers a cache on variables First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) TLB a cache on page table Branch-prediction a cache on prediction information? Bigger Proc/Regs L-Cache L-Cache Memory Faster Disk, Tape, etc. /8/ CS5-S5 L Caches 3 Review Terminology Hit data appears in some block in the upper level (example Block X) Hit Rate the fraction of memory access found in the upper level Hit Time Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = - (Hit Rate) Miss Penalty Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty (5 instructions on 6!) To Processor Upper Level Memory Lower Level Memory Blk X From Processor Blk Y /8/ CS5-S5 L Caches Why it works Exploit the statistical properties of programs Locality of reference Temporal Spatial P(access,t) Block Placement Q Where can a block be placed in the upper level? Fully Associative, Set Associative, Direct Mapped Average Memory Access Time AMAT = HitTime + MissRate MissPenalty = ( HitTimeInst + MissRateInst MissPenaltyInst ) + ( HitTimeData + MissRateData MissPenaltyData ) Simple hardware structure that observes program behavior and reacts to improve future performance Is the cache visible in the ISA? address /8/ CS5-S5 L Caches 5 /8/ CS5-S5 L Caches 6

2 NOW Handout Page KB Direct Mapped Cache, 3B blocks For a ** N byte cache The uppermost (3 - N) bits are always the The lowest M bits are the Byte Select (Block Size = ** M) 3 Valid Bit x5 Example x5 Stored as part of the cache state Cache Index Ex x Byte 3 Byte 63 Byte 3 Byte Byte Byte 33 Byte 3 /8/ CS5-S5 L Caches 7 9 Byte Select Ex x 3 Byte 99 3 Review Set Associative Cache N-way set associative N entries for each Cache Index N direct mapped caches operates in parallel How big is the tag? Example Two-way set associative cache Cache Index selects a set from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Valid Cache Block Adr Tag Compare Sel Cache Index Mux Cache Block Sel OR /8/ CS5-S5 L Cache Caches Block Hit 8 Compare Valid Q How is a block found if it is in the upper level? Index identifies set of possibilities Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag Block Address Block Tag Index Offset Cache size = Associativity * index_size * offest_size Q3 Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative Random LRU (Least Recently Used) Assoc -way -way 8-way Size LRU Ran LRU Ran LRU Ran 6 KB 5.% 5.7%.7% 5.3%.% 5.% 6 KB.9%.%.5%.7%.%.5% 56 KB.5%.7%.3%.3%.%.% /8/ CS5-S5 L Caches 9 /8/ CS5-S5 L Caches Q What happens on a write? Write through The information is written to both the block in the cache and to the block in the lowerlevel memory. Write back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT read misses cannot result in writes WB no repeated writes to same location WT always combined with write buffers so that don t wait for lower level memory What about on a miss? Write_no_allocate vs write_allocate /8/ CS5-S5 L Caches Write Buffer for Write Through Processor Cache Write Buffer DRAM A Write Buffer is needed between the Cache and Memory Processor writes data into the cache and the write buffer Memory controller write contents of the buffer to memory Write buffer is just a FIFO Typical number of entries Works fine if Store frequency (w.r.t. time) << / DRAM write cycle /8/ CS5-S5 L Caches

3 NOW Handout Page 3 Review Cache performance Miss-oriented Approach to Memory Access MemAccess CPUtime = IC CPI Execution + MissRate MissPenalty CycleTime Inst Separating out Memory component entirely AMAT = Average Memory Access Time MemAccess CPUtime = IC CPI AluOps + AMAT CycleTime Inst Effective CPI = CPI ideal_mem + P mem * AMAT /8/ CS5-S5 L Caches 3 Impact on Performance Suppose a processor executes at Clock Rate = MHz (5 ns per cycle), Ideal (no misses) CPI =. 5% arith/logic, 3% ld/st, % control Suppose that % of memory operations get 5 cycle miss penalty Suppose that % of instructions get same miss penalty CPI = ideal CPI + average stalls per instruction.(cycles/ins) + [.3 (DataMops/ins) x. (miss/datamop) x 5 (cycle/miss)] + [ (InstMop/ins) x. (miss/instmop) x 5 (cycle/miss)] = ( ) cycle/ins = 3. 58% of the time the proc is stalled waiting for memory! AMAT=(/.3)x[+.x5]+(.3/.3)x[+.x5]=.5 /8/ CS5-S5 L Caches Example Harvard Architecture Unified vs Separate I&D (Harvard) Proc Unified Cache- Unified Cache- I-Cache- Proc D-Cache- Unified Cache- Statistics (given in H&P) 6KB I&D Inst miss rate=.6%, Data miss rate=6.7% 3KB unified Aggregate miss rate=.99% Which is better (ignore L cache)? Assume 33% data ops 75% accesses from instructions (./.33) hit time=, miss time=5 Note that data hit has stall for unified cache (only one port) AMAT Harvard =75%x(+.6%x5)+5%x(+6.7%x5) =.5 AMAT Unified =75%x(+.99%x5)+5%x(++.99%x5)=. /8/ CS5-S5 L Caches 5 The Cache Design Space Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back The optimal choice is a compromise depends on access characteristics» workload Bad» use (I-cache, D-cache, TLB) depends on technology / cost Good Simplicity often wins Cache Size Factor A Less Associativity Block Size Factor B More /8/ CS5-S5 L Caches 6 Review Improving Cache Performance Memory accesses CPUtime = IC CPI Execution + Miss rate Miss penalty Clock cycle time Instruction. Reduce the miss rate,. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. /8/ CS5-S5 L Caches 7 Reducing Misses Classifying Misses 3 Cs Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) Conflict If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) More recent, th C Coherence - Misses caused by cache coherence. /8/ CS5-S5 L Caches 8

4 NOW Handout Page 3Cs Absolute Miss Rate (SPEC9) way -way -way 8-way Conflict Capacity Cache Rule miss rate -way associative cache size X ~= miss rate -way associative cache size X/ way -way -way 8-way Conflict Capacity Compulsory vanishingly small Cache Size (KB) Compulsory /8/ CS5-S5 L Caches Cache Size (KB) Compulsory /8/ CS5-S5 L Caches 3Cs Relative Miss Rate How Can Reduce Misses? % 8% 6% % % -way -way -way 8-way Capacity Conflict 3 Cs Compulsory, Capacity, Conflict In all cases, assume total cache size not changed What happens if ) Change Block Size Which of 3Cs is obviously affected? ) Change Associativity Which of 3Cs is obviously affected? % Caveat fixed block size Compulsory Cache Size (KB) /8/ CS5-S5 L Caches 3) Change Algorithm / Compiler Which of 3Cs is obviously affected? /8/ CS5-S5 L Caches. Reduce Misses via Larger Block Size Miss Rate 5% % 5% % 5% K K 6K 6K 56K. Reduce Misses via Higher Associativity Cache Rule Miss Rate DM cache size N ~ Miss Rate -way cache size N/ Beware Execution time is only final measure! Will Clock Cycle time increase? Hill [988] suggested hit time for -way vs. -way external cache +%, internal + % % Block Size (bytes) 56 /8/ CS5-S5 L Caches 3 /8/ CS5-S5 L Caches

5 NOW Handout Page 5 Example Avg. Memory Access Time vs. Miss Rate 3. Reducing Misses via a Victim Cache assume CCT =. for -way,. for -way,. for 8-way vs. CCT direct mapped Cache Size Associativity (KB) -way -way -way 8-way (Red means A.M.A.T. not improved by more associativity) /8/ CS5-S5 L Caches 5 How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [99] -entry victim cache removed % to 95% of conflicts for a KB direct mapped data cache Used in Alpha, HP machines TAGS DATA Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data To Next Lower Level In Hierarchy /8/ CS5-S5 L Caches 6. Reducing Misses via Pseudo-Associativity How to combine fast hit time of Direct Mapped and have the lower conflict misses of -way SA cache? Divide cache on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Hit Time Pseudo Hit Time Miss Penalty Time Drawback CPU pipeline is hard if hit takes or cycles Better for caches not tied directly to processor (L) Used in MIPS R L cache, similar in UltraSPARC 5. Reducing Misses by Hardware Prefetching of Instructions & Data E.g., Instruction Prefetching Alpha 6 fetches blocks on a miss Extra block placed in stream buffer On miss check stream buffer Works with data blocks too Jouppi [99] data stream buffer got 5% misses from KB cache; streams got 3% Palacharla & Kessler [99] for scientific programs for 8 streams got 5% to 7% of misses from 6KB, -way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty /8/ CS5-S5 L Caches 7 /8/ CS5-S5 L Caches 8 6. Reducing Misses by Software Prefetching Data Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth 7. Reducing Misses by Compiler Optimizations McFarling [989] reduced caches misses by 75% on 8KB direct mapped cache, byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Data Merging Arrays improve spatial locality by single array of compound elements vs. arrays Loop Interchange change nesting of loops to access data in order stored in memory Loop Fusion Combine independent loops that have same looping and some variables overlap Blocking Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows /8/ CS5-S5 L Caches 9 /8/ CS5-S5 L Caches 3

6 NOW Handout Page 6 Merging Arrays Example Loop Interchange Example /* Before sequential arrays */ int val[size]; int key[size]; /* After array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[size]; /* Before */ for (k = ; k < ; k = k+) for (j = ; j < ; j = j+) for (i = ; i < 5; i = i+) x[i][j] = * x[i][j]; /* After */ for (k = ; k < ; k = k+) for (i = ; i < 5; i = i+) for (j = ; j < ; j = j+) x[i][j] = * x[i][j]; Reducing conflicts between val & key; improve spatial locality /8/ CS5-S5 L Caches 3 Sequential accesses instead of striding through memory every words; improved spatial locality /8/ CS5-S5 L Caches 3 Loop Fusion Example /* Before */ for (i = ; i < N; i = i+) for (j = ; j < N; j = j+) a[i][j] = /b[i][j] * c[i][j]; for (i = ; i < N; i = i+) for (j = ; j < N; j = j+) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = ; i < N; i = i+) for (j = ; j < N; j = j+) { a[i][j] = /b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} misses per access to a & c vs. one miss per access; improve spatial locality /8/ CS5-S5 L Caches 33 Blocking Example /* Before */ for (i = ; i < N; i = i+) for (j = ; j < N; j = j+) {r = ; for (k = ; k < N; k = k+){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two Inner Loops Read all NxN elements of z[] Read N elements of row of y[] repeatedly Write N elements of row of x[] Capacity Misses a function of N & Cache Size N 3 + N => (assuming no conflict; otherwise ) Idea compute on BxB submatrix that fits /8/ CS5-S5 L Caches 3 Blocking Example /* After */ for (jj = ; jj < N; jj = jj+b) for (kk = ; kk < N; kk = kk+b) for (i = ; i < N; i = i+) for (j = jj; j < min(jj+b-,n); j = j+) {r = ; for (k = kk; k < min(kk+b-,n); k = k+) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; B called Blocking Factor Capacity Misses from N 3 + N to N 3 /B +N Conflict Misses Too? /8/ CS5-S5 L Caches 35 Reducing Conflict Misses by Blocking..5 Blocking Factor Direct Mapped Cache Fully Associative Cache 5 5 Conflict misses in caches not FA vs. Blocking size Lam et al [99] a blocking factor of had a fifth the misses vs. 8 despite both fit in cache /8/ CS5-S5 L Caches 36

7 NOW Handout Page 7 Summary of Compiler Optimizations to Reduce Cache Misses (by hand) vpenta (nasa7) gmty (nasa7) tomcatv btrix (nasa7) mxm (nasa7) spice cholesky (nasa7) compress Performance Improvement Impact of Memory Hierarchy on Algorithms Today CPU time is a function of (ops, cache misses) vs. just f(ops) What does this mean to Compilers, Data structures, Algorithms? The Influence of Caches on the Performance of Sorting by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM- SIAM Symposium on Discrete Algorithms, January, 997, Quicksort fastest comparison based sorting algorithm when all keys fit in memory Radix sort also called linear time sort because for keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys For Alphastation 5, 3 byte blocks, direct mapped L MB cache, 8 byte keys, from to merged loop loop fusion blocking arrays interchange /8/ CS5-S5 L Caches 37 /8/ CS5-S5 L Caches 38 Quicksort vs. Radix as vary number keys Instructions Quicksort vs. Radix as vary number keys Instrs & Time Radix sort Quick (Instr/key) Radix (Instr/key) Radix sort Time Quick (Instr/key) Radix (Instr/key) Quick (Clocks/key) Radix (clocks/key) 3 Quick sort E+7 Instructions/key Set size in keys /8/ CS5-S5 L Caches 39 3 Quick sort Instructions E+7 Set size in keys /8/ CS5-S5 L Caches Quicksort vs. Radix as vary number keys Cache misses 5 3 Quick sort Radix sort Cache misses Set size in keys Quick(miss/key) Radix(miss/key) What is proper approach to fast algorithms? /8/ CS5-S5 L Caches Review What happens on Cache miss? For in-order pipeline, options Freeze pipeline in Mem stage (popular early on Sparc, R) IF ID EX Mem stall stall stall stall Mem Wr IF ID EX stall stall stall stall Ex Mem Wr» Stall, Load cache line, Restart mem stage» This is why cost on CM = Penalty + Hit Time Use Full/Empty bits in registers + MSHR queue» MSHR = Miss Status/Handler Registers (Kroft) Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line. Per cache-line keep info about memory address. For each word register (if any) that is waiting for result. Used to merge multiple requests to one memory line» New load creates MSHR entry and sets destination register to Empty. Load is released from pipeline.» Attempt to use register before result returns causes instruction to block in decode stage.» Limited out-of-order execution with respect to loads. Popular with in-order superscalar architectures. Out-of-order pipelines already have this functionality built in /8/ (load queues, etc). CS5-S5 L Caches

8 NOW Handout Page 8 Disadvantage of Set Associative Cache N-way Set Associative Cache v. Direct Mapped Cache N comparators vs. Extra MUX delay for the data Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss Possible to assume a hit and continue. Recover later if miss. Valid Adr Tag Compare Cache Index Cache Block Cache Block Sel Mux Sel Compare Valid Review Four Questions for Memory Hierarchy Designers Q Where can a block be placed in the upper level? (Block placement) Fully Associative, Set Associative, Direct Mapped Q How is a block found if it is in the upper level? (Block identification) Tag/Block Q3 Which block should be replaced on a miss? (Block replacement) Random, LRU Q What happens on a write? (Write strategy) Write Back or Write Through (with Write Buffer) OR /8/ CS5-S5 L Cache Caches Block Hit 3 /8/ CS5-S5 L Caches Summary Memory accesses CPUtime = IC CPI Execution + Miss rate Miss penalty Clock cycle time Instruction 3 Cs Compulsory, Capacity, Conflict. Reduce Misses via Larger Block Size. Reduce Misses via Higher Associativity 3. Reducing Misses via Victim Cache. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Misses by Compiler Optimizations Remember danger of concentrating on just one parameter when evaluating performance /8/ CS5-S5 L Caches 5

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Fall 1995 Review: Who Cares About the Memory Hierarchy? Processor Only Thus Far in Course: