Memory Hierarchy Chapter 2. Abdullah Muzahid

Size: px

Start display at page:

Download "Memory Hierarchy Chapter 2. Abdullah Muzahid"

Rosa Fleming
6 years ago
Views:

1 Memory Hierarchy Chapter 2 Abdullah Muzahid

2 17. 2-Way Set Associative Example Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and LRU replacement policy: How big is cache? Addr Blk Fnd Upd R/W Binary addr Tag Set O Way Way 3R R R R R R R R R R

3 17. 2-Way Set Associative Example Addr Blk Fnd Upd R/W Binary addr Tag Set O Way Way 3R Miss 34R Miss 1216R Miss 444R Miss 1 448R R Miss 1 296R R R R Miss 1

4 18. Write-Through, No Write Allocate Example Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and LRU replacement policy: Addr Blk Fnd Upd Mem R/W Binary addr Tag Set O Way Way Refs 3W R R W W R R How many main memory reads? How many main memory writes?

5 18. Write-Through, No Write Allocate Example Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and LRU replacement policy: Addr Blk Fnd Upd Mem R/W Binary addr Tag Set O Way Way Refs 3W Miss None 1W 34R Miss 1R 444R Miss 1 1R 448W W 8496W Miss None 1W 85R Miss 1R 34R Miss 1 1R Main memory reads 4 Main memory writes 3

6 19. Write-Back, Write Allocate Example Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and LRU replacement policy: Addr Blk Fnd Upd Mem Dirty R/W Binary addr Tag Set O Way Way Refs 3W R R W W R R Put * in Upd Way if that way is (still) dirty How many reads and writes, and why?

7 19. Write-Back, Write Allocate Example Assume 2-way set-associative, 64 cache sets, 16-byte cache line, and LRU replacement policy: Addr Blk Fnd Upd Mem Dirty R/W Binary addr Tag Set O Way Way Refs 3W Miss 1 * 1R 34R Miss 1R 444R Miss 1 1R 448W *1 8496W Miss 1 * 1R 85R * 34R Miss 1 1R+1W ) Last ref evicts dirty block (448), causes read of 34 and write of 448! Main memory reads 5 Main memory writes 1

8 2. L1 cache of AMD Opteron Comp Arch, Henn & Patt, Fig B.5, pg B-13 Block address <25> <9> Tag (512 blocks) (512 blocks) Index 2 2 Block offset <6> Valid <1> Tag <25> =? =? Data <64> 2:1 mux 4 CPU address Data in Data out Victim buffer Lower-level memory 2-way assoc L1 Valid & dirty bit 8 blk LRU bit for each set 1. into 3 parts: 64 byte blksz! 6 bits o 512 entry! 9 bits index 25-bit tag (4 2. Index determ proper set 3. Check if tag match & valid bit set 4. Mux selects which way to pass out On hit, output! CPU On miss, output! vic bu

9 21. Improving Cache Performance Assume main and virtual memory implementations are fixed, how can we improve our cache performance? Reduce the cache miss penalty Reduce cache miss rate Use parallelism to overlap operations, improving one or both of above Doing hardware prefetch in parallel with normal mem tra c can reduce miss rate Reduce cache hit time ) Will discuss each of these in turn

10 22. Ideas for Reducing Cache Miss Penalty Use early restart: Allow cpu to continue as soon as required bytes are in cache, rather than waiting for entire block to load Critical word first: Load accessed bytes in block first Load remaining words in block in wrap-around manner Status bits needed to indicate how much of block has arrived Particularly good for caches wt large block sizes Give memory reads priority over writes Merging write bu er Victim caches Use multilevel caches

11 23. Giving Memory Reads Priority Over Writes (Reducing Cache Miss Penalty) Assume common case of having a write-bu er so that the CPU does not stall on writes (as it must for reads): The CPU can check in write bu er on read miss: presently in write bu, load from there not in write bu, load from mem before prior writes Advantages: Since read stalls CPU & write does not, we min stalls May avoid mem load if in write bu Write bu ers can make write-back more e cient as well: 1. For dirty bit eviction, copy the dirty blk frm cache to write bu 2. Load the evicting block from mem to cache (CPU unstalled) 3. Write dirty block from bu to memory

12 24. Merging Write Bu er (Reducing Cache Miss Penalty) Due to lat, multiword writes more e c than writing words sep Mult words in write bu may be associated with same Valid bits used to indicate which words to write Reduces the # of mem accesses Reduces the # of write bu stalls for a given bu size Write address V V Comp Arch, Henn & Patt, Fig 2.7, pg Mem[1] Mem[18] Mem[116] Mem[124] Write address V V V V 1 1 Mem[1] 1 Mem[18] 1 Mem[116] 1 V V Mem[124] Top bu wo write merge Don t need valid tags in write-back Assume 32-byte blk for further cache For seq acc, 4-fold red in # of writes & bu e In practice, must handle 1,2,4 as well as 8-byte words ) Larger blksz, more help

13 25. Victim Caches (Reducing Cache Miss Penalty) Victim cache: small (eg, 1-8 blocks), fully-associative cache that contains recently evicted blocks from a primary cache Checked in parallel with primary cache Available on following cycle if the item in the victim cache Victim block swapped with block in cache Of great benefit to direct-mapped L1 cache Comp Arch 3rd ed, Henn & Patt, Fig 5.13, pg 422 Less popular today!

14 26. Multi-Level Caches (Reducing Cache Miss Penalty) Becomes more popular as miss penalty for primary cache grows Further caches may be o -chip, but still made of SRAM Almost all general purpose machine have at least 2 lvls of cache, most have 2 on-chip caches Further caches typically have larger blocks and cache size Equations: local miss rate = misses in this cache / accesses to this cache global miss rate = misses in this cache / accesses to L1 cache avg acc time = L1 hit time + L1 miss rate * L1 miss penalty L1 miss penalty = L2 hit time + L2 miss rate * L2 miss penalty L2 miss penalty = mainmem access time L1 miss penalty is average access time for L2 Local miss rate: % of this cache s refs that go to the next lvl Global miss rate: % of all cache s refs that go to the next lvl

15 26. Cache miss equation examples Assume nref=1, nl1miss=4, nl2miss = 2, L2 miss penalty = 1 cycles, L2 hit time = 1, L1 hit time = 1 1. What is the local and global miss rate for each cache?! L1 has same loc & glob miss rate, since all mem refs go to L1 L1 miss rate = 1 4 = 1 4 =.4 = 4% L2 local miss rate = 2 4 =.5 = 5% L2 global miss rate = 1 2 = 1 2 =.2 = 2% 2. What is the average access time? avg acc time = L1 hit time + L1 miss rate L1 miss penalty L1 miss penalty = avg L2 acc time avg L2 acc time = L2 hit time + L2 miss rate L2 miss pen avg L2 acc time = = = 6 avg acc time = L1 hit time + L1 miss rate L1 miss penalty avg acc time = = = =3.4 cycles = =

16 27. Reducing Cache Miss Rate Just discussed techniques for reducing the cost of a cache miss, now want to investigate ways to increase our chances of hitting in the cache. Cache Miss Categories: Compulsory: First access to block must always miss Calc total # of blocks accessed in program Capacity: Blocks that are replaced and reloaded because the cache cannot contain all the used blocks needed during execution. Sim fully-assoc cache, sub compulsory miss from total miss Conflict: Occurs when too many blocks map to same cache set Sim desired cache, sub comp & capacity miss from total miss

17 Miss rate per type Miss rate per type.1 Comp Arch, Henn & Patt, Fig 2.2, pg % 8% 6% 4% 2% Cache size (KB) 28. Miss Rate vs. Cache Size 1-way 2-way 4-way 8-way Capacity Compulsory % Cache size (KB) 1-way 2-way 4-way 8-way Capacity Compulsory Top figure shows total miss rate Compulsory (tiny 1st line) misses stay constant Only way to dec comp is to increase blksz, which may increase miss penalty Capacity (lrg blk area) go down with size Conflict misses dec wt size: Since # of conflicts go down wt size, assoc pays o less for large caches Bottom figure shows distribution of misses % of compuls misses increase wt size, since other types of misses decrease wt size

18 29. Reducing Miss Rate wt Larger Blocks Advantages: Exploits spatial locality Reduces compulsory misses Disadvantages: Increases miss penalty Can increase conflicts May waste bandwidth SPEC92 blksz analysis: If linesz lrg comp to cachesz, conflicts rise, increasing miss rate 64-byte linesz reasonable 8 studied cache sizes Comp Arch, Henn & Patt, Fig B.1, pg B-27 Cache Size blksz 4K 16K 64K 256K % 3.94% 2.4% 1.9% % 2.87% 1.35%.7% 64 7.% 2.64% 1.6%.51% % 2.77% 1.2%.49% % 3.29% 1.15%.49%

19 3. Reducing Miss Rate wt Larger Caches & Higher Associativity Larger Caches Advantages: Reduces capacity & conflict misses Disdvantages: Uses more space May increase hit time Higher cost ($, power, die) Higher Associativity Advantages: Reduces conflict misses Disadvantages: May increase hit time Tag check done before data can be sent Req more space & power More logic for comparitors More bits for tag Other status bits (LRU)

20 31. Reducing Miss Rate wt Way Prediction & Pseudo Associativity Hit time as fast as direct-mapped, and req only 1 comparitor Reduces misses like a set-associative cache Will have fast hits and slow hits Way Prediction Each set has bits indicating which block to check on next access A miss requires checking the other blocks in the set on subsequent cycles Pseudo Associativity Accesses cache as in directmapped cache wt 1 less indx bit On miss, chks sister blk in cache (eg., by invert most sig indx bit) May swap two blks on an init cache miss wt a pseudo way hit

21 32. Reducing Cache Miss Penalty and/or Miss Rate via Parallelism Nonblocking caches Allow cache hit accesses while a cache miss is being serviced Some allow hits under multiple misses (req. a queue of outstanding misses) Could use a status bit for each block to indicate blk currently being filled Hardware prefetching & Software prefetching Idea is that predicted mem blocks are fetched while doing computations on present blocks Requires nonblocking caches Most prefetches do not raise exceptions If guess is right, data in-cache for use If wrong, wasted some bandwidth we weren t using anyway Helps with latency, by exploiting unused bandwidth If bus saturated, prefetch won t help, and most archs ignore Can help with throughput, if usage is sporadic Could expand conflict/capacity misses if prefetch is wrong

22 Pipeline Cache Access Pipeline cache access Increases hit latency But gives fast clock cycle and high bandwidth Most modern processors do this

23 34. Reducing Cache Hit Time Small & simple caches Small caches! less propogation delay Direct mapped! overlap tag chk & data sending Some designs have tags on-chip, data o Avoiding address translation: Virtual caches Avoids virtual-physical trans step, but problematic in practice Virtually indexed, physically tagged Indx cache by page o set, but tag with Can get data frm cache earlier Pipelined cache access: Allows fast clock speed, but results in greater br mispred penalty & load latency

24 35. Increasing Cache Bandwidth wt Multibanked Caches Increase bandwidth by sending address to b banks simultaneously b banks lookup address & write to bus at same time Increases bandwidth by b in best case Usually use sequential interleaving ) Figure 5.6 shows b=4 Block address Bank Block address 1 Comp Arch, Henn Block & Patt, Fig 5.6, pg 299 Bank 1 Bank 2 Block address address 2 3 Bank

25 Compiler Opt: Loop Interchange improve spaial/temporal locality of data for(j=;j<1;j=j+1) { } } for(i=;i<5;i=i+1){ x[i][j] = 2 * x [i][j]; for(i=;i<5;i=i+1){ } for (j=;j<1;j=j+1) { x[i][j] = 2 * x [i][j]; } Copyright Josep Torrellas 1999, 21, 22 25

26 Hardware Prefetching of I,D Prefetch : access items before they are needed and deposit them into caches or external buffers I prefetching: e.g. fetch next block on a miss or on access. The prefetched block goes to a stream buffer (or cache) D prefetching : same idea could have several stream buffers to capture several localiies Careful about bandwidth use Copyright Josep Torrellas 1999, 21, 22 26

27 Compiler Controlled Prefetching Compiler inserts prefetch instrucions Register prefetch : into a reg. (+ cache) Cache prefetch : into the cache Can be fauling : causes an excepion if protecion violaion non fauling : turns to No op if it would cause an excepion Needs a non blocking or lockup free cache: cache can be accessed while there is a prefetch / miss pending. Copyright Josep Torrellas 1999, 21, 22 27

28 Example 8 KB dir mapped cache with 16 B blocks Each element of a and b is 8 byte long 3r,1c 11r,3c for(i=;i<3;i=i+1) for(j=;j<1;j=j+1) a[i][j]= b[j][] * b[j+1][] a: Even j value miss; odd j value hit (spaial loc) - > 15 misses b: No spaial locality ; Only temp locality ; suppose no conflicts, miss 11 Imes TOTAL= 251 misses Copyright Josep Torrellas 1999, 21, 22 28

29 Usually works in loops Can be combined with loop unrolling & sokware pipelining Problem: Overhead Prefetching Copyright Josep Torrellas 1999, 21, 22 29

30 SimplificaIons: 1) not worry about first few misses, 2) not a fauling pref Split so that first loop prefetches a & b second loop prefetches only a assume long latency of miss prefetch 7 iteraions ahead for(j=;j<1;j=j+1) { prefetch(b[j+8][]); prefetch(a[][j+7]); a[][j] = b[j][]*b[j+1][]; } for(i=1;i<3;i=i+1){ for (j=;j<1;j=j+1) { prefetch(a[i][j+7]); a[i][j]=b[j][]*b[j+1][]; } } Copyright Josep Torrellas 1999, 21, 22 3

31 We are prefetching a[][7] - a[][99] a[1][7] - a[1][99] a[2][7] - a[2][99] b[8][] - b[1][] only lek with: 8 misses for b b[][].b[7][] 12 misses for a: a[][] a[][2] a[][4] a[][6] a[1][] a[1][2] a[1][4] a[1][6] a[2][] a[2][2] a[2][4] a[2][6] So execute 4 instrucions to avoid 231 misses Copyright Josep Torrellas 1999, 21, 22 31

32 36. Summary of Cache Optimizations Miss Miss Hit HW Technique pen rate tim BW cmplx Comment Comment Lrgr cachesz = + = 1widely used widely for used L2,L3 for L2,L3 Larger blksz + = = P4 L2 uses P4 L2 128 uses bytes 128 bytes Higher assoc = + = 1widely used widely used Multilevel + = = = 2Costly hrdwr, Costly hrdwr, esp if esp if caches L1 blkszl1 6= blksz L2; widely 6= L2; widely used used Cache indx w/o = = + = 1triv if small triv ifcache small translation USIII/21264 USIII/21264 Read priority + = = = 1easy foreasy uniproc, for uniproc, over writes widely used widely used Crit wrd frst + = = = 2widely used widely used &earlyrestrt Mrgng write bu + = = = 1widely used widely used Victim caches + + = = 2Athlon had Athlon 8-entry had 8-entry Way pred & = = + = 1I-cache I-cache of USIII/D-c of USIII/D-c of R43 of R43 Pseudoassoc = = + = 1L2 of ofl2 R1K of of R1K Comp. opt. = + = = hard, varies hard, by varies comp. by comp. Hardware pref + + = = 2I,3D widely used widely used Software pref + + = = 3widely used widely used Sm & simple cache = + widely used widely L1used L1 Nonblk caches + = = + 3all out-of-order all out-of-order CPUs CPUs Pipelined cache = = - + 1widely used widely used banked caches = = = + 1L2 of Opteron L2 of Opteron & Niagara & Niagara

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 2 (cont) Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Improving Cache Performance Average mem access time = hit time + miss rate * miss penalty speed up