Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy Maurizio Palesi Maurizio Palesi 1

References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio Palesi 2

Who Cares About the Memory Hierarchy? Performance 1000 100 10 1 Processor DRAM Memory Gap (latency) Moore s Law 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 µproc CPU 60%/yr. (2X/1.5yr) Processor Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs) Maurizio Palesi 3

Maurizio Palesi 4

Maurizio Palesi 5

Maurizio Palesi 6

Levels of the Memory Hierarchy CPU Registers 100s Bytes <10s ns Cache K Bytes 10 100 ns 1 0.1 cents/bit Main Memory M Bytes 200ns 500ns $.0001.00001 cents /bit Disk G Bytes, 10 ms (10,000,000 ns) 10 5 10 6 cents/bit Tape infinite sec min 10 8 Registers Instr. Operands Cache Blocks Memory Pages Disk Files Tape Staging Xfer Unit prog./compiler 1 8 bytes cache cntl 8 128 bytes OS 512 4K bytes user/operator Mbytes Faster Larger Maurizio Palesi 7

What is a Cache? Small, fast storage used to improve average access time to slow memory Exploits spatial and temporal locality In computer architecture, almost everything is a cache! Registers a cache on variables First level cache a cache on second level cache Second level cache a cache on memory Memory a cache on disk (virtual memory) TLB a cache on page table Branch prediction a cache on prediction information? Maurizio Palesi 8

The Principle of Locality The Principle of Locality Program access a relatively small portion of the address space at any instant of time Two Different Types of Locality Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 20 years, HW relied on locality for speed Maurizio Palesi 9

Exploit Locality By taking advantage of the principle of locality Present the user with as much memory as is available in the cheapest technology Provide access at the speed offered by the fastest technology DRAM is slow but cheap and dense Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense Good choice for providing the user FAST access time Maurizio Palesi 10

General Principes Locality Temporal Locality: referenced again soon Spatial Locality: nearby items referenced soon Locality + smaller HW is faster = memory hierarchy Levels: each smaller, faster, more expensive/byte than level below Inclusive: data found in top also found in the bottom Definitions Upper is closer to processor Block is the minimum unit that is present or not in upper level Address = Block frame address + block offset address Maurizio Palesi 11

Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor Hit Time << Miss Penalty (500 instructions on 21264!) To Processor From Processor Upper Level Memory Blk X Lower Level Memory Blk Y Maurizio Palesi 12

Cache Measures Average memory access time = Hit time + Miss rate x Miss penalty [ns or clocks] Miss penalty: time to replace a block from lower level, including time to replace in CPU access time: time to lower level = f(latency to lower level) transfer time: time to transfer block =f(bw between upper & lower levels) Maurizio Palesi 13

Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT Larger block size means larger miss penalty Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up Too few cache blocks In general, Average Access Time = Hit Time + Miss Penalty x Miss Rate Miss Penalty Miss Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Increased Miss Penalty & Miss Rate Block Size Block Size Block Size Maurizio Palesi 14

4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) Random, LRU Q4: What happens on a write? (Write strategy) Write Back or Write Through (with Write Buffer) Maurizio Palesi 15

Q1: Where can a block be placed in the upper level? Fully associative: block 12 can go anywhere Direct mapped: block 12 can go only into block 4 (12 mod 8) Set associative: block 12 can go anywhere in set 0 (12 mod 4) Block no. 0 1 2 3 4 5 6 7 Block no. 0 1 2 3 4 5 6 7 Block no. 0 1 2 3 4 5 6 7 Block frame address Set 0 Set 1 Set 2 Set 3 Block no. 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Maurizio Palesi 16

Q2: How Is a Block Found If It Is in the Upper Level? Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag Full Associative: No index Direct Mapped : Large index Maurizio Palesi 17

Cache Direct Mapped Main Memory 0 7 Cache 00000 ( 0) 00001 ( 1) 00111 ( 7) 01000 ( 8) 01111 (15) 10000 (16) 10111 (23) 11000 (24) Part. 0 Part. 1 Part. 2 Part. 3 11111 (31) Maurizio Palesi 18

Cache Direct Mapped Main Memory 0 7 0 7 tag Cache 00000 ( 0) 00001 ( 1) 00111 ( 7) 01000 ( 8) 01111 (15) 10000 (16) 10111 (23) 11000 (24) Part. 0 Part. 1 Part. 2 Part. 3 11111 (31) Maurizio Palesi 19

Q3: Which Block Should be Replaced on a Miss? Easy for Direct Mapped S.A. or F.A.: Random (large associativities) LRU (smaller associativities) Associativity 2 way 4 way 8 way Size LRU RND LRU RND LRU RND 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% Maurizio Palesi 20

Q4: What Happens on a Write? Write through: The information is written to both the block in the cache and to the block in the lower level memory Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced Is block clean or dirty? Pros and Cons of each WT: read misses cannot result in writes (because of replacements) WB: no writes of repeated writes WT always combined with write buffers so that don t wait for lower level memory Maurizio Palesi 21

Write Buffer for Write Through Processor Cache DRAM A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer s nightmare Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation Write buffer Maurizio Palesi 22

How a Block is Found in Cache tag index offset CPU address Cache tag Cache data compare Hit/miss Data bus CPU Maurizio Palesi 23

How a Block is Found in Cache Two sets of Address tags and data RAM Use address bits to select correct DRAM 2:1 Mux for the way Maurizio Palesi 24

Reducing Misses Classifying Misses: 3 Cs Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses Capacity If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved Conflict If block placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses Maurizio Palesi 25

3Cs Absolute Miss Rate (SPEC92) 0.14 0.12 0.1 0.08 0.06 0.04 1-way 2-way 4-way 8-way Capacity 0.02 0 1 2 4 8 16 32 64 128 Cache Size (KB) Compulsory Maurizio Palesi 26

2:1 Cache Rule miss rate 1 way associative cache size X = miss rate 2 way associative cache size X/2 0.14 0.12 0.1 0.08 0.06 0.04 1-way 2-way 4-way 8-way Capacity 0.02 0 1 2 4 8 16 32 64 128 Cache Size (KB) Compulsory Maurizio Palesi 27

3Cs Relative Miss Rate 100% 80% 60% 1-way 2-way 4-way 8-way Conflict 40% Capacity 20% 0% 1 2 4 8 16 32 Cache Size (KB) 64 128 Compulsory Maurizio Palesi 28