Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Size: px

Start display at page:

Download "Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative"

Brendan Jennings
5 years ago
Views:

1 Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory Memory Web cache I/O I/O controller disk Disk disk Disk I/O I/O controller Display Display I/O I/O controller Network Network 2 Alpha Chip Photo Microprocessor Report 9/12/94 s: L1 data L1 instruction L2 unified TLB Branch history 3

2 Alpha Chip s s: L1 data L1 instruction L2 unified TLB Branch history L1 Right Half L2 L3 Control L1 Data I n s t r. Right Half L2 L2 s 4 Locality of Reference Principle of Locality: Programs tend to reuse data and instructions near those they have used recently. Temporal locality: recently referenced items are likely to be referenced in the near future. Spatial locality: items with nearby addresses tend to be referenced close together in time. sum = 0; for (i = 0; i < n; i++) Locality in Example: sum += a[i]; *v = sum; Data Reference array elements in succession (spatial) Instructions Reference instructions in sequence (spatial) Cycle through loop repeatedly (temporal) 5 Caching: The Basic Idea Main Memory Stores words A Z in example Stores subset of the words 4 in example Organized in lines Multiple words To exploit spatial locality Access Word must be in cache for processor to access Processor Small, Fast A B G H Big, Slow Memory A B C Y Z 6

3 Basic Idea (Cont.) Initial Read C Read D Read Z A B A B A B Y Z G H C D C D C D holds 2 lines Each with 2 words Maintaining : Each time the processor performs a load or store, bring line containing the word into the cache May need to evict existing line Subsequent loads or stores to any word in line performed within cache Load line C+D into cache miss 7 Word already in cache hit Load line Y+Z into cache Evict oldest entry Design Issues for s Key Questions: Where should a line be placed in the cache? (line placement) How is a line found in the cache? (line identification) Which line should be replaced on a miss? (line replacement) What happens on a write? (write strategy) Constraints: Design must be very simple Hardware realization All decision making within nanosecond time scale Want to optimize performance for typical programs Do extensive benchmarking and simulations Many subtle engineering tradeoffs 9 Direct-Mapped s Simplest Design Each memory line has a unique cache location Parameters (or block) size B = 2 b m-bit Physical Address Number of bytes in each line Typically 2X 8X word size t s b Number of Sets S = 2 s Number of lines cache can hold Total Size = B*S = 2 b+s Physical Address tag set index offset Address used to reference main memory m bits to reference M = 2 m total bytes Partition into fields Offset: Lower b bits indicate which byte within line Set: Next s bits indicate how to locate line within cache : Identifies this line when in cache 10

4 Indexing into Direct-Mapped Use set index bits to select cache set Set 0: 0 1 B 1 Set 1: Set S 1: 0 1 B B 1 t s b tag set index offset Physical Address 11 Direct-Mapped Matching Identifying Must have tag match high order bits of address Must have = 1 =? Selected Set: = 1? 0 1 B 1 t s b tag set index offset Physical Address Lower bits of address select byte or word within cache line 12 Direct Mapped Simulation t=1 s=2 b=1 x xx x M=16 byte addresses, B=2 bytes/line, S=4 sets, E=1 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] (1) 0 [0000] v tag data 1 0 m[1] m[0] (2) 13 [1101] v tag data 1 0 m[1] m[0] 1 1 m[13] m[12] (3) 8 [1000] v tag data 1 1 m[9] m[8] (4) 0 [0000] v tag data 1 0 m[1] m[0] 1 1 m[13] m[12] 13

5 Example We have a direct-mapped cache with 16 cache lines, line size is 16 bytes. What is the cache size? Integer variable A s address is 0x , which line in the example cache may store variable A? The information stored at the selected cache line has the following information: V bit: valid : 0x Data 0x aabbcc Is this a cache hit? 15 One More Example We have a direct-mapped cache with 256 cache lines, line size is 16 bytes. What is the cache size? Integer variable A s address is 0x , which line in the example cache may store variable A? The information stored at the selected cache line has the following information: V bit: valid : 0x40800 Data 0x aabbcc Is this a cache hit? What value is A? Why Use Middle Bits as Index? 4-line High-Order Bit Indexing Adjacent memory lines would map to same cache entry Poor use of spatial locality Middle-Order Bit Indexing Consecutive memory lines map to different cache lines Can hold C-byte region of address space in cache at one time High-Order Bit Indexing Middle-Order Bit Indexing

6 Why Use Middle Bits as Index? Example: Assume B is an array of double precision fp and we have a 4-line direct-mapped cache with line size=16b. In the following loop, variable a,i and j are allocated to registers. for (I = 0; I< ; I++) for (j = 0; j< 8; j++) a += B[j]*I; line If middle bits are used as index, we can get ~100% cache hit. If high bits are used as index, we get ~50% hit. 20 Direct Mapped Implementation (DECStation 3100) byte tag set offset valid tag (16 bits) data (32 bits) 16,384 sets = hit 21 data Properties of Direct Mapped s Strength Minimal control hardware overhead Simple design (Relatively) easy to make fast Weakness Vulnerable to thrashing 22

7 Vector Product Example float dot_prod(float x[1024], y[1024]) { float sum = 0.0; int i; for (i = 0; i < 1024; i++) sum += x[i]*y[i]; return sum; } Machine DECStation 5000 MIPS Processor with 64KB direct-mapped cache, 16 B line size Performance Good case: 24 cycles / element Bad case: 66 cycles / element 23 Thrashing Example x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3] x[1020] x[1021] x[1022] x[1023] y[1020] y[1021] y[1022] y[1023] Access one element from each array per iteration 24 Thrashing Example: Good Case x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3] Access Sequence Read x[0] x[0], x[1], x[2], x[3] loaded Read y[0] y[0], y[1], y[2], y[3] loaded Read x[1] Hit Read y[1] Hit 2 misses / 8 reads Analysis x[i] and y[i] map to different cache lines Miss rate = 25% Two memory accesses / iteration On every 4th iteration have two misses Timing 10 cycle loop time 28 cycles / cache miss Average time / iteration = * 2 * 28 25

8 Thrashing Example: Bad Case x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3] Access Pattern Read x[0] x[0], x[1], x[2], x[3] loaded Read y[0] y[0], y[1], y[2], y[3] loaded Read x[1] x[0], x[1], x[2], x[3] loaded Read y[1] y[0], y[1], y[2], y[3] loaded 8 misses / 8 reads 26 Analysis x[i] and y[i] map to same cache lines Miss rate = 100% Two memory accesses / iteration On every iteration have two misses Timing 10 cycle loop time 28 cycles / cache miss Average time / iteration = * 2 * 28 When will X conflict with Y? float dot_prod(float x[], y[], int n) {... for (i = 0; i < n; i++) sum += x[i]*y[i];} General Case: The difference between X s and Y s starting addresses is multiple of the direct-mapped cache size. Example: Assume the direct-map cache has 256 Bytes and 16 lines. Array A and B are declared as float A[256], B[256]; and passed as X and Y to dot_prod() The address difference between A and B is 1024 Bytes. If A s starting address is a, then B s starting address is a The lower 10 bits of A[i] s and B[i] s addresses are identical, so they will have the same index bits. 27 When will X conflict with Y? float dot_prod(float x[], y[], int n) {... for (i = 0; i < n; i++) sum += x[i]*y[i];} Common Cases: 1) Multiple dimensional arrays float A[64][64] dot_prod(a[i], A[i+1], n) 2) Multiple arrays float A[1024], B[1024],... dot_prod(a, B, n) 3) Power-of-two Strides float A[100000] dot_prod(&a[i], &A[i+256], n) 28

9 Set Associative Mapping of Memory s Each set can hold E lines Typically between 2 and 8 Given memory line can map to any entry within its given set Eviction Policy Which line gets kicked out when bring new line in Commonly either Least Recently Used (LRU) or pseudo-random LRU: least-recently accessed (read or written) line gets evicted LRU State Set i: 0: 1: E 1: 0 1 B B B 1 29 Indexing into 2-Way Associative Use middle s bits to select from among S = 2 s sets Set 0: Set 1: 0 1 B B B B 1 Set S 1: 0 1 B B 1 t s b tag set index offset Physical Address 30 2-Way Associative Matching Identifying Must have one of the tags match high order bits of address Must have = 1 for this line =? Selected Set: = 1? 0 1 B B 1 t s b Lower bits of address select byte or word within cache line tag set index offset Physical Address 31

10 2-Way Set Associative Simulation t=2 s=1 b=1 xx x x 1 00 m[1] m[0] M=16 addresses, B=2 bytes/line, S=2 sets, E=2 entries/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] 0 (miss) 1 00 m[1] m[0] 1 11 m[13] m[12] 13 (miss) 1 10 m[9] m[8] 1 11 m[13] m[12] 1 10 m[9] m[8] 1 00 m[1] m[0] 32 8 (miss) (LRU replacement) 0 (miss) (LRU replacement) 2-Way Set Associative Simulation t=2 s=1 b=1 xx x x M=16 addresses, B=2 bytes/line, S=2 sets, E=2 entries/set Address trace (reads): 0 [0000] 1 [0001] 14 [1110] 8 [1000] 0 [0000] 1 00 m[1] m[0] 0 (miss) 1 00 m[1] m[0] 1 11 m[15] m[14] 14 (miss) 1 00 m[1] m[0] 1 11 m[15] m[14] 1 10 m[9] m[8] 1 00 m[1] m[0] 1 11 m[15] m[14] 1 10 m[9] m[8] 33 8 (miss) 0 (hit) Two-Way Set Associative Implementation Set index selects a set from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Data 0 : : : Set Index Data 0 : : : Adr Compare 1 Sel1 Mux 0 Sel0 Adr Compare Hit OR 34

11 Fully Associative Mapping of Memory s consists of single set holding E lines Given memory line can map to any line in set Only practical for small caches Entire LRU State 0: 1: E 1: 0 1 B B B 1 35 Fully Associative Matching Identifying Must check all of the tags for match Must have = 1 for this line =? = 1? 0 1 B B B 1 t b tag offset Physical Address 36 Lower bits of address select byte or word within cache line Fully Associative Simulation t=3 s=0 b=1 xxx x M=16 addresses, B=2 bytes/line, S=1 sets, E=4 entries/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] 0 (miss) v tag data 1 00 m[1] m[0] (1) set ø (2) 13 (miss) v tag data m[1] m[0] m[13] m[12] (3) 8 (miss) v tag data m[1] m[0] m[13] m[12] m[9] m[8] 37

12 Write Policy What happens when processor writes to the cache? Should memory be updated as well? Write Through: Store by processor updates cache and memory. Memory always consistent with cache Never need to store from cache to memory ~2X more loads than stores Store Processor Load 38 Load Memory Write Strategies (Cont.) Write Back: Store by processor only updates cache line Modified line written to memory only when it is evicted Requires dirty bit for each line»set when line in cache is modified»indicates that line in memory is stale Memory not always consistent with cache Processor Store Load Write Back Load Memory 39 Write Strategies (cont.) Advantages of Write Through and lower level memory are consistent (important for I/O) Read misses don t result in writes to the lower memory Easier to implement Advantages of Write Back Multiple writes within a line require only one write to the lower level memory. Writes occur at the speed of cache memory Minimize traffic to the lower level memory (could be very significant, especially in multiprocessor systems) 40

13 Multi-Level s Options: separate data and instruction caches, or a unified cache Processor TLB regs L1 Dcache L1 Icache L2 Memory disk disk size: speed: $/Mbyte: line size: 256 B ns KB 1-2 ns 8 B 32 B 32 B larger, slower, cheaper 1-8MB SRAM 6 ns $20/MB 1.0 GB DRAM 60 ns $0.12/MB 8 KB 200 GB 6-9 ms $0.002/MB larger line size, higher associativity, more likely to write back 41 Alpha Hierarchy Regs. L1 Data 1 cycle latency 8KB, direct Write-through Dual Ported 32B lines L1 Instruction 8KB, direct 32B lines L2 Unified 8 cycle latency 96KB 3-way assoc. Write-back Write allocate 32B/64B lines L3 Unified 1M-64M direct Write-back Write allocate 32B or 64B lines Main Memory Up to 1TB Processor Chip Improving memory performance was a main design goal Earlier Alpha s CPUs starved for data 42 Pentium III Xeon Hierarchy Regs. L1 Data 1 cycle latency 16KB 4-way Write-through 32B lines L1 Instruction 16KB, 4-way 32B lines L2 Unified 512K 4-way Write-back Write -allocate 32B lines Main Memory Up to 4GB Processor Chip 43

14 Itanium-II Hierarchy Int Regs. fp Regs. L1 D$ 1 cycle latency 16KB, 4 way Write-through Dual Ported 32B lines L2 Unified 5 cycle latency 256KB 8-way assoc. Write-back 64B lines L3 Unified 3M 12 cycle lat Write-back 64B lines Main Memory L1 I$ 16KB, 4 way 32B lines Processor Chip Madison will have 6MB of L3 44 Caching as a General Principle Larger, slower, and cheaper storage devices L5: L4: L3: L2: L0: registers L1: on-chip L1 cache (SRAM) off-chip L2 cache (SRAM) main memory (DRAM) local secondary storage (local disks) remote secondary storage (distributed file systems, Web servers) CPU registers hold words retrieved from L1 cache. L1 cache holds cache lines retrieved from L2. L2 cache holds cache lines retrieved from memory. Main memory holds disk blocks retrieved from local disks. Local disks hold files retrieved from disks on remote network servers. 45 Type Registers TLB SRAM SRAM Virtual Memory What d 4-byte word Address Translations 32-byte block 32-byte block 4-KB page Forms of Caching Where d CPU Registers On-Chip TLB On-Chip L1 Off-Chip L2 Main Memory Latency (cycles) Managed By Compiler Hardware Hardware Hardware MMU+OS Buffered Files Network File Browser File Buffer Parts of Files Web Pages Main Memory Processor Disk Processor Disk ,000,000 10,000,000 OS AFS Client Browser Web Web Pages Server Disks 1,000,000,000 Akamai Server 46

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example Locality CS429: Computer Organization and Architecture Dr Bill Young Department of Computer Sciences University of Texas at Austin Principle of Locality: Programs tend to reuse data and instructions near