Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Slide Set 5 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017

ENCM 501 W17 Lectures: Slide Set 5 slide 2/37 Contents Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 3/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 4/37 Review: Example computer with only one level of cache and no virtual memory CORE L1 I- CACHE L1 D- CACHE DRAM CONTROLLER DRAM MODULES We re looking at this simple system because it helps us to think about cache design and performance issues while avoiding the complexity of real systems like the Intel i7 shown in textbook Figure 2.21.

ENCM 501 W17 Lectures: Slide Set 5 slide 5/37 Review: Example data cache organization index 8 Key 8-to-256 decoder. set 0 set 1 set 2. set 253 set 254 set 255 way 0. way 1 block status: 1 valid bit and 1 dirty bit per block tag: 1 18-bit stored tag per block data: 64-byte (512-bit) data block set status: 1 LRU bit per set.. This could fit within our simple example hierarchy, but is also not much different from some current L1 D-cache designs.

ENCM 501 W17 Lectures: Slide Set 5 slide 6/37 Valid bits in caches going 1 0 I hope it s obvious why V bits for blocks go 0 1. But why might V bits go 1 0? In other words, why does it sometimes make sense to invalidate one or more cache blocks? Here are two big reasons. (There are likely some other good reasons.) DMA: direct memory access. Instruction writes by O/S kernels and by programs that write their own instructions. Let s make some notes about each of these reasons.

ENCM 501 W17 Lectures: Slide Set 5 slide 7/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 8/37 Storage cells in the example cache Data blocks are implemented with SRAM cells. The design tradeoffs for the cell design relate to speed, chip area, and energy use per read or write. Tags and status bits might be SRAM cells or might be CAM ( content addressable memory ) cells. A CMOS CAM cell uses the same 6-transistor structure as an SRAM cell for reads, writes, and holding a stored bit for long periods of time during which there are neither reads nor writes. A CAM cell also has 3 or 4 extra transistors that help in determining whether the bit pattern in a group of CAM cells (e.g., a stored tag) matches some other bit pattern (e.g., a search tag).

ENCM 501 W17 Lectures: Slide Set 5 slide 9/37 CAM cells organized to make a J-bit stored tag BL J 1 BL J 1 BL J 2 BL J 2 BL 0 BL 0 WL i... MATCH i CAM CELL CAM CELL... CAM CELL For reads or writes the wordline and bitlines plays the same roles they do in a row of an SRAM array. To check for a match, the wordline is held LOW, and the search tag is applied to the bitlines. If every search tag bit matches the corresponding stored tag bit, the matchline stays HIGH; if there is even a single mismatch, the matchline goes LOW.

ENCM 501 W17 Lectures: Slide Set 5 slide 10/37 CAM cells versus SRAM cells for stored tags With CAM cells, tag comparisons can be done in place. With SRAM cells, stored tags would have to be read via bitlines to comparator circuits outside the tag array, which is a slower process. (Schematics of caches in the textbook tend to show tag comparison done outside of tag arrays, but that is likely done to show that a comparison is needed, not to indicate physical design.) CAM cells are larger than SRAM cells. But the total area needed for CAM-cell tag arrays will still be much smaller than the total area needed for SRAM data blocks. Would it make sense to use CAM cells for V (valid) bits?

ENCM 501 W17 Lectures: Slide Set 5 slide 11/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 12/37 Cache line is a synonym for cache block Hennessy and Patterson are fairly consistent in their use of the term cache block. A lot of other literature uses the term cache block. However, the term cache line, which means the same thing, is also in wide use. So in other literature, you may read things like... In a 4-way set-associative cache, an index finds a set containing 4 cache lines. In a direct-mapped cache, there is one cache line per index. A cache miss, even if it is for access to a single byte, will result in the transfer of an entire cache line.

ENCM 501 W17 Lectures: Slide Set 5 slide 13/37 Direct-mapped caches A direct-mapped cache can be thought of as a special case of a set-associative cache, in which there is only one way. For a given cache capacity, a direct-mapped cache is easier to build than an N-way set-associative cache with N 2: no logic is required to find the correct way for data transfer after a hit is detected; no logic is needed to decide which block in a set to replace in handling a miss. Direct-mapped caches may also be faster and more energy-efficient. However, direct-mapped caches are vulnerable to index conflicts (sometimes called index collisions).

ENCM 501 W17 Lectures: Slide Set 5 slide 14/37 Let s think about changing the design of our example 32 KB 2-way set-associative data cache to a direct-mapped organization. If the capacity stays at 32 KB and the block size remains at 64 bytes, how does the change in organization affect the number of sets, the widths of indexes and tags, and the number of memory cells needed?

ENCM 501 W17 Lectures: Slide Set 5 slide 15/37 Data cache index conflict example Consider this sketch of a C function: int g1, g2, g3, g4; void func(int *x, int n) { int loc[10], k; while ( condition ) { } } make accesses to g1, g2, g3, g4 and loc What will happen in the following scenario? the addresses of g1 to g4 are 0x0804 fff0 to 0x0804 fffc the address of loc[0] is 0xbfbd fff0

ENCM 501 W17 Lectures: Slide Set 5 slide 16/37 Instruction cache index conflict example Suppose a program spends much of its time in a loop within function f... void f(double *x, double *y, int n) { int k; for (k = 0; k < n; k++) y[k] = g(x[k]) + h(x[k]); } Suppose that g and h are small, simple functions that don t call other functions. What kind of bad luck could cause huge numbers of misses in a direct-mapped instruction cache?

ENCM 501 W17 Lectures: Slide Set 5 slide 17/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 18/37 Motivation for set-associative caches (1) Qualitatively: In our example of data cache index conflicts, the conflicts go away if the cache is changed from direct-mapped to 2-way set-associative. In our example of instruction cache index conflicts, in the worst case, the conflicts go away if the cache is changed from direct-mapped to 4-way set-associative. Quantitatively, see Figure B.8 on page B-24 of the textbook. Conflict misses are a big problem in direct-mapped caches. Moving from direct-mapped to 2-way to 4-way to 8-way reduces the conflict miss rate at each step.

ENCM 501 W17 Lectures: Slide Set 5 slide 19/37 Motivation for set-associative caches (2) Detailed studies show that 4-way set-associativity is good enough to eliminate almost all conflict misses. But many practical cache designs are 8- or even 16-way set-associative. There must be reasons for this other than the desire to avoid conflict misses. We ll come back to this question later.

ENCM 501 W17 Lectures: Slide Set 5 slide 20/37 Replacement strategies in set-associative caches Let N be the number of ways. With N = 2, LRU replacement is easy to implement a single bit in each set can track which block should be replaced on a miss in that set. Exact LRU replacement is harder to implement with N > 2. LRU status bits would have to somehow encode a list of least-to-most-recent accesses within a set. However, choice of replacement strategy between various reasonable options seems to have very little effect on miss rate see Figure B.4 on page B-10 of the textbook. So we re not going to study cache block replacement strategy in detail in ENCM 501.

ENCM 501 W17 Lectures: Slide Set 5 slide 21/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 22/37 Fully-associative caches A fully-associative cache can be thought of as an N-way set-associative cache in which N is equal to the number of blocks. In this way of thinking, how many sets are there in a fully-associative cache? What is the width of an index? For energy use, how would hit detection in a fully-associative cache compare with hit detection in a direct-mapped or (small N) set-associative cache with the same capacity?

ENCM 501 W17 Lectures: Slide Set 5 slide 23/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 24/37 Different options for handling writes See Q4: What Happens on a Write? on pages B-10 to B-12 of the textbook for details. There is not much I can put into lecture slides to improve on the clarity of that material. Note that the cache access examples presented at the end of Slide Set 4 assume a write-back policy, using write allocate in the case of a write miss.

ENCM 501 W17 Lectures: Slide Set 5 slide 25/37 Write buffers: Location and purpose A question that could arise from recent lecture material: In a system with one level of cache, is a write buffer part of the L1 D-cache or part of the DRAM controller? Partial answer: Neither the textbook nor Web search results are perfectly clear about this. But it s probably simpler to think of the write buffer as part of the L1 D-cache. Regardless, it s more important to understand the purpose of a write buffer: to securely hold pending main memory updates while minimizing the need of a processor core to stall.

ENCM 501 W17 Lectures: Slide Set 5 slide 26/37 Write buffers are hardware queues In general, a write buffer can be thought of as a collection of write buffer entries, in which each entry is a triple: (address, data size, data) The key idea is that a write buffer holds multiple pending writes, in case there is a flurry of cache misses, each causing replacement of a dirty block in a write-back cache; a simple flurry of writes to a write-through cache; some other situation that generates writes to main memory faster than main memory can receive them. A processor should only stall on a write if a write buffer is full has as many pending writes as it can hold which should be an unusual event.

ENCM 501 W17 Lectures: Slide Set 5 slide 27/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 28/37 Multi-level caches A common configuration: A core has an L1 I-cache and an L1 D-cache, each with 32 KB capacity, and a unified L2 cache with 256 KB capacity. So there is a total of 320 KB of cache capacity for this core. Why is it not better to use the chip area for a 160 KB L1 I-cache and a 160 KB L1 D-cache? Give two reasons, one related to SRAM array design issues, and another related to memory use patterns of software programs. The same ideas apply to systems with 3 levels of cache. Other complicated design concerns come up when multiple cores share access to a single L2 or L3 cache.

ENCM 501 W17 Lectures: Slide Set 5 slide 29/37 AMAT for a 2-level cache system A formula from the textbook: AMAT = L1 hit time + L1 miss rate (L2 hit time + L2 miss rate L2 miss penalty) What is AMAT if there are no cache misses at all? What is AMAT if there are some L1 misses but no L2 misses? What are design criteria for the L1 and L2 caches, given that the goal is to minimize AMAT?

ENCM 501 W17 Lectures: Slide Set 5 slide 30/37 This definition (not from the textbook!) is incorrect: For a system with two levels of caches, the L2 hit rate of a program is the number of L2 hits divided by the total number of memory accesses. What is a correct definition for L2 hit rate, compatible with the formula for AMAT?

ENCM 501 W17 Lectures: Slide Set 5 slide 31/37 L2 cache design tradeoffs An L1 cache must keep up with a processor core. That is a challenge to a circuit design team but keeps the problem simple: If a design is too slow, it fails. For an L2 cache, the tradeoffs are more complex: Increasing capacity improves L2 miss rate but makes L2 hit time, chip area and (probably) energy use worse. Decreasing capacity improves L2 hit time, chip area, and (probably) energy use, but makes L2 miss rate worse. Suppose L1 hit time = 1 cycle, L1 miss rate = 0.020, L2 miss penalty = 100 cycles. Which is better, considering AMAT only, not chip area or energy? (a) L2 hit time = 10 cycles, L2 miss rate = 0.50 (b) L2 hit time = 12 cycles, L2 miss rate = 0.40

ENCM 501 W17 Lectures: Slide Set 5 slide 32/37 3 C s of cache misses: compulsory, capacity, conflict It s useful to think about the causes of cache misses. Compulsory misses (sometimes called cold misses ) happen on access to instructions or data that have never been in a cache. Examples include: instruction fetches in a program that has just been copied from disk to main memory; data reads of information that has just been copied from a disk controller or network interface to main memory. Compulsory misses would happen even if a cache had the same capacity as the main memory the cache was supposed to mirror.

ENCM 501 W17 Lectures: Slide Set 5 slide 33/37 Capacity misses This kind of miss arises because a cache is not big enough to contain all the instructions and/or data a program accesses while it runs. Capacity misses for a program can be counted by simulating a program run with a fully-associative cache of some fixed capacity. Since instruction and data blocks can be placed anywhere within a fully-associative cache, it s reasonable to assume that any miss on access to a previously accessed instruction or data item in a fully-associative cache occurs because the cache is not big enough. Why is this a good but not perfect approximation?

ENCM 501 W17 Lectures: Slide Set 5 slide 34/37 Conflict misses Conflict misses (also called collision misses ) occur in direct-mapped and N-way set-associative caches because too many accesses to memory generate a common index. In the absence of a 4th kind of miss coherence misses, which can happen when multiple processors share access to a memory system we can write: conflict misses = total misses compulsory misses capacity misses The main idea behind increasing set-associativity is to reduce conflict misses without the time and energy problems of a fully-associative cache.

ENCM 501 W17 Lectures: Slide Set 5 slide 35/37 3 C s: Data from experiments Textbook Figure B.8 has a lot of data; it s unreasonable to try to jam all of that data into a few lecture slides. So here s a subset of the data, for 8 KB capacity. N is the degree of associativity, and miss rates are in misses per thousand accesses. miss rates N compulsory capacity conflict 1 0.1 44 24 2 0.1 44 5 4 0.1 44 < 0.5 8 0.1 44 < 0.5 This is real data from practical applications. It is worthwhile to study the table to see what general patterns emerge.

ENCM 501 W17 Lectures: Slide Set 5 slide 36/37 Outline of Slide Set 5 Review of example from Slide Set 4 SRAM cells and CAM cells Direct-mapped caches More about set-associative cache organization Fully-associative caches Options for handling writes to caches Multi-level caches Caches and Virtual Memory

ENCM 501 W17 Lectures: Slide Set 5 slide 37/37 Caches and Virtual Memory Both are essential systems to support applications running on modern operating systems. As mentioned in Slide Set 4, it really helps to keep in mind what problems are solved by caches and what very different problems are solved by virtual memory. Caches are an impressive engineering workaround for difficult facts about relative latencies of memory arrays. Virtual memory (VM) is a concept, a great design idea, that solves a wide range of problems for computer systems in which multiple applications are sharing resources. Slide Set 6 will be about virtual memory and interactions between cache systems and virtual memory systems.