Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 9 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 2/71 Contents Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 3/71 Outline of Slide Set 9 Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 4/71 Must-haves for caches Here are the two essential requirements for a cache: Correctness: A cache must never feed a garbage instruction or data item to a processor core Speed, when there is a cache hit: Pipeline designs assume that instructions and data memory items will be ready within a fixed number of clock cycles, such as 1 cycle for the 5-stage pipeline we ve just studied, perhaps 2 4 cycles for deeper 10- to 15-stage pipelines in current commercial processors

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 5/71 Goals in cache design In comparing designs that are correct and sufficiently fast, these goals are important: Low miss rate for most programs Misses can t be entirely avoided, but obviously it s good to make them infrequent Relatively low chip area Smaller is better Relatively low energy use per clock cycle Why is this important both for computers running on batteries and for plugged-in computers? There are tradeoffs here! Most ways to reduce miss rate require more transistors (so more chip area) and more energy per clock cycle

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 6/71 Consequence of the requirement for speed The decisions about where to look in a cache, and whether or not there was a hit must both be very fast The choices are: Look in one place only, and do a simple comparison of bit patterns to determine hit or miss Look in multiple places at the same time, and do multiple parallel comparisons to determine hit or miss Serial solutions (check in one place, then in another, maybe then yet another, etc) are too slow

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 7/71 Outline of Slide Set 9 Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 8/71 Direct-mapped caches A direct-mapped cache uses a look-in-one-place-only strategy to detect hits or misses Let s look at an example for MIPS with room for 1024 instructions (if it s an I-cache) or 1024 data words (if it s a D-cache) Note: 1024 = 2 10 How is cache capacity defined? What is the capacity of this example cache in KB? Our cache will be a small memory array, with ten-bit addresses The dimensions will be 1024 rows 53 columns We ll see shortly why 1024 won t work

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 9/71 How addresses are split for our example cache Here is a -bit main memory address split into three pieces: 31 search tag 12 11 2 1 0 set bits 00 byte offset Byte offset: In our simple example that allows only word accesses, these bits don t get used They would matter in a cache that supported instructions like LB, LBU, and SB Set bits: These ten bits are used as an address into the cache Many textbooks use the word index for what Harris and Harris call the set bits Your instructor is likely to slip from time to time and say index when he means set bits! Search tag: We ll soon see how these 20 bits get used

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 10/71 An example address split for our example cache Suppose that 0x0040_3570 is the main memory address of some instruction, and our example cache is an I-cache To try to find this instruction in the I-cache, the main memory address is split into search tag, set bits, and byte offset as 0000 0000 0100 0000 0011 0101 0111 00 00 How many other addresses of main memory words would generate the same set bit pattern of 0101 0111 00?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 11/71 Memory cell organization in our example cache Our cache is organized as 1024 sets Each set needs 53 1-bit SRAM cells 1024 sets V-bits SRAM 20 SRAM cells per stored tag SRAM cells per cached instruction or data word A valid bit, or V-bit, indicates whether its set contains valid information: 1 means YES and 0 means NO The meaning of the stored tags will be explained by example

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 12/71 Hit detection logic in the example direct-mapped cache set bits from main memory address 10 10 to 1024 decoder search tag from main memory address V 0 bits 00 11 01 00 11 01 01 010101 0 1 0 01 01 01 0 20 00 11 00 11 00 11 00 11 01 01 01 01 00 11 00 11 1024x20 stored tag array 00 11 00 11 00 11 00 11 00 11 010101 00 11 01 01 01 01 01 00 11 01 00 11 00 11 00 11 00 11 00 11 00 11 00 11 010101 00 11 00 11 00 11 00 11 00 11 00 11 010101 00 11 01 01 01 01 01 01 01 01 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 010101 00 11 20 bit equality comparator Active wires, SRAM cells, and logic shown in RED 1 means Hit 0 means Miss For a hit the V-bit in the selected set must be 1 AND the stored tag in the selected set must match the search tag

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 13/71 Tracing an example instruction fetch Let s continue to suppose that our example cache is an I-cache Suppose also that a program has been running for a while, so many of the program s instructions are in the I-cache Finally, suppose that the processor core now wants the instruction at address 0x0040_3570 As seen a few slides back, the address is split into search tag, set bits, and byte offset as 0000 0000 0100 0000 0011 0101 0111 00 00 Set 0101 0111 00 two in the cache is checked the decoder selects one 20-bit stored tag to copy into the 20-bit equality comparator, as shown on slide 12

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 14/71 Tracing an example instruction fetch, continued 0101 0111 00 two is 348 ten This fact is irrelevant to the digital hardware, but handy for human discussion How is it determined whether there is a hit or a miss? What happens if there is a hit? What happens if there is a miss? What is a possibly unfortunate side effect of a miss if the V-bit in the selected set is already 1 when the miss is detected?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 15/71 Textbook example, pages 482 485 Harris and Harris present an example of a tiny direct-mapped cache with 8-word (so -byte) capacity (It s too small to be practical, but the size makes it possible to draw a schematic without using a lot of or ) The schematic on page 484 shows that the cache has 8 sets Using that information, let s show how main memory addresses should be split into byte offset, set bits, and search tag for this cache Highly recommended: Please read Example 88 on page 485 for a brief and clear example of how a data cache is helpful when a program has good temporal locality of reference

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 16/71 Outline of Slide Set 9 Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 17/71 The set-bit conflict problem in direct-mapped caches for (i = 0; i < n; i++) { x = foo(i); bar(x); } Let s consider again the example 1024-word direct-mapped I-cache presented in slides 8 14 Suppose instructions for foo start at 0x0040_3570 Suppose instructions for bar start at 0x0041_3570 What is going to happen in the I-cache?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 18/71 Set-bit conflicts, continued What could make the for-loop example even worse? Can you think of conflict examples for a direct-mapped D-cache?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 19/71 The general problem of set-bit conflicts A conflict in a direct-mapped cache occurs when two or more frequently-accessed instructions or two or more frequentlyaccessed data items share the set bits for access in an I-cache or a D-cache Conflicts can cause a high miss rate and therefore many lost clock cycles even in a situation where a cache is not close to full of frequently-accessed items Misses due to conflicts can be reduced significantly but not totally eliminated by the use of set-associative caches

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 20/71 Outline of Slide Set 9 Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 21/71 Set-associative caches Instead of starting with a definition for set-associative cache, let s jump straight into an example Suppose we would like the same 1024-word capacity as in our previous direct-mapped example, but with a 2-way set-associative organization It turns out that for this new design, main memory addresses should be split like this: 31 search tag 11 10 2 1 0 set bits 00 byte offset

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 22/71 The next slide shows a schematic of the hit detection and read logic for our example 1024-word two-way set-associative cache Let s look at the schematic and answer some questions: Why is 9 the right number of set bits? What do the = boxes do? What is the purpose of the AND and OR gates? What does the mux do in the case of a hit? What does the mux do in the case of a miss? What will happen in the selected set in the case of a miss?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 23/71 main memory address way 1 way 0 21 search tag 9 set bits 00 decoder V tag 21 data/instr V tag 21 data/instr set 511 set 510 set 1 set 0 = = Hit 1 Hit 0 1 for hit, 0 for miss 1 0 data/instruction available to core

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 24/71 Notes about slide 23 I designed the circuit and schematic to be very similar to Figure 89 on page 486 of Harris and Harris please take a close look at that figure I drew in the decoder to be as clear as possible about how one set is inspected for a hit, while all other sets are ignored Obviously, a capacity of 1024 words (my example) versus a capacity of 8 words (textbook example) affects the number of sets, the number of set bits, and the number of bits in tags

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 25/71 Logic details we WON T study for ANY cache designs We won t look at wiring and logic for communication of data between the cache and the processor core in the case of a memory write We won t look at wiring and logic for communication of addresses and data-or-instructions between the cache and main memory However, we will look at important concepts related to writing to caches and main memory

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 26/71 Review: Set-bit conflicts in a direct-mapped cache This is from slide 17: for (i = 0; i < n; i++) { x = foo(i); bar(x); } It s possible, by bad luck, that instruction addresses from foo generate the same set bits as instruction addresses from bar That s not a problem in a two-way set-associative cache

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 27/71 If foo needs to use sets 348 353, and bar needs to use sets 348 351, that can be resolved without conflict: V 1 1 1 1 tag way 1 way 0 bar instr s and tags data/instr V 1 1 1 1 1 1 tag data/instr foo instr s and tags set 511 set 353 set 352 set 351 set 350 set 349 set 348 set 0 What kind of conflict problem in the for loop would a 2-way cache be unable to solve?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 28/71 How many ways in an N-way set-associative cache? In the example we ve just studied, N = 2 In a lot of textbook and lecture examples, N = 2, simply because with N > 2 it gets hard to fit diagrams into pages and slides! However, many caches in modern processor chips are 4-way, 8-way, or 16-way set-associative

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 29/71 Outline of Slide Set 9 Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 30/71 Fully-associative caches A fully-associative cache can be thought of as an extreme case of a set-associative cache, in which there is only one set Every lookup in a fully-associative cache requires a parallel, simultaneous check of all of the V-bits and tags in the cache This uses a lot of energy, and makes fully-associative design a poor choice for medium-size and large caches Some very small memory systems do use fully-associative lookup See pages 487 488 of the textbook for a little more discussion of fully-associative caches

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 31/71 Outline of Slide Set 9 Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide /71 Multi-word blocks DRAM latency is the length of time needed for a DRAM system to start sending data in response to a read request This is typically 30 100 processor core clock cycles DRAM bandwidth is the rate at which a DRAM system can transmit data from sequential addresses once transmission has started DRAM bandwidth is much less of a problem than DRAM latency For example, in a typical laptop with DRAM on two DIMMs, 64 bytes (so, 512 bits) of instructions or data can be transmitted in about 8 core clock cycles Conclusion: It makes much more sense to read DRAM in many-word bursts than it does to read DRAM in single-word accesses

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 33/71 1024-word direct-mapped cache with 16-word blocks A schematic for the read logic in this kind of cache is shown on the next slide For each V-bit and tag there will be 16 instruction words (if it s an I-cache) or 16 data words (if it s a D-cache) For this cache it turns out to be necessary to split the main memory address into four pieces, as follows: 31 12 11 6 5 2 1 0 search tag set bits block 00 offset byte offset

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 34/71 main memory address 20 search tag 6 set bits block offset 00 4 decoder V tag 20 16 data/instr words per block 1111 1110 0001 0000 set 63 set 62 set 1 set 0 = Hit data/instruction word available to core

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 35/71 For the cache on the previous slide, let s answer some questions Q1: Is hit detection really any different from hit detection in a direct-mapped cache with one-word blocks? Q2: What are the roles of the block offset and the 16:1 -bit bus multiplexer? If we use this design as an I-cache, and try a fetch with the previous example instruction address of 0x0040_3570, the address would be split into tag, set bits, block offset, and byte offset as 0000 0000 0100 0000 0011 0101 01 11 00 00 Q3: How is a hit detected for the above address, and in the case of a hit, how does the correct instruction get fed to the processor core?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 36/71 Example instruction fetch in the cache of slide 34: Miss In set 21 either the V-bit is 0 or the stored tag doesn t match the tag coming from the instruction address The whole 16-word block needs to be replaced! Why? Words with addresses 0x0040_3540, 0x0040_3544,, 0x0040_357c, get copied from main memory into set 21 in the cache Why are those the right addresses to use? What other updates happen in set 21? What else happens?

16 word addresses with common tags and indexes address (hex) address (base two) 0x0040_3540 0000 0000 0100 0000 0011 0101 01 00 00 00 0x0040_3544 0000 0000 0100 0000 0011 0101 01 00 01 00 0x0040_3548 0000 0000 0100 0000 0011 0101 01 00 10 00 0x0040_354c 0000 0000 0100 0000 0011 0101 01 00 11 00 0x0040_3550 0000 0000 0100 0000 0011 0101 01 01 00 00 0x0040_3554 0000 0000 0100 0000 0011 0101 01 01 01 00 0x0040_3558 0000 0000 0100 0000 0011 0101 01 01 10 00 0x0040_355c 0000 0000 0100 0000 0011 0101 01 01 11 00 0x0040_3560 0000 0000 0100 0000 0011 0101 01 10 00 00 0x0040_3564 0000 0000 0100 0000 0011 0101 01 10 01 00 0x0040_3568 0000 0000 0100 0000 0011 0101 01 10 10 00 0x0040_356c 0000 0000 0100 0000 0011 0101 01 10 11 00 0x0040_3570 0000 0000 0100 0000 0011 0101 01 11 00 00 0x0040_3574 0000 0000 0100 0000 0011 0101 01 11 01 00 0x0040_3578 0000 0000 0100 0000 0011 0101 01 11 10 00 0x0040_357c 0000 0000 0100 0000 0011 0101 01 11 11 00

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 38/71 Why use multi-word blocks? As seen a few slides back, reading 64 bytes (16 4-byte words) from consecutive DRAM addresses does not take much more time than reading one 4-byte word Because most programs have good spatial locality of reference, it pays to read many adjacent words at once when copying instructions or data from main memory to a cache 64 bytes is a very common block size in current computers Much smaller block sizes (2 or 4 words) might be used to make a point in classroom examples or lab exercises

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 39/71 I-cache spatial locality example Suppose a MIPS program calls procedure bob for the first time bob is a leaf procedure with 27 instructions Suppose every one of those instructions is fetched at least once The instruction cache is organized like the cache on slide 34 How many misses will there be in the I-cache while the procedure runs?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 40/71 D-cache spatial locality example Suppose that i, n and sum are of type int, in GPRs, and a is of type int*, also in a GPR for (i = 0; i < n; i++) { sum += a[i]; } The data cache is organized like the cache on slide 34 Assume that none of the array elements read by the loop are in the D-cache when the loop starts Roughly how many misses will there be in the D-cache while the loop runs?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 41/71 Summary of cache examples in this slide set and textbook Section 83 The example 1024-word caches we ve seen in this slide set were, in order, direct-mapped with 1-word blocks; set-associative with 1-word blocks; direct-mapped with multi-word blocks Section 83 of the textbook presents example 8-word caches in exactly the same order A capacity of 8 words is far too small to be useful, but does allow textbook authors to draw diagrams showing entire caches

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 42/71 Caches in current processor chips Current processor chips have caches that are set-associative and have multi-word blocks Q1: Why might textbooks and lectures slides not show a set-associative cache with multi-word blocks? Q2: What are the capacity, associativity, and block size of the L1 (level one) D-caches in an Intel Core i7 chip?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 43/71 Outline of Slide Set 9 Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 44/71 Replacement policies The term replacement policy is a general name for a method of deciding where to place new instructions or data when they are brought into a cache after a miss First, an important review item: If a cache has multi-word blocks, replacement must bring an entire block into a cache, so that all of the updated block is consistent with the new tag for the block Second, a question: In a direct-mapped cache, is replacement a complicated problem with several different reasonable solutions?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 45/71 Replacement in set-associative caches A block of data or instructions will be brought into a cache as a result of a cache miss The set bits of the address that caused the miss dictate which set must receive the new block But in an N-way set-associative cache, any one of the N blocks in the selected set could receive the new block Is it possible for the cache to make a perfect choice about which of N blocks to replace?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 46/71 LRU replacement in 2-way set-associative caches A replacement strategy that works well for 2-way set-associative caches is called LRU, for least-recently-used An extra SRAM bit, called U, can be added to each set, to indicate which of the two blocks in the set has been less recently used This is shown on the next slide Questions, related to the next slide: Q1: How does a hit in set 233, way 0 affect the U bit in set 233? Q2: How does a hit in set 42, way 1 affect the U bit in set 42? Q3: If there is a miss in set 98, where in the cache does the new data or instruction go?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 47/71 This is the cache of slide 23, with one extra 1-bit SRAM cell per set the U-bit main memory address 21 search tag 9 set bits 00 decoder V tag way 1 21 data/instr V tag way 0 21 data/instr U set 511 set 510 set 1 set 0 = = Hit 1 Hit 0 1 for hit, 0 for miss 1 0 data/instruction available to core

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 48/71 More about LRU replacement What is the rationale for LRU replacement? LRU replacement is easy to implement in an N-way set-associative cache if N = 2 Is it easy to implement when N > 2? Why or why not? If not, what might be a reasonable alternative?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 49/71 Outline of Slide Set 9 Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 50/71 Introduction to cache write policies So far, lecture and textbook discussion about caches has focused on two very similar problems: For instruction fetch in an I-cache, how is a hit detected, and what should happen if there is a miss? When a load instruction searches for data in a D-cache, how is a hit detected, and what should happen if there is a miss? A really important topic, not covered in depth in the textbook, is, How does a D-cache support STORE instructions?

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 51/71 Quick Review: One level of cache, no address translation physical address physical address processor core (control, PC, GPRs, ALUs, etc) instructions physical address I-cache D-cache main memory data instructions or data

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 52/71 Outline of Slide Set 9 Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 53/71 Write buffers A write buffer is a critical component of a system for handling writes with a memory system that has caches In a simple system, like the one shown on the next slide, a write buffer contains a collection of data addresses and associated data items that are waiting to be copied out from a D-cache to main memory Depending on the D-cache design, data items in the write buffer could be as narrow as a single byte, or as wide as an entire D-cache block

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 54/71 This is a slight enhancement to the diagram on slide 51 Note that data bound from the core to main memory must pass through the write buffer physical address physical address processor core (control, PC, GPRs, ALUs, and so on) instruction physical address I-cache main memory data D-cache bwb bwb stands for block width in bits write buffer bwb instructions or data

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 55/71 Attention: The core and the caches are fast! Main memory is relatively slow A write buffer allows the core and the D-cache to keep running while multiple chunks of a data have accumulated in the write buffer, waiting for the time-consuming trip to main memory If the write buffer becomes completely full, the core may have to stall until at least some of the data in the write buffer has moved out to main memory

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 56/71 Outline of Slide Set 9 Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 57/71 Writethrough and writeback policies Writethrough and writeback are names given to kinds of policies for making sure that D-caches properly transmit data to main memory in response to store instructions There are many variations of writethrough policies There are many variations of writeback policies In ENCM 369 in 2018, we ll just briefly present the basics of each kind of policy

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 58/71 processor core (control, PC, GPRs, ALUs, and so on) physical address instruction physical address I-cache physical address main memory data D-cache bwb write buffer bwb instructions or data The key difference between writethrough and writeback is in the rules for deciding when an address and some data must be put in the write buffer

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 59/71 Writethrough Here s a simple example of a writethrough policy: Every store instruction sends an address and some data to the write buffer! So every write goes through the D-cache, regardless of whether there whether there is a hit or a miss in the D-cache In the case of a D-cache write miss, replacement of a block is similar to what happens in response to a read miss

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 60/71 Writeback Now, here s a simple example of a writeback policy: If a store instruction hits in the D-cache, the data for the store is written into the appropriate D-cache block but not to the write buffer! A block in the D-cache containing fresh data that hasn t yet been put in the write buffer is called a dirty block Data in a dirty block will go to the write buffer when that block is evicted from the D-cache

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 61/71 Block status bits in writeback caches In earlier cache examples, which were I-caches, a block could have one of two statuses: valid or invalid, as indicated by the V-bit In a writeback D-cache, an extra status bit is added to each block, called the dirty bit, or D-bit A block can have any one of three statuses: invalid (V = 0), valid-and-clean (D = 0, V = 1), or valid-and-dirty (D = 1, V = 1) For the organization of slide 58 A valid, clean block is a perfect reflection of main memory contents A valid, dirty block has fresh data that isn t yet in main memory

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 62/71 The next slide shows the addition of D-bits to the 1024-word direct-mapped cache with 1-word blocks, to enable the cache to function as a writeback cache The words to replacement logic indicate that some details have been left out a miss in a set with D = 1 and V = 1 will require copying a data address and a data word to the write buffer

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 63/71 main memory address 20 search tag 10 set bits 00 decoder byte offset D V tag 20 data set 1023 set 1022 set 1 set 0 = data available to core to replacement logic Hit (1 for hit, 0 for miss)

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 64/71 As noted previously, caches in modern processor chips tend both to be set-associative and to have multi-word blocks The next slide shows a 1024-word 2-way set-associative D-cache with 2-word blocks Note that each 2-word block has one D-bit, one V-bit, and one stored tag The U bit in each helps with an LRU replacement policy

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 65/71 main memory address 21 search tag 8set bits 00 decoder D V tag word 1 word 0 D V tag word 1 word 0 U way 1 21 21 way 0 set 255 set 254 set 1 set 0 block offset Hit 1 = Hit 0 = 1 for hit, 0 for miss 1 0 1 0 1 0 data available to core

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 66/71 Writethrough is simpler but writeback is more efficient Most current data caches use some kind of writeback policy Imagine that a program is making very frequent updates to elements of a small array With a writeback cache these updates leave the bus between cache and main memory idle, which saves energy (and would make bus bandwidth available for other cores in a multi-core system); in comparison to a writethrough cache, the risk that the write buffer could become full is reduced

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 67/71 Outline of Slide Set 9 Introduction to cache design Direct-mapped caches The Set-Bit Conflict Problem in Direct-Mapped Caches Set-Associative Caches Fully-Associative Caches Multi-Word Blocks Replacement Policies Introduction to cache write policies Write buffers Writethrough and writeback policies Multi-level cache systems

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 68/71 Multi-level cache systems All the cache designs we have looked at so far are single-level: A miss in the I-cache or D-cache requires access to main memory for data Most current cache systems are multi-level Slide 70 shows the simplest reasonable two-level arrangement Cache level numbering starts with L1 (level one) closest to the core Note that the L2 cache is unified unlike the L1 caches, it holds a mix of instructions and data

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 69/71 Review yet again: One level of cache, no address translation physical address physical address I-cache processor core (control, PC, GPRs, ALUs, and so on) instruction physical address main memory data D-cache bwb write buffer bwb instructions or data

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 70/71 Two levels of cache, no address translation core phys addr instr phys addr data L1 I-cache L1 D-cache bwb write buffer phys addr bwb instrs or data L2 unified cache bwb write buffer phys addr bwb instrs or data main memory

ENCM 369 Winter 2018 Section 01 Slide Set 9 slide 71/71 Design goals for multi-level cache systems Key fact: Small SRAM arrays are faster than medium-size SRAM arrays, because cells in smaller arrays drive shorter wires L1 caches are small, so they can be fast enough to take care of hits from the core in roughly 1 3 clock cycles L2 caches are bigger and slower than L1 caches; much smaller and much faster than the DRAM used for main memory If a memory access misses in L1 but hits in L2, the time lost by the core is much much less than the time lost in a miss that involves access to main memory