6 th Lecture :: The Cache - Part Three

Size: px

Start display at page:

Download "6 th Lecture :: The Cache - Part Three"

Caroline Jordan
6 years ago
Views:

1 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 1/17 [CS7031] Graphics and Console Hardware and Real-time Rendering 6 th Lecture :: The Cache - Part Three Dr. Michael Manzke michael.manzke@cs.tcd.ie Trinity College Dublin

2 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 2/17 Textbook The following sildes are based on Chapter Five "Memory Hierarchy Design" and Appendix C "Review of the Memory Hierarchy" in [HP07]. Figures are take from the book s support material. [HP07] John L. Hennessy and David A. Patterson. Computer Architecture, A Quantitative Approach. Morgan Kaufmann, fourt edition edition, 2007

3 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 3/17 Analysis of Replacement Policy Size (KB) Associativity Two-way Four-way Eight-way LRU Random FIFO LRU Random FIFO LRU Random FIFO This table shows cache misses per 1000 instructions. LRU performes better than Random or FIFO for the smaller cache. But LRU is more diffcult to implement. We look at the various associativities and cache sizes in the the L1 and L2 caches of the XBOX 360 [AB06] in the 5 th lecture. See page C-10 in Hennessy and Patterson [HP07].

4 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 4/17 Write-through vs. Write-back Write operations may use a dirty bit to decide if a cache block needs to be written to the next lower level. This reduces the number of write operation to the next level. Write-back can run at cache frequency Multiple write operation to the same block require only one write to the next level. This reduces the memory bandwidth to the next level. This is attractive for multiprocessors (We will cover this in a later lecture). It also saves power.

5 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 5/17 Write-through and Coherency Write-through is simple to implement This also simplifies the coherency. Important for multiprocessors and I/O. Write-through caches can be more efficiently implemented as multilevel caches.

6 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 6/17 Write Misses Write misses may be implemented as: Write allocate. Block is allocated in cache on write miss. No-write allocate. Block is not allocated on write miss.

7 Opteron Data Cache The silde on the next pagepage 8 shows the structure of the AMD Opteron s L1 data cache in the following configuration: 64K (cache size) 64 byte block (block size) two-way set-associative placement LRU replacement write back write allocate (on write miss) Equation (1) calculate the Index 2 Index = Cache size Block size Set associativity = = = 29 (1) Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 7/17

8 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 8/17 The Opteron Cache Block Diagram See page C-13 in Hennessy and Patterson [HP07].

9 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 9/17 Opteron Data Cache Example Step 1 & 2 (Read) Step 1 in the figure on page 8 generates the index, tag and block offset from the CPU address. The Index is 9 bits wide and tag is 25 bits wide. The 3 bits wide block offset selects the requested 8 bytes from the cache block. Step 2 in the figure on page 8 used the index to select the correct set in the two compare units (the cache is two-way set associated). Two units are used to perform the operation concurrently.

10 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 10/17 Opteron Data Cache Example Step 3 & 4 (Read) Step 3 in the figure on page 8 compare the two cache blocks in the selected set against the tag. Again the two units perform this operation concurrently. Step 4 in the figure on page 8 the logic uses the result from the comparison in step 3 to switch the 2:mux to the correct cache block. The block offset is applied to select the correct bytes in the cache block and copied into the victim buffer.

11 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 11/17 Opteron Data Cache Example (Write) The first three steps are identical to the read operation if: A cache block holds the word. Write miss: The Opteron uses write back. The dirty bit determines if the cache block needs to be written to the next lower level through the victim buffer. The victim buffer in similar to a write buffer. Both read misses and write misses require the cache controller to replace a cache line. The Opteron uses LRU replacement policy.

12 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 12/17 Separate Data and Instruction Cache The CPU knows if it fetches an instruction or data and can use a separate cache for instruction and data. This doubles the available bandwidth. The two caches may be individually optimised in terms of: Capacity Block size Associativity We saw in Andrews and Baker s paper [AB06] (5 th lecture) that the three symmetric multithreading (SMT) cores of the XBOX 360 have: Two-way set-associative L1 instruction caches for every core. Four-way set-associative L1 data caches for every core. Both caches are 32 Kbyte. The data cache is write-through and does not allocate cache blocks during write operations.

13 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 13/17 Basic Cache Optimisations Cache optimisations can be classified as: Miss rate reduction: Larger block size. Larger cache size. Higher associativity. Miss penalty reduction: Multilevel caches Reads have priority over writes Reducing the time to hit in the cache: No address translation when indexing a cache.

14 The Four Cs Model The Four Cs model is used as taxonomy of cache miss causes. Compulsory First access to a block Capacity Cache block must be replace with a different block. Conflict Cache block may have to be replaced because the cache is not fully associative. Coherency Cache flushes that keep multiple caches coherent in multiprocessors. Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 14/17

15 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 15/17 Cache Miss Rate Table Cache size (KB) Degree associative Total Miss rate Compulsory Capacity Conflict 4 1-way (0.1%) (72%) (28%) 4 2-way (0.1%) (93%) (7%) 4 4-way (0.1%) (99%) (1%) 4 8-way (0.1%) (100%) (0%) 32 1-way (0.2%) (89%) (11%) 32 2-way (0.2%) (99%) (0%) 32 4-way (0.2%) (100%) (0%) 32 8-way (0.2%) (100%) (0%) See page C-23 in Hennessy and Patterson [HP07] for full table.

16 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 16/17 Textbook Please read Appendix C "Review of the Memory Hierarchy" in [HP07].

17 Dr. Michael Manzke :: CS7031 :: 6 th Lecture :: The Cache - Part Three :: October 20, 2010 p. 17/17 The XBOX System Architecture Paper [AB06] Jeff Andrews and Nick Baker. Xbox 360 system architecture. IEEE Micro, 26(2):25 37, 2006.

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache Fundamental Questions Computer Architecture and Organization Hierarchy: Set Associative Q: Where can a block be placed in the upper level? (Block placement) Q: How is a block found if it is in the upper