CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Size: px

Start display at page:

Download "CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás"

Leonard Young
5 years ago
Views:

1 CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

2 Outline 2 Cache memories: Revision of basic cache memories Direct mapping caches N-way associative and fully associative caches Data replacement policies Write Policies Hierarchical cache systems Optimizing caches software and hardware techniques

3 Reducing memory access times 3 Locality principles: Temporal locality: If a given access is accessed, it is likely that it will be accessed in the near future e.g., program loops Spatial locality: If a given access is accessed, it adjacent addresses are likely to be needed in the near future e.g., sequential code or accessing the elements of an n-dimensional array 90/10 RULE: A program typically spends around 90% of time executing the same 10% of instructions!

4 Cache memory organization 4 A cache memory is usually composed of: A set of n-ways (n parallel cache memories) Each way, is composed of multiple lines of data, each line including: A TAG value A VALID bit that specifies: Value of 0 the block of data is invalid and corresponds to uninitialized data Value of 1 the block of data corresponds to valid data A data BLOCK, typically composed of several data elements Some additional CTRL fields for data replacement and write policies CACHE WAY N-1... CACHE WAY 1 CACHE WAY 0 TAG V CTRL LINE DATA (BLOCK) Address check Control Word... Word Word

5 Mapping a memory in a small cache 5 Whenever a memory access is performed, the processor first checks the cache memory Address TAG INDEX WORD SELECT The validity bit states whether the data in the cache is: (1) entry stores valid data (0) entry stores values corresponding from an unitialized cache If data is on the cache memory retrieve it immediately Select cache line CACHE MEMORY TAG V LINE DATA If data is not on cache memory, fetch it from memory (or from a lower level cache memory) Once data is received from the lower memory hierarchy, send the fetched data to the processor and also put it on the cache Address Comparison = Line is valid TAG matches Data Word Cache Hit (Data Word validity)

6 Mapping a memory in a small cache 6 To check if data is on the cache, perform the following steps: 1. Divide the memory address into TAG (most significant bits) OFFSET, or word/byte select (least significant bits) INDEX (remaining bits) 2. Use the INDEX to select a row of the CACHE memory; retrieve the following fields: DATA TAG (TAG of the data stored in cache) VALID (validity of the cache entry) DATA (actual data) 3. Check the cache entry for data presence: If V=0, the entry is invalid generate a cache miss If V=1 AND entry TAG data TAG, the data is not on the cache (it corresponds to some other memory position) generate a cache miss If V=1 AND entry TAG = data TAG, fetch the data from the cache generate a cache hit

7 Example 7 Consider the following code segment where initially: R1=10F8h R2=11F8h R4=12F8h R5=00F8h Assume instruction and data caches with the following characteristics: Direct mapping (1-way) caches 8 lines, each line with: Instruction Cache 8Bytes Data Cache 32 Bytes 120h Loop: L.D F0,0(R1) 124h L.D F2,0(R2) 128h MUL.D F2,F0,F2 12Ch DADD.D F2,F2,F4 130h S.D 0(R4),F2 134h DADD R1,R1,#-8 138h DADD R2,R2,#-8 13Ch DADD R4,R4,#-8 140h BNE R1,R5,Loop 1. Divide the memory address word between TAG, INDEX and OFFSET fields 2. Make a diagram of the cache verification process when accessing memory address 120h 3. Determine the cache size (in Bytes) 4. Compute the data and instructions hit rate for the example program (assume that the cache is initially empty) ps: assume that the memory is byte addressed

8 Associative caches 8 Address TAG Address Comparison INDEX Select cache line WORD SELECT CACHE WAY 1 CACHE WAY 0 TAG = V LINE DATA Data on Way 1 Data on Way 0 Hit on Way 1 Hit on Way 0 To decrease cache misses, multiple caching tables (ways) can be combined An n-way associative cache looks for the target address in each of the n tables If the address is matched in one of the ways, it retrieves the data from that way If the value is not found, it fetches the data from the memory (or from a lower hierarchy cache memory) and puts it on one of the ways (see replacement policies) Data Cache Hit (Data Word validity)

9 Associative caches 9 Address TAG INDEX WORD SELECT CACHE WAY 1 CACHE WAY 0 Addressing is made as if only one way exists. Thus, the number of bits for each addressing field is computed as: Select cache line TAG V LINE DATA WORD SELECT = log 2 INDEX = log 2 #CACHE LINES LINE SIZE MEM. WORD SIZE TAG = ADDR SIZE (WORD SEL + INDEX) NOTE: the word select is often referred to as line offset or simply offset Data on Way 1 Address Comparison = Data on Way 0 Hit on Way 1 Hit on Way 0 Data Cache Hit (Data Word validity)

10 Associative caches 10 A fully associative cache is a particular case of associative caches, where each table is composed of only one entry (line) The index field has 0 bits In general, increased associativity (more ways) decreases the number of cache misses Particular cases can be found where this is not true Increased associativity also increases the total amount of hardware Each way requires 1 comparator A N:1 multiplexer is required to select the output data Also increases the propagation delay An advantage of direct mapping caches is that the actual data can be retrieved before data validation Address If the data is invalid, ignore the data at a later pipeline stage TAG WORD SELECT

11 Data replacement policies 11 Direct mapping cache: There is only one possible location Associative caches may replace data into any of the multiple ways Random: Randomly select the way to insert the new data (thus discarding the old one) Least Recently Used (LRU): Select the line which hasn t been used for longer Implementation of an LRU algorithm is prohibitive with a high number of ways (typically > 4) Pseudo-LRU uses only one bit to select which is the most recently used line Least Frequently Used (LFU): Counts how many times each line was accessed First-In First-Out Replace the oldest line in cache 2-WAY MAPPING CACHE TAG V DIRECT MAPPING CACHE (1 WAY) POSSIBLE TARGET TAG V TARGET LINE WAY 0 WAY 1 TAG V POSSIBLE TARGET

12 Data replacement policies Victim Cache 12 A substituted line is likely going to be needed in the near future Temporal locality principle To overcome situations when the removed line is required again: Use a victim cache to store the last lines to be removed (substituted) from the cache When a cache miss occurs, check the victim cache before looking at the lower hierarchy memory for the required block If the line is found on the victim cache, put it back on the (main) cache Victim caches are typically fully associative caches Works as a FIFO memory The AMD Athlon uses an 8-entry victim cache Victim caches are especially useful when associated to small direct mapping caches A victim cache with 4 entries can reduce in 25% the caches misses on a 4KB associative cache

13 13 Data replacement policies Reducing the cache miss overhead When the cache line includes more than one word (which is the most typical case), the processor does not need to wait for the whole line to be fetched Early Restart: As soon as the required word is fetched from the lower level memory, it is given to the processor, so that it can continue executing Critical Word First: Start by fetching the required word from the lower level memory and immediately give the value to the CPU; once the required word is fetched, load the remaining words (generally in a circular manner) CACHE LINE (1) Critical word first (2) remaining words Does not guarantee that the cache miss overhead is reduced If an adjacent word is required immediately after (spatial locality principle), the processor may still stall

14 Example 2 14 Consider the following code segment where initially: R1=10F8h R2=11F8h R4=12F8h R5=00F8h Assume an instruction cache with the following characteristics: Cache size of 32B Lines of 4 bytes 120h Loop: L.D F0,0(R1) 124h L.D F2,0(R2) 128h MUL.D F2,F0,F2 12Ch DADD.D F2,F2,F4 130h S.D 0(R4),F2 134h DADD R1,R1,#-8 138h DADD R2,R2,#-8 13Ch DADD R4,R4,#-8 140h BNE R1,R5,Loop Determine the cache hit and miss rates considering: 1. A direct mapping cache 2. A fully associative cache with a Least Recently Used (LRU) substitution policy ps: assume that the memory is byte addressed

15 Write policies 15 Write through: If data is on the cache Write the new value on the cache Write the new value on the memory If data is NOT on the cache Write the new value on the memory Data can be written to cache and to memory The memory is always updated Write Back If data is on the cache Write the new value on the cache If data is NOT on the cache Load the corresponding line to the cache and replace it with the new word value Data can only be written to memory Data is only written to memory when it is replaced

16 Write policies Write back implementation 16 Address TAG INDEX WORD SELECT CACHE WAY 1 CACHE WAY 0 Use an additional dirty bit to indicate whether any of the words in the line has been modified Select cache line TAG V P D LINE DATA If the dirty bit is one, before replacing a line with new data, the cache controller must write the value to memory (or to a lower cache hierarchy) Data on Way 1 Address Comparison P Replacement policy bits D Dirty bit = Data on Way 0 Hit on Way 1 Hit on Way 0 Data Cache Hit (Data Word validity)

17 Write policies Write back implementation 17 Address TAG INDEX WORD SELECT CACHE WAY 1 CACHE WAY 0 A cache miss imposes a large overhead if the block to be replaced is dirty Prioritize accesses: Select cache line TAG V P D LINE DATA Use a buffer to temporarily store data that is going to be replaced First load the data from the lower level memory Then store the dirty data to memory Data on Way 1 Address Comparison P Replacement policy bits D Dirty bit = Data on Way 0 Hit on Way 1 Hit on Way 0 Data Cache Hit (Data Word validity)

18 Write policies Write through implementation 18 A write through policy impose a large latency on memory writes Uses a write buffer to decrease latency The processor writes on the buffer The cache controller writes the data to the cache and lower level memory CPU Cache Memory (or lower level cache) The write buffer works as a FIFO (First In First Out) Works well if: Write frequncy Otherwise the write buffer will saturate! 1 time to write on lower level cache Saturation of the write buffer can be prevented by: Increasing its size Increasing the bandwidth to the lower cache level (e.g., by introducing more caching levels)

19 Write policies Write through implementation 19 A write through policy impose a large latency on memory writes Uses a write buffer to decrease latency The processor writes on the buffer The cache controller writes the data to the cache and lower level memory CPU Cache Memory (or lower level cache) To decrease latency of memory writes: Group writes to the lower level memory on the write buffer Data that have adjacent address, is written together (to allow bursting) Grouping is typically performed when a new value is added to the buffer The write buffer may generate RAW conflicts if the data is on the write buffer and not on the cache Solution 1: clear the write buffer before loading the value Solution 2: also verify if the data is on the write buffer

20 Write allocation policies 20 Write allocate: If the write data is not on the cache, allocate it (i.e., generate a cache miss to force the line to be put on the cache) Write not allocate: If the write data is not on the cache, do not attempt to fetch the line from memory Naturally it only makes sense to use the combined policies: Write Back Write allocate Write Through Write not allocate

21 Example 3 21 Consider the following code segment where initially: R1=10F8h R2=11F8h R4=12F8h R5=00F8h Assume a data cache with the following characteristics: 2-way associative caches 8 lines, each line with 32 bytes 120h Loop: L.D F0,0(R1) 124h L.D F2,0(R2) 128h MUL.D F2,F0,F2 12Ch DADD.D F2,F2,F4 130h S.D 0(R4),F2 134h DADD R1,R1,#-8 138h DADD R2,R2,#-8 13Ch DADD R4,R4,#-8 140h BNE R1,R5,Loop Determine the cache hit and miss rates considering: 1. A write through, write not allocate policy 2. A write back, write allocate policy 3. A write back, write allocate policy using a 4 entry victim cache ps: assume that the memory is byte addressed

22 Example 4 22 Consider the execution of a segment of code in the MIPS 32 processor Assume that, by tracing the execution of the program, you obtain memory accesses on the left Assume that all memory accesses correspond to 32-bit word load/stores Assume a 256B, direct mapping data cache with lines of 16B LOAD ( D4h) LOAD ( E4h) STORE ( D8h) LOAD ( E8h) LOAD ( E0h) STORE ( h) LOAD ( Ch) LOAD ( Dh) Assuming that the cache is initially empty, indicate the data on the cache after the execution of the program. Assume the following policies: 1. A write through, write not allocate policy 2. A write back, write allocate policy ps: assume that the memory is byte addressed

23 Causes of misses 23 Compulsory First reference to a block (line) Capacity Blocks discarded and later retrieved Conflict Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache

24 Cache miss impact on memory access time Using a single cache 24 The average memory access time depends on: Cache access latency the time to check if the required data is on the cache Cache miss rate Cache miss penalty the time to retrieve the data from a lower level memory T MEM ACCESS = T CACHE LATENCY + MR T MISS PENALTY Example: What is the average memory access time considering: A processor running at 2GHz A cache access latency of 5 clock cycles with an average hit rate of 90% An average memory access latency of 100 clock cycles Solution: 15 x 0.5 ns

Cache miss impact on memory access time Using a multi-level cache 25 Adding additional cache levels decreases the average access time For a n-level cache hierarchy: T MEM ACCESS = T L1 LATENCY + MR

25 Cache miss impact on memory access time Using a multi-level cache 25 Adding additional cache levels decreases the average access time For a n-level cache hierarchy: T MEM ACCESS = T L1 LATENCY + MR L1 T L2 ACCESS T L2 ACCESS = T L2 LATENCY + MR L2 T L3 ACCESS T L3 ACCESS = T Ln ACCESS = T Ln LATENCY + MR Ln T MEM For a 3-level cache hierarchy: T ACCESS = T L1 LAT + MR L1 T L2 LAT + MR L2 T L3 LAT + MR L3 T MEM Example: What is the average memory access time considering: A processor running at 2GHz An L1 cache with a latency of 3 clock cycles and an average hit rate of 80% An L2 cache with a latency of 10 clock cycles and an average hit rate of 60% An L3 cache with a latency of 20 clock cycles and an average hit rate of 50% An average memory access latency of 100 clock cycles Solution: 10.6 x 0.5 ns

26 Cache miss impact on memory access time Using a multi-level cache 26 To reduce the memory access time Decrease the L1 cache to decrease the initial latency (eventhough it increases the L1 Miss Rate) Add a second L2 cache L2 memory size >> L1 memory size L2 associativity (#ways) > L1 associativity L2 access time > L1 access time (typically around 3 times higher) Global L2 miss rate < L1 Miss rate Global L2 Miss Rate = #L2 Misses #Memory accesses = #Memory accesses MR L1 MR L2 = MR #Memory accesses L1 MR L2

27 Multi-level cache organization 27 If a block of data is on L1 should it be also on L2? Multi-level inclusion A block on L1 is always on L2 Whenever an L1 miss occurs the line (block) is loaded from L2 to L1 Worth when the L2 cache is much larger than the L1 cache Multi-level exclusion A block on L1 is never on L2 Whenever an L1 miss occurs the block is swapped from L2 to L1 The swapped out block is placed on the L2 Requires that the L1 and L2 line (block) sizes are equal Worth when the L2 cache is not much larger than the L1 cache

28 Multi-level cache organization Instruction and Data Caches 28 Separate the data and instruction caches: The processor can simultaneously access the L1 instruction cache to perform an instruction load, while accessing the L1 data cache for a load/store operation Avoids structural conflicts The data and instruction caches can be individually optimized Capacity (memory size), block size, associativity (#ways) Solves cache miss conflicts An instruction word can never occupy a data line Increases the capacity conflicts Each cache has a fixed size

29 29 Hardware optimizations

30 Cache optimizations 30 Increased block size Decreases compulsory misses Increases capacity and conflict misses, increases miss penalty Increase associativity Reduces conflict misses Increases hit time and power consumption Increase total cache size Increases cache access latency and power consumption Higher number of cache levels Reduces average memory access time

31 Cache optimizations 31 Give priority of reads over writes Reduces miss penalty Pipeline cache access Increases overall cache bandwidth Makes it easier to increase associativity Multi-banked caches Use multiple (interleaved) memory banks to store data Allows multiple simultaneous cache accesses

32 Cache optimizations 32 Way prediction Use extra control bits to predict which way has the required address Reduces cache latency, but requires data invalidation whenever the prediction is wrong Used in ARM Cortex A8 Data prefetching Detect patterns in memory accesses to predict the address of the next load instruction Useful in stream-like applications Allows hiding the memory access latency Current Intel architectures use multiple data pre-fetchers

33 33 Software optimizations

34 34 Software optimizations Instruction Cache Techniques for reducing instruction cache misses Instruction re-ordering to reduce the miss rate [McFarling89] For a direct mapping instruction cache of 2kB, lines of 4B Reduces the miss penalty in 50% For a direct mapping instruction cache of 8kB, lines of 4B Reduces the miss penalty in 75% Simple examples: Align sub-routines to the beginning of a cache line Apply loop unrolling only as long as the instructions fit on the cache Modern processors include special registers (typically referred as performance counters) that can count the number of cache misses Other events can also be measured (e.g., number of FP operations)

35 35 Software optimizations Data Cache Array composition Join arrays in memory such that memory is accessed sequentially Uses spatial locality Loop interchange Swap nested loops to access memory in sequential order Typically used when performing vector/matrix operations Uses temporal locality Blocking Instead of accessing entire rows or columns, subdivide matrices into blocks Requires more memory accesses but improves locality of accesses Uses both spatial and temporal locality

36 Software optimizations Data Cache 36 Data pre-fetching Register prefetching Insert load instructions before the data is actually used Requires extra registers Cache prefetching Use special instructions to load data into the cache These techniques are usually combined with loop unrolling and software pipelining

37 37 Next lesson More on memory systems: Virtual memory

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E