ECE 485/585 Microprocessor System Design

Microprocessor System Design Lecture 8: Principle of Locality Cache Architecture Cache Replacement Policies Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based on materials provided by Mark F.

Cache Topics Cache Basics Memory vs. Processor Performance The Memory Hierarchy Registers, SRAM, DRAM, Disk Spatial and Temporal Locality Cache Hits and Misses Cache Organization Direct Mapped Caches Two-Way, Four-Way Caches Fully Associative (N-Way) Caches Sector-mapped caches Cache Line Replacement Algorithms Cache Performance and Performance improvements Cache Coherence Intel Cache Evolution Multicore Caches Cache Design Issues

The Problem: Memory Wall Memory Technology Trends 100,000 Relative Performance Gains 10,000 Performance 1,000 CPU 100 10 Memory 1 Year From Hennessy & Patterson, Computer Architecture: A Quantitative Approach (4 th edition)

Memory System Design Tradeoffs A big challenge in memory system design is to provide a sufficiently large memory capacity, with reasonable speed at an affordable cost SRAM Complex basic cell circuit => fast access, but high cost per bit DRAM Simpler basic cell circuit => less cost per bit, but slower than SRAMs Flash memory and Magnetic disks DRAMs provide more storage than SRAM but less than what is necessary Disks provide a large amount of storage, but are much slower than DRAMs No single memory technology can provide both large capacity and fast speed at an affordable cost

A Solution: Memory Hierarchy From Hennessy & Patterson, Computer Architecture: A Quantitative Approach (4 th edition) Processor Datapath Control Registers On-Chip Cache Second Level Cache (SRAM) Third Level Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) Tertiary Storage (Tape) Intermediate results Cached DRAM Instructions File System Data Paging [Cached Files] Archive Backup

Intel Pentium 4 3.2 GHz Server Component Access Speed (Time for data to be returned) Registers 1 cycle = 0.3 nanoseconds L1 Cache 3 cycles = 1 nanoseconds L2 Cache 20 cycles = 7 nanoseconds L3 Cache 40 cycles = 13 nanoseconds Memory 300 cycles = 100 nanoseconds

How is the Hierarchy Managed? Registers Memory Compiler Programmer Cache Memory Hardware Memory Disk Operating System (Virtual Memory: Paging) Programmer (File System)

Principle of Locality Analysis of programs indicates that many instructions in localized areas of a program are executed repeatedly during some period of time, while other instructions are executed relatively less frequently These frequently executed instructions may be the ones in a loop, nested loop or few procedures calling each other repeatedly This is called locality of reference Temporal locality of reference Recently executed instruction is likely to be executed again very soon Recently accessed data is likely to be accessed very soon Spatial locality of reference Instructions/data with addresses close to a recently accessed instruction/data are likely to be accessed soon A cache is designed to take advantage of both types of locality of reference

Use of a Cache Memory Processor Cache Main memory A cache is a small but fast SRAM inserted between processor and main memory Data in a cache is organized at the granularity of cache blocks When the processor issues a request for a memory address, an entire block (e.g., 64 bytes) is transferred from the main memory to the cache Later references to same address can be serviced by the cache (temporal locality) References to other addresses in this block can also be serviced by the cache (spatial locality) Higher locality => More requests serviced by the cache

Caching Student Advising Analogy Thousands of student folders Indexed by 9-digit student ID Located up the stairs and down the hall a long walk Space for 100 file folders at my desk Located at my side short access time

Cache Organization How is the Cache laid out? The cache is made up of a number of cache lines (sometimes called blocks) Data is hauled into the cache from memory in chunks (may be smaller than a line) If CPU requests 4 bytes of data, cache gets entire line (32/64/128 bytes) Spatial locality says you re likely to need that data anyway Incur cost only once rather than each time CPU needs piece of data Ex: The Pentium P4 Xeon s Level 1 Data Cache: Contains 8K bytes The cache lines are each 64 bytes This gives 8192 bytes / 64 bytes = 128 cache lines

Simple Direct Mapped Cache 31 address 4 3 cache lines ( sets ) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 data index Use least significant 4 bits to determine which slot to cache data in But 2 28 different addresses could have their data cached in the same spot 0 4 set

Simple Direct Mapped Cache (cont d) address tag 31 43 index 0 valid tag data 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 set Need to store tag to be sure the data is for this address and not another (Only need to store the address minus the index bits 28 bits in this example)

Cache Hits and Misses When the processor needs to access some data, that data may or may not be found in the cache If the data is found in the cache, it is called a cache hit Read hit: The processor reads data from cache and does not need to go to memory Write hit: Cache has a replica of the contents of main memory, both cache and main memory need to be updated If the data is not found in the cache, it is called a cache miss The block containing the data is transferred from memory to cache After the block is transferred, the desired data is forwarded to the processor. The desired data may also be forwarded to the processor as soon as it is transferred without waiting for the entire cache block to be transferred. This is called load-through or critical word first

Cache Behavior Reads Read behavior if Valid bit clear /* slot empty cache miss */ stall CPU read cache line from memory set Valid bit write Tag bits deliver data to CPU else /* slot occupied */ if Tag bits match /* cache hit! */ deliver data to CPU else /* occupied by another cache miss */ stall CPU cast out existing cache line ( victim ) read cache line from memory write Tag bits deliver data to CPU

Cache Behavior - Writes Policy decisions for all writes Write Through Write Back Replace data in cache and memory Requires write buffer to be effective Allows CPU to continue w/o waiting for DRAM Replace data in cache only Requires addition of dirty bit in tag/valid memory Write back on when: Cache flush is performed Line becomes victim and is cast out Policy decision for write miss Write Allocate Place the data into the cache Write No Allocate (or Write Around) Don t place the data in the cache Philosophy successive writes (without intervening read) unlikely Saves not only the cache line fill for the requested cache line but possibility of casting out a line which is more likely to be used later

Write Buffer for Write-Through A Write Buffer is needed between cache and memory if using Write Through policy to avoid having processor wait Processor writes data into the cache and the write buffer Memory controller write contents of the buffer to memory Write Buffer is just a FIFO Intel: posted write buffer PWB Small depth Store frequency << 1/DRAM write cycle Processor Cache DRAM Write Buffer

Cache Behavior Writes Write behavior if Valid bit set /* slot occupied */ if Tag bits match /* cache hit! */ write data to cache write data to memory - or - set dirty bit for cache line else /* occupied by another */ stall CPU cast out existing cache line ( victim ) read cache line from memory write Tag bits write data to cache write data to memory - or - set dirty bit for cache line else /* slot empty */ stall CPU read cache line from memory write Tag bits set Valid bit write data to cache write data to memory - or - set dirty bit for cache line write through or write back Why? write through or write back assumes write allocate Why? write through or write back

Why read a cache line for a write? bytes to be written D V Tag cache line Data being written by CPU is smaller than cache line Write misses in cache Have only single valid bit and tag bits for entire line Subsequent read operation must find valid data for rest of cache line

Casting Out a Victim Depends upon policies Write Through Data in cache isn t the only current copy (memory is up to date) Just over-write victim cache line with new cache line (change tag bits) Write Back Must check dirty bit to see if victim cache line is modified If so, must write the victim cache line back to memory Can lead to interesting behavior A CPU read can cause memory write followed by read Write back dirty cache line (victim) Read new cache line A CPU write can cause a memory write followed by a read Write back dirty cache line (victim) Read new cache line into which data will be written in cache

What if Cache miss and cache location at Cache Index occupied Called a cache conflict or collision Action: Cast Out existing entry ( victim ) Replace with new entry..but what if we need that earlier entry again? Thrashing Solution N-way set associative caches Simultaneously hold in cache two (or more) lines that would have been forced to share same place in direct mapped cache

Cache Organization How Does The Cache Manage the Cache Lines? Associativity describes how data is stored in the cache Direct Mapped Associativity = 1 Each set has single line If it s in the cache there s only one place it could be N-way Associativity Each set contains N lines There are N places ( ways ) the line could be Fully associative All cache lines share the same possible places