EC 513 Computer Architecture Cache Organization Prof. Michel A. Kinsy
The course has 4 modules Module 1 Instruction Set Architecture (ISA) Simple Pipelining and Hazards Module 2 Superscalar Architectures Vector machines VLIW Multithreading GPU Module 3 Branch Prediction Caches Memory Models & Synchronization Cache Coherence Protocols Module 4 On-Chip networks On-chip Network routing
Architecture Taxonomy Processor Organizations Single instruction, single data stream (SISD) Single instruction multiple data stream (SIMD) Multiple instruction, single data stream (MISD) Multiple instruction, multiple data stream (MIMD) Uniprocessor Vector Processor Array Processor Shared Memory (Tightly Coupled) Distributed Memory (Loosely Coupled Symmetric Multiprocessor (SMP) Nonuniformed Memory Access (NUMA) Cluster Parallelism Paradigms: Instruction level, Data level and Task level Parallelisms
CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency (time for a single access) Memory access time >> Processor cycle time Bandwidth (number of accesses per unit time) if fraction m of instructions access memory, 1+m memory references / instruction Ghost of the stored-program architecture CPI = 1 requires 1+m memory refs / cycle
Processor- Memory Gap Performance gap: CPU (55% each year) vs. DRAM (7% each year) Processor operations take of the order of 1 ns Memory access requires 10s or even 100s of ns Each instruction executed involves at least one memory access Performance 1000 100 10 1 µproc 60%/year Moore s Law DRAM CPU Processor-Memory Performance Gap: (grows 50% / year) DRAM 7%/year 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 Time 1992 1993 1994 1995 1996 1997 1998 1999 2000 [From David Patterson, UC Berkeley]
Processor-DRAM Gap (latency) Four-issue 2GHz superscalar accessing 100ns DRAM could execute 800 instructions during time for one memory access! Performance 1000 100 10 µproc 60%/year Moore s Law CPU Processor-Memory Performance Gap: (grows 50% / year) DRAM 7%/year 1 DRAM 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 Time 1992 1993 1994 1995 1996 1997 1998 1999 2000 [From David Patterson, UC Berkeley]
Memory Trends The fastest memories are expensive and thus not very large Capacity Access Time Cost (per GB) 100s B ns $Millions 10s KB few ns $100s Ks Reg L1 $ 4-8 bytes (word) MBs 10s ns $10s Ks 100s MB 100s ns $1000s 10s GB 10s ms $10s Ln $ 8-32 bytes (block) 1 to 4 blocks Main Memory 1,024+ bytes (disk sector = page) Secondary Memory
Illustrative View of Memory Organization A fast memory can help bridge the CPU-memory gap The fastest memories are expensive and thus not very large Control On-Chip Components Datapath ALU RegFile Instr Cache Data Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Memory (Disk)
Intel Core i7 Organization
Intel Haswell
Memory Technology Early machines used a variety of memory technologies Manchester Mark I used CRT Memory Storage EDVAC used a mercury delay line Core memory was first large scale reliable main memory Invented by Forrester in late 40s at MIT for Whirlwind project Bits stored as magnetization polarity on small ferrite cores threaded onto 2 dimensional grid of wires
Memory Technology First commercial DRAM was Intel 1103 1Kbit of storage on single chip Charge on a capacitor used to hold value Semiconductor memory quickly replaced core in 1970s Intel formed to exploit market for semiconductor memory Phase change memory (PCM) looking promising for the future Slightly slower, but much denser than DRAM and non-volatile
Memory Technology Random Access Memory (RAM) Any byte of memory can be accessed without touching the preceding bytes RAM is the most common type of memory found in computers and other digital devices There are two main types of RAM DRAM (Dynamic Random Access Memory) Needs to be refreshed regularly (~ every 8 ms) 1% to 2% of the active cycles of the DRAM Used for Main Memory SRAM (Static Random Access Memory)
Memory Technology Random Access Memory (RAM) Any byte of memory can be accessed without touching the preceding bytes RAM is the most common type of memory found in computers and other digital devices There are two main types of RAM DRAM (Dynamic Random Access Memory) SRAM (Static Random Access Memory) Content will last until power turned off Low density (6 transistor cells), high power, expensive, fast Used for caches
RAM Organization One memory row holds a block of data, so the column address selects the requested bit or word from that block Col. 1 bit lines Col. Row Address Decoder 2 M Row 1 word lines N+M N M Column Decoder & Sense Amplifiers Row 2 N Memory cell (one bit) Data D
DRAM Architecture Col. 1 bit lines Col. Row Address Decoder 2 M Row 1 word lines N+M N M Column Decoder & Sense Amplifiers Row 2 N Memory cell (one bit) Data D Modern chips have around 4 logical banks on each chip Each logical bank physically implemented as many smaller arrays
RAM Organization One memory row holds a block of data, so the column address selects the requested bit or word from that block RAS or Row Access Strobe triggering row decoder CAS or Column Access Strobe triggering column selector Address 21 Chip select Output enable Write enable SRAM 2M x 16 16 Dout[15-0] Din[15-0] 16
RAM Organization Latency: Time to access one word Access time: time between the request and when the data is available (or written) Cycle time: time between requests Usually cycle time > access time Bandwidth: How much data from the memory can be supplied to the processor per unit time Width of the data channel * The rate at which it can be used
Typical Memory Reference Patterns Address n loop iterations Instruction fetches Stack accesses subroutine call subroutine return argument access Data accesses scalar accesses Time
A Typical Memory Hierarchy CPU RF Split instruction & data primary caches (on-chip SRAM) L1 Instructio n Cache L1 Data Cache Unified L2 Cache Multiple interleaved memory banks (off-chip DRAM) Memory Memory Memory Memory DMA Disks/External Memory/Devices/ Others Multi-ported register file (part of CPU) Large unified secondary cache (on-chip SRAM)
Definition of a Cache A cache is simply a copy of a small data segment residing in the main memory Fast but small extra memory Hold identical copies of main memory Lower latency Higher bandwidth Usually several levels (1, 2 and 3)
Cache Structures Processor Address Data CACHE Address Data Main Memory 100 304 Data Byte Data Byte Data Byte Address Tag 6848 416
Caching & Cache Structures Processor Address Data CACHE Address Data Main Memory copy of main memory location 100 copy of main memory location 101 100 304 Data Byte Data Byte Data Byte Address Tag 6848 416
Caching & Cache Structures Processor Address Data CACHE Address Data Main Memory copy of main memory location 100 copy of main memory location 101 100 304 Data Byte Data Byte Data Byte Line Address Tag 6848 416 Data Block
Multilevel Caches Cache is transparent to user (happens automatically) Line Word CPU Reg File Cache Memory Main Memory Data is in the cache fraction h of the time Go to main 1 h of the time
Multilevel Caches Cache is transparent to user (happens automatically) Line Word CPU Reg File Cache Memory Main Memory Data is in the cache fraction h of the time For a cache with hit rate h, effective access time is: Go to main 1 h of the time C eff = hc fast + (1 h)(c slow + C fast ) = C fast + (1 h)c slow
Caches This organization works because most programs exhibit locality The principle of temporal locality says that if a program accesses one memory address, there is a good chance that it will access the same address in the near future The principle of spatial locality says that if a program accesses one memory address, there is a good chance that it will also access other nearby addresses CPU L1 L2 DRAM
Caching Principles Cache contains copies of some of Main Memory Those storage locations recently used When Main Memory address A is referenced in CPU Cache checked for a copy of contents of A If found, cache hit Copy used No need to access Main Memory If not found, cache miss Main Memory accessed to get contents of A Copy of contents also loaded into cache
Caching principles Cache size (in bytes or words) Total cache capacity A larger cache can hold more of the program s useful data but is more costly and likely to be slower Block or cache-line size Unit of data transfer between cache and main With a larger cache line, more data is brought in cache with each miss. This can improve the hit rate but also may bring low-utility data in cache
Placement policy Caching principles Determining where an incoming cache line is stored More flexible policies imply higher hardware cost and may or may not have performance benefits (due to more complex data location) Replacement policy Determining which of several existing cache blocks (into which a new cache line can be mapped) should be overwritten Typical policies: choosing a random or the least recently used block
Compulsory misses Caching Principles With on-demand fetching, first access to any item is a miss Capacity misses We have to evict some items to make room for others This leads to misses that are not incurred with an infinitely large cache Conflict misses The placement scheme may force us to displace useful items to bring in other items This may lead to misses in future
Line width (2 W ) Caching principles Too small a value for W causes a lot of main memory accesses Too large a value increases the miss penalty and may tie up cache space with low-utility items that are replaced before being used Set size or associativity (2 S ) Direct mapping (S = 0) is simple and fast Greater associativity leads to more complexity, and thus slower access, but tends to reduce conflict misses
Cache Algorithm (Read) Look at Processor Address, search cache tags to find match. Then either Found in cache a.k.a. HIT Return copy of data from cache Q: Which line do we replace? Not in cache a.k.a. MISS Read block of data from Main Memory Wait Return data to processor and update cache
Caches Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions CPU L1 L2 DRAM
Cache Performance Metrics Cache miss rate Number of cache misses divided by number of accesses Cache hit time Time between sending address and data returning from cache Cache miss latency Time between sending address and data returning from next-level cache/memory Cache miss penalty Extra processor stall caused by next-level cache/memory access 35
Average Memory Access Time Average Memory Access Time (AMAT) AMAT = Hit time + (Miss rate x Miss penalty) Memory stall cycles = Memory accesses x miss rate x miss penalty CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time CPI = ideal CPI + average stalls per instruction Having L1 and L2 Caches AMAT = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 Miss Penalty L2 )
Placement Policy Block Number Memory 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 Set Number 0 1 2 3 0 1 2 3 4 5 6 7 Cache Block 12 can be placed Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into set 0 block 4 (12 mod 4) (12 mod 8)
Address Bit-Field Partitioning The address (e.g., 32-bit) issued by the CPU is generally divided into 3 fields Tag Serves as the unique identifier for a group of data Different regions of memory may be mapped to the same cache location/block The tag is used to differentiate between them Index It is used to index into the cache structure Block Offset The least significant bits are used to determine the exact data word If the block size is B then b = log 2 B bits will be needed in the address to specify data word Address Tag Index Block Offset t bits k bits b bits
Direct-Mapped Cache Tag Index Block Offset t V Tag k Data Block b 2 k lines t = HIT Data Word or Byte
Direct Map Address Selection Index Tag Block Offset k V Tag t Data Block b 2 k lines t = HIT Data Word or Byte
Hashed Address Selection Address t Block Offset V Tag Data Block b Hash 2 k lines t = HIT Data Word or Byte
2-Way Set-Associative Cache Tag Index Block Offset b t V Tag k Data Block V Tag Data Block t = = Data Word or Byte HIT
Fully Associative Cache V Tag Data Block t = Block Offset Tag b t = = Data Word or Byte HIT
Write Performance t Tag V Tag Index k Block Offset b Data 2 k lines t = WE HIT Data Word or Byte
Improving Cache Performance Average memory access time = To improve performance: Reduce the hit time Hit time + Miss rate x Miss penalty Reduce the miss rate (e.g., larger cache) Reduce the miss penalty (e.g., L2 cache) What is the simplest design strategy? Biggest cache that doesn t increase hit time past 1-2 cycles (approx 8-32KB in modern technology)
Effect of Cache on Performance Larger cache size Reduces conflict misses Hit time will increase Higher associativity Reduces conflict misses May increase hit time Larger block size Reduces compulsory misses Exploit burst transfers in memory and on buses Increases miss penalty and conflict misses
Replacement Policy Which block from a set should be evicted? Random Least Recently Used (LRU) LRU cache state must be updated on every access True implementation only feasible for small sets (2-way) Pseudo-LRU binary tree often used for 4-8 way First In, First Out (FIFO) a.k.a. Round-Robin Used in highly associative caches Not Least Recently Used (NLRU) FIFO with exception for most recently used block or blocks
Reducing Write Hit Time Problem: Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit Solutions Design data RAM that can perform read and write in one cycle, restore old value after tag miss Fully-associative (CAM Tag) caches: Word line only enabled if hit Pipelined writes: Hold write data for store in single buffer ahead of cache, write cache data during next store s tag check
Victim Caches CPU RF L1 Data Cache Unified L2 Cache Evicted data from L1 Hit data (miss in L1) Victim Cache FA, 4 blocks where? Evicted data from VC Victim cache is a small associative back up cache, added to a direct
Victim Caches Victim cache is a small associative back up cache, added to a direct Mapped cache, which holds recently evicted lines 1. First look up in direct mapped cache 2. If miss, look in victim cache 3. If hit in victim cache, swap hit line with line now evicted from L1 4. If miss in victim cache, L1 victim -> VC, VC victim->? Fast hit time of direct mapped but with reduced conflict misses
Delayed Write Timing Time LD 0 ST 1 Tag LD 0 ST 1 ST 2 LD 3 ST 4 LD 5 ST 2 LD 3 Data LD 0 ST 1 LD 3 ST 2 LD 5 ST 4 LD 5 Buffer ST 1 ST 2 ST 2 ST 4 ST 4
Pipelining Cache Writes Data from a store hit written into data portion of cache during tag access of subsequent store Address and Store Data From CPU Tag Index Store Data Delayed Write Addr. =? Load/Store Delayed Write Data Tags S L Data =? 1 0 Hit? Load Data to CPU
Cache hit: Write through Write Policy Choices Write both cache & memory Generally higher traffic but simplifies cache coherence Write back Write cache only (memory is written only when the entry is evicted) A dirty bit per block can further reduce the traffic
Cache miss: Write Policy Choices No write allocate: only write to main memory Write allocate (aka fetch on write): fetch into cache Common combinations: Write through and no write allocate Write back with write allocate
Reducing Read Miss Penalty CPU RF Data Cache Write buffer Unified L2 Cache Evicted dirty lines for writeback cache OR All writes in writethru cache
Reducing Read Miss Penalty Problem: Write buffer may hold updated value of location needed by a read miss RAW data hazard Stall: On a read miss, wait for the write buffer to go empty Bypass: Check write buffer addresses against read miss addresses, if no match, allow read miss to go ahead of writes, else, return value in write buffer
Prefetching Speculate on future instruction and data accesses and fetch them into cache(s) Instruction accesses easier to predict than data accesses Varieties of prefetching Hardware prefetching Software prefetching Mixed schemes What types of misses does prefetching affect?
Issues in Prefetching Usefulness should produce hits Timeliness not late and not too early Cache and bandwidth pollution Most recent Security / side-channel issues CPU L1 Instruction Unified L2 Cache RF L1 Data Prefetched data
Hardware Instruction Prefetching Instruction prefetch in Alpha AXP 21064 Fetch two blocks on a miss; the requested block (i) and the next consecutive block (i+1) Requested block placed in cache, and next block in instruction stream buffer If miss in cache but hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2) Req block Stream Buffer Prefetched instruction block CPU RF L1 Instruction Req block Unified L2 Cache
Hardware Data Prefetching Prefetch-on-miss: Prefetch b + 1 upon miss on b One Block Lookahead (OBL) scheme Initiate prefetch for block b + 1 when block b is accessed Why is this different from doubling block size? Can extend to N block lookahead Strided prefetch If observe sequence of accesses to block b, b+n, b+2n, then prefetch b+3n etc.
Itanium-2 On-Chip Caches Level 1, 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2, 256KB, 4-way s.a,128b line, quad-port (4 load or 4 store), five cycle latency Level 3, 3MB, 12-way s.a.,128b line, single 32B port, twelve cycle latency
Next Class Advanced Memory Operations