Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline 5.1 Introduction 5.2 The Basics of Caches 5.3 Measuring and Improving Cache Performance 2

Depar rtment of Electr rical Engineering, Since 1980, CPU has outpaced DRAM... Q. How do architects address this gap? Put smaller, faster cache memories between CPU and DRAM. Create a memory hierarchy. Gap grew 50% per year 3 Feng-Chia Unive ersity Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM (Dynamic Random Access Memory): value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10) SRAM DRAM 4

Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Memory array Activation Precharge Source : http://en.wikipedia.org/wiki/dram DRAM 5 Exploiting Memory Hierarchy Users want large and fast memories! SRAM access times are 0.5 5ns at cost of $2,000 to $5,000 per GB. DRAM access times are 50-70ns at cost of $20 to $75 per GB. Disk access times are 5 to 20 million ns at cost of $0.20 to $2 per GB. Try and give it to them anyway build a memory hierarchy CPU Registers Cache (SRAM) Interconnection Memory (DRAM) Memory Controller Input Device Output Device in 2008 6

The Principle of Locality A principle that makes having a memory hierarchy a good idea Two different types of locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Our initial focus: two levels (upper, lower) Block (aka line): minimum unit of data Hit: data requested is in the upper level Miss: data requested is not in the upper level 7 Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline 5.1 Introduction 5.2 The Basics of Caches 5.3 Measuring and Improving Cache Performance 8

Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Direct mapped Fully associative Set associative Cache Organization 9 Two issues: Cache How do we know if a data item is in the cache? If it is, how do we find it? Our first example: block size is one word of data "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level 10

Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Direct Mapped Cache Mapping: address is modulo the number of blocks in the cache 000 Cache 001 010 011 100 101 110 111 00001 00101 01001 00 01101 10001 10101 11001 11101 Memory 11 For MIPS: Hit Direct Mapped Cache Index 0 1 2 1021 1022 1023 Address (showing bit positions) 31 30 13 12 11 2 1 0 Byte offset 20 10 Tag Valid Tag = 20 Index Data 32 Data 12

Tags and Valid Bits How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0 13 Depar rtment of Electrical Engineering, Feng-Chia Unive ersity Direct Mapped Cache Taking advantage of spatial locality: 14

Feng-Chia Unive ersity 4-Way Set Associative 31 30 12 11 10 9 8 3 2 1 0 22 8 Index V Tag Data V Tag Data V Tag Data V Tag Data 0 1 2 253 254 255 Hit 4-to-1 multiplexor Data 22 32 15 Feng-Chia Unive ersity Example : Direct Mapped Cache 32-bit Address Cache Size = 64KByte Block Size = 32Byte Direct mapped 16

Feng-Chia Unive ersity Example : Set Associative Cache 32-bit Address Cache Size = 64KByte Block Size = 32Byte 2-way set associative 17 Block Size Considerations Larger blocks should reduce miss rate Due to spatial locality But in a fixed-sized cache Larger blocks fewer of them More competition increased miss rate Larger blocks pollution Larger miss penalty Can override benefit of reduced miss rate Early restart and critical-word-first can help 18

Line Size and Locality Increasing the block size tends to decrease miss rate: rate Miss 40% 35% 30% 25% 20% 15% 10% 5% 0% 4 16 64 Block size (bytes) 256 1 KB 8 KB 16 KB 64 KB 256 KB Use split caches because there is more spatial locality in code: Block size in Instruction Data miss Effective combined Program words miss rate rate miss rate gcc 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% spice 1 12% 1.2% 13% 1.3% 12% 1.2% 4 0.3% 0.6% 0.4% 19 Cache Misses On cache hit, CPU proceeds normally On cache miss Stall the CPU pipeline Fetch block from next level of hierarchy Instruction cache miss Restart instruction fetch Data cache miss Complete data access 20

Hits vs. Misses Read Hit Read Read dmiss Memory References Write Hit Write Write Miss Write-through through Write-back Write-around Write-allocate Write-through Write-back 21 Write-Through On data-write hit, could just update the block in cache But then cache and memory would be inconsistent Write through: also update memory But makes writes take longer e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles Effective CPI = 1 + 0.1 100 = 11 Solution: write buffer Holds data waiting to be written to memory CPU continues immediately Only stalls on write if write buffer is already full 22

Write Buffers for Write-Through Caches Processor Cache Lower Level Memory Write Buffer Holds data awaiting write-through to lower level memory Q. Why a write buffer? So CPU doesn t stall Q. Why a buffer, why not just one register? Bursts of writes are common Q: Are Read After Write (RAW) hazards an issue for write buffer? Yes! Drain buffer before next read, or send read 1st after check write buffers. 23 Write-Back Alternative: On data-write hit, just update the block in cache Keep track of whether each block is dirty When a dirty block is replaced Write it back to memory Can use a write buffer to allow replacing block to be read first 24

Random: Replacement Policy Candidate blocks are randomly selected, possibly using some hardware assistance. Least Recently Used (LRU) The block replaced is the one that has been unused for the longest time First In, First Out (FIFO) Because LRU can be complicated to calculate, this approximates LRU by determining the oldest block rather than the LRU 25 Main Memory Supporting Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Example cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer For 4-word block, 1-word-wide DRAM Miss penalty = 1 + 4 15 + 4 1 = 65 bus cycles Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle 26

Depar rtment of Electrical Engineering, Hardware Issues Make reading multiple words easier by using banks of memory 2-word wide memory Miss penalty = 1 + (2x15) + 2x1 = 33 bus cycles Bandwidth = 16 bytes / 33 cycles = 0.48 B/cycle 4-bank interleaved memory Miss penalty = 1 + 15 + 4 1 = 20 bus cycles Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle 27 Depar rtment of Electrical Engineering, Non-interleaving memory Interleaving memory Pipelining read/write accesses Memory Word Word Word Word Bank 0 Bank 1 Bank 2 address address address address Bank 3 0 4 1 5 2 6 3 7 8 9 10 11 12 13 14 15 28

Advanced DRAM Organization Bits in a DRAM are organized as a rectangular array DRAM accesses an entire row Burst mode: supply successive words from a row with reduced latency Double data rate (DDR) DRAM Transfer on rising and falling clock edges Quad data rate (QDR) DRAM Separate DDR inputs and outputs 29 Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity 5.1 Introduction Outline 5.2 The Basics of Caches 5.3 Measuring and Improving Cache Performance 30

Feng-Chia Unive ersity Simplified model: Performance execution time = (execution cycles + stall cycles) cycle time stall cycles = # of instructions miss ratio miss penalty Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size? 31 Given Cache Performance Example I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions Miss cycles per instruction I-cache: 0.02 100 = 2 D-cache: 0.36 004 0.04 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 Ideal CPU is 5.44/2 =2.72 times faster 32

Depar rtment of Electrical Engineering, Feng-Chia Unive ersity Decreasing miss ratio with associativity 33 Decreasing miss penalty with multilevel lil l caches Add a second level cache: often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in 2nd level l cache Using multilevel caches: try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache 34

Feng-Chia Unive ersity Performance of Multilevel Caches Example: CPI of 1.0 on a 4 GHz machine with a 2% miss rate, 100ns DRAM access Adding 2nd level cache with 5ns access time decreases miss rate to 0.5% 35 Feng-Chia Unive ersity Interactions with Software Misses depend on memory access patterns Algorithm behavior Compiler optimization for memory access Difficult to predict best algorithm: need experimental data 36