Memory. Lecture 22 CS301 - PDF Free Download

Memory Lecture 22 CS301

Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday

Pipelined Machine Fetch Decode Execute Memory PC 4 Read Addr Out Data Instruction Memory << 2 op/fun rs rt rd imm src1 src1data src2 src2data Register File destreg destdata << 2 Addr Out Data Data Memory In Data 16 Sign Ext 32 Pipeline Register (Writeback)

The Challenge Be able to randomly access gigabytes (or more) of data at processor speeds

How Do We Access Data?

Program Characteristics Temporal Locality Spatial Locality

Program Characteristics Temporal Locality w If you use one item, you are likely to use it again soon Spatial Locality

Program Characteristics Temporal Locality w If you use one item, you are likely to use it again Spatial Locality w If you use one item, you are likely to use its neighbors soon

Examples of Each Type of Locality? Temporal locality w Good? w Bad? Spatial locality w Good? w Bad?

Locality Programs tend to exhibit spatial & temporal locality. Just a fact of life. How can we use this knowledge of program behavior to solve our problem?

Predicting Data Accesses Can we predict what data we will use?

Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request

Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request w Like branch prediction, use previous behavior

Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request w Like branch prediction, use previous behavior Keep a prediction for every load? w Fetch stage for load is *TOO LATE* Keep a prediction per-memory address?

Memory Hierarchy Tech Speed Size Cost/bit SRAM (logic) Fastest CPU L1 Smallest Highest SRAM (logic) L2 Cache DRAM (capacitors) Slowest DRAM Largest Lowest

Using Caches To Improve Performance Caches make the large gap between processor speed and memory speed appear much smaller Caches give the appearance of having lots and lots of quickly accessible memory w Achieved by exploiting spatial and temporal locality

SRAM Static Random Access Memory w Volatile memory array 4-6 transistors per bit w Fast accesses 0.5-2.5ns Dimensions: w Height: # addressable locations w Width: # of b per addressable unit Usually 1 or 4 2M x 16 SRAM Height: 2M Width: 16

SRAM: Selection Logic Need to choose which addressable unit goes to output lines 2M multiplexor infeasible Single shared output line: bit line w Tri-state buffer used to allow multiple sources to drive bit line Tri-State Buffer A Ctrl Z 0 0 X 0 1 0 1 0 X 1 1 1 Select 0 Enable Data 0 In Out Select 1 Enable Data 1 In Out Select 2 Enable Output Data 2 In Out Select 3 Data 3 In Enable Out

SRAM: Using Bit Lines input lines address or word lines bit (output) lines

SRAM: For Large Arrays Large arrays (4Mx8 SRAM) require HUGE decoders and word lines Instead, 2-stage decoding w Selects addresses for eight 4Kx1024 arrays w Multiplexors select 1 bit from each 1024-b wide array

DRAM Dynamic RAM Value stored as charge in capacitor (1T) w Must be refreshed w Refresh by reading and writing back cells w Only uses 1-2% of active DRAM cycles 2 level decoder w Row access Row access strobe (RAS) w Column access Column access strobe (CAS) Access times w 50-70 ns (typical) 4M x 1 DRAM 2048 x 2048 array

Memory Hierarchy Tech Speed Size Cost/bit SRAM (logic) Fastest CPU L1 Smallest Highest SRAM (logic) L2 Cache DRAM (capacitors) Slowest DRAM Largest Lowest

What Do We Need to Think About? 1. Design cache that takes advantage of spatial & temporal locality

What does that mean?!? 1. Design cache that takes advantage of spatial & temporal locality 2. When you program, place data together that is used together to increase spatial & temporal locality

What does that mean?!? 1. Design cache that takes advantage of spatial & temporal locality 2. When you program, place data together that is used together to increase locality w Java - difficult to do w C - more control over data placement Note: Caches exploit locality. Programs have varying degrees of locality. Caches do not have locality!

Cache Design Temporal Locality Spatial Locality

Cache Design Temporal Locality w When we obtain the data, store it in the cache. Spatial Locality

Cache Design Temporal Locality w When we obtain the data, store it in the cache. Spatial Locality w Transfer large block of contiguous data to get item s neighbors. w Block (Line): Amount of data transferred for a single miss (data plus neighbors)

Where do we put data? Searching whole cache takes time & power Direct-mapped w Limit each piece of data to one possible position Search is quick and simple

Memory Direct-Mapped 000000 000100 Index 00 01 10 11 Cache 010000 010100 100000 100100 110000 110100

Memory Direct-Mapped One block (line) 000000 000100 Index 00 01 10 11 Cache 010000 010100 100000 100100 110000 110100

Direct-Mapped cache Block (Line) size = 2 words (8B) Index Data 292 00 01 10 11 Byte Address 0b100100100 Where do we look in the cache? How do we know if it is there?

Direct-Mapped cache Block (Line) size = 2 words (8B) Index 00 01 10 11 Data Byte Address 0b100100100 Block Address Where is it within the block? Where do we look in the cache? BlockAddress mod #slots BlockAddress & (#slots-1) How do we know if it is there?

Direct-Mapped cache Block (Line) size = 2 words (8B) Valid 00 01 10 11 Tag Data 1 1001 M[292-295] M[288-291] Byte Address 0b100100100 Tag Index Where is it within the block? Where do we look in the cache? BlockAddress mod #slots BlockAddress & (#slots-1) How do we know if it is there? We need a tag & valid bit

Splitting the Address Direct-Mapped Cache Valid Tag Data 00 0 0b1010001 01 0 10 0 11 0 Tag Index Block Offset Byte Offset

Definitions Byte Offset: Which within? Block Offset: Which within? Set: Group of checked each access Index: Which within cache? Tag: Is this the right one?

Definitions Byte Offset: Which byte within word Block Offset: Which within? Set: Group of checked each access Index: Which within cache? Tag: Is this the right one?

Definitions Byte Offset: Which byte within word Block Offset: Which word within block Set: Group of checked each access Index: Which within cache? Tag: Is this the right one?

Definitions Byte Offset: Which byte within word Block Offset: Which word within block Set: Group of blocks checked each access Index: Which within cache? Tag: Is this the right one?

Definitions Byte Offset: Which byte within word Block Offset: Which word within block Set: Group of blocks checked each access Index: Which set within cache? Tag: Is this the right one?

Definitions Block (Line) Hit Miss Hit time / Access time Miss Penalty

Definitions Block - unit of data transfer bytes/ words Hit Miss Hit time / Access time Miss Penalty

Definitions Block - unit of data transfer bytes/ words Hit - data found in this cache Miss Hit time / Access time Miss Penalty

Definitions Block - unit of data transfer bytes/ words Hit - data found in this cache Miss - data not found in this cache w Send request to lower level Hit time / Access time Miss Penalty

Definitions Block - unit of data transfer bytes/ words Hit - data found in this cache Miss - data not found in this cache w Send request to lower level Hit time / Access time w Time to access this cache Miss Penalty

Definitions Block - unit of data transfer bytes/words Hit - data found in this cache Miss - data not found in this cache w Send request to lower level Hit time / Access time w Time to access this cache Miss Penalty w Time to receive block from lower level w Not always constant

Example 1 Direct-Mapped Block size=2 words Direct-Mapped Cache Valid Tag Data 00 0 0x1010001 01 0 10 0 11 0 Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 0 10 0 11 0 Direct-Mapped Cache Tag Data Reference Stream: Hit/Miss 0b1001000 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 0 10 0 11 0 Direct-Mapped Cache Tag Data 72 Reference Stream: Hit/Miss 0b1001000 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 0 11 0 Direct-Mapped Cache Tag 10 Data M[76-79] M[72-75] 72 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 0 11 0 Direct-Mapped Cache Tag 10 Data M[76-79] M[72-75] 20 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 0 Direct-Mapped Cache Tag 10 00 Data M[76-79] M[72-75] M[20-23] M[16-19] 20 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 0 Direct-Mapped Cache Tag 10 00 Data M[76-79] M[72-75] M[20-23] M[16-19] 56 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 1 10 1 1 Direct-Mapped Cache Tag Data 10 00 01 M[76-79] M[72-75] M[20-23] M[16-19] 11 M[60-63] M[56-59] 56 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 1 Direct-Mapped Cache Tag 10 00 01 Data M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 16 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 1 Direct-Mapped Cache Tag 10 00 01 Data M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 16 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 1 Direct-Mapped Cache Tag 10 00 01 Data M[20-23] M[16-19] M[76-79] M[72-75] M[60-63] M[56-59] 20 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 1 Direct-Mapped Cache Tag 10 00 01 Data M[20-23] M[16-19] M[76-79] M[72-75] M[60-63] M[56-59] 20 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 H 0b0100100 Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 0 01 10 1 1 11 1 Direct-Mapped Cache Tag 10 01 01 Data M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 36 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 H 0b0100100 M Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 1 01 10 1 1 11 1 Direct-Mapped Cache Tag 01 10 01 01 Data M[36-39] M[32-35] M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 36 Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 H 0b0100100 M Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 1 01 10 1 1 11 1 Direct-Mapped Cache Tag 01 10 01 01 Data M[36-39] M[32-35] M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 H 0b0100100 M Miss Rate: Tag Index Block Offset Byte Offset

Example 1 Direct-Mapped Block size=2 words Valid 00 1 01 10 1 1 11 1 Direct-Mapped Cache Tag 01 10 01 01 Data M[36-39] M[32-35] M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] Reference Stream: Hit/Miss 0b1001000 M 0b0010100 M 0b0111000 M 0b0010000 H 0b0010100 H 0b0100100 M Miss Rate: 4 / 6 = 67% Hit Rate: 2 / 6 = 33% Tag Index Block Offset Byte Offset

Implementation Byte Address 0x100100100 Tag Valid Index Tag Data Byte Offset Block offset 00 01 10 11 = MUX Hit? Data

Example 2 You are implementing a 64-Kbyte cache, 32-bit address The block size (line size) is 16 bytes. Each word is 4 bytes How many bits is the block offset? How many bits is the index? How many bits is the tag?

Example 2 You are implementing a 64-Kbyte cache The block size (line size) is 16 bytes. Each word is 4 bytes How many bits is the block offset? w 16 / 4 = 4 words -> 2 bits How many bits is the index? How many bits is the tag?

Example 2 You are implementing a 64-Kbyte cache The block size (line size) is 16 bytes. Each word is 4 bytes, address 32 bits How many bits is the block offset? w 16 / 4 = 4 words -> 2 bits How many bits is the index? w 64*1024 / 16 = 4096 -> 12 bits How many bits is the tag?

Direct-mapped $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate? Example

Example Direct-mapped $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate? 0 1 2 3 4 5 6 7 0 16 1 17 32 16 36 45

Reducing Cache Conflicts Problem: w Lines that map to same cache index conflict w Lines conflict even if other cache lines unused Solution: w Have multiple cache lines for each mapping

Cache Set Associativity Set: Group of cache lines address can map to Direct-mapped: 1 location for block n-way set associative: n locations for block Fully-associative: Maps to any location Direct-mapped 2-way set associative Fully-associative Set 0 1 2 3 4 5 6 7 Set 0 1 2 3 Set 0

Cache Set Associativity Decreases conflicts => increases hit rate On cache request, must check every cache line in set w Increases hit time Number of sets smaller than direct mapped, so fewer index bits w lg (number of sets) where sets < # cache lines Tag bits increase

2-way set associative $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate? Example

Example 2-way set associative $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate? 0 0 1 1 2 2 3 3 0 32 16 36 45

Implementation Byte Address 0x100100100 0 Valid Tag Tag Data Index Valid Byte Offset Block offset Tag Data 1 Hit? = MUX = MUX MUX Data

Example You are implementing a 1Mbyte 4-way set associative cache, 32-bit address The block size (line size) is 256 bytes. How many bits is the block offset? How many bits is the index? How many bits is the tag?

What Happens on Cache Miss? Detect desired block is not there w Valid bit 0 OR w Tag not one we re looking for If valid bit is set but tag not one we re looking for, evict current block Request line from lower level Upon receipt of data from lower level, set tag and valid bits and store data. Pass data up to requestor

How caches work Classic abstraction Each level of hierarchy has no knowledge of the configuration of lower level L1 cache s perspective Me L1 L2 cache s perspective Me L2 Cache Memory L2 Cache Memory DRAM DRAM

Memory Operation at any level Address 1. Me Cache 1. Cache receives request Memory

Memory operation at any level Address 1. Me 2. Cache 1. Cache receives request 2. Look for item in cache Memory

Memory operation at any level Address 1. Me 2. Cache 3. Data 1. Cache receives request 2. Look for item in cache Hit - return data Memory

Memory operation at any level Address 1. Me 2. 3. Memory Cache 1. Cache receives request 2. Look for item in cache Hit - return data Miss - request memory

Memory operation at any level Address 1. Me 2. 3. Memory Cache 4. 1. Cache receives request 2. Look for item in cache Hit - return data Miss - request memory receive data update cache

Memory operation at any level Address 1. Me 2. 3. Memory Cache 5. Data 4. 1. Cache receives request 2. Look for item in cache Hit - return data Miss - request memory receive data update cache return data

Performance Hit: latency = Miss: latency = Goal: minimize misses!!!

Performance Hit: latency = access time Miss: latency = Goal: minimize misses!!!

Performance Hit: latency = access time Miss: latency = access time + miss penalty Goal: minimize misses!!!

Performance How does the memory system affect CPI? Penalty on cache hit: w hit time w frequently only 1 cycle needed to access on cache hit Penalty on cache miss: w miss time time to get from lower level of memory CPI = 1 + memory stalls/instruction = 1 + (% miss) (cache miss penalty)

L1 cache s perspective Me Memory L1 L1 s miss penalty contains the access of L2, and possibly the access of DRAM!!! L2 Cache DRAM

Multi-level Caches Base CPI 1.0, 500MHz clock Main memory-100 cycles, L2-10 cycles L1 miss rate per instruction - 5% W/L2-2% of instructions go to DRAM What is the speedup with the L2 cache?

Multi-level Caches CPI = 1 + memory stalls / instructions

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/ miss = 1 + 5 = 6 cycles / instr

Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr CPI new =1 + L2%*L2penalty +Mem %*MemPenalty instr =1 + 5% * 10 + 2% * 100=3.5 cycles/

Average Memory Access Time AMAT = L1 access time + L1 miss penalty L1 miss penalty = L2 access time + L2 miss penalty L2 miss penalty = Memory access time + Memory miss penalty

Calculate AMAT Organization: w L1 cache Access time is 1 cycle Hit rate of 90% w L2 cache Access time is 10 cycles Hit rate of 95% w Memory Access time is 100 cycles Hit rate of 100%

Ways To Improve Cache Performance Make the cache bigger w Pro: More stuff can fit in the cache so stuff doesn t have to get thrown out as often w Con: Time to access larger memory longer Reduce the number of conflicts in the cache by increasing associativity w Pro: Multiple memory lines that map to same cache set can reside in cache simultaneously w Con: More time needed to determine if there is a hit because have to check multiple cache blocks

Ways To Improve Cache Performance Use multiple levels of cache w Access time of non-primary cache not as important. More important for it to have lower miss rate. w Pro: Reduces (average) miss penalty if there is a hit in lower level of cache w Con: Takes up space and increases (worst-case) latency if access misses in this level of cache. Make the block size larger to exploit spatial locality w Pro: Fewer misses for sequential accesses w Pro: Decreases bits dedicated to tags w Con: Fewer blocks in cache for given cache size w Con: Miss penalty may be larger because larger blocks need to be retrieved from lower level of hierarchy

2-way set associative $ w Block size = 4 words w Total size = 32 words Word addresses w 2 w 35 w 63 w 110 w 210 w 77 w 3 w 97 w 170 What is the hit rate? Example

Cache Writes There are multiple copies of the data lying around w L1 cache, L2 cache, DRAM Do we write to all of them? Do we wait for the write to complete before the processor can proceed?

Do we write to all of them? Write-through - write to all levels of hierarchy Write-back - write to lower level only when cache line gets evicted from cache w creates inconsistent data - different values for same item in cache and DRAM. w Inconsistent data in highest level in cache is referred to as dirty

Write-Through CPU sw $3, 0($5) L1 L2 Cache DRAM

Write-Back CPU sw $3, 0($5) L1 L2 Cache DRAM

Write-through vs Write-back Which performs the write faster? w Write-back - it only writes the L1 cache Which has faster evictions from a cache? w Write-through - no write involved, just overwrite tag Which causes more bus traffic? w Write-through. DRAM is written every store. Write-back only writes on eviction.

Beyond The Cache: Memory

Memory System Design Challenges DRAM is designed for density, not speed DRAM is slower than the bus We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. Widening anything increases the cost by quite a bit.

Narrow Configuration CPU Given: w 1 clock cycle request w 15 cycles / word DRAM latency w 1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

Wide Configuration CPU Given: w 1 clock cycle request w 15 cycles / 2 words DRAM latency w 1 cycle / 2 words bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

Interleaved Configuration CPU Byte 0 in DRAM 0, byte 1 in DRAM 1, Byte 2 in DRAM 0,... Given: w 1 clock cycle request w 15 cycles / word DRAM latency w 1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM DRAM

Interleaved Configuration CPU Given: w 1 clock cycle request w 15 cycles / word DRAM latency w 1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1 cycle + 15 cycles / 2 words * 8 words + 1 cycle / word * 8 words = 69 cycles Cache Bus DRAM DRAM

DRAM Optimizations Fast page mode w Allow repeated accesses to row buffer without another row access time Synchronous DRAM w Add clock signal to DRAM interface to make synchronous w Programmable register holds number of bytes to transfer over many cycles Double Data Rate (DDR) w Transfer data on rising and falling clock edges instead of just one.

DRAM Optimizations Make DRAM chip act like a memory system Each chip has interleaved memory and a high speed interface RDRAM w Switch RAS/CAS lines to bus that allows multiple access to be inflight simultaneously You don t have to wait for one DRAM request to finish before sending another request Direct RDRAM w Don t multiplex over one bus. Have 3: Data Row Column