Memory. Lecture 22 CS301

Size: px

Start display at page:

Download "Memory. Lecture 22 CS301"

Sabrina Hunt
6 years ago
Views:

1 Memory Lecture 22 CS301

2 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday

3 Pipelined Machine Fetch Decode Execute Memory PC 4 Read Addr Out Data Instruction Memory << 2 op/fun rs rt rd imm src1 src1data src2 src2data Register File destreg destdata << 2 Addr Out Data Data Memory In Data 16 Sign Ext 32 Pipeline Register (Writeback)

4 The Challenge Be able to randomly access gigabytes (or more) of data at processor speeds

5 How Do We Access Data?

6 Program Characteristics Temporal Locality Spatial Locality

7 Program Characteristics Temporal Locality w If you use one item, you are likely to use it again soon Spatial Locality

8 Program Characteristics Temporal Locality w If you use one item, you are likely to use it again Spatial Locality w If you use one item, you are likely to use its neighbors soon

9 Examples of Each Type of Locality? Temporal locality w Good? w Bad? Spatial locality w Good? w Bad?

10 Locality Programs tend to exhibit spatial & temporal locality. Just a fact of life. How can we use this knowledge of program behavior to solve our problem?

11 Predicting Data Accesses Can we predict what data we will use?

12 Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request

13 Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request w Like branch prediction, use previous behavior

14 Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request w Like branch prediction, use previous behavior Keep a prediction for every load? w Fetch stage for load is *TOO LATE* Keep a prediction per-memory address?

15 Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request w Like branch prediction, use previous behavior Keep a prediction for every load? w Fetch stage for load is *TOO LATE* Keep a prediction per-memory address? w Given address, guess next likely address

16 Predicting Data Accesses Can we predict what data we will use? w Instead of predicting branch direction, predict next memory address request w Like branch prediction, use previous behavior Keep a prediction for every load? w Fetch stage for load is *TOO LATE* Keep a prediction per-memory address? w Given address, guess next likely address w Too many choices table too large or fits

17 Memory Hierarchy Tech Speed Size Cost/bit SRAM (logic) Fastest CPU L1 Smallest Highest SRAM (logic) L2 Cache DRAM (capacitors) Slowest DRAM Largest Lowest

18 Using Caches To Improve Performance Caches make the large gap between processor speed and memory speed appear much smaller Caches give the appearance of having lots and lots of quickly accessible memory w Achieved by exploiting spatial and temporal locality

19 SRAM Static Random Access Memory w Volatile memory array 4-6 transistors per bit w Fast accesses ns Dimensions: w Height: # addressable locations w Width: # of b per addressable unit Usually 1 or 4 2M x 16 SRAM Height: 2M Width: 16

SRAM: Selection Logic Need to choose which addressable unit goes to output lines 2M multiplexor infeasible Single shared output line: bit line w Tri-state buffer used to allow multiple sources

20 SRAM: Selection Logic Need to choose which addressable unit goes to output lines 2M multiplexor infeasible Single shared output line: bit line w Tri-state buffer used to allow multiple sources to drive bit line Tri-State Buffer A Ctrl Z 0 0 X X Select 0 Enable Data 0 In Out Select 1 Enable Data 1 In Out Select 2 Enable Output Data 2 In Out Select 3 Data 3 In Enable Out

21 SRAM: Using Bit Lines input lines address or word lines bit (output) lines

22 SRAM: For Large Arrays Large arrays (4Mx8 SRAM) require HUGE decoders and word lines Instead, 2-stage decoding w Selects addresses for eight 4Kx1024 arrays w Multiplexors select 1 bit from each 1024-b wide array

23 DRAM Dynamic RAM Value stored as charge in capacitor (1T) w Must be refreshed w Refresh by reading and writing back cells w Only uses 1-2% of active DRAM cycles 2 level decoder w Row access Row access strobe (RAS) w Column access Column access strobe (CAS) Access times w ns (typical) 4M x 1 DRAM 2048 x 2048 array

24 Memory Hierarchy Tech Speed Size Cost/bit SRAM (logic) Fastest CPU L1 Smallest Highest SRAM (logic) L2 Cache DRAM (capacitors) Slowest DRAM Largest Lowest

25 What Do We Need to Think About? 1. Design cache that takes advantage of spatial & temporal locality

26 What does that mean?!? 1. Design cache that takes advantage of spatial & temporal locality 2. When you program, place data together that is used together to increase spatial & temporal locality

27 What does that mean?!? 1. Design cache that takes advantage of spatial & temporal locality 2. When you program, place data together that is used together to increase locality w Java - difficult to do w C - more control over data placement

28 What does that mean?!? 1. Design cache that takes advantage of spatial & temporal locality 2. When you program, place data together that is used together to increase locality w Java - difficult to do w C - more control over data placement Note: Caches exploit locality. Programs have varying degrees of locality. Caches do not have locality!

29 Cache Design Temporal Locality Spatial Locality

30 Cache Design Temporal Locality w When we obtain the data, store it in the cache. Spatial Locality

31 Cache Design Temporal Locality w When we obtain the data, store it in the cache. Spatial Locality w Transfer large block of contiguous data to get item s neighbors. w Block (Line): Amount of data transferred for a single miss (data plus neighbors)

32 Where do we put data? Searching whole cache takes time & power Direct-mapped w Limit each piece of data to one possible position Search is quick and simple

33 Memory Direct-Mapped Index Cache

34 Memory Direct-Mapped One block (line) Index Cache

35 Direct-Mapped cache Block (Line) size = 2 words (8B) Index Data Byte Address 0b Where do we look in the cache? How do we know if it is there?

36 Direct-Mapped cache Block (Line) size = 2 words (8B) Index Data Byte Address 0b Block Address Where is it within the block? Where do we look in the cache? BlockAddress mod #slots BlockAddress & (#slots-1) How do we know if it is there?

37 Direct-Mapped cache Block (Line) size = 2 words (8B) Valid Tag Data M[ ] M[ ] Byte Address 0b Tag Index Where is it within the block? Where do we look in the cache? BlockAddress mod #slots BlockAddress & (#slots-1) How do we know if it is there? We need a tag & valid bit

38 Splitting the Address Direct-Mapped Cache Valid Tag Data b Tag Index Block Offset Byte Offset

39 Definitions Byte Offset: Which within? Block Offset: Which within? Set: Group of checked each access Index: Which within cache? Tag: Is this the right one?

40 Definitions Byte Offset: Which byte within word Block Offset: Which within? Set: Group of checked each access Index: Which within cache? Tag: Is this the right one?

41 Definitions Byte Offset: Which byte within word Block Offset: Which word within block Set: Group of checked each access Index: Which within cache? Tag: Is this the right one?

42 Definitions Byte Offset: Which byte within word Block Offset: Which word within block Set: Group of blocks checked each access Index: Which within cache? Tag: Is this the right one?

43 Definitions Byte Offset: Which byte within word Block Offset: Which word within block Set: Group of blocks checked each access Index: Which set within cache? Tag: Is this the right one?

44 Definitions Block (Line) Hit Miss Hit time / Access time Miss Penalty

45 Definitions Block - unit of data transfer bytes/ words Hit Miss Hit time / Access time Miss Penalty

46 Definitions Block - unit of data transfer bytes/ words Hit - data found in this cache Miss Hit time / Access time Miss Penalty

47 Definitions Block - unit of data transfer bytes/ words Hit - data found in this cache Miss - data not found in this cache w Send request to lower level Hit time / Access time Miss Penalty

48 Definitions Block - unit of data transfer bytes/ words Hit - data found in this cache Miss - data not found in this cache w Send request to lower level Hit time / Access time w Time to access this cache Miss Penalty

49 Definitions Block - unit of data transfer bytes/words Hit - data found in this cache Miss - data not found in this cache w Send request to lower level Hit time / Access time w Time to access this cache Miss Penalty w Time to receive block from lower level w Not always constant

50 Example 1 Direct-Mapped Block size=2 words Direct-Mapped Cache Valid Tag Data x Tag Index Block Offset Byte Offset

51 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data Reference Stream: Hit/Miss 0b b b b b b Miss Rate: Tag Index Block Offset Byte Offset

52 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data 72 Reference Stream: Hit/Miss 0b b b b b b Miss Rate: Tag Index Block Offset Byte Offset

53 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag 10 Data M[76-79] M[72-75] 72 Reference Stream: Hit/Miss 0b M 0b b b b b Miss Rate: Tag Index Block Offset Byte Offset

54 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag 10 Data M[76-79] M[72-75] 20 Reference Stream: Hit/Miss 0b M 0b b b b b Miss Rate: Tag Index Block Offset Byte Offset

55 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data M[76-79] M[72-75] M[20-23] M[16-19] 20 Reference Stream: Hit/Miss 0b M 0b b b b b Miss Rate: Tag Index Block Offset Byte Offset

56 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data M[76-79] M[72-75] M[20-23] M[16-19] 56 Reference Stream: Hit/Miss 0b M 0b M 0b b b b Miss Rate: Tag Index Block Offset Byte Offset

57 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data M[76-79] M[72-75] M[20-23] M[16-19] 11 M[60-63] M[56-59] 56 Reference Stream: Hit/Miss 0b M 0b M 0b M 0b b b Miss Rate: Tag Index Block Offset Byte Offset

58 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 16 Reference Stream: Hit/Miss 0b M 0b M 0b M 0b b b Miss Rate: Tag Index Block Offset Byte Offset

59 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 16 Reference Stream: Hit/Miss 0b M 0b M 0b M 0b H 0b b Miss Rate: Tag Index Block Offset Byte Offset

60 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data M[20-23] M[16-19] M[76-79] M[72-75] M[60-63] M[56-59] 20 Reference Stream: Hit/Miss 0b M 0b M 0b M 0b H 0b b Miss Rate: Tag Index Block Offset Byte Offset

61 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data M[20-23] M[16-19] M[76-79] M[72-75] M[60-63] M[56-59] 20 Reference Stream: Hit/Miss 0b M 0b M 0b M 0b H 0b H 0b Miss Rate: Tag Index Block Offset Byte Offset

62 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 36 Reference Stream: Hit/Miss 0b M 0b M 0b M 0b H 0b H 0b M Miss Rate: Tag Index Block Offset Byte Offset

63 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data M[36-39] M[32-35] M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] 36 Reference Stream: Hit/Miss 0b M 0b M 0b M 0b H 0b H 0b M Miss Rate: Tag Index Block Offset Byte Offset

64 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data M[36-39] M[32-35] M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] Reference Stream: Hit/Miss 0b M 0b M 0b M 0b H 0b H 0b M Miss Rate: Tag Index Block Offset Byte Offset

65 Example 1 Direct-Mapped Block size=2 words Valid Direct-Mapped Cache Tag Data M[36-39] M[32-35] M[76-79] M[72-75] M[20-23] M[16-19] M[60-63] M[56-59] Reference Stream: Hit/Miss 0b M 0b M 0b M 0b H 0b H 0b M Miss Rate: 4 / 6 = 67% Hit Rate: 2 / 6 = 33% Tag Index Block Offset Byte Offset

66 Implementation Byte Address 0x Tag Valid Index Tag Data Byte Offset Block offset = MUX Hit? Data

67 Example 2 You are implementing a 64-Kbyte cache, 32-bit address The block size (line size) is 16 bytes. Each word is 4 bytes How many bits is the block offset? How many bits is the index? How many bits is the tag?

68 Example 2 You are implementing a 64-Kbyte cache The block size (line size) is 16 bytes. Each word is 4 bytes How many bits is the block offset? w 16 / 4 = 4 words -> 2 bits How many bits is the index? How many bits is the tag?

69 Example 2 You are implementing a 64-Kbyte cache The block size (line size) is 16 bytes. Each word is 4 bytes, address 32 bits How many bits is the block offset? w 16 / 4 = 4 words -> 2 bits How many bits is the index? w 64*1024 / 16 = > 12 bits How many bits is the tag?

70 Example 2 You are implementing a 64-Kbyte cache The block size (line size) is 16 bytes. Each word is 4 bytes, address 32 bits How many bits is the block offset? w 16 / 4 = 4 words -> 2 bits How many bits is the index? w 64*1024 / 16 = > 12 bits How many bits is the tag? w 32 - ( ) = 16 bits

71 Direct-mapped $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate? Example

72 Example Direct-mapped $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate?

73 Reducing Cache Conflicts Problem: w Lines that map to same cache index conflict w Lines conflict even if other cache lines unused Solution: w Have multiple cache lines for each mapping

74 Cache Set Associativity Set: Group of cache lines address can map to Direct-mapped: 1 location for block n-way set associative: n locations for block Fully-associative: Maps to any location Direct-mapped 2-way set associative Fully-associative Set Set Set 0

75 Cache Set Associativity Decreases conflicts => increases hit rate On cache request, must check every cache line in set w Increases hit time Number of sets smaller than direct mapped, so fewer index bits w lg (number of sets) where sets < # cache lines Tag bits increase

76 2-way set associative $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate? Example

77 Example 2-way set associative $ w Block size = 2 words w Total size = 16 words Word addresses w 0 w 16 w 1 w 17 w 32 w 16 w 36 w 45 What is the hit rate?

78 Implementation Byte Address 0x Valid Tag Tag Data Index Valid Byte Offset Block offset Tag Data 1 Hit? = MUX = MUX MUX Data

79 Example You are implementing a 1Mbyte 4-way set associative cache, 32-bit address The block size (line size) is 256 bytes. How many bits is the block offset? How many bits is the index? How many bits is the tag?

80 What Happens on Cache Miss? Detect desired block is not there w Valid bit 0 OR w Tag not one we re looking for If valid bit is set but tag not one we re looking for, evict current block Request line from lower level Upon receipt of data from lower level, set tag and valid bits and store data. Pass data up to requestor

81 How caches work Classic abstraction Each level of hierarchy has no knowledge of the configuration of lower level L1 cache s perspective Me L1 L2 cache s perspective Me L2 Cache Memory L2 Cache Memory DRAM DRAM

82 Memory Operation at any level Address 1. Me Cache 1. Cache receives request Memory

83 Memory operation at any level Address 1. Me 2. Cache 1. Cache receives request 2. Look for item in cache Memory

84 Memory operation at any level Address 1. Me 2. Cache 3. Data 1. Cache receives request 2. Look for item in cache Hit - return data Memory

85 Memory operation at any level Address 1. Me Memory Cache 1. Cache receives request 2. Look for item in cache Hit - return data Miss - request memory

86 Memory operation at any level Address 1. Me Memory Cache Cache receives request 2. Look for item in cache Hit - return data Miss - request memory receive data update cache

87 Memory operation at any level Address 1. Me Memory Cache 5. Data Cache receives request 2. Look for item in cache Hit - return data Miss - request memory receive data update cache return data

88 Performance Hit: latency = Miss: latency = Goal: minimize misses!!!

89 Performance Hit: latency = access time Miss: latency = Goal: minimize misses!!!

90 Performance Hit: latency = access time Miss: latency = access time + miss penalty Goal: minimize misses!!!

91 Performance How does the memory system affect CPI? Penalty on cache hit: w hit time w frequently only 1 cycle needed to access on cache hit Penalty on cache miss: w miss time time to get from lower level of memory CPI = 1 + memory stalls/instruction = 1 + (% miss) (cache miss penalty)

92 L1 cache s perspective Me Memory L1 L1 s miss penalty contains the access of L2, and possibly the access of DRAM!!! L2 Cache DRAM

93 Multi-level Caches Base CPI 1.0, 500MHz clock Main memory-100 cycles, L2-10 cycles L1 miss rate per instruction - 5% W/L2-2% of instructions go to DRAM What is the speedup with the L2 cache?

94 Multi-level Caches CPI = 1 + memory stalls / instructions

95 Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/ miss = = 6 cycles / instr

96 Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr CPI new =1 + L2%*L2penalty +Mem %*MemPenalty instr =1 + 5% * % * 100=3.5 cycles/

97 Multi-level Caches CPI = 1 + memory stalls / instructions CPI old = 1 + 5% miss/instr * 100 cycles/miss = = 6 cycles / instr CPI new =1 + L2%*L2penalty +Mem %*MemPenalty instr =1 + 5% * % * 100=3.5 cycles/ Speedup = 6 / 3.5 = 1.7

98 Average Memory Access Time AMAT = L1 access time + L1 miss penalty L1 miss penalty = L2 access time + L2 miss penalty L2 miss penalty = Memory access time + Memory miss penalty

99 Calculate AMAT Organization: w L1 cache Access time is 1 cycle Hit rate of 90% w L2 cache Access time is 10 cycles Hit rate of 95% w Memory Access time is 100 cycles Hit rate of 100%

100 Ways To Improve Cache Performance Make the cache bigger w Pro: More stuff can fit in the cache so stuff doesn t have to get thrown out as often w Con: Time to access larger memory longer Reduce the number of conflicts in the cache by increasing associativity w Pro: Multiple memory lines that map to same cache set can reside in cache simultaneously w Con: More time needed to determine if there is a hit because have to check multiple cache blocks

101 Ways To Improve Cache Performance Use multiple levels of cache w Access time of non-primary cache not as important. More important for it to have lower miss rate. w Pro: Reduces (average) miss penalty if there is a hit in lower level of cache w Con: Takes up space and increases (worst-case) latency if access misses in this level of cache. Make the block size larger to exploit spatial locality w Pro: Fewer misses for sequential accesses w Pro: Decreases bits dedicated to tags w Con: Fewer blocks in cache for given cache size w Con: Miss penalty may be larger because larger blocks need to be retrieved from lower level of hierarchy

102 2-way set associative $ w Block size = 4 words w Total size = 32 words Word addresses w 2 w 35 w 63 w 110 w 210 w 77 w 3 w 97 w 170 What is the hit rate? Example

103 Cache Writes There are multiple copies of the data lying around w L1 cache, L2 cache, DRAM Do we write to all of them? Do we wait for the write to complete before the processor can proceed?

104 Do we write to all of them? Write-through - write to all levels of hierarchy Write-back - write to lower level only when cache line gets evicted from cache w creates inconsistent data - different values for same item in cache and DRAM. w Inconsistent data in highest level in cache is referred to as dirty

105 Write-Through CPU sw $3, 0($5) L1 L2 Cache DRAM

106 Write-Back CPU sw $3, 0($5) L1 L2 Cache DRAM

107 Write-through vs Write-back Which performs the write faster? w Write-back - it only writes the L1 cache Which has faster evictions from a cache? w Write-through - no write involved, just overwrite tag Which causes more bus traffic? w Write-through. DRAM is written every store. Write-back only writes on eviction.

108 Beyond The Cache: Memory

109 Memory System Design Challenges DRAM is designed for density, not speed DRAM is slower than the bus We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. Widening anything increases the cost by quite a bit.

110 Narrow Configuration CPU Given: w 1 clock cycle request w 15 cycles / word DRAM latency w 1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

111 Narrow Configuration CPU Given: w 1 clock cycle request w 15 cycles / word DRAM latency w 1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1cycle + 15 cycles/word * 8 words + 1 cycle/word * 8 words = 129 cycles Cache Bus DRAM

112 Wide Configuration CPU Given: w 1 clock cycle request w 15 cycles / 2 words DRAM latency w 1 cycle / 2 words bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

113 Wide Configuration CPU Given: w 1 clock cycle request w 15 cycles / 2 words DRAM latency w 1 cycle / 2 words bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1cycle + 15 cycles/2 words * 8 words + 1 cycle/2words*8words = 65 cycles Cache Bus DRAM

114 Interleaved Configuration CPU Byte 0 in DRAM 0, byte 1 in DRAM 1, Byte 2 in DRAM 0,... Given: w 1 clock cycle request w 15 cycles / word DRAM latency w 1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM DRAM

115 Interleaved Configuration CPU Given: w 1 clock cycle request w 15 cycles / word DRAM latency w 1 cycle / word bus latency If a cache block is 8 words, what is the miss penalty of an L2 cache miss? 1 cycle + 15 cycles / 2 words * 8 words + 1 cycle / word * 8 words = 69 cycles Cache Bus DRAM DRAM

116 DRAM Optimizations Fast page mode w Allow repeated accesses to row buffer without another row access time Synchronous DRAM w Add clock signal to DRAM interface to make synchronous w Programmable register holds number of bytes to transfer over many cycles Double Data Rate (DDR) w Transfer data on rising and falling clock edges instead of just one.

117 DRAM Optimizations Make DRAM chip act like a memory system Each chip has interleaved memory and a high speed interface RDRAM w Switch RAS/CAS lines to bus that allows multiple access to be inflight simultaneously You don t have to wait for one DRAM request to finish before sending another request Direct RDRAM w Don t multiplex over one bus. Have 3: Data Row Column

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality Caching Chapter 7 Basics (7.,7.2) Cache Writes (7.2 - p 483-485) configurations (7.2 p 487-49) Performance (7.3) Associative caches (7.3 p 496-54) Multilevel caches (7.3 p 55-5) Tech SRAM (logic) SRAM