Memory Hierarchy. Mehran Rezaei

Size: px

Start display at page:

Download "Memory Hierarchy. Mehran Rezaei"

Hubert Ross
6 years ago
Views:

1 Memory Hierarchy Mehran Rezaei

2 What types of memory do we have? Registers Cache (Static RAM) Main Memory (Dynamic RAM) Disk (Magnetic Disk)

3 Option : Build It Out of Fast SRAM About 5- ns access Decoders are big Array are big It will cost LOTS of money SRAM costs $7 per megabyte $286,72 for MIPS

4 Option 2: Build It Out of DRAM About ns access Why build a fast processor that stalls for dozens of cycles on each memory load? Still costs lots of money for new machines DRAM costs $2 per megabyte $8,92 for MIPS

5 Option 3: Build It Using Disks About,, ns access (snore!) We could have stopped with the Intel 44 Costs are pretty reasonable Disk storage costs $.5 per megabyte $2 for MIPS

6 Our Requirements We want a memory system that runs a processor clock speed (about ns access) We want a memory system that we can afford (maybe 25% to 33% of the total system costs). Options 2-3 are too slow Options -2 (or -3) are too expensive Time for option 4!

7 Option 4: Use a Little of Everything (Wisely) Use a small array of SRAM Small means fast! Small means cheap! Use a larger amount of DRAM And hope that you rarely have to use it Use a really big amount of Disk storage Disks are getting cheaper at a faster rate than we fill them up with data (for most people).

8 Option 4: The Memory Hierarchy Use a small array of SRAM For the CACHE (hopefully for most accesses) Use a bigger amount of DRAM For the Main memory Use a really big amount of Disk storage For the Virtual memory

9 Famous Picture of Food Memory Hierarchy Cache Main Memory Cost Latency Access Disk Storage

10 Datapath Memory Memory Hierarchy Principle of Locality Suggests Make the common case fast Processor Control Unit Speed (ns).25s s s 6 s Size (bytes) s Ks Ms Gs

11 The Big Picture CPU DTLB D-$ I-$ ITLB L2-Cache BIU Physical To mem Addr. map Read Data Buf DRAM Mem Req. Scheduling

12 5 components of computer Computer Processor Memory Devices Keyboard, Mouse Control ( brain ) Datapath ( brawn ) (where programs, data live when running) Input Output Display, Printer Memory Hierarchy

13 Memory Hierarchy input control datapath cache Memory Disk output

14 Memory Hierarchy (Analogy) You re writing a term paper about Programming languages (c++ in particular) at a table in a library Library is equivalent to disk Limitless capacity (in principle) But very slow to retrieve a book Table is Memory Smaller capacity: which means you must return books when table fills up Easier and faster to find book on the table once you have already retrieved it

15 Memory Hierarchy Analogy (Cont d) Open books on the table are Cache Smaller capacity: you can have very few open books fit on the table; again, when table is full you must close a book If you think of it, much, much faster to retrieve data Illusion created: what if we can have the whole library open on the table (and open) Keep as many recently used books open on the table as possible since likely to be used again Also keep as many books on the table as possible, since faster than going to bookshelves

16 Class Problem Given the following: Cache: cycle access time Main memory: cycle access time Disk: cycles access time What is the average access time if you measure that 9% of the cache accesses are hits and 8% of the accesses to main memory are hits.

17 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R3 2 cache lines 4 bit tag field byte block V V tag data

18 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R3 Is it in the cache? No valid tags tag This is a Cache miss data Allocate: address tag Mem[] block

19 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R3 lru tag data Misses: Hits:

20 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R3 Check tags: 5 lru Cache Miss 5 5 tag data Misses: Hits:

21 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R3 5 lru 5 tag 5 data Misses: 2 Hits:

22 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R3 5 Check tags: 5, but = (HIT!) lru 5 tag 5 data Misses: 2 Hits:

23 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R3 5 lru 5 tag 5 data Misses: 2 Hits:

24 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R3 5 lru 5 tag 5 data Misses: 2 Hits:

25 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R3 5 7 lru 7 5 and 7 (MISS!) 57 tag 5 7 data Misses: 2 Hits:

26 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R3 5 7 lru 7 tag 7 data Misses: 3 Hits:

27 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R lru 7 and 7 = 7 (HIT!) 7 tag 7 data Misses: 3 Hits:

28 A Very Simple Memory System Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 7 ] Ld R2 M[ 7 ] R R R2 R3 7 7 lru 7 tag 7 data Misses: 3 Hits:

29 Basic Cache Design Cache memory can copy data from any part of main memory It has 2 parts: The TAG (CAM) holds the memory address The BLOCK (SRAM) holds the memory data Accessing the cache: Compare the reference address with the tag If they match, get the data from the cache block If the don t match, get the data from main memory

30 CAMs: Content Addressable Memories Instead of thinking of memory as an array of data indexed by a memory address Think of memory as a set of data matching a query. Instead of an address, we send data to the memory, asking if any location contains that data.

31 Operations on CAMs Search: the primary way to access a CAM Send data to CAM memory Return found or not found Alternatively, return the address of where it was found Write: Send data for CAM to remember Where should it be stored if CAM is full? Replacement policy» Replace oldest data in the CAM» Replace least recently searched data

32 CAM Array Found 5 storage element CAM array of 9 bits each

33 Previous Use of CAMs You have seen a simple CAM used before, where?

34 Cache Organization A cache memory consists of multiple tag/block pairs (called cache lines) Searches can be done in parallel (within reason) At most one tag will match If there is a tag match, it is a cache HIT If there is no tag match, it is a cache MISS Our goal is to keep the data we think will be accessed in the near future in the cache

35 Cache Operation Every cache miss will get the data from memory and ALLOCATE a cache line to put the data in. Just like any CAM write Which line should be allocated? Random? OK, but hard to grade test questions Better than random? How?

36 Picking the Most Likely Addresses What is the probability of accessing a random memory location? With no information, it is just as likely as any other address But programs are not random They tend to use the same memory locations over and over. We can use this to pick the most referenced locations to put into the cache

37 Temporal Locality The principle of temporal locality in program references says that if you access a memory location (e.g., ) you will be more likely to re-access that location than you will be to reference some other random location.

38 Using Locality in the Cache Temporal locality says any miss data should be placed into the cache It is the most recent reference location Temporal locality says that the least recently referenced (or least recently used LRU ) cache line should be evicted to make room for the new line. Because the re-access probability falls over time as a cache line isn t referenced, the LRU line is least likely to be re-referenced.

39 Calculating Average Access Latency Avg latency = cache latency hit rate + memory latency miss rate Avg latency = cycle (2/5) + 5 (3/5) = 9.4 cycles per reference To improve average latency: Improve memory access latency, or Improve cache access latency, or Improve cache hit rate

40 Class Problem 2 For application X running on the MIPS, you measure that 4% of the operations access memory. You also measure that 9% of the memory references hit in the cache. Whenever a cache miss occurs, the processor is stalled for 2 cycles to transfer the block from memory into the cache. What is the CPI for application X on this processor? What would the CPI be if there were no cache?

41 Calculating Cost How much does a cache cost? Calculate storage requirements Example cache: 2 bytes of SRAM Calculate overhead to support access (tags) Example cache: 2 4-bit tags The cost of the tags is often forgotten for caches, but this cost drives the design of real caches What is the cost if a 32 bit address is used?

42 How do we reduce the overhead? Have a small address Impractical Cache bigger units than bytes Each block has a single tag, and blocks can be whatever size we choose.

43 Block size for caches Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 4 ] Ld R2 M[ ] R R R2 R3 2 cache lines 3 bit tag field 2 byte block V V tag data

44 Block size for caches Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 4 ] Ld R2 M[ ] R R R2 R3 tag data

45 Block size for caches Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 4 ] Ld R2 M[ ] R R R2 R3 lru tag data Addr: Misses: Hits:

46 Block size for caches Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 4 ] Ld R2 M[ ] R R R2 R3 lru tag data Misses: Hits:

47 Block size for caches Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 4 ] Ld R2 M[ ] R R R2 R3 5 lru tag data Addr: Misses: 2 Hits:

48 Block size for caches Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 4 ] Ld R2 M[ ] R R R2 R3 5 lru tag data Addr: Misses: 2 Hits:

49 Block size for caches Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 4 ] Ld R2 M[ ] R R R2 R3 5 lru tag data Misses: 2 Hits:

50 Block size for caches Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 4 ] Ld R2 M[ ] R R R2 R3 5 lru tag data Addr: Misses: 2 Hits:

51 Block size for caches Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 4 ] Ld R2 M[ ] R R R2 R3 5 4 lru tag data Misses: 2 Hits:

52 Block size for caches Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 4 ] Ld R2 M[ ] R R R2 R3 5 4 lru tag data Addr: Misses: 2 Hits:

53 Block size for caches Processor Cache Memory Ld R M[ ] Ld R2 M[ 5 ] Ld R3 M[ ] Ld R3 M[ 4 ] Ld R2 M[ ] R R R2 R3 4 4 lru tag data Misses: 2 Hits:

54 Spatial Locality Notice that when we accessed address, we also brought in address. This turned out to be a good thing since we later referenced address and found it in the cache. Spatial locality in a program says that if we reference a memory location (e.g., ), we are more likely to reference a location near it (e.g., ) than some random location.

55 Basic Cache Organization Decide on the block size How? Simulate lots of different block sizes and see which one gives the best performance Most systems use a block size between 32 bytes and 28 bytes Longer sizes reduce the overhead by: Reducing the number of CAMs Reducing the size of each CAM Tag Block offset

56 Processor Class Problem 2 Cache Memory What is the cache hit rate when the following code is executed? Ld R M[ 7 ] Ld R2 M[ 6 ] Ld R3 M[ 9 ] Ld R3 M[ ] Ld R2 M[ ] R R R2 R3 2 cache lines 3 bit tag field 2 byte block V V tag data

57 The Main Issue, fully associative and byte line size address V = TAG (CAM) Data = = = hit

58 What seems to be the problem?. How large is the CAM if we need 8 Kbyte cache? 2. How does each of those comparators look like, in terms of actual hardware? 3. How many comparators are needed for 8 Kbyte cache? 4. What should we do?. Increase cache line size 2. Decrease the level of associativity 3. Increase cache line size and decrease set associtivity

59 Increase cache line size *32 Decoder = Byte 3 Byte Byte = hit How many comparators do we need now? Is the size of each comparators reduced?

60 Other alternatives Ultimately only one comparator How many cache lines do we have? Using a portion of address, can we address the cache lines?

61 8*256 Decoder Direct Mapped Cache *32 Decoder Byte 3 Byte Byte = hit

62 7*28 Decoder 2-way set-associative cache *32 Decoder Byte 3 Byte Byte = hit Byte 3 Byte Byte =

63 The three fields All fields are read as unsigned integers. Offset: specifies which byte within the block we want Index: specifies the cache index (which row of the cache we should look in) Tag: the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same cache line (cache block)

64 Class Problem Determine the length of offset, index, and tag for the following cache configurations: Direct Mapped, 8 Kbytes, 32 byte line size on 32 bit machine Two-Way set associative, 8 Kbytes, 64 byte line size on 32 bit machine Direct Mapped, 32 Kbytes, 28 byte line size, on 32 bit machine

65 Example Suppose we have a 6KB cache with 4 word blocks Determine the size of the tag, index and offset fields if we re using a 32-bit architecture Offset need to specify correct byte within a block block contains 4 words (each word is 32 bit) = 6 bytes = 2 4 bytes need 4 bits to specify correct byte Case Study Cache simulator (Cont d)

66 Example (Cont d) Index: (index into an array of blocks ) need to specify correct row in cache cache contains 6 KB = 2 4 bytes block contains 2 4 bytes (4 words) # blocks/cache = bytes/cache bytes/block = 2 4 bytes/cache 2 4 bytes/block = 2 blocks/cache need bits to specify this many rows

67 Example (Cont d) Tag: use remaining bits as tag tag length = addr length offset - index = bits = 8 bits so tag is leftmost 8 bits of memory address 32 bits address tag index offset 8 bits bits 4 bits

68 Accessing Data in Cache Ex.: 6KB cache with 4 -word blocks Read 4 addresses x4, xc, x34, x84 Address (hex) 4 8 C Memory Value of Word a b c d C C e f g h i j k l

69 Accessing Data in Cache 4 Addresses: x4, xc, x34, x84 4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields Tag Index Offset

70 Direct Mapped, 6 Kbytes, 6 Byte Block Valid Tag x-3 x4-7 x8-b xc-f

71 Valid Index Read x4 Tag field Index field Offset Tag x-3 x4-7 x8-b xc-f...

72 So we read block () Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f

73 No valid data Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f

74 So load that data into cache, setting tag, valid Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d

75 Read from cache at offset, return word b Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d

76 Read xc =..... Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d

77 Index is Valid Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d

78 Index valid, Tag Matches Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d

79 Index Valid, Tag Matches, return d Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d

80 Read x34 =..... Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d

81 So read block 3 Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d

82 No valid data Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d

83 Load that cache block, return word f Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d 2 3 e f g h

84 Read x84 =..... Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d 2 3 e f g h

85 So read Cache Block, Data is Valid Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d 2 3 e f g h

86 Cache Block Tag does not match (...!= 2) Tag field Index field Offset Valid Index Tag x-3 x4-7 x8-b xc-f a b c d 2 3 e f g h

87 Miss, so replace block with new data & tag Valid Tag field Index field Offset Index Tag x-3 x4-7 x8-b xc-f 2 i j k l 2 3 e f g h

88 And return word j Valid Tag field Index field Offset Index Tag x-3 x4-7 x8-b xc-f 2 i j k l 2 3 e f g h

89 And return word j Valid Tag field Index field Offset Index Tag x-3 x4-7 x8-b xc-f 2 i j k l 2 3 e f g h

90 A Better look (Direct Mapped) Memory Address Memory 6 bits 2 3 tag index offset 2 bits 2 bits 2 bits

91 A Better look (2 way set associative) Memory Address Memory bits tag index offset set bits bit 2 bits 6

92 7*28 Decoder 2-way set-associative cache (bigger cache) *32 Decoder Set Byte 3 Byte Byte = hit Byte 3 Byte Byte =

93 Direct Mapped vs. Set Associative cache cache size Direct Mapped - cache size NL*LS NL: Number of lines LS: Line Size Set Associative - cache size NW*NS*LS NW: Number of ways NS: Number of sets LS: Line size

94 Writing vs. Reading Writing: sw $r4,($r6) Is this a write, a read, or a read & a write Hit: data is found in the cache Write through: write to the cache and to the memory For speed we need write buffers Write back: Write to the cache and not to the memory There should exist a dirty bit for each line of the cache Miss: data is not found in the cache Write allocate: first bring the data to the cache then write to the cache Write no-allocate: do not bring line to the cache; simply write to the memory and not to the cache. Does write back write no-allocate make sense?

95 Writing vs. Reading (cont d) Reading: lw $4,($6) Hit: data is found in the cache Simply read it from the cache and put it into the register Miss: Bring the data from the memory to the cache and then load it into the register What if write is write back, and the line is dirty (its dirty bit has been set due to a write) As it turns out, write back is more complicated to implement, and therefore write through is more suitable for the first level.

96 Unified vs. Split Unified cache Both instructions and data share the cache About 75% of the accesses are instruction references Unified cache has slightly lower miss rate Split Instruction cache and data cache No structural hazard

97 Problems Comparing unified and split caches Consider 32K unified cache and 6K instruction 6 K data split cache If miss rate for unified cache is.99% If 75% of the memory references are instruction references if instruction cache miss rate is.64% and data cache miss rate is 6.47% for the split cache Which one of the caches has higher miss rate

98 Problems (Cont d) For the previous problem, assume That a hit takes cycle Miss takes 5 cycles Load or store hit takes extra cycle on a unified cache (why why) Write through cache with write buffer, and ignore the penalty due to write to the buffer What is the average memory access time for each case?

99 Where are we? cache memory disk Talked about Today: Virtual Memory

100 Motivation Cache: Speed is the main motivation Take advantage of the principle of locality Spatial locality Temporal locality Virtual Memory: unlimited memory space Multitasking Allocation Protection A task needs more memory than exists

101 Tasks (processes) to co-exist What happens if gcc needs to expand? If netscape needs more memory than is on the machine? If acroread has an error and writes to address x2? When does gcc have to know it will run at x3? What if netscape isn t using its memory? x x3 x4 x6 OS gcc acroread netscape

102 Linux user virtual address space xc x4 x848 K virtual memory User stack Shared lib heap data text unused Mem invisible to user brk Stack pointer

103 Who takes care of address translation Virtual address Physical address CPU MMU Physical Mem

104 netscape The advantage gcc Physical Memory Hard disk

105 Virtual Memory (paging mechanism) Memory is divided into fixed size pages Linux: 4 Kbyte page size Tru64: 8 Kbyte page size Virtual addr Page offset Address translation Physical addr Page offset

106 Remarks: Paging mechanism (Cont d) If we have virtual memory, the CPU generates virtual address Each block is called a page If a page is not found in the main memory, it is said to be a page fault Two main issues Where to place a page if page fault occurs Page replacement How to declare that a page is in the main memory Addressing mechanism

107 Page table components Page table register Virtual page number Page table register points to beginning of page table Page offset valid Physical page number Physical page number Physical page number Page offset

108 Page table components Page table register x4 Page table register points to beginning of page table xff valid Physical page number x45fc x45fc xff

109 Page table components Page table register x2 Page table register points to beginning of page table xcf valid Physical page number Exception: page fault. Stop this process 2. Pick page to replace 3. Write back data 4. Get referenced page 5. Update page table 6. Reschedule process

110 Size of page table 32-bit address, 8 Kbyte page, and 28 Mbytes of Memory VPN 9 bits PPN offset 3 bits offset 4 bits Each table s entry is about 4 bytes, so 2 9 * 4 = 2 MBytes

111 Reduce the page table size Keep two registers Start address End address Allow hashing Multiple page tables Page the page table

112 Multiple page tables Page table register Virtual address Virtual Superpage Virtual page Page offset valid super page table valid 2 nd level page table page Physical page number page 2 nd level page table Physical page number Page offset Physical address

113 Putting it all together Loading your program in memory Ask operating system to create a new process Construct a page table for this process Mark all page table entries as invalid with a pointer to the disk image of the program That is, point to the executable file containing the binary. Run the program and get an immediate page fault on the first instruction.

114 Page replacement Page table indirection enables a fully associative mapping between virtual and physical pages How do we implement LRU? True LRU is expensive, but LRU is a heuristic anyway, so approximating LRU is fine Reference bit on page, cleared occasionally by operating system. Then pick any unreferenced page to evict.

115 Performance issues Any memory access needs a page table access first; page table is also in memory, isn t it? Load Store How does cache fit in this picture? Physically addressed cache Virtually addressed cache

116 Translation Lookaside Buffer (TLB) vpn offset = = = = = = = = ppn offset

117 Physically/Virtually addressed Cache CPU TLB VA PA TLB CPU VA Cache Cache PA Cache miss Cache miss MEM MEM

118 Physically/Virtually addressed Cache CPU TLB VA PA TLB CPU VA Cache Cache PA Cache miss Cache miss MEM MEM

119 Synonyms (Aliasing problem) Cache Physical Memory Virtual Address Space

120 The Big Picture CPU DTLB D-$ I-$ ITLB L2-Cache BIU Physical To mem Addr. map Read Data Buf DRAM Mem Req. Scheduling

Chapter 2: Memory Hierarchy Design, part 1 - Introducation. Advanced Computer Architecture Mehran Rezaei

Chapter 2: Memory Hierarchy Design, part 1 - Introducation Advanced Computer Architecture Mehran Rezaei Temporal Locality The principle of temporal locality in program references says that if you access