Lecture 11 Cache and Virtual Memory

Size: px

Start display at page:

Download "Lecture 11 Cache and Virtual Memory"

Augustine Norris
5 years ago
Views:

1 Lecture 11 Cache and Virtual Memory Peng Liu 1

2 Associative Cache Example 2

3 Tag & Index with Set-Associative Caches Assume a 2 n -byte cache with 2 m -byte blocks that is 2 a set-associative Which bits of the address are the tag or the index? m least significant bits are byte select within the block Basic idea The cache contains 2 n /2 m =2 n-m blocks Each cache way contains 2 n-m /2 a =2 n-m-a blocks Cache index: (n-m-a) bits after the byte select Same index used with all cache ways Observation For fixed size, length of tags increases with the associativity Associative caches incur more overhead for tags 3

4 Placement Policy Block Number Memory Set Number Cache block 12 can be placed Fully Associative anywhere (2-way) Set Direct Associative Mapped anywhere in only into set 0 block 4 (12 mod 4) (12 mod 8) 4

5 Direct-Mapped Cache Tag Index Block Offset t V Tag k Data Block b 2 k lines = t HIT Data Word or Byte 5

6 2-Way Set-Associative Cache Tag Index Block Offset b t V Tag k Data Block V Tag Data Block t = = Data Word or Byte HIT 6

7 Fully Associative Cache V Tag Data Block = t Block Offset Tag b t = = Data Word or Byte HIT 7

8 Replacement Methods Which line do you replace on a miss? Direct Mapped Easy, you have only one choice Replace the line at the index you need N-way Set Associative Need to choose which way to replace Random (choose one at random) Least Recently Used (LRU) (the one used least recently) Often difficult to calculate, so people use approximations. Often they are really not recently used 8

9 Replacement Policy In an associative cache, which block from a set should be evicted when the set becomes full? Random Least Recently Used (LRU) LRU cache state must be updated on every access true implementation only feasible for small sets (2-way) pseudo-lru binary tree often used for 4-8 way First In, First Out (FIFO) a.k.a. Round-Robin used in highly associative caches Not Least Recently Used (NLRU) FIFO with exception for most recently used block or blocks This is a second-order effect. Why? Replacement only happens on misses 9

10 Block Size and Spatial Locality Block is unit of transfer between the cache and memory Tag Word0 Word1 Word2 Word3 4 word block, b=2 Split CPU address block address offset b 32-b bits b bits 2 b = block size a.k.a line size (in bytes) Larger block size has distinct hardware advantages less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses What are the disadvantages of increasing block size? Fewer blocks => more conflicts. Can waste bandwidth. 10

11 CPU-Cache Interaction (5-stage pipeline) PCen PC 0x4 Add nop addr inst hit? Primary Instruction Cache IR D Decode, Register Fetch E A B MD1 ALU M Y MD2 we addr Primary Data rdata Cache hit? wdata R Stall entire CPU on data cache miss To Memory Control Cache Refill Data from Lower Levels of Memory Hierarchy 11

12 Improving Cache Performance Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the hit time reduce the miss rate reduce the miss penalty What is the simplest design strategy? Biggest cache that doesn t increase hit time past 1-2 cycles (approx 8-32KB in modern technology) [ design issues more complex with out-of-order superscalar processors ] 12

13 Serial-versus-Parallel Cache and Memory Access α is HIT RATIO: Fraction of references in cache 1 - α is MISS RATIO: Remaining references Processor Addr Data CACHE Addr Data Main Memory Average access time for serial search: t cache + (1 - α) t mem Processor Addr Data CACHE Data Main Memory Average access time for parallel search: α t cache + (1 - α) t mem Savings are usually small, t mem >> t cache, hit ratio α high High bandwidth required for memory path Complexity of handling parallel paths can slow t cache 13

14 Causes for Cache Misses Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect replacement policy Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity 14

15 Effect of Cache Parameters on Performance Larger cache size + reduces capacity and conflict misses - hit time will increase Higher associativity + reduces conflict misses - may increase hit time Larger block size + reduces compulsory and capacity (reload) misses - increases conflict misses and miss penalty 15

16 Multilevel Caches A memory cannot be large and fast Increasing sizes of cache at each level CPU L1$ L2$ DRAM Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions 16

17 Multilevel Caches Primary (L1) caches attached to CPU Small, but fast Focusing on hit time rather than hit rate Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Unified instruction and data Focusing on hit rate rather than hit time Main memory services L2 cache misses Some high-end systems include L3 cache 17

18 A Typical Memory Hierarchy Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) CPU RF L1 Instruction Cache L1 Data Cache Unified L2 Cache Memory Memory Memory Memory Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM) 18

19 Itanium-2 On-Chip Caches (Intel/HP, 2002) Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency 22

20 What About Writes? Where do we put the data we want to write? In the cache? In main memory? In both? Caches have different policies for this question Most systems store the data in the cache (why?) Some also store the data in memory as well (why?) Interesting observation Processor does not need to wait until the store completes 23

21 Cache Write Policies: Major Options Write-through (write data go to cache and memory) Main memory is updated on each cache write Replacing a cache entry is simple (just overwrite new block) Memory write causes significant delay if pipeline must stall Write-back (write data only goes to the cache) Only the cache entry is updated on each cache write so main memory and cache data are inconsistent Add dirty bit to the cache entry to indicate whether the data in the cache entry must be committed to memory Replacing a cache entry requires writing the data back to memory before replacing the entry if it is dirty 24

22 Write-through Write Policy Trade-offs Misses are simpler and cheaper (no write-back to memory) Easier to implement Requires buffering to be practical Uses a lot of bandwidth to the next level of memory Write-back Writes are fast on a hit Multiple writes within a block require only one writeback later Efficient block transfer on write back to memory at evicaiton 25

23 Write Policy Choices Cache hit: write through: write both cache & memory generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) a dirty bit per block can further reduce the traffic Cache miss: no write allocate: only write to main memory write allocate (aka fetch on write): fetch into cache Common combinations: write through and no write allocate write back with write allocate 26

24 Write Buffer to Reduce Read Miss Penalty CPU RF Data Cache Write buffer Unified L2 Cache Evicted dirty lines for writeback cache OR All writes in writethru cache Processor is not stalled on writes, and read misses can go ahead of write to main memory Problem: Write buffer may hold updated value of location needed by a read miss Simple scheme: on a read miss, wait for the write buffer to go empty Faster scheme: Check write buffer addresses against read miss addresses, if no match, allow read miss to go ahead of writes, else, return value in write buffer 27

25 Write Buffers for Write-Through Caches Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory Q. Why a write buffer? Q. Why a buffer, why not just one register? Q. Are Read After Write (RAW) hazards an issue for write buffer? A. So CPU doesn t stall A. Bursts of writes are common. A. Yes! Drain buffer before next read, or check write buffers. 28

26 Avoiding the Stalls for Write-Through Use write buffer between cache and memory Processor writes data into the cache and the write buffer Memory controller slowly drains buffer to memory Write buffer: a first-in-first-out buffer (FIFO) Typically holds a small number of writes Can absorb small bursts as long as the long term rate of writing to the buffer does not exceed the maximum rate of writing to DRAM 29

27 Cache Write Policy: Allocation Options What happens on a cache write that misses? It s actually two subquestions Do you allocate space in the cache for the address? Write-allocate VS no-write allocate Actions: select a cache entry, evict old contents, update tags, Do you fetch the rest of the block contents from memory? Of interest if you do write allocate Remember a store updates up to 1 word from a wider block Fetch-on-miss VS no-fetch-on-miss For no-fecth-on-miss must remember which words are valid Use fine-grain valid bits in each cache line 30

28 Write-back caches Typical Choices Write-allocate, fetch-on-miss Write-through caches Write-allocate, fetch-on-miss Write-allocate, no-fetch-on-miss No-write-allocate, write-around Modern HW support multiple polices Select by OS on at some coarse granularity 31

29 Steps Write allocate Fetch on miss Pick replacement Fetch block Write Miss Actions Write through No fetch on miss Pick replacement No write allocate Write around Write invalidate Hit Invalidate tag Fetch on miss Pick replacement Write back Fetch block Write back Write allocate No fetch on miss Pick replacement Write back 4 Write cache Write partial cache Write cache Write partial cache 5 Write memory Write memory Write memory Write memory 32

30 Be Careful, Even with Write Hits Reading form a cache Read tags and data in parallel If it hits, return the data, else go to lower level Writing a cache can take more time First read tag to determine hit/miss Then overwrite data on a hit Otherwise, you may overwrite dirty data or write the wrong cache way Can you ever access tag and write data in parallel? 33

31 Splitting Caches Most processors have separate caches for instructions & data Often noted $I and $D Advantages Extra access port Can customize to specific access patterns Low hit time Disadvantages Capacity utilization Miss rate 34

32 Write Policy: Write-Through vs Write-Back Policy Write-Through Data written to cache block also written to lower-level memory Write-Back Write data only to the cache Update lower level when a block falls out of the cache Debug Easy Hard Do read misses produce writes? Do repeated writes make it to lower level? No Yes Additional option -- let writes to an un-cached address allocate a new cache line ( write-allocate ). Yes No 35

33 Cache Design: Datapath + Control Most design errors come from incorrect specification of state machine behavior! Common bugs: Stalls, Block replacement, Write buffer To CPU Control State Machine Control Control To Lower Level Memory To CPU Addr Din Dout Blocks Tags Addr Din Dout To Lower Level Memory 36

34 Cache Controller Example cache characteristics Direct-mapped, write-back, write allocate Block size: 4 words (16 bytes) Cache size: 16KB (1024 blocks) 32-bit byte addresses Valid bit and dirty bit per block Blocking cache CPU waits until assess is complete Address 37

35 Signals between the Processor and the Cache 38

36 Finite State Machines Use and FSM to sequence control steps Set of states, transition on each clock edge State values are binary encoded Current state stored in a register Next state = fn (current state, current inputs) Control output signals = fo (current state) 39

37 Cache Controller FSM Idle state Waiting for a valid read or write request from the processor Compare Tag state Testing if hit or miss If hit, set Cache Ready after read or write -> Idle state If miss, updates the cache tag If dirty ->Write-Back state, else -> Allocate state Write-Back state Writing the 128-bit block to memory Waiting for ready signal from memory ->Allocate state Allocate state Fetching new blocks is from memory 40

38 Main Memory Supporting Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Example cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer For 4-word block, 1-word-wide DRAM Miss penalty = 1 + 4x15 + 4x1 = 65 bus cycles Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle 41

39 Increasing Memory Bandwidth 42

40 DRAM Technology 43

41 Why DRAM over SRAM? Density! SRAM Cell: Large bit bit 6 transistors nfet and pfet 3 interface wires Vdd and Gnd 1 0 word DRAM Cell: Small transistor + capacitor nfet only 2 interface wires no Vdd Density advantage: 3X to 10X, depends on metric 44

42 DRAM: Reading, Writing, Refresh Writing DRAM: Drive data on bit line Select row Capacitor holds state for 60 ms -- then must do refresh 1 Reading DRAM Select row Sense bit line (~1 million electrons) Write value back 1 Refresh: a dummy read

43 Synchronous DRAM (SDRAM) Interface A clocked bus protocol (ex: 100 MHz) Note! This example is best-case! For a random access, DRAM takes many more than 2 cycles! Cache controller puts commands on bus (CAS = Column Address Strobe) Data comes out several cycles later. 46

44 Advanced DRAM Organization Bits in a DRAM are organized as a rectangular array DRAM accesses an entire row Burst mode: supply successive words from a row with reduced latency Double data rate (DDR) DRAM Transfer on rising and falling clock edges Quad data rate (QDR) DRAM Separate DDR inputs and outputs DIMMs: small boards with multiple DRAM chips connected in parallel Functions as a higher capacity, wider interface DRAM chip Easier to manipulate, replace,.. 47

45 Measuring Performance 48

46 Measuring Performance Memory system is important for performance Cache access time often determines the overall system clock cycle time since it is often the slowest pipeline stage Memory stalls is a large contributor to CPI Stall due to instructions & data, reading & writing Stalls include both cache miss stalls and write buffer stalls Memory system & performance CPU Time = (CPU Cycles + Memory Stall Cycles) * Cycle Time MemStallCycles = Read Stall Cycles + Write Stalls Cycles CPI = CPIpipe + AcgMemStallCycles CPIpipe = 1 + HazardStallsCycles 49

47 Memory Performance Read stalls are fairly easy to understand Read Cycles = Read/prog * ReadMissRate * ReadMissPenalty Write stalls depend upon the write policy Write-through Write Stall = (Writes/Prog * WriteMissRate *WriteMissPenalty)+ Write Buffer Stalls Write-back Write Stall = (Writes/Prog * WriteMissRate * WriteMissPenalty) Write miss penalty can be complex: Can be partially hidden if processor can continue executing Can include extra time to write-back a value we are evicting 50

48 Worst-Case Simplicity Assume that write and read misses cause the same delay MemoryAccesses MemoryStallCycles = MissRate MissPenalty Pr ogram Instructions Misses MemoryStallCycles = MissPenalty Pr ogram Instruction In a single-level cache system MissPenalty = latency of DRAM In a multi-level cache system MissPenalty is the latency of L2 cache etc Calculate by considering MissRateL2, MissPenaltyL2 etc Watch out: global vs local miss rate for L2 51

49 Simple Cache Performance Example Consider the following Miss rate for instruction access is 5% Miss rate for data access is 8% Data references per instruction are 0.4 CPI with perfect cache is 2 Read and write miss penalty is 20 cycles Including possible write buffer stalls What is the performance of this machine relative to one without misses? Always start by considering execution times (IC*CPI*CCT) But IC and CCT are the same here, so focus on CPI 52

50 Performance Solution Find the CPI for the base system without misses CPI no misses = CPIperfect = 2 Find the CPI for system with misses Misses/inst = I Cache Misses + D Cache Misses = 0,05 + (0.08*0.4) = Memory Stall Cycles = Misses/Inst * MissPenalty = 0.082*20 = 1.64 cycles/inst CPI with misses = CPIperfect + Memory Stall Cycles = = 3.64 Compare the performance Performancenomisses CPIwithmisses 3.64 n = 1.82 Performance = CPI = 2 = withmisses nomisses 53

51 Another Cache Problem Given the following data Base CPI of instruction reference per instruction fetch 0.27 loads/instruction 0.13 stores/instruction A 64KB S.A, cache with 4-word block size has a miss rate of 1.7% Memory access time = 4 cycles + #words/block Suppose the cache uses a write through, write-around write strategy without a write buffer. How much faster would the machine be with a perfect write buffer? CPUtime = Instruction Count*(CPIbase + CPImemory) * ClockCycleTime Performance is proportional to CPI = CPImemory 54

52 No Write Buffer CPU Cache Lower Level Memory CPI memory = reads/inst.*miss rate * read miss penalty + writes/inst.* write penalty read miss penalty = 4 cycles + 4 words * 1cycle/word = 8 cycles write penalty = 4 cycles + 1word * 1cycle/word = 5 cycles CPI memory = (1 if ld)(1/inst.)*(0.017)*8 cycles + (0.13st)(1/inst.)*5cycles CPI memory = 0.17 cycles/inst cycles/inst. = 0.82 cycles/inst. CPI overall = 1.5 cycles/inst cycles/inst. = 2.32 cycles/inst. 55

53 Perfect Write Buffer CPU Cache Wbuff Lower Level Memory CPI memory = reads/inst.*miss rate * 8 cycle read miss penalty + writes/inst.* (1- miss rate) * 1 cycle hit penalty A hit penalty is required because on hits we must Access the cache tags during the MEM cycle to determine a hit Stall the processor for a cycle to update a hit cache block CPI memory = 0.17 cycles/inst. + (0.13st)(1/inst.)*( )*1cycle CPI memory = 0.17 cycles/inst cycles/inst. = 0.30 cycles/inst. CPI overall = 1.5 cycles/inst cycles/inst. = 1.80 cycles/inst. 56

54 Perfect Write Buffer + Cache Write Buffer WBuff CPU Cache Lower Level Memory CWB CPI memory = reads/inst.*miss rate * 8 cycle read miss penalty Avoid a hit penalty on write by: Add a one-entry write buffer to the cache itself Write the last store hit to the data array during next stors s MEM Hazard: On loads, must check CWB along with cache!. CPI memory = 0.17 cycles/inst. CPI overall = 1.5 cycles/inst cycles/inst. = 1.67 cycles/inst. 57

55 Virtual Memory 58

56 Motivation #1: Large Address Space for Each Executing Program Each program thinks it has a ~232 byte address space of its own May not use it all though 0xFFFF_FFFF 0x8000_0000 Kernel (OS) memory (code, data, heap, stack) User stack (created at runtime) memory invisible to user code $sp (stack pointer) Available main memory may be much smaller brk run-time heap (managed by malloc) 0x1000_0000 0x0040_0000 Read/write segment (.data,.bss) Read-only segment (.text) Loaded from the executable file 0x0000_0000 unused 59

57 Motivation #2: Memory Management for Multiple Programs At an point in time, a computer may be running multiple programs E.g., Firefox + Thunderbird Questions: How do we share memory between multiple programs? How do we avoid address conflicts? How do we protect programs Isolations and selective sharing 60

58 Virtual Memory in a Nutshell Use hard disk (or Flash) as a large storage for data of all programs Main memory (DRAM) is a cache for the disk Managed jointly by hardware and the operating system (OS) Each running program has its own virtual address space Address space as shown in previous figure Protected from other programs Frequently-used portions of virtual address space copied to DRAM DRAM = physical address space Hardware + OS translate virtual addresses (VA) used by program to physical addresses (PA) used by the hardware Translation enables relocation (DRAM disk) & protection 61

59 Reminder: Memory Hierarchy Everything is a Cache for Something Else Access time Capacity Managed by 1 cycle ~500B Software/compiler 1-3 cycles ~64KB hardware 5-10 cycles 1-10MB hardware ~100 cycles ~10GB Software/OS cycles ~100GB Software/OS 62

60 DRAM vs. SRAM as a Cache DRAM vs. disk is more extreme than SRAM vs. DRAM Access latencies: DRAM ~10X slower than SRAM Disk ~100000X slower than DRAM Importance of exploiting spatial locality First byte is ~100,000X slower than successive bytes on disk vs, ~4X improvement for page-mode vs. regular accesses to DRAM 63

61 Impact of These Properties on Design Bottom line: Design decision made for virtual memory driven by enormous cost of misses (disk accesses) Consider the following parameters for DRAM as a cache for the disk Line size? Large, since disk better at transferring large blocks and minminzes miss rate Associativity? High, to miminze miss rate Write through or write back? Write back, since can t afford to perform small writes to disk 64

62 Terminology for Virtual Memory Virtual memory used DRAM as a cache for disk New terms VM block is called a page The unit of data moving between disk and DRAM It is typically larger than a cache block (e.g., 4KB or 16KB) Virtual and physical address spaces can be divided up to virtual pages and physical pages (e.g., contiguous chunks of 4KB) VM miss is called a page fault More on this later 65

63 Locating an Object in a Cache SRAM Cache (L1,L2, etc) Tag stored with cache line Maps from cache block to a memory address No tag for blocks not in cache If not in cache, then it is in main memory Hardware retrieves and manages tag information Can quickly match against multiple tags 66

64 Locating an Object in a Cache (cont.) SRAM Cache (virtual memory) Each allocated page of virtual memory has entry in page table Mapping from virtual pages to physical pages One entry per page in the virtual address space Page table entry even if page not in memory Specifies disk address OS retrieve and manages page table information Object Name X Page Table D J Location 0 On Disk 0 1 Cache Data X 1 N

65 A System with Physical Memory Only Examples: Most Cray machines, early PCs, nearly all embedded systems, etc Addresses generated by the CPU point directly to bytes in physical memory 68

66 A System with Virtual Memory Examples: Workstations, serves, modern PCs, etc. Page Table Memory Virtual Addresses 0 1 Physical Addresses 0 1 CPU P-1 N-1 Disk Address Translation: Hardware converts virtual addresses to physical addresses via an OS-managed lookup table (page table) 69

67 Page Faults (Similar to Cache Misses ) What if an object is on disk rather than in memory? Page table entry indicates virtual address not in memory OS exception handler invoked to move data from disk into memory OS has full control over placement Full-associativity to minimize future misses Before fault After fault Page Table Memory Virtual Addresses 0 1 Physical Addresses 0 1 CPU P-1 N-1 Disk 70

68 Does VM Satisfy Original Motivations? Multiple active programs can share physical address space Address conflicts are resolved All programs think their code is at 0x Data from different programs can be protected Programs can share data or code when desired 71

69 Answer: Yes, Using Separate Addresses Spaces Per Program Each program has its own virtual address space and own page table Addresses 0x from different programs can map to different locations or same location as desired OS control how virtual pages as assigned to physical memory 72

70 Translation: High-level View Fixed-size pages Physical page sometimes called as frame 73

71 Translation: Process 74

72 Translation Process Explained Valid page Check access rights (R,W,X) against access type Generate physical address if allowed Generate a protection fault (exception) if illegal access Invalid page Page is not currently mapped and a page fault is generated Faults are handled by the operating system Sometimes due to a program error => program terminated E.g. accessing out of the bounds of array Sometimes due to caching => refill & restart Desired data or code available on disk Space allocated in DRAM, page copied from disk, page table updated Replacement may be needed 75

73 VM: Replacement and Writes To reduce page fault rate, OS uses least-recently used (LRU) replacement Reference bit (aka use bit) in PTE set to 1 on access to page Periodically cleared to 0 by OS A page with reference bit = 0 has not been used recently Disk writes take millions of cycles Block at once, not individual locations Write through is impractical Use write-back Dirty bit in PTE set when page is written 76

74 VM: Issues with Unaligned Accesses Memory access might be aligned or unaligned What happens if unaligned address access straddles a page boundary? What if one page is present and the other is not? Or, what if neither is present? MIPS architecture disallows unaligned memory access Interesting legacy problem on 80x86 which does support unaligned access 77

75 Fast Translation Using a TLB Address translation would appear to require extra memory references One to access the PTE Then the actual memory access But access to page tables has good locality So use a fast hardware cache of PTEs within the processor Called a Translation Look-aside Buffer (TLB) Typical: PTEs, cycle for hit cycles for miss, 0.01%-1% miss rate Misses could be handled by hardware or software 78

76 Fast Translation Using a TLB 79

77 TLB Entries The TLB is a cache for page table entries (PTE) The data for a TLB entry ( == a PTE entry) Physical page number (frame #) Access rights (R/W bits) Any other PTE information (dirty bit, LRU info, etc) The tags for a TLB entry Physical page number Portion of it not used for indexing into the TLB Valid bit LRU bits If TLB is associative and LRU replacement is used 80

78 TLB Case Study: MIPS R2000/R3000 Consider the MIPS R2000/R3000 processors Addresses are 32 bits with 4KB pages (12 bit offset) TLB has 64 entries, fully associative Each entry is 64 bits wide 81

79 If page is in memory TLB Misses Load the PTE from memory and retry Could be handled in hardware Can get complex for more complicated page table structures Or in software Raise a special exception, with optimized handler This is what MIPS does using a special vectored interrupt If page is not in memory (page fault) OS handles fetching the page and updating the page table Then restart the faulting instruction 82

80 TLB & Memory Hierarchies Once address is translated, it used to access memory hierarchy A hierarchy of caches (L1, L2, etc) 83

81 Basic process TLB and Cache Interaction Use TLB to get VA Use VA to access caches and DRAM Question: can you ever access the TLB and the cache in parallel? 84

82 TLB Caveats What happens to the TLB when switching between programs The OS must flush the entries in the TLB Large number of TLB misses after every switch Alternatively, use PIDs (process ID) in each TLB entry Allows entries from multiple programs to co-exist Gradual replacement Limited reach 64 entry TLB with 8KB pages maps 0.5MB Smaller than many L2 caches in most systems TLB miss rate > L2 miss rate! Potential solutions Multilevel TLBs ( just like multi-level caches) Larger pages 85

83 Larger Pages Advantages Page Size Tradeoff Smaller page tables Fewer page faults and more efficient transfer with larger applications Improved TLB coverage Disadvantages Higher internal fragmentation Smaller Pages Advantages Improved time to start up small processes with fewer pages Internal fragmentation is low (important for small programs) Disadvantages High overhead in large page tables General trend toward larger pages 1978:512B, 1984:4KB, 1990:16KB, 2000:64KB 86

84 Multiple Page Sizes Many machines support multiple page sizes SPARC: 8KB, 64KB, 1MB, 4MB MIPS R4000: 4KB -16MB Page size dependent upon application OS kernel used large pages User applications use smaller pages Issues Software complexity TLB complexity 87

85 Virtual Memory Summary Use hard disk ( or Flash) as large storage for data of all programs Main memory (DRAM) is a cache for the disk Managed jointly by hardware and the operating system (OS) Each running program has its own virtual address space Address space as shown in previous figure Protected from other programs Frequently-used portions of virtual address space copied to DRAM DRAM = physical address space Hardware + OS translate virtual addresses (VA) used by program to physical addresses (PA) used by the hardware Translation enables relocation & protection 88

86 Acknowledgements These slides contain material from courses: UCB CS152 Stanford EE108B 89

Lecture 11 Cache. Peng Liu.

Lecture 11 Cache. Peng Liu. Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative