Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Size: px

Start display at page:

Download "Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches"

Adrian Price
5 years ago
Views:

1 Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging 6.823, L8--1 Asanovic Laboratory for Computer Science M.I.T. Highly-Associative Caches 6.823, L8--2 For high associativity, use content-addressable memory (CAM) for tags Used in low-power microprocessors, e.g. StrongARM is 32-way setassociative. (Higher hit rates at lower energy than 2-4 way set-ass. RAM tags) Comparator per tag requires more transistors (~double area per tag bit) Address tag t set i offset b Set Set 1 i Set 0 Tag =? Data Block Tag =? Data Block Tag Tag =? =? Data Data Block Block Tag =? Data Block Tag =? Data Block Tag =? Data Block Tag =? Data Block Tag =? Data Block Only one set enabled Only hit data accessed Hit? Data Page 1

2 Replacement Policy In an associative cache, which block from a set should be evicted when the set becomes full? 6.823, L8--3 Random Least Recently Used (LRU) LRU cache state must be updated on every access true implementation only feasible for small sets (2-way easy) pseudo-lru binary tree often used for 4-8 way e.g. 3 state bits for 4-way pseudo-lru: First In, First Out (FIFO) aka. Round-Robin used in highly associative caches way3 way2 way1 way0 This is a second-order effect. How often does replacement happen? CPU-Cache Interaction (Simple 5-stage pipeline) 6.823, L8--4 PCen PC 0x4 Add nop addr inst hit? Primary Instruction Cache IR D Decode, Register Fetch E A B MD1 M ALU Y MD2 we addr Primary Data rdata Cache hit? wdata R Stall entire CPU on data cache miss To Memory Control Cache Refill Data from Lower Levels of Memory Hierarchy Page 2

3 Write Policy 6.823, L8--5 Cache hit: write through: write both cache & memory - generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) - a dirty bit per block can further reduce the traffic Cache miss: no write allocate: only write to main memory write allocate: (aka fetch on write) fetch block into cache Common combinations: write through and no write allocate write back with write allocate Managing Cache Writes 6.823, L8--6 In a direct-mapped cache, can we write cache data RAM in same cycle as cache tag RAM read? In a highly-associative cache with CAM tags, can we write cache data in the same cycle as tag CAM search? Page 3

4 Pipelining Cache Writes 6.823, L8--7 Possible solutions: - Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit - Design data RAM that can perform read and write in one cycle, restore old value after tag miss - Hold write data for store in single buffer ahead of cache, write cache data during next store s tag check - Need to bypass from write buffer if read matches write buffer tag Use CAM tags --- data write only enabled if hit Cache Performance 6.823, L8--8 Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the hit time reduce the miss rate (e.g., larger cache) reduce the miss penalty (e.g., L2 cache) First order effect: the size and the hit time Ÿdesign the largest primary cache without slowing down the clock or adding pipeline stages Page 4

5 Causes for Cache Misses 6.823, L8--9 Compulsory: first-reference aka cold start misses - misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect placement & replacement policy Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity Determining the type of a miss requires running program traces on a cache simulator Effect of Cache Parameters on Performance 6.823, L8--10 Larger cache size + reduces capacity and conflict misses - hit time may increase Larger block size + spatial locality reduces compulsory misses and capacity reload misses - fewer blocks may increase conflict miss rate - larger blocks may increase miss penalty Higher associativity + reduces conflict misses (up to around 4-8 way) - hit time may increase Page 5

6 Reducing Hit Time 6.823, L8--11 On-chip versus off-chip caches Pipelining write tag check and data update for single-cycle write-hits Sum-Addressed Caches (UltraSPARC-III) - Can evaluate A+B=C equality faster than adding A and B! - In register+offset addressing mode, no need to perform addition before cache access - Tag compare unit performs A+B=C operation Pseudo-associative caches (way-predicting caches) - Guess which way will have hit, only look there first - Check other ways sequentially on miss - Combine direct-mapped hit time, associative miss rate - Larger hit time if way prediction poor Techniques to Further Reduce Cache Miss Rate 6.823, L8--12 Hardware prefetching of instructions and data Can remove compulsory misses Stream buffers predict sequential or strided accesses Speculative fetches can add useless memory traffic Software prefetching Software prefetch requires extending the ISA with register and cache prefetch instructions Takes extra instruction issue slots Software tuned to one hardware implementation Other software techniques placement (array padding) to reduce cache conflicts blocking transformations (reuse data once in cache) to reduce capacity misses Page 6

7 Reducing Miss Penalty 6.823, L8--13 Give priority to read-misses over writes and write-backs - queue writes in write buffer, let read overtake writes to memory - must check write buffer for match on read address Multi-level caches - reduced latency on a primary cache miss Fetch critical word in block first (aka wrap-around refill) - restart the processor as soon as needed word arrives Sub-block placement, fetch only part of a block on miss - small tag overhead from large blocks, but low miss penalty from small sub-blocks - need an extra valid bit per sub-block V Tag V sub-blk. 0 V sub-blk. 1 V sub-blk. 2 V sub-blk. 3 Victim caches - hold recently evicted blocks nearby to reduces conflict miss penalty for low-associativity caches Non-blocking caches to reduce stalls on cache misses Memory Management 6.823, L8--14 The Fifties: - Absolute Addresses - Dynamic address translation The Sixties: - Paged memory systems and TLBs - Atlas Demand paging Modern Virtual Memory Systems (next lecture) Page 7

8 Types of Names for Memory Locations 6.823, L8--15 machine language address ISA virtual address Address Mapping physical address Physical Memory (DRAM) Machine language address Ÿ as specified in machine code Virtual address Ÿ ISA specifies translation of machine code address into virtual address of program variable (may involve segment registers etc.) Physical address Ÿ operating system specifies mapping of virtual address into name for a physical memory location (i.e., actual address signals going to DRAM chips) EDSAC, early 50 s Absolute Addresses 6.823, L8--16 effective address = physical memory address Only one program ran at a time, with unrestricted access to entire machine (RAM + I/O devices) Addresses in a program depended upon where the program was to be loaded in memory But it was more convenient for programmers to write location-independent subroutines ŸLead to the development of loaders & linkers to statically relocate and link programs Page 8

9 Dynamic Address Translation Motivation: In the early machines, I/O operations were slow and each word transferred involved the CPU 6.823, L8--17 Higher throughput if CPU and I/O of two or more programs were overlapped Ÿmultiprogramming Location independent programs: Programming and storage management ease Ÿ need for a base register Protection: Independent programs should not affect each other inadvertently Ÿ need for a bound register program1 program2 Physical Memory User versus Kernel 6.823, L8--18 With multiprogramming came move away from programming bare machine users should be protected from each other (and program bugs) users need to share resources (e.g., CPU time, memory, disk) Hardware support evolves to support OS device interrupts to support multiprogramming (can t enforce that users software polls for each others devices) protected state and privileged execution modes to run OS Ÿ exceptions to catch protection violations Purely software schemes to manage protection high-level language programming only (no assembly code) trusted compiler software bugs cause security loopholes and system crashes Page 9

10 Simple Base and Bound Translation 6.823, L8--19 Load X Program Address Space Bound Register d Bound Violation Effective Addr Register Base Register + Physical Address current segment Main Memory Base and bounds registers only visible/accessible when processor running in kernel mode (aka supervisor mode) Separate Areas for Program and Data 6.823, L8--20 Load X Data Bound Register Effective Addr Register Data Base Register d + Bound Violation data segment Main Memory Program Address Space Program Bound Register Program Counter Program Base Register d + Bound Violation program segment This permitted sharing of program segments Used today on Cray vector supercomputers Page 10

11 user 1 user 2 free user 3 free OS Space 16 K 24 K 24 K 32 K 24 K Memory Fragmentation User 4 & 5 arrives user 1 user 2 user 4 free user 3 user 5 OS Space 16 K 24 K 16 K 8 K 32 K 24 K User 2 & 3 leaves user 1 free user 4 free user , L8--21 OS Space 16 K 24 K 16 K 40 K 24 K As users come and go, the storage is fragmented. Therefore, at some stage programs have to be moved around to compact the storage (burping the memory) Address Space of User-1 Paged Memory Systems: To reduce fragmentation Processor generated address can be interpreted as a pair <page number,offset> page number Page Table of User-1 offset A page table contains the physical address of the base of each page Fixed-length pages plus indirection through page table relaxes the contiguous allocation requirement 6.823, L Page 11

12 Private Address Space per User 6.823, L8--23 User 1 User 2 VA1 VA1 Page Table Physical Memory OS pages Page Table User 3 VA1 Page Table Each user has a page table Page table contains an entry for each user page where should page tables reside? FREE 6.823, L8--24 Where Should Page Tables Reside? Space required by the page tables is proportional to the page size, number of users,... ŸSpace requirement is large too expensive to keep in registers Special registers just for the current user: - need new management instructions - affects the context-switching time may not be feasible for large page tables Main memory: - needs one reference to retrieve the page base address and another to access the data word Ÿ doubles number of memory references! Page 12

Page Tables in Physical Memory 6.823, L8--25 Page Table, User 1 VA1 User 1 Page Table, User 2 VA1 User 2 Translation Lookaside Buffers Caching the Address Translation virtual address VPN offset 6.

13 Page Tables in Physical Memory 6.823, L8--25 Page Table, User 1 VA1 User 1 Page Table, User 2 VA1 User 2 Translation Lookaside Buffers Caching the Address Translation virtual address VPN offset 6.823, L8--26 V R W D tag PPN (VPN = virtual page number) (PPN = physical page number) hit? physical address PPN offset TLB speeds up the address translation (IBM, late 60 s) TLB keeps the <VPN, PPN> mappings for the recently accessed pages TLB also keeps additional information about each page, e.g., read/write, dirty Usually TLB is per process and flushed on a context switch Page 13

14 A Problem in Early Sixties 6.823, L8--27 There were many applications whose data could not fit in the main memory, e.g., Payroll Paged memory system reduced fragmentation but still required the whole program to be resident in the main memory Programmers moved the data back and forth from the secondary store by overlaying it repeatedly on the primary store tricky programming! Manual Overlays 6.823, L8--28 Assuming an instruction can address all the storage on the drum method1 - programmer keeps track of addresses in the main memory and initiates an I/O transfer when required method2 - automatic initiation of I/O transfers by software address translation Brookner s interpretive coding, 1960 method 1 proved too difficult for users and method 2 too slow! 40k bits main 640k bits drum central store Ferranti Mercury 1956 Page 14

15 Demand Paging Atlas, , L8--29 Primary 32 Pages 512 words/page A page from secondary storage is brought into the primary storage whenever it is (implicitly) demanded by the processor. Tom Kilburn Central Memory Secondary (Drum) 32x6 pages User sees 32 x 6 x 512 words of storage Primary memory as a cache for secondary memory Hardware Organization of Atlas 6.823, L8--30 Effective Address Initial Address Decode 48-bit words 512-word pages 1 PAR per page frame (Page Address Register) 0 31 PARs Fixed (ROM) 16 pages 0.4 ~1 Psec Subsidiary 2 pages 1.4 Psec Main 32 pages 1.4 Psec <effective PN, status> system code (not swapped) system data (not swapped) Drum (4) 192 pages Tape 8 decks 88 Psec/word Compare the effective page address against all 32 PARs match Ÿ normal access no match Ÿpage fault the state of the partially executed instruction was saved Page 15

16 Atlas Demand Paging Scheme On a page fault: 6.823, L8--31 input transfer into a free page is initiated the PAR is updated if no free page is left, a page is selected to be replaced (based on usage) the replaced page is written on the drum - to minimize drum latency effect, the first empty page on the drum was selected the page table is updated to point to the new location of the page on the drum Caching vs Demand Paging 6.823, L8--32 secondary memory CPU cache primary memory CPU primary memory Caching Demand paging cache entry page-frame cache block (~32 bytes) page (~4K bytes) cache miss (1% to 20%) page miss (<0.001%) cache hit (~1 cycle) page hit (~100 cycles) cache miss (~10 cycles) page miss(~5m cycles) a miss is handled a miss is handled in hardware mostly in software Page 16

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,