COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Size: px

Start display at page:

Download "COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory"

Steven Jones
6 years ago
Views:

1 1 COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations of Sudeep Pasricha and others: Kubiatowicz, Patterson, Mutlu, Elsevier Outline Memory hierarchy, example, terminology 4 questions 5 basic cache optimizations Virtual memory Summary 2

2 2 Since 1980, CPU has outpaced DRAM... Performance (1/latency) CPU CPU 60% per yr 2X in 1.5 yrs Gap grew 50% per year DRAM 9% per yr DRAM 2X in 10 yrs Year How did architects address this gap? Put small, fast cache memories between CPU and DRAM. Create a memory hierarchy 3 Memory Hierarchy Everything is a cache for something else Take advantage of the principle of locality to: Present as much memory as possible in the cheapest technology Provide access at speed offered by the fastest technology Processor Datapath Control Registers On-Chip Cache L3 Cache (SRAM) Main Memory (DRAM/ PCM) Secondary Storage (Disk/ FLASH/ PCM) Tertiary Storage (Tape/ Cloud Storage) Speed (ns): 1s 10s-100s 100s 10,000,000s (10s ms) 10,000,000,000s (10s sec) Size (bytes): 100s Ks-Ms Ms Gs Ts 4

3 Example of Memory Hierarchy: Apple imac G5 (released 2004) Managed by hardware Managed by

512K 256M 80G Latency Cycles, Time 1, 0.6 ns 3, 1.9 ns 3, 1.9 ns 11, 6.

6 GHz Goal: Illusion of large, fast, cheap memory Let programs address a memory space that

3 3 Example of Memory Hierarchy: Apple imac G5 (released 2004) Managed by hardware Managed by compiler Managed by OS, hardware, application Reg L1 Inst L1 Data L2 DRAM Disk Size 1K 64K 32K 512K 256M 80G Latency Cycles, Time 1, 0.6 ns 3, 1.9 ns 3, 1.9 ns 11, 6.9 ns 88, 55 ns 10 7, 12 ms imac G5 1.6 GHz Goal: Illusion of large, fast, cheap memory Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access 5 imac s PowerPC 970: all caches on-chip L1 (64K Instruction) Registers 512K L2 (1K) L1 (32K Data) 6

4 4 Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit Rate: Fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of: Time to determine hit/miss + Memory access time Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor Hit Time << Miss Penalty (500 instructions on 21264!) To Processor Upper Level Memory Lower Level Memory From Processor Blk X Blk Y 7 Program Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to be accessed again soon e.g., instructions in a loop, induction variables Spatial locality Items near those accessed recently are likely to be accessed soon E.g., sequential instruction access, array data 8

5 5 Outline Memory hierarchy, example, terminology 4 questions 5 basic cache optimizations Virtual memory Summary 9 4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) 10

6 6 Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: Direct mapped, 2-way set associative (SA), Fully associative S.A. Mapping = Block Number MODULO Number Sets Direct Mapped (12 mod 8) = Way Set Associative (12 mod 4) = Fully associative (Full Mapped) Cache Memory 11 Direct Mapped Cache Location determined by address Direct mapped: only one choice (Block address) modulo (#Blocks in cache) #Blocks is a power of 2 Use loworder address bits 12

7 7 Sources of Cache Misses Compulsory (cold start or process migration, first reference): first access to a block Cold fact of life: not a whole lot you can do about it Note: If you are going to run billions of instruction, Compulsory Misses are insignificant Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Coherence (invalidation): other process (e.g., I/O) updates memory 13 Q2: How is a block found if it is in the upper level? Block is minimum quantum of caching Data select field used to select data within block Index Used to Lookup Candidates in Cache Index identifies the set How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag If no candidates match, then declare cache miss What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0 14

: : : 8 Address Subdivision 15 Review: Direct Mapped Cache Direct Mapped 2 N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 M )

8 : : : 8 Address Subdivision 15 Review: Direct Mapped Cache Direct Mapped 2 N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 M ) Example: 1 KB Direct Mapped Cache with 32 B Blocks Index chooses potential block Tag checked to verify block Byte select chooses byte within block Cache Tag Cache Index Ex: 0x50 Ex: 0x Byte Select Ex: 0x00 0 Valid Bit : Cache Tag 0x50 : Byte 31 Byte 63 Cache Data Byte 1 Byte 33 : Byte 0 Byte Byte 1023 Byte

9 : : 9 Review: Set Associative (SA) Cache N-way set associative: N entries (blocks) per Cache Index N direct mapped caches operates in parallel Example: Two-way Set Associative cache Cache Index selects a set from the cache Two tags in the set are compared to input in parallel Data is selected based on the tag result Cache Tag Cache Index 5 4 Byte Select 0 Valid Cache Tag Cache Data Cache Block 0 : : : Cache Data Cache Block 0 : Cache Tag Valid : : Compare Sel1 1 Mux 0 Sel0 Compare OR CS252-S10, Hit Lecture 03 Cache Block 17 Review: Fully Associative Cache Fully Associative: Every block can hold any line Address does not include a cache index Compare Cache Tags of all Cache Entries in Parallel Example: Block Size = 32 B blocks We need N 27-bit comparators Still have byte select to choose from within block 31 Cache Tag (27 bits long) 4 Byte Select Ex: 0x01 0 = = = = Cache Tag Valid Bit Cache Data Byte 31 Byte 63 Byte 1 Byte 33 Byte 0 Byte 32 = : : : 18

10 10 Q3: Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: LRU (Least Recently Used): Appealing, but hard to implement for high associativity Random: Easy, but how well does it work?» Miss rates: Assoc: 2-way 4-way 8-way Size LRU Ran LRU Ran LRU Ran 16K 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64K 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256K 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 19 Q4: What happens on a Write? Write-Through Write-Back Policy Data written to cache block also written to lower-level memory Write data only to the cache block Update lower level when a block falls out of the cache Debug Easy Hard Do read misses produce writes? Do repeated writes make it to lower level? No Yes Yes No 20

11 11 Write Buffers for Write-Through Caches Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory Q. Why a write buffer? Q. Why a buffer, why not just one register? Q. Are Read After Write (RAW) hazards an issue for write buffer? A. So CPU doesn t stall A. Bursts of writes are common A. Yes! Drain buffer before next read, or check write buffers for match on reads 21 Outline Memory hierarchy, example, terminology 4 questions 5 basic cache optimizations Virtual memory Summary 22

12 12 5 Basic Cache Optimizations Reducing Miss Rate 1. Larger Block size (compulsory misses) 2. Larger Cache size (capacity misses) 3. Higher Associativity (conflict misses) Reducing Miss Penalty 4. Multilevel Caches Reducing hit time 5. Giving Reads Priority over Writes E.g., Read complete before earlier writes in write buffer 23 Outline Memory hierarchy, example, terminology 4 questions 5 basic cache optimizations Virtual memory Summary 24

13 13 Virtual Memory Use main memory as a cache for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main memory Each gets a private virtual address space holding its frequently used code and data Protected from other programs CPU and OS translate virtual addresses to physical addresses VM block is called a page; Typical size of a page: 1K - 8K VM translation miss is called a page fault 25 Virtual Address Space Virtual Memory Physical Address Space Virtual Address V page no. 10 offset Page Table Base Reg index into page table Page Table V Access Rights PA table located in physical memory P page no. offset 10 Physical Address Virtual Memory => treat memory as a cache for the disk Page table maps virtual page numbers to physical frames PTE = Page Table Entry 26

14 14 Page Fault Penalty On page fault, the page must be fetched from disk Takes millions of clock cycles Handled by OS code Try to minimize page fault rate Fully associative placement Smart replacement algorithms 27 Page Tables Stores placement information Array of page table entries, indexed by virtual page number Page table register in CPU points to page table in physical memory If page is present in memory PTE stores the physical page number Plus other status bits (referenced, dirty, ) If page is not present PTE can refer to location in swap space on disk

15 15 Translation Using a Page Table Mapping Pages to Storage

16 16 Replacement and Writes To reduce page fault rate, prefer leastrecently used (LRU) replacement Reference bit (aka use bit) in PTE set to 1 on access to page Periodically cleared to 0 by OS A page with reference bit = 0 has not been used recently Disk writes take millions of cycles Block at once, not individual locations Write through is impractical Use write-back Dirty bit in PTE set when page is written Fast Translation Using a TLB Address translation would appear to require extra memory references One to access the PTE Then the actual memory access But access to page tables has good locality So use a fast cache of PTEs within the CPU Called a Translation Look-aside Buffer (TLB) Typical: PTEs, cycle for hit, cycles for miss, 0.01% 1% miss rate Fully Associative Misses could be handled by hardware or software

Associative, or Direct Mapped Translation with a TLB hit VA PA miss

17 17 Translation Look-Aside Buffers Translation Look-Aside Buffers (TLB) Cache applied to address translations Fully Associative, Set Associative, or Direct Mapped Translation with a TLB hit VA PA miss CPU TLB Cache miss Translation hit data Main Memory 33 Fast Translation Using a TLB

18 18 TLB Misses If page is in memory Load the PTE from memory and retry Could be handled in hardware» Can get complex for more complicated page table structures Or in software» Raise a special exception, with optimized handler If page is not in memory (page fault) OS handles fetching the page and updating the page table Then restart the faulting instruction TLB Miss Handler TLB miss indicates Page present, but PTE not in TLB Page not preset Must recognize TLB miss before destination register overwritten Raise exception Handler copies PTE from memory to TLB Then restarts instruction If page not present, page fault will occur

instruction TLB and Cache Interaction If cache tag uses physical address Need to translate before cache lookup

19 19 Page Fault Handler Use faulting virtual address to find PTE Locate page on disk Choose page to replace If dirty, write to disk first Read page into memory and update page table Make process runnable again Restart from faulting instruction TLB and Cache Interaction If cache tag uses physical address Need to translate before cache lookup Alternative: use virtual address tag Complications due to aliasing» Different virtual addresses for shared physical address

20 20 Memory Protection Different tasks can share parts of their virtual address spaces But need to protect against errant access Requires OS assistance Hardware support for OS protection Privileged supervisor mode (aka kernel mode) Privileged instructions Page tables and other state information only accessible in supervisor mode System call exception (e.g., syscall in MIPS) Three Advantages of Virtual Memory 1. Translation: Program can be given consistent view of memory, even though physical memory is scrambled Makes multithreading reasonable (now used a lot!) Only the most important part of program ( Working Set ) must be in physical memory. Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. 2. Protection: Different threads (or processes) protected from each other. Different pages can be given special behavior» (Read Only, Invisible to user programs, etc.) Kernel data protected from User programs Very important for protection from malicious programs 3. Sharing: Can map same physical page to multiple users ( Shared memory ) 40

21 21 Outline Memory hierarchy, example, terminology 4 questions 5 basic cache optimizations Virtual memory Summary 41 Summary #1/3: The Cache Design Space Several interacting dimensions cache size block size associativity replacement policy write-through vs. write-back write allocation The optimal choice is a compromise depends on access characteristics» workload Bad» use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Good Cache Size Factor A Less Associativity Block Size Factor B More 42

22 22 Summary #2/3: Caches The Principle of Locality: Program access a relatively small portion of the address space at any instant of time.» Temporal Locality: Locality in Time» Spatial Locality: Locality in Space Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Capacity Misses: increase cache size Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Write Policy: Write Through vs. Write Back Today CPU time is a function of (ops, cache misses) vs. just of(ops): affects Compilers, Data structures, and Algorithms 43 Summary #3/3: TLB, Virtual Memory Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled? Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is paramount! 44

Handout 4 Memory Hierarchy

Handout 4 Memory Hierarchy Outline Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options (MMU Sub-system) Conclusion 2012/11/7 2 Since 1980, CPU has outpaced