CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

Size: px

Start display at page:

Download "CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review"

Solomon Cobb
6 years ago
Views:

1 CISC 662 Graduate Computer Architecture Lecture 6 - Cache and virtual memory review Michela Taufer Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional teaching material from: Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley)

2 Cache 2

3 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) 000 Performance 00 0 Moore s Law µproc CPU 60%/yr. (2X/.5yr Processor-Memory ) Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/0 yrs) Time 3

4 Capacity Access Time Cost CPU Registers 00s Bytes <s ns Cache 0s-00s K Bytes -0 ns $0/ MByte Main Memory M Bytes 00ns- 300ns $/ MByte Disk 0s G Bytes, 0 ms (0,000,000 ns) $0.003/ MByte Tape infinite sec-min $0.004/ MByte Levels of the Memory Hierarchy Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler -8 bytes cache cntl 8-28 bytes OS 52-4K bytes user/operator Mbytes Upper Level faster Larger Lower Level 4

5 What is a cache? Small, fast storage used to improve average access time to slow memory. Exploits spatial and temporal locality In computer architecture, almost everything is a cache! Registers: a cache on variables First-level cache: a cache on second-level cache Second-level cache: a cache on memory Memory: a cache on disk (virtual memory) TLB (Translation lookaside buffer): a cache on page table Branch-prediction: a cache on prediction information? Proc/Regs Bigger L-Cache L2-Cache Faster Memory Disk, Tape, etc. 5

6 The Principle of Locality The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 5 years, HW (hardware) relied on locality for speed 6

7 Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty To Processor From Processor Upper Level Memory Blk X Lower Level Memory Blk Y 7

8 Cache Measures Hit rate: fraction found in that level So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Miss penalty: time to replace a block from lower level, including time to replace in CPU access time: time to lower level = f(latency to lower level) transfer time: time to transfer block =f(bw between upper & lower levels) 8

9 Simplest Cache: Direct Mapped Memory Address A B C D E F Memory 4 Byte Direct Mapped Cache Cache Index Location 0 can be occupied by data from: Memory location 0, 4, 8,... etc. In general: any memory location whose 2 LSBs of the address are 0s Address<:0> => cache index Which one should we place in the cache? How can we tell which one is in the cache? 9

10 4 Questions for Memory Hierarchy Q: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) 0

11 Q: Where can a block be placed in the upper level? Block 2 placed in 8 block cache: Cache Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Full Mapped Direct Mapped 2-Way Assoc (2 mod 8) = 4 (2 mod 4) = Memory

12 Direct-Mapped Caches set 0: valid tag cache block E= lines per set set : valid tag cache block set S-: valid tag cache block 2

13 Set Associative Cache set 0: valid tag cache block valid tag cache block E=2 lines per set set : set S-: valid tag cache block valid tag cache block valid tag cache block valid tag cache block 3

14 Q2: How is a block found if it is in the upper level? Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag Block Address Tag Index Block Offset 4

15 Accessing Direct-Mapped Caches set 0: valid tag cache block selected set set : valid tag cache block m- t bits tag s bits b bits set index block offset 0 set S-: valid tag cache block 5

16 Accessing Set Associative Caches set 0: valid valid tag tag cache block cache block Selected set set : valid valid tag tag cache block cache block t bits tag s bits b bits set index block offset 0 set S-: valid valid tag tag cache block cache block 6

17 Q3: Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Assoc: 2-way 4-way 8-way Size LRU Ran LRU Ran LRU Ran 6 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB.9% 2.0%.5%.7%.4%.5% 256 KB.5%.7%.3%.3%.2%.2% 7

18 Q4: What happens on a write? Write through The information is written to both the block in the cache and to the block in the lower-level memory. Write back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes WB: no repeated writes to same location WT always combined with write buffers so that don t wait for lower level memory 8

19 Virtual Memory 9

20 Virtual Memory Use main memory as a cache for secondary storage to Allow efficient and safe sharing of memory among multiple programs Provide the ability to easily run programs larger than the size of physical memory Simplify loading a program for execution by providing for code relocation (i.e., the code can be loaded anywhere in main memory) What makes it work? again the Principle of Locality A program is likely to access a relatively small portion of its address space during any period of time Each program is compiled into its own address space a virtual address space During run-time each virtual address must be translated to a physical address (an address in main memory) 20

21 Virtual Memory Space A virtual address is translated to a physical address by a combination of hardware and software A virtual memory block is called a page and a virtual memory miss is called a page fault Virtual address Physical address 2

22 Pages: Virtual Memory Blocks A virtual page can reside either in main memory or on disk Page faults the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important (LRU) can handle the faults in software instead of hardware using write-through is too expensive so we use write-back 22

23 Address Translation Virtual Address (VA) Virtual page number Page offset Translation Physical page number Page offset Physical Address (PA) So each memory request first requires an address translation from the virtual space to the physical space 23

24 Page Tables If a virtual page can be mapped to any physical page fully associative placement of pages The OS can choose to replace any page when a page fault occurs But need to use a clever and flexible replacement schema to reduce page fault rate And a full search is impractical!!!! A page table is used to index the memory It is indexed by the virtual page number Each entry contains the physical page number for that virtual page It is stored in memory Each program has its own page table 24

25 Address Translation Mechanisms Virtual page # V Offset Physical page # Physical page # Page Table (in physical memory) Offset The page table maps each page in virtual memory to EITHER a page in main memory OR a page stored on disk (next level of hierarchy) Physical memory Disk storage 25

26 Virtual Addressing with a Cache It takes an extra memory access to translate a VA to a PA CPU VA PA miss Translation Cache data hit Main Memory This makes memory (cache) accesses very expensive; this is the hit time that defines the clock cycle time The fix is to use a Translation Lookaside Buffer (TLB) a small cache that keeps track of recently used address mappings to avoid a page table access 26

27 Translation-lookaside buffer A memory access by a program = two memory accesses: Memory access to obtain the physical address Memory access to get the data To improve access performance: Relay on spatial and temporal locality of the page Keep track of recent translations translation-lookaside buffer Translation-lookaside buffer (TLB) cache containing a subset of the virtual-to-physical page mapping that are already in the memory 27

28 Making Address Translation Fast Virtual page # V 0 Tag Physical page # TLB V Physical page # Page Table (in physical memory) Physical memory Disk storage 28

29 Bits Valid bit: if it is off, a page fault occurs Reference bit: it is set whenever a page is accessed. Used for implementing LRU schema (least recently used) Dirty bit: tracks whether a page has been written since it was read in memory If the OS decides to replace the page, the bit tells whether the page has to be written on disk before its location is given to another page 29

30 A TLB in the Memory Hierarchy CPU VA /2 t hit PA t miss TLB Lookup Cache Main Memory miss hit Translation data TLB miss is it a page fault or merely a TLB miss? If the page exists in memory, then the TLB miss can be handled (in hardware or software) by loading the translation information from the page table into the TLB If the page is not in memory, the it s a true page fault TLB misses are much more frequent than true page faults 30

31 3

32 Some Virtual Memory Design Parameters Total size (blocks) Total size (KB) Block size (B) Miss penalty (clocks) Miss rates Paged VM 6,000 to 250, ,000 to,000,000, to 64,000 0,000,000 to 00,000, % to 0.000% 6 to to 6 4 to 32 0 to 000 TLBs 0.0% to 2% 32

33 Two Machines Cache Parameters TLB organization Intel P4 TLB for instructions and TLB for data Both 4-way set associative Both use ~LRU replacement Both have 28 entries TLB misses handled in hardware AMD Opteron 2 TLBs for instructions and 2 TLBs for data Both L TLBs fully associative with ~LRU replacement Both L2 TLBs are 4-way set associative with round-robin LRU Both L TLBs have 40 entries Both L2 TLBs have 52 entries TBL misses handled in hardware 33

34 TLB Event Combinations TLB Page Table Cache Possible? Under what circumstances? Hit Hit Hit Yes what we want! Hit Hit Miss Yes although the page table is not checked if the TLB hits Miss Hit Hit Yes TLB miss, PA in page table Miss Hit Miss Yes TLB miss, PA in page table, but data not in cache Miss Miss Miss Yes page fault Hit Miss Miss Impossible TLB translation not possible if page not present in memory Hit Miss Hit Impossible TLB translation not possible if page not present in memory Miss Miss Hit Impossible data not allowed in cache if page is not in memory 34

35 Why Not a Virtually Addressed Cache? A virtually addressed cache would only require address translation on cache misses CPU VA Translation PA Main Memory hit Cache data but Two different virtual addresses can map to the same physical address (processes sharing data), i.e., two different cache entries hold data for the same physical address synonyms Must update all cache entries with the same physical address or the memory becomes inconsistent 35

36 The Hardware/Software Boundary What parts of the virtual to physical address translation is done by or assisted by the hardware? Translation Lookaside Buffer (TLB) that caches the recent translations» TLB access time is part of the cache hit time» Can allot an extra stage in the pipeline for TLB access Page table storage, fault detection and updating» Page faults result in interrupts (precise) that are then handled by the OS» Hardware must support (i.e., update appropriately) Dirty and Reference bits (e.g., ~LRU) in the Page Tables Disk placement» Bootstrap (e.g., out of disk sector 0) so the system can service a limited number of page faults before the OS is even loaded 36

37 Summary The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time.» Temporal Locality: Locality in Time» Spatial Locality: Locality in Space Caches, TLBs, Virtual Memory all understood by examining how they deal with the four questions: ) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled? Page tables map virtual address to physical address TLBs are important for fast translation 37

38 Deadlines 0 Nov 6 Lec4 - Review Cache and Review Virtual Chap 4 Memory Nov 0 Homework 3 due Nov Lec5 - Multiprocessors and Thread-Level Parallelism; Symmetric Shared Memory Nov 3 Lec6 - Distributed Shared Memory 2 Nov 8 No class Chap 5 2 Nov 20 Lec7 Homework 3 review 3 Nov 24 Homework 4 due 3 Nov 25 Lec8 - Synchronization 3 Nov 27 Thanksgiving Holiday 4 Dec 2 Lec9 Homework 4 review 4 Dec 4 Lec20 - Memory Technology; Virtual Memory and Virtual Machine 5 Dec 9 Lec2 - Design of Memory Hierarchy 38

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1 Memory Hierarchy Maurizio Palesi Maurizio Palesi 1 References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5 Maurizio