Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 2)

Size: px

Start display at page:

Download "Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 2)"

Emery Jennings
5 years ago
Views:

1 Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 2) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Feng-Chia Unive ersity Outline 5.4 Virtual Memory 5.5 A Common Framework for Memory Hierarchies 5.6 Virtual Machines 5.8 Cache Coherence 5.12 Concluding Remarks 2

2 Depar rtment of Electrical Engineering, Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: illusion of having more physical memory program relocation protection 3 Depar rtment of Electrical Engineering, Feng-Chia Unive ersity Translation Using a Page Table 4

Page Tables Stores placement information

virtual page number Page table register in

If page is present in memory PTE stores the

space on disk 5 Depar rtment of Electrical

3 Page Tables Stores placement information Array of page table entries, indexed by virtual page number Page table register in CPU points to page table in physical memory If page is present in memory PTE stores the physical page number Plus other status bits (referenced, dirty, ) If page is not present PTE can refer to location in swap space on disk 5 Depar rtment of Electrical Engineering, Feng-Chia Unive ersity Mapping Pages to Storage 6

4 Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Page fault Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) can handle the faults in software instead of hardware 7 Replacement and Writes To reduce page fault rate, prefer least-recently used (LRU) replacement Reference bit (aka use bit) in PTE set to 1 on access to page Periodically cleared to 0 by OS A page with reference bit = 0 has not been used recently Disk writes take millions of cycles Block at once, not individual locations Write through is impractical Use write-back Dirty bit in PTE set when page is written 8

Depar rtment of Electrical Engineering, Feng-Chia Unive ersity Making Address Translation Fast A

retry Could be handled in hardware Can get complex for more complicated page table structures Or in

5 Depar rtment of Electrical Engineering, Feng-Chia Unive ersity Making Address Translation Fast A cache for address translations: Translation Lookaside Buffer Typical values: entries, miss-rate: 0.01% - 1% miss-penalty: cycles 9 If page is in memory TLB Misses Load the PTE from memory and retry Could be handled in hardware Can get complex for more complicated page table structures Or in software Raise a special exception, with optimized handler If page is not in memory (page fault) OS handles fetching the page and updating the page table Then restart t the faulting instruction ti 10

6 Depar rtment of Electrical Engineering, Feng-Chia Unive ersity TLBs and Caches Physically indexed and physically tagged Virtually indexed and virtually tagged Virtually indexed but physically tagged 11 Depar rtment of Electrical Engineering, Feng-Chia Unive ersity TLBs and Caches 12

7 Feng-Chia Unive ersity 5.4 Virtual Memory Outline 5.5 A Common Framework for Memory Hierarchies 5.6 Virtual Machines 5.8 Cache Coherence 5.12 Concluding Remarks 13 Questions for Memory Hierarchy Q1: Where can a block be placed? (Block placement) One place (direct mapped) A few places (set associative) Any place (fully associative) Q2: How is a block found? (Block identification) Indexing (as in a direct-mapped cache) Limited search (as in a set-associative cache) Full search (as in a fully associative at cache) A separate lookup table (as in a page table) Q3: Which block should be replaced on a miss? Random (Block replacement) Least Recently Used (LRU) First In, First Out (FIFO) Q4: What happens on a write? (Write strategy) t Write-through Write-back 14

8 3Cs (3 categories for the sources of misses) Compulsory misses First access Also called cold-start misses Capacity misses Caused when the cache cannot contain all the blocks needed during execution of a program When block are replaced and then later retrieved Conflict misses Occur in set associative or direct mapped cache Multiple blocks compete for the same set Also called collision misses 15 15% 12% 9% 6% Data Cache Miss Rates 1 KB 2 KB 4 KB 8 KB 3% 16 KB 32 KB 0 Associativity 64 KB 128 KB One-way Two-way Four-way Eight-way 16

9 Cache Design Trade-offs Design change Effect on miss rate Negative performance effect Increase Decrease May increase cache size capacity misses access time Increase Decrease May increase associativity it conflict misses access time Increase Decrease Increases miss penalty. block size compulsory misses For very large block size, may increase miss rate due to pollution. 17 Feng-Chia Unive ersity Outline 5.4 Virtual Memory 5.5 A Common Framework for Memory Hierarchies 5.6 Virtual Machines 5.8 Cache Coherence 5.12 Concluding Remarks 18

10 Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple guests Avoids security and reliability problems Aids sharing of resources Virtualization has some performance impact Feasible with modern high-performance comptuers Examples IBM VM/370 (1970s technology!) VMWare Microsoft Virtual PC 19 Virtual Machine Monitor Maps virtual resources to physical resources Memory, I/O devices, CPUs Guest code runs on native machine in user mode Traps to VMM on privileged instructions and access to protected resources Guest OS may be different from host OS VMM handles real I/O devices Emulates generic virtual I/O devices for guest 20

11 Example: Timer Virtualization In native machine, on timer interrupt OS suspends current process, handles interrupt, selects and resumes next process With Virtual Machine Monitor VMM suspends current VM, handles interrupt, selects and resumes next VM If a VM requires timer interrupts VMM emulates a virtual timer Emulates interrupt for VM when physical timer interrupt occurs 21 Instruction Set Support User and System modes Privileged instructions only available in system mode Trap to system if executed in user mode All physical resources only accessible using privileged instructions Including page tables, interrupt controls, I/O registers Renaissance of virtualization support Current ISAs (e.g., x86) adapting 22

Feng-Chia Unive ersity 5.4 Virtual Memory Outline 5.

6 Virtual Machines 58 5.8 Cache Coherence 5.

Cache Coherence Problem Suppose two CPU cores share

step Event CPU A s cache CPU B s cache Memory 0 0 1

12 Feng-Chia Unive ersity 5.4 Virtual Memory Outline 5.5 A Common Framework for Memory Hierarchies 5.6 Virtual Machines Cache Coherence 5.12 Concluding Remarks 23 Feng-Chia Unive ersity Cache Coherence Problem Suppose two CPU cores share a physical address space Write-through caches Time step Event CPU A s cache CPU B s cache Memory CPU A reads X CPU B reads X CPU A writes 1 to X

13 Coherence Defined Informally: Reads return most recently written value Formally: P writes X; P reads X (no intervening writes) read returns written value P 1 writes X; P 2 reads X (sufficiently later) read returns written value c.f. CPU B reading X after step 3 in example P 1 writes X, P 2 writes X all processors see writes in the same order End up with the same final value for X 25 Cache Coherence Protocols Operations performed by caches in multiprocessors to ensure coherence Migration of data to local caches Reduces bandwidth for shared memory Replication of read-shared data Reduces contention for access Snooping protocols Each cache monitors bus reads/writes Directory-based protocols Caches and memory record sharing status of blocks in a directory 26

14 Invalidating Snooping Protocols Cache gets exclusive access to a block when it is to be written Broadcasts an invalidate message on the bus Subsequent read in another cache misses Owning cache supplies updated value CPU activity Bus activity CPU A s cache CPU B s cache Memory 0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X CPU A writes 1 to X Invalidate for X 1 0 CPU B read X Cache miss for X Feng-Chia Unive ersity Outline 5.4 Virtual Memory 5.5 A Common Framework for Memory Hierarchies 5.6 Virtual Machines 5.8 Cache Coherence 5.12 Concluding Remarks 28

15 Concluding Remarks Fast memories are small, large memories are slow We really want fast, large memories Caching gives this illusion Principle of locality Programs use a small part of their memory space frequently Memory hierarchy L1 cache L2 cache DRAM memory disk Memory system design is critical for multiprocessors 29 Depar rtment of Electrical Engineering, Feng-Chia Unive ersity Modern Systems Things are getting complicated! L2 Cache CPU 30

16 Depar rtment of Electrical Engineering, Feng-Chia Unive ersity Multilevel On-Chip Caches Intel Nehalem 4-core processor Per core: 32KB L1 I-cache, 32KB L1 D-cache, 512KB L2 cache 31 Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Some Issues for Research Processor speeds continue to increase very fast much faster than either DRAM or disk access times Design challenge: dealing with this growing disparity Prefetching? 3 rd level caches and more? Memory design? Memory system design for multiprocessors 32

Caches: What Are They For? For computers, memory accesses are like going to the library, load word 0x02009AD0 0x002008... 0x00200A.

17 Caches: What Are They For? For computers, memory accesses are like going to the library, load word 0x02009AD0 0x x00200A... Finding the necessary information in the page of a book, 0x0CA829F0 And going back home to do the work involving that information. 0x0CA829F0 Hurry up, will ya?! Loading... While computers don t mind going back and forth like this for data, it usually means users have to do a lot of waiting. Almost... Fortunately for users, computers have caches, which is the equivalent of keeping copies of the books needed on a shelf near the workspace. Through a number of mechanisms, caches give the illusion of being able to access memory very quickly! Home Sweet CPU

18 Cache Associativity Just as bookshelves come in different shapes and sizes, caches can also take on a variety of forms and capacities. But no matter how large or small they are, caches fall into one of three categories: direct mapped, n-way set associative, and fully associative. Direct Mapped Memory Address Tag Index Offset A cache block can only go in one spot in the cache. It makes a cache block very easy to find, but it s not very flexible about where to put the blocks. 2-Way Set Associative Tag Index Offset Way Set Associative This cache is made up of sets that can fit two blocks each. The index is now used to find the set, and the tag helps find the block within the set. Tag Index Offset Each set here fits four blocks, so there are fewer sets. As such, fewer index bits are needed. 0 1 Fully Associative Tag Offset No index is needed, since a cache block can go anywhere in the cache. Every tag must be compared when finding a block in the cache, but block placement is very flexible! They all look set associative to me... m = 8 That s because they are! The direct mapped cache is just a 1-way set associative cache, and a fully associative cache of m blocks is an m-way set associative cache!

19 Cache Misses When you just can t find what you re looking for... Sometimes, the cache doesn t have the memory block the computer s looking for. When this happens, it s called a cache miss. There are three causes of cache misses. Just remember the three C s: ompulsory Compulsory misses happen when a block is referenced for the first time. The computer can t get a block that doesn t exist yet! apacity The block is not in the cache because there is no space in the cache for it. Caches are of finite size, after all. Hey! I needed that! onflict These types of misses happen only in direct-mapped and setassociative caches. Multiple blocks can be mapped to a set, forcing evictions when the set is full.

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple