Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

Similar documents
CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Virtual Memory, Address Translation

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Virtual Memory, Address Translation

Cache Performance (H&P 5.3; 5.5; 5.6)

Computer Architecture Spring 2016

Memory Hierarchies 2009 DAT105

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

EITF20: Computer Architecture Part4.1.1: Cache - 2

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Virtual Memory: From Address Translation to Demand Paging

Lec 11 How to improve cache performance

Page 1. Multilevel Memories (Improving performance using a little cash )

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

ECE 571 Advanced Microprocessor-Based Design Lecture 12

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Virtual Memory Virtual memory first used to relive programmers from the burden of managing overlays.

Lecture 11. Virtual Memory Review: Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Chapter 8. Virtual Memory

LECTURE 12. Virtual Memory

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

EE 660: Computer Architecture Advanced Caches

CSC 631: High-Performance Computer Architecture

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Memory Hierarchy. Slides contents from:

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory Hierarchy. Mehran Rezaei

Main Memory (Fig. 7.13) Main Memory

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Virtual Memory. Motivation:

Processes and Virtual Memory Concepts

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Virtual to physical address translation

Memory Hierarchy Requirements. Three Advantages of Virtual Memory

Computer Science 146. Computer Architecture

Another View of the Memory Hierarchy. Lecture #25 Virtual Memory I Memory Hierarchy Requirements. Memory Hierarchy Requirements

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Address Translation. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

ECE4680 Computer Organization and Architecture. Virtual Memory

Caching Basics. Memory Hierarchies

Advanced cache optimizations. ECE 154B Dmitri Strukov

Memory: Page Table Structure. CSSE 332 Operating Systems Rose-Hulman Institute of Technology

Memory Management. Dr. Yingwu Zhu

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Memory Hierarchy. Slides contents from:

Carnegie Mellon. Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Chapter 2: Memory Hierarchy Design Part 2

ECE 411 Exam 1 Practice Problems

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

EEC 483 Computer Organization

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

VIRTUAL MEMORY II. Jo, Heeseung

Handout 4 Memory Hierarchy

Virtual Memory: From Address Translation to Demand Paging

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Cache performance Outline

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Lecture-18 (Cache Optimizations) CS422-Spring

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Lecture 7 - Memory Hierarchy-II

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS 153 Design of Operating Systems Winter 2016

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018

Basic Memory Management

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

MEMORY: SWAPPING. Shivaram Venkataraman CS 537, Spring 2019

198:231 Intro to Computer Organization. 198:231 Introduction to Computer Organization Lecture 14

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Introduction to OpenMP. Lecture 10: Caches

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Systems. Virtual Memory. Han, Hwansoo

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Transcription:

EEC 581 Computer Architecture Memory Hierarchy Design (II) Department of Electrical Engineering and Computer Science Cleveland State University Topics to be covered Cache Penalty Reduction Techniques Victim cache Assist cache Non-blocking cache Data Prefetch mechanism Virtual Memory 2 1

3Cs Absolute Miss Rate (SPEC92) Compulsory misses are a tiny fraction of the overall misses Capacity misses reduce with increasing sizes Conflict misses reduce with increasing associativity 0.14 1-way Conflict 0.12 2-way 0.1 4-way 0.08 8-way 0.06 Capacity 0.04 0.02 0 1 2 4 8 16 32 64 128 Cache Size (KB) Compulsory 3 2:1 Cache Rule Miss rate DM cache size X ~= Miss rate 2-way SA cache size X/2 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 1-way 2-way 4-way 8-way Conflict Capacity 1 2 4 8 16 32 64 128 Cache Size (KB) Compulsory 4 2

3Cs Relative Miss Rate 100% 80% 60% 1-way 2-way 4-way 8-way Conflict 40% Capacity 20% 0% 1 2 4 8 16 Caveat: fixed block size Cache Size (KB) 32 64 128 Compulsory 5 Victim Caching [Jouppi 90] Processor L1 Memory VC Victim Cache Organization Victim cache (VC) A small, fully associative structure Effective in direct-mapped caches Whenever a line is displaced from L1 cache, it is loaded into VC Processor checks both L1 and VC simultaneously Swap data between VC and L1 if L1 misses and VC hits When data has to be evicted from VC, it is written back to memory 6 3

% of Conflict Misses Removed Dcache Icache 7 Assist Cache [Chan et al. 96] Processor L1 Memory AC Assist Cache Organization Assist Cache (on-chip) avoids thrashing in main (off-chip) L1 cache (both run at full speed) 64 x 32-byte fully associative CAM Data enters Assist Cache when miss (FIFO replacement policy in Assist Cache) Data conditionally moved to L1 or back to memory during eviction Flush back to memory when brought in by Spatial locality hint instructions Reduce pollution 8 4

Multi-lateral Cache Architecture Processor Core A B Memory A Fully Connected Multi-Lateral Cache Architecture Most of the cache architectures be generalized into this form 9 Cache Architecture Taxonomy Processor Processor Processor A B A A B Memory General Description Processor Memory Single-level cache Processor Memory Two-level cache Processor A B A B A B Memory Memory Memory Victim cache Assist cache NTS, and PCS caches 10 5

Non-blocking (Lockup-Free) Cache [Kroft 81] Prevent pipeline from stalling due to cache misses (continue to provide hits to other lines while servicing a miss on one/more lines) Uses Miss Status Handler Register (MSHR) Tracks cache misses, allocate one entry per cache miss (called fill buffer in Intel P6 proliferation) New cache miss checks against MSHR Pipeline stalls at a cache miss only when MSHR is full Carefully choose number of MSHR entries to match the sustainable bus bandwidth 11 Bus Utilization (MSHR = 2) Time m1 m2 m3 Initiation m4 interval m5 Lead-off latency 4 data chunk Stall due to insufficient MSHR BUS Bus Idle Data Transfer Memory bus utilization 12 6

Bus Utilization (MSHR = 4) Time Stall BUS Bus Idle Data Transfer Memory bus utilization 13 Sample question What is the major difference between CDC6600 s Scoreboarding algorithm and IBM 360/91 s Tomasulo algorithm? (One sentence) Why IBM 360/91 only implemented Tomasulo s algorithm in the floating-point unit but not in the integer unit? (One sentence) What are the two main functions of a ReOrder Buffer (ROB)? 14 7

Sample question What is the major difference between CDC6600 s Scoreboarding algorithm and IBM 360/91 s Tomasulo algorithm? (One sentence) Tomasulo algorithm does register renaming. Why IBM 360/91 only implemented Tomasulo s algorithm in the floating-point unit but not in the integer unit? (One sentence) Due to the long latency of the FPU. There are only 4 registers in the FPU. What are the two main functions of a ReOrder Buffer (ROB)? To support i) precise exception and ii) branch misprediction recovery 15 Sample question What is the main responsibility of the Load Store Queue? Given 4 architectural registers (R0 to R3) and 16 physical registers (T0 to T15). The current RAT content is indicated in the leftmost table below. Note that the physical registers are allocated in circular numbering order (i.e., T0, T1, to T15 then back to T0). Assume the renaming logic can rename 4 instructions per clock cycle. For the following instruction sequence, fill the RAT content one cycle later. (The destination register of arithmetic instructions is on the left-hand side.) What are the two main functions of a ReOrder Buffer (ROB)? RAT (after 1 cycle) 16 8

Sample question What is the main responsibility of the Load Store Queue? To perform memory address disambiguation and maintain memory ordering. Given 4 architectural registers (R0 to R3) and 16 physical registers (T0 to T15). The current RAT content is indicated in the leftmost table below. Note that the physical registers are allocated in circular numbering order (i.e., T0, T1, to T15 then back to T0). Assume the renaming logic can rename 4 instructions per clock cycle. For the following instruction sequence, fill the RAT content one cycle later. (The destination register of arithmetic instructions is on the left-hand side.) What are the two main functions of a ReOrder Buffer (ROB)? 17 Sample question Caches and main memory are sub-divided into multiple banks in order to allow parallel access. What is an alternative way of allowing parallel access? What a cache that allows multiple cache misses to be outstanding to main memory at the same time, the pipeline is not stalled. What is called? While cache misses are outstanding to main memory, what is the structure that keeps bookkeeping information about the outstanding cache misses? This structure often augments the cache. 18 9

Sample question Caches and main memory are sub-divided into multiple banks in order to allow parallel access. What is an alternative way of allowing parallel access? Multiporting, duplicating What a cache that allows multiple cache misses to be outstanding to main memory at the same time, the pipeline is not stalled. What is called? Non-blocking (or lockup-free) ( While cache misses are outstanding to main memory, what is the structure that keeps bookkeeping information about the outstanding cache misses? This structure often augments the cache. Miss status handling registers (MSHRs) 19 Sample question Consider a processor with separate instruction and data caches (and no L2 cache). We are focusing on improving the data cache performance since our instruction cache achieves 100% hit rate with various optimizations. The data cache is 4kB, direct-mapped, and has single cycle access latency. The processor supports a 64-bit virtual address space, 8kB pages and no more than 16GB physical memory. The cache block size is 32 bytes. The data cache is virtually indexed and physically tagged. Assume that the data TLB hit rate is 100%. The miss rate of the data cache is measured to be 10%. The miss penalty is 20 cycles. Compute the average memory access latency (in terms of number of cycles) for data accesses. To improve the overall memory access latency, we decided to introduce a victim cache. It is fully associative and has eight entries. Its access latency is one cycle. To save power and energy consumption, we decided to access the victim cache only after we detect a miss from the data cache. The victim cache hit rate is measured to be 50% (i.e., the probability of finding data in the victim cache given that the data cache doesn t have it). Further, only after we detect a miss from the victim cache we start miss handling. Compute the average memory access latency for data accesses. 20 10

Prefetch (Data/Instruction) Predict what data will be needed in future Pollution vs. Latency reduction If you correctly predict the data that will be required in the future, you reduce latency. If you mispredict, you bring in unwanted data and pollute the cache To determine the effectiveness When to initiate prefetch? (Timeliness) Which lines to prefetch? How big a line to prefetch? (note that cache mechanism already performs prefetching.) What to replace? Software (data) prefetching vs. hardware prefetching 21 Software-controlled Prefetching Use instructions Existing instruction Alpha s load to r31 (hardwired to 0) Specialized instructions and hints Intel s SSE: prefetchnta, prefetcht0/t1/t2 MIPS32: PREF PowerPC: dcbt (data cache block touch), dcbtst (data cache block touch for store) Compiler or hand inserted prefetch instructions 22 11

Alpha The Alpha architecture supports data prefetch via load instructions with a destination of register R31 or F31. LDBU, LDF, LDG, LDL, LDT, LDWU LDS LDQ Normal cache line prefetches. Prefetch with modify intent; sets the dirty and modified bits. Prefetch, evict next; no temporal locality. The Alpha architecture also defines the following instructions. FETCH FETCH_M Prefetch Data Prefetch Data, Modify Intent PowerPC dcbt Dcbtst Intel SSE Data Cache Block Touch Data Cache Block Touch for Store The SSE prefetch instruction has the following variants: prefetcht0 prefetcht1 prefetcht2 prefetchnta Temporal data; prefetch data into all cache levels. Temporal with respect to first level cache; prefetch data in all cache levels except 0th cache level. Temporal with respect to second level cache; prefetch data in all cache levels, except 0th and 1st cache levels. Non-temporal with respect to all cache levels; prefetch data into non-temporal cache structure, with minimal cache pollution. 12

Software-controlled Prefetching for (i=0; i < N; i++) { prefetch (&a[i+1]); prefetch (&b[i+1]); } sop = sop + a[i]*b[i]; /* unroll loop 4 times */ for (i=0; i < N-4; i+=4) { prefetch (&a[i+4]); prefetch (&b[i+4]); } sop = sop + a[i]*b[i]; sop = sop + a[i+1]*b[i+1]; sop = sop + a[i+2]*b[i+2]; sop = sop + a[i+3]*b[i+3]; sop = sop + a[n-4]*b[n-4]; sop = sop + a[n-3]*b[n-3]; sop = sop + a[n-2]*b[n-2]; sop = sop + a[n-1]*b[n-1]; Prefetch latency <= computational time 25 Hardware-based Prefetching Sequential prefetching Prefetch on miss Tagged prefetch Both techniques are based on One Block Lookahead (OBL) prefetch: Prefetch line (L+1) when line L is accessed based on some criteria 26 13

Sequential Prefetching Prefetch on miss Initiate prefetch (L+1) whenever an access to L results in a miss Alpha 21064 does this for instructions (prefetched instructions are stored in a separate structure called stream buffer) Tagged prefetch Idea: Whenever there is a first use of a line (demand fetched or previously prefetched line), prefetch the next one One additional Tag bit for each cache line Tag the prefetched, not-yet-used line (Tag = 1) Tag bit = 0 : the line is demand fetched, or a prefetched line is referenced for the first time Prefetch (L+1) only if Tag bit = 1 on L 27 Sequential Prefetching Prefetch-on-miss when accessing contiguous lines Demand fetched Prefetched miss Demand fetched Prefetched hit Demand fetched Prefetched Demand fetched Prefetched miss Tagged Prefetch when accessing contiguous lines 0 Demand fetched 0 Demand fetched 0 Demand fetched 1 Prefetched 0 Prefetched 0 Prefetched 1 Prefetched 0 Prefetched 1 Prefetched miss 28 hit hit 14

29 Virtual Memory Virtual memory separation of logical memory from physical memory. Only a part of the program needs to be in memory for execution. Hence, logical address space can be much larger than physical address space. Allows address spaces to be shared by several processes (or threads). Allows for more efficient process creation. Virtual memory can be implemented via: Demand paging Demand segmentation Main memory is like a cache to the hard disc! 30 15

Virtual Address The concept of a virtual (or logical) address space that is bound to a separate physical address space is central to memory management Virtual address generated by the CPU Physical address seen by the memory Virtual and physical addresses are the same in compile-time and load-time address-binding schemes; virtual and physical addresses differ in execution-time address-binding schemes 31 Advantages of Virtual Memory Translation: Program can be given consistent view of memory, even though physical memory is scrambled Only the most important part of program ( Working Set ) must be in physical memory. Contiguous structures (like stacks) use only as much physical memory as necessary yet grow later. Protection: Different threads (or processes) protected from each other. Different pages can be given special behavior (Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs => Far more viruses under Microsoft Windows Sharing: Can map same physical page to multiple users ( Shared memory ) 32 16

Use of Virtual Memory stack stack Shared Libs Shared page Shared Libs heap Static data code heap Static data code Process A Process B 33 Virtual vs. Physical Address Space Virtual Virtual Address Memory 0 A Physical Address 0 Main Memory 4k B 4k C 8k C 8k 12k 4G D....... 12k 16k 20k 24k 28k A B D Disk 34 17

= Paging Divide physical memory into fixed-size blocks (e.g., 4KB) called frames Divide logical memory into blocks of same size (4KB) called pages To run a program of size n pages, need to find n free frames and load program Set up a page table to map page addresses to frame addresses (operating system sets up the page table) 35 Page Table and Address Translation Virtual page number (VPN) Page offset Page Table Main Memory Physical page # (PPN) Physical address 36 18

= Page Table Structure Examples One-to-one mapping, space? Large pages Internal fragmentation (similar to having large line sizes in caches) Small pages Page table size issues Multi-level Paging Inverted Page Table Example: 64 bit address space, 4 KB pages (12 bits), 512 MB (29 bits) RAM Number of pages = 2 64 /2 12 = 2 52 (The page table has as many entrees) Each entry is ~4 bytes, the size of the Page table is 2 54 Bytes = 16 Petabytes! Can t fit the page table in the 512 MB RAM! 37 Multi-level (Hierarchical) Page Table Divide virtual address into multiple levels Level 1 is stored in P1 the Main memory P2 Page offset P1 P2 Level 1 page directory (pointer array) Level 2 page table (stores PPN) PPN Page offset 38 19

Inverted Page Table One entry for each real page of memory Shared by all active processes Entry consists of the virtual address of the page stored in that real memory location, with Process ID information Decreases memory needed to store each page table, but increases time needed to search the table when a page reference occurs 39 Linear Inverted Page Table Contain entries (size of physical memory) in a linear array Need to traverse the array sequentially to find a match Can be time consuming PID = 8 Virtual Address VPN = 0x2AA70 Offset match PPN = 0x120D Offset Physical Address PPN Index 0 1 2 0x120C 0x120D PID VPN 1 0x74094 12 0xFEA00 1 0x00023........ 14 0x2409A 8 0x2AA70....... Linear Inverted Page Table 40 20

Hashed Inverted Page Table Use hash table to limit the search to smaller number of page-table entries PID = 8 Virtual Address VPN = 0x2AA70 Offset 2 Hash anchor table Hash PID VPN 0 1 0x74094 1 12 0xFEA00 2 1 0x00023 0x120C 0x120D........ 14 0x2409A 8 0x2AA70... match.... Next 0x0012 --- 0x120D.... 0x0980 0x00A0.... 41 Fast Address Translation How often address translation occurs? Where the page table is kept? Keep translation in the hardware Use Translation Lookaside Buffer (TLB) Instruction-TLB & Data-TLB Essentially a cache (tag array = VPN, data array=ppn) Small (32 to 256 entries are typical) Typically fully associative (implemented as a content addressable memory, CAM) or highly associative to minimize conflicts 42 21

= 43 Example: Alpha 21264 data TLB VPN <35> offset <13> Address Space Number <8> <4><1> <35> <31> ASN ProtVTag PPN...... 128:1 mux 44-bit physical address 44 22

TLB and Caches Several Design Alternatives VIVT: Virtually-indexed Virtually-tagged Cache VIPT: Virtually-indexed Physically-tagged Cache PIVT: Physically-indexed Virtually-tagged Cache Not outright useful, R6000 is the only used this. PIPT: Physically-indexed Physically-tagged Cache 45 46 23

Virtually-Indexed Virtually-Tagged (VIVT) cache line return Processor Core VA VIVT Cache miss TLB Main Memory hit Fast cache access Only require address translation when going to memory (miss) Issues? 47 VIVT Cache Issues - Aliasing Homonym Same VA maps to different PAs Occurs when there is a context switch Solutions Include process id (PID) in cache or Flush cache upon context switches Synonym (also a problem in VIPT) Different VAs map to the same PA Occurs when data is shared by multiple processes Duplicated cache line in VIPT cache and VIVT$ w/ PID Data is inconsistent due to duplicated locations Solution Can Write-through solve the problem? Flush cache upon context switch If (index+offset) < page offset, can the problem be solved? (discussed later in VIPT) 48 24

49 Physically-Indexed Physically-Tagged (PIPT) cache line return Processor Core VA TLB PA PIPT Cache miss Main Memory hit Slower, always translate address before accessing memory Simpler for data coherence 50 25

Virtually-Indexed Physically-Tagged (VIPT) TLB PA Processor Core VA VIPT Cache miss Main Memory cache line return hit Gain benefit of a VIVT and PIPT Parallel Access to TLB and VIPT cache No Homonym How about Synonym? 51 Deal w/ Synonym in VIPT Cache Index VPN A Process A point to the same location within a page Process B VPN B Index VPN A!= VPN B How to eliminate duplication? make cache Index A == Index B? Tag array Data array 52 26

Synonym in VIPT Cache VPN Cache Tag Page Offset Set Index Line Offset If two VPNs do not differ in a then there is no synonym problem, since they will be indexed to the same set of a VIPT cache Imply # of sets cannot be too big Max number of sets = page size / cache line size Ex: 4KB page, 32B line, max set = 128 A complicated solution in MIPS R10000 a 53 R10000 s Solution to Synonym 32KB 2-Way Virtually-Indexed L1 VPN 12 bit 10 bit 4-bit Direct-Mapped Physical L2 a= VPN[1:0] stored as part of L2 cache Tag L2 is Inclusive of L1 VPN[1:0] is appended to the tag of L2 Given two virtual addresses VA1 and VA2 that differs in VPN[1:0] and both map to the same physical address PA Suppose VA1 is accessed first so blocks are allocated in L1&L2 What happens when VA2 is referenced? 1 VA2 indexes to a different block in L1 and misses 2 VA2 translates to PA and goes to the same block as VA1 in L2 3. Tag comparison fails (since VA1[1:0] VA2[1:0]) 4. Treated just like as a L2 conflict miss VA1 s entry in L1 is ejected (or dirty-written back if needed) due to inclusion policy 54 27

Deal w/ Synonym in MIPS R10000 VA1 Page offset index a1 VA2 Page offset index a2 1 miss 0 TLB L1 VIPT cache L2 PIPT Cache Physical index a2 a2!=a1 a1 Phy. Tag data 55 Deal w/ Synonym in MIPS R10000 VA1 Page offset index a1 VA2 Page offset index a2 Only one copy is present in L1 0 1 TLB L1 VIPT cache L2 PIPT Cache Data return a2 Phy. Tag data 56 28