CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Similar documents
Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Page 1. Multilevel Memories (Improving performance using a little cash )

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

Cache Performance (H&P 5.3; 5.5; 5.6)

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CS3350B Computer Architecture

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Chapter 8. Virtual Memory

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Virtual Memory: From Address Translation to Demand Paging

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Memory Hierarchy. Mehran Rezaei

Lecture 11. Virtual Memory Review: Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Virtual Memory Virtual memory first used to relive programmers from the burden of managing overlays.

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory hierarchy review. ECE 154B Dmitri Strukov

ECE4680 Computer Organization and Architecture. Virtual Memory

Modern Computer Architecture

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

CS 153 Design of Operating Systems Winter 2016

Cache performance Outline

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Memory Hierarchies 2009 DAT105

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EITF20: Computer Architecture Part4.1.1: Cache - 2

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668

ECE468 Computer Organization and Architecture. Virtual Memory

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

Virtual Memory, Address Translation

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Virtual Memory: From Address Translation to Demand Paging

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Recap: Memory Management

CPE 631 Lecture 04: CPU Caches

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

EN1640: Design of Computing Systems Topic 06: Memory System

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Caching Basics. Memory Hierarchies

Computer Architecture Spring 2016

Handout 4 Memory Hierarchy

Memory Hierarchy. Slides contents from:

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Modern Virtual Memory Systems. Modern Virtual Memory Systems

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Advanced Memory Organizations

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Chapter Seven Morgan Kaufmann Publishers

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy. Part II Virtual Memory

CSE502: Computer Architecture CSE 502: Computer Architecture

Chapter 2: Memory Hierarchy Design Part 2

Transcription:

CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY SMALL PORTION OF THE ADDRESS SPACE AT A TIME TWO DIFFERENT TYPES OF LOCALITY: TEMPORAL LOCALITY: IF AN ITEM IS REFERENCED, IT WILL TEND TO BE REFERENCED AGAIN SOON SPATIAL LOCALITY: IF AN ITEM IS REFERENCED, ITEMS WHOSE ADDRESSES ARE CLOSE TEND TO BE REFERENCED SOON SPATIAL LOCALITY TURNS INTO TEMPORAL LOCALITY IN BLOCKS/PAGES TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE CPU AVERAGE MEMORY ACCESS TIME (AMAT) AMAT= hit time + miss rate x miss penalty Compiler management Hardware management registers L1(cache) L2(cache) SRAM speed cost/bit size MISS RATE: FRACTION OF ACCESSES NOT SATISFIED AT THE HIGHEST LEVEL NUMBER OF MISSES IN L1 DIVIDED BY THE NUMBER OF PROCESSOR REFERENCES ALSO HIT RATE = 1 - MISS RATE MISSES PER INSTRUCTIONS (MPI) NUMBER OF MISSES IN L1 DIVIDED BY NUMBER OF INSTRUCTIONS EASIER TO USE THAN MISS RATE: CPI = CPI 0 + MPI*MISS PENALTY Kernel/OS management Ln-1(Main Memory--DRAM) Ln (Secondary Memory--DISK) GOALS: HIGH SPEED, LOW COST, HIGH CAPACITY INCLUSION COHERENCE MISS PENALTY: AVERAGE DELAY PER MISS CAUSED IN THE PROCESSOR IF PROCESSOR BLOCKS ON MISSES, THEN THIS IS SIMPLY THE NUMBER OF CLOCK CYCLES TO BRING A BLOCK FROM MEMORY OR MISS LATENCY IN A OoO PROCESSOR, THE PENALTY OF A MISS CANNOT BE MEASURED DIRECTLY DIFFERENT FROM MISS LATENCY MISS RATE AND PENALTY CAN BE DEFINED AT EVERY CACHE LEVELS USUALLY NORMALIZED TO THE NUMBER OF PROCESSOR REFERENCES OR TO THE NUMBER OF ACCESSES FROM THE LOWER LEVEL 1

CACHE MAPPING MEMORY BLOCKS ARE MAPPED TO CACHE LINES MAPPING CAN BE DIRECT, SET-ASSOCIATIVE OR FULLY ASSOCIATIVE CACHE ACCESS TWO PHASES: INDEX + TAG CHECK DIRECT-MAPPED CACHES: CACHE SLICE EXAMPLE OF A DIRECT-MAPPED CAHE WITH TWO WORDS PER LINE. DIRECT-MAPPED: EACH MEMORY BLOCK CAN BE MAPPED TO ONLY ONE CACHE LINE: BLOCK ADDRESS MODULO THE NUMBER OF LINES IN CACHE SET-ASSOCIATIVE: EACH MEMORY BLOCK CAN BE MAPPED TO A SET OF LINES IN CACHE; SET NUMBER IS BLOCK ADDRESS MODULO THE NUMBER OF CACHE SETS FULLY ASSOCIATIVE: EACH MEMORY BLOCK CAN BE IN ANY CACHE LINE directory block0 data word0 word1 block0 directory data block0 word0 word1 block0 CACHE IS MADE OF DIRECTORY+ DATA MEMORY, ONE ENTRY PER CACHE LINE DIRECTORY: STATUS (STATE) BITS: VALID, DIRTY, REFERENCE, CACHE COHERENCE CACHE ACCESS HAS TWO PHASES USE INDEX BITS TO FETCH THE TAGS AND DATA FROM THE SET (CACHE INDEX) CHECK TAGS TO DETECT HIT/MISS =? tag set# slice 1word word_offset narrow cache byte_offset tag =? wide cache 1block set# byte_offset slice CACHE ACCESS REPLACEMENT POLICIES SET-ASSOCIATIVE CACHE SLICE 0 SLICE 1 WAY 0 WAY 1 BLOCK 0 BLOCK 1 SLICE 2 WAY 2 BLOCK 2 3-WAY SET- ASSOCIATIVE CACHE RANDOM, LRU, FIFO, PSEUDO-LRU MAINTAINS REPLACEMENT BITS EXAMPLE: LEAST-RECENTLY USED (LRU) tag set# byte_offset FULLY ASSOCIATIVE DIRECTORY Associative search DATA DIRECT-MAPPED: NO NEED SET-ASSOCIATIVE: PER-SET REPLACEMENT FULLY ASSOCIATIVE: CACHE REPLACEMENT tag byte_offset 2

WRITE POLICIES WRITE THROUGH: WRITE TO NEXT LEVEL ON ALL WRITES COMBINED WITH WRITE BUFFER TO AVOID CPU STALLS SIMPLE, NO INCONSISTENCY AMONG LEVELS CPU All accesses L1 misses L1 L2 All stores Write Buffer WRITE-BACK: WRITE TO NEXT LEVEL ON REPLACEMENT WITH THE HELP OF A DIRTY BIT AND A WRITE-BACK BUFFER WRITES HAPPEN ON A MISS ONLY CLASSIFICATION OF CACHE MISSES THE 3 C s COMPULSORY (COLD) MISSES: ON THE 1ST REFERENCE TO A BLOCK CAPACITY MISSES: SPACE IS NOT SUFFICIENT TO HOST DATA OR CODE CONFLICT MISSES: HAPPEN WHEN TWO MEMORY BLOCKS MAP ON THE SAME CACHE BLOCK IN DIRECT-MAPPED OR SET-ASSOCIATIVE CACHES LATER ON: COHERENCE MISSES 4C s CLASSIFICATION HOW TO FIND OUT? COLD MISSES: SIMULATE INFINITE CACHE SIZE CAPACITY MISSES: SIMULATE FULLY ASSOCIATIVE CACHE THEN DEDUCT COLD MISSES CONFLICT MISSES: SIMULATE CACHE THEN DEDUCT COLD AND CAPACITY MISSES CLASSIFICATION IS USEFUL TO UNDERSTAND HOW TO ELIMINATE MISSES ALLOCATION ON WRITE MISSES ALWAYS ALLOCATE IN WRITE-BACK; DESIGN CHOICE IN WRITE- THROUGH PROBLEM: WHICH REPLACEMENT POLICY SHOULD WE USE IN THE FULLY ASSOCIATIVE CACHE? MULTI-LEVEL CACHE HIERARCHIES EFFECT OF CACHE PARAMETERS LARGER CACHES SLOWER, MORE COMPLEX, LESS CAPACITY MISSES 1ST-LEVEL AND 2ND LEVEL ARE ON-CHIP;3RD AND 4TH LEVELS ARE MOSTLY OFF-CHIP USUALLY, CACHE INCLUSION IS MAINTAINED WHEN A BLOCK MISSES IN L1 THEN IT MUST BE BROUGHT INTO ALL Li. WHEN A BLOCK IS REPLACED IN Li, THEN IT MUST BE REMOVED FROM ALL Lj, j<i ALSO: EXCLUSION IF A BLOCK IS IN Li THEN IT IS NOT IN ANY OTHER CACHE LEVEL IF A BLOCK MISSES IN L1 THEN ALL COPIES ARE REMOVED FROM ALL Li s, i>1 IF A BLOCK IS REPLACED IN Li THEN IT IS ALLOCATED IN Li+1 OR NO POLICY WE WILL ASSUME INCLUSION LARGER BLOCK SIZE EXPLOIT SPATIAL LOCALITY TOO BIG A BLOCK INCREASES CAPACITY MISSES BIG BLOCKS ALSO INCREASE MISS PENALTY HIGHER ASSOCIATIVITY ADDRESSES CONFLICT MISSES 8-16 WAY SA IS AS GOOD AS FULLY ASSOCIATIVE A 2-WAY SA CACHE OF SIZE N HAS A SIMILAR MISS RATE AS A DIRECT MAPPED CACHE OF SIZE 2N HIGHER HIT TIME 3

LOCKUP-FREE (NON-BLOCKING CACHES) CACHE IS A 2-PORTED DEVICE: MEMORY & PROCESSOR C cm : cache to interface C pc : processor to cache interface LOCKUP-FREE CACHES:PRIMARY/SECONDARY MISSES PRIMARY: THE FIRST MISS TO A BLOCK SECONDARY: FOLLOWING ACCESSES TO BLOCKS PENDING DUE TO PRIMARY MISS LOT MORE MISSES (BLOCKING CACHE ONLY HAS PRIMARY MISSES) NEEDS MSHRs FOR BOTH PRIMARY AND SECONDARY MISSES MISSES ARE OVERLAPPED WITH COMPUTATION AND OTHER MISSES IF A LOCKUP-FREE CACHE MISSES, IT DOES NOT BLOCK RATHER, IT HANDLES THE MISS AND KEEPS ACCEPTING ACCESSES FROM THE PROCESSOR ALLOWS FOR THE CONCURRENT PROCESSING OF MULTIPLE MISSES AND HITS CACHE HAS TO BOOKKEEP ALL PENDING MISSES MSHRs (Miss Status Handling Registers) CONTAIN THE ADDRESS OF PENDING MISS, THE DESTINATION BLOCK IN CACHE, AND THE DESTINATION REGISTER NUMBER OF MSHRs LIMITS THE NUMBER OF PENDING MISSES DATA DEPENDENCIES EVENTUALLY BLOCK THE PROCESSOR NON-BLOCKING CACHES ARE REQUIRED IN DYNAMICALLY SCHEDULED PROCESSOR AND TO SUPPORT PREFETCHING HARDWARE PREFETCHING OF INSTRUCTIONS AND DATA SEQUENTIAL PREFETCHING OF INSTRUCTIONS ON AN I-FETCH MISS, FETCH TWO BLOCKS INSTEAD OF ONE SECOND BLOCK IS STORED IN AN I-STREAM BUFFER IF I-STREAM-BUFFER HITS, BLOCK IS MOVED TO L1 I-STREAM BUFFER BLOCKS ARE OVERLAID IF NOT ACCESSED ALSO APPLICABLE TO DATA, BUT LESS EFFECTIVE HARDWARE PREFETCH ENGINES DETECT STRIDES IN STREAM OF MISSING ADDRESSES, THEN START FETCHING AHEAD COMPILER-CONTROLLED (aka SOFTWARE) PREFETCHING INSERT PREFETCH INSTRUCTIONS IN CODE THESE ARE NON-BINDING, LOAD IN CACHE ONLY IN A LOOP, WE MAY INSERT PREFETCH INSTRUCTIONS IN THE BODY OF THE LOOP TO PREFETCH DATA NEEDED IN FUTURE LOOP ITERATIONS. LOOP L.D F2,0(R1) PREF-24(R1) ADD.D F4,F2,F0 S.D F4,O(R1) SUBI R1,R1,#8 BNEZ R1, LOOP CAN WORK FOR BOTH LOADS AND STORES REQUIRES A NON-BLOCKING CACHE INSTRUCTION OVERHEAD DATA MUST BE PREFETCHED ON TIME, SO THAT THEY ARE PRESENT IN CACHE AT THE TIME OF ACCESS DATA MAY NOT BE PREFETCHED TOO EARLY SO THAT THEY ARE STILL IN CACHE AT THE TIME OF THE ACCESS CAN EASILY BE DONE FOR ARRAYS, BUT CAN BE APPLIED TO POINTER ACCESSES 4

FASTER HIT TIMES PRINCETON vs HARVARD CACHE: PRINCETON: UNIFIED INSTRUCTIONS/DATA CACHE HARVARD: SPLIT INSTRUCTION AND DATA CACHE PRINCETON MEANS THAT INSTRUCTIONS AND DATA CAN USE THE WHOLE CACHE, AS THEY NEED HARVARD MEANS THAT BOTH CACHES CAN BE OPTIMIZED FOR THEIR ACCESS TYPE IN A PIPELINED MACHINE, FLC IS HARVARD AND SLC IS PRINCETON PIPELINE CACHE ACCESSES ESPECIALLY USEFUL FOR STOREs PIPELINE TAG CHECK AND DATA STORE SEPARATE READ AND WRITE PORTS TO CACHE OPTIMIZED FOR EACH ALSO USEFUL FOR I-CACHES AND LW IN D-CACHES INCREASES THE PIPELINE LENGTH MUST FIND WAYS OF SPLITTING CACHE ACCESSES INTO STAGES FASTER HIT TIMES KEEP THE CACHE SIMPLE AND FAST THIS FAVORS DIRECT-MAPPED CACHES LESS MULTIPLEXING OVERLAP OF TAG CHECK AND USE OF DATA INTERESTINGLY, THE SIZE OF FLC TENDS TO DECREASE AND ASSOCIATIVITY GOES UP AS FLCs TRY TO KEEP UP WITH CPU Processor Alpha 21164 Alpha 21364 MPC750 PA-8500 Classic Pentium Pentium-II Pentium-III Pentium-IV Mips R10K/12K UltraSPARC-IIi UltraSPARC-III L1 data cache 8KB, direct mapped 64KB, 2-way 32KB, 8-way, PLRU 1MB, 4-way, PLRU 16K, 4-way, LRU 16KB, 4 -way, PLRU 16K, 4-way, PLRU 8KB, 4-way, PLRU 32KB, 2-way, LRU 16KB, direct mapped 64KB, 4-way, Random AVOID ADDRESS TRANSLATION OVERHEAD (ToBeCoveredLater) VIRTUAL MEMORY WHY? ALLOWS APPLICATIONS TO BE BIGGER THAN MAIN MEMORY SIZE PREVIOUSLY: MEMORY OVERLAYS HELPS WITH MULTIPLE PROCESS MANAGEMENT EACH PROCESS GETS ITS OWN CHUNK OF MEMORY PROTECTION OF PROCESSES AGAINST EACH OTHER PROTECTION OF PROCESSES AGAINST THEMSELVES MAPPING OF MULTIPLE PROCESSES TO MEMORY RELOCATION APPLICATION AND CPU RUN IN VIRTUAL SPACE MAPPING OF VIRTUAL TO PHYSICAL SPACE IS INVISIBLE TO THE APPLICATION MANAGEMENT BETWEEN MM AND DISK MISS IN MM IS A PAGE FAULT OR ADDRESS FAULT BLOCK IS A PAGE OR SEGMENT PAGES vs SEGMENTS MOST SYSTEMS TODAY USE PAGING SOME SYSTEMS USE PAGED SEGMENTS SOME SYSTEMS USE MULTIPLE PAGE SIZES SUPERPAGES (TO BE COVERED LATER) PAGE SEGMENT Addressing one two (segment and offset) Programmer visible? Invisible May be visible Replacing a block Trivial Hard Memory use efficiency Internal fragmentation External fragmentation Efficient disk traffic Yes Not always 5

VIRTUAL ADDRESS MAPPING PAGED VIRTUAL MEMORY VA1 VA3 virtual (Process 1) PA1 PA2 Physical (main ) PA3 VA1 AND VA2 ARE SYNONYMS VA3 in P1 and in P2 are HOMONYMS virtual (Process 2) VA2 VA3 VIRTUAL ADDRESS SPACE DIVIDED INTO PAGES PHYSICAL ADDRESS SPACE DIVIDED INTO PAGEFRAMES PAGE MISSING IN MM = PAGE FAULT Pages not in MM are on disk: swap-in/swap-out Or have never been allocated New page may be placed anywhere in MM (fully associative map) DYNAMIC ADDRESS TRANSLATION Effective address is virtual Must be translated to physical for every access Virtual to physical translation through page table in MM PAGE TABLE HIERARCHICAL PAGE TABLE PAGE TABLE TRANSLATES ADDRESS, ENFORCES PROTECTION PAGE REPLACEMENT FIFO LRU--MFU APPROXIMATE LRU (working set) REFERENCE BIT (R) PER PAGE IS PERIODICALLY RESET BY O/S PAGE CACHE: HARD vs SOFT PAGE FAULTS WRITE STRATEGY IS WRITE BACK USING MODIFY (M) BIT M AND R BITS ARE EASILY MAINTAINED BY SOFTWARE USINGTRAPS HIERARCHICAL PAGE TABLE SUPPORTS SUPERPAGES (HOW?) MULTIPLE DRAM ACCESSES PER MEMORY TRANSLATION TRANSLATION ACCESSES CAN BE CACHED, BUT STILL REQUIRE MULTIPLE CACHE ACCESSES SOLUTION: SPECIAL CACHE DEDICATED TO TRANSLATIONS 6

TRANSLATION LOOKASIDE BUFFER TYPICAL MEMORY HIERARCHY PAGE TABLE ENTRY IS CACHED IN TLB TLB IS ORGANIZED AS A CACHE (DM,SA,or FA) ACCESSED WITH THE VPN PID ADDED TO DEAL WITH HOMONYMS TLBs ARE MUCH SMALLER THAN CACHES BECAUSE OF COVERAGE USUALLY TWO TLB: i-tlb AND d-tlb TLB MISS CAN BE HANDLED IN A HARDWARED MMU OR BY A SOFTWARE TRAP HANDLER TABLE WALKING CACHE SIZE LIMITED TO 1 PAGE PER WAY OF ASSOCIATIVITY INDEXING CACHE WITH VIRTUAL ADDRESS BITS WHEN THE L1 CACHE SIZE IS LARGER THAN 1 PAGE PER WAY OF ASSOCIATIVITY, THEN THE BLOCK MIGHT END UP IN TWO DIFFERENT SETS THIS IS DUE TO SYNONYMS A.K.A. ALIASES VA1 virtual (Process 1) PA Physical (main ) VA1 AND VA2 ARE SYNONYMS virtual (Process 2) VA2 VA1 VA2 PA ALIASING PROBLEM z1 z2 y1 y2 block offset page displacement set index If the cache is indexed with the 2 extra bits of VA, then, a block may end up in different sets if accessed with two synonyms, causing consistency problems. How to avoid this?? On a miss, search all the possible sets (in this case 4 sets), move block if needed (SHORT miss), and settle on a (LONG) miss only if all 4 sets miss, OR Since SLC is physically addressed and inclusion is enforced, make sure that SLC entry contains a pointer to the latest block copy in cache, OR Page coloring: make sure that all aliases have the same two extra bits(z1=y1=x1 and z2=y2=x2) No two synonyms can reside in the same page Aliasing problem is more acute in caches that are both indexed and tagged with virtual address bits. x1 x2 7

VIRTUAL ADDRESS CACHES Index and tag are virtual and TLB is accessed on cache misses only To search all the sets, we need to check each tag one by one in the TLB (very inefficient unless L1 cache is direct mapped) Aliasing can be solved with anti-aliasing hardware, usually in the form of a reverse map in the SLC or some other table Protection: Access right bits must be migrated to the L1 cache and managed In multiprocessor systems virtual address caches aggravate the coherence problem The main advantages of VA in L1 caches are 1) fast hit with large cache (no TLB translation in the critical path) and 2) very large TLB with extremely low miss rates 8