Lec 11 How to improve cache performance

Similar documents
Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Types of Cache Misses: The Three C s

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Lecture-18 (Cache Optimizations) CS422-Spring

Memory Hierarchies 2009 DAT105

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Chapter-5 Memory Hierarchy Design

CPE 631 Lecture 06: Cache Design

Aleksandar Milenkovich 1

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Basics

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Computer Architecture Spring 2016

Cache performance Outline

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Advanced cache optimizations. ECE 154B Dmitri Strukov

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 11. Virtual Memory Review: Memory Hierarchy

CMSC 611: Advanced Computer Architecture. Cache and Memory

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Caching Basics. Memory Hierarchies

Cache Performance (H&P 5.3; 5.5; 5.6)

Lec 12 How to improve cache performance (cont.)

Topics. Digital Systems Architecture EECE EECE Need More Cache?

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Modern Computer Architecture

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

CS3350B Computer Architecture

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Cache Optimisation. sometime he thought that there must be a better way

Computer Architecture Spring 2016

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

A Cache Hierarchy in a Computer System

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Pollard s Attempt to Explain Cache Memory

Memory Hierarchy Design

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Copyright 2012, Elsevier Inc. All rights reserved.

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

LECTURE 5: MEMORY HIERARCHY DESIGN

Advanced Caching Techniques

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Adapted from David Patterson s slides on graduate computer architecture

Page 1. Multilevel Memories (Improving performance using a little cash )

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Advanced Caching Techniques

Page 1. Memory Hierarchies (Part 2)

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Virtual Memory. Virtual Memory

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Memory Hierarchy. Advanced Optimizations. Slides contents from:

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Introduction. The Big Picture: Where are We Now? The Five Classic Components of a Computer

CPE 631 Lecture 04: CPU Caches

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Memory Hierarchy Design. Chapter 5

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Transcription:

Lec 11 How to improve cache performance

How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation, way prediction, and trace caches 2. Increase cache bandwidth.--3 pipelined cache access, multibanked caches, nonblocking caches, 3. Reduce the miss penalty--4 multilevel caches, critical word first, read miss prior to writes, merging write buffers, and victim caches 4. Reduce the miss rate--4 larger block size, large cache size, higher associativity,and compiler optimizations 5. Reduce the miss penalty and miss rate via parallelism--2 hardware prefetching,and compiler prefetching ComputerArchitecture_Introduction 1.2 2/30

1 st Hit Time Reduction Technique: Small and Simple Caches Using small and Direct-mapped cache The less hardware that is necessary to implement a cache, the shorter the critical path through the hardware. Direct-mapped is faster than set associative for both reads and writes. Fitting the cache on the chip with the CPU is also very important for fast access times. ComputerArchitecture_Introduction 1.3 3/30

Hit time varies with size and associativity 2.5 2 1.5 1 0.5 1-way 2-way 4-way 8-way 0 16KB 32KB 64KB 128KB 256KB 512KB 1MKB ComputerArchitecture_Introduction 1.4 4/30

2 nd Hit Time Reduction Technique: Way Prediction Way Prediction (Pentium 4 ) Extra bits are kept in the cache to predict the way,or block within set of the next cache access. If the predictor is correct, the instruction cache latency is 1 clock clock cycle. If not,it tries the other block, changes the way predictor, and has a latency of 1 extra clock cycles. Simulation using SPEC95 suggested set prediction accuracy is excess of 85%, so way prediction saves pipeline stage in more than 85% of the instruction fetches. ComputerArchitecture_Introduction 1.5 5/30

3 rd Hit Time Reduction Technique: Avoiding Address Translation during Indexing of the Cache CPU VA PA miss Translation Cache Main Memory data hit Page table is a large data structure in memory TWO memory accesses for every load, store, or instruction fetch!!! Virtually addressed cache? synonym problem Cache the address translations? ComputerArchitecture_Introduction 1.6 6/30

TLBs A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time) ComputerArchitecture_Introduction 1.7 7/30

Translation Look-Aside Buffers Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than 128-256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. hit VA PA miss CPU TLB Lookup Cache Main Memory Translation with a TLB miss Translation hit 1/2 t data t 20 t ComputerArchitecture_Introduction 1.8 8/30

Fast hits by Avoiding Address Translation CPU CPU CPU TB $ VA PA PA VA Tags $ TB VA VA PA PA Tags VA $ TB PA L2 $ MEM MEM Conventional Organization MEM Virtually Addressed Cache Translate only on miss Synonym Problem Virtual indexed, Physically tagged Overlap $ access with VA translation: requires $ index to remain invariant across translation ComputerArchitecture_Introduction 1.9 9/30

Virtual Addressed Cache Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache (vs. Physical Cache) Any Problems? Every time process is switched logically must flush the cache; otherwise get false hits Cost is time to flush + compulsory misses from empty cache Add process identifier tag that identifies process as well as address within process: can t get a hit if wrong process ComputerArchitecture_Introduction 1.10 10/30

Virtual cache 虚页号 Index page offset Tag = ComputerArchitecture_Introduction 1.11 11/30

Dealing with aliases Dealing with aliases (synonyms); Two different virtual addresses map to same physical address NO aliasing! What are the implications? HW antialiasing: guarantees every cache block has unique address verify on miss (rather than on every hit) cache set size <= page size? what if it gets larger? How can SW simplify the problem? (called page coloring) I/O must interact with cache, so need virtual address ComputerArchitecture_Introduction 1.12 12/30

Aliases problem with Virtual cache A1 A2 A1 Tag index offset A2 Tag index offset ComputerArchitecture_Introduction 1.13 13/30

Overlap address translation and cache access (Virtual indexed, physically tagged) VitualPageNo page offset TLB translation PhysicalPageNo page offset Tag = ComputerArchitecture_Introduction 1.14 14/30

4 th Hit Time Reduction Technique: Trace caches Find a dynamic sequence of instructions including taken branches to load into a cache block. The block determined by CPU instead of by memory layout. Complicated address mapping mechanism ComputerArchitecture_Introduction 1.15 15/30

Why Trace Cache? Bring N instructions per cycle No I-cache misses No prediction miss No packet breaks! Because branch in each 5 instruction, so cache can only provide a packet in one cycle. ComputerArchitecture_Introduction 1.16 16/30

What s Trace? Trace: dynamic instruction sequence When instructions ( operations ) retire from the pipeline, pack the instruction segments into TRACE, and store them in the TRACE cache, including the branch instructions. Though branch instruction may go a different target, but most times the next operation sequential will just be the same as the last sequential. ( locality ) ComputerArchitecture_Introduction 1.17 17/30

Whose propose? Peleg Weiser (1994) in Intel corporation Patel / Patt ( 1996) Rotenberg / J. Smith (1996) Paper: ISCA 98 ComputerArchitecture_Introduction 1.18 18/30

Trace in CPU LI M Trace cache I cache Fill unit OP DM ComputerArchitecture_Introduction 1.19 19/30

Instruction segment Block A T F Fill unit pack them into : Block A Block B B T F C Block D Block C T, F Pre dict info. ComputerArchitecture_Introduction 1.20 20/30

Pentium 4: trace cache, 12 instr./per cycle Trace segment Onchip L2 Cache Memory A B C Trace Cache Instruction Cache Disk Pentium4 12 instr. PC BR predictor Decode Rename Fill unit Execution cord Data cache ComputerArchitecture_Introduction 1.21 21/30

How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation, way prediction, and trace caches 2. Increase cache bandwidth.--3 pipelined cache access, multibanked caches, nonblocking caches, 3. Reduce the miss penalty--4 multilevel caches, critical word first, read miss prior to writes, merging write buffers, and victim caches 4. Reduce the miss rate--4 larger block size, large cache size, higher associativity,and compiler optimizations 5. Reduce the miss penalty and miss rate via parallelism--2 hardware prefetching,and compiler prefetching ComputerArchitecture_Introduction 1.22 22/30

1 st Increasing cache bandwidth: Pipelined Caches What problems for instruction pipeline? Hit in multiple cycles, giving fast clock cycle time ComputerArchitecture_Introduction 1.23 23/30

2 nd Increasing cache bandwidth: Nonblocking Caches A nonblocking(lockup-free cache) cache,allows The cache to continues to supply hits while processing read misses ( hit under miss, hit under multiple miss ). Complex caches What s can the even precondition have multiple? outstanding misses ( miss under miss ). It will further lower effective miss penalty Nonblocking, in conjunction with out-of-order execution, can allow the CPU to continue executing instructions after a data cache miss. ComputerArchitecture_Introduction 1.24 24/30

Performance of Nonblocking cache ComputerArchitecture_Introduction 1.25 25/30

3 nd Increasing cache bandwidth: Multibanked Caches Cache is divided into independent banks that can support simultaneous accesses like interleaved memory banks. E.g.,T1 ( Niagara ) L2 has 4 banks Banking works best when accesses naturally spread themselves across banks mapping of addresses to banks affects behavior of memory system Simple mapping that works well is sequential interleaving ComputerArchitecture_Introduction 1.26 26/30

Single banked two bank consecutive two bank two bank interleaving group interleaving ComputerArchitecture_Introduction 1.27 27/30

Summary: Increase Cache Bandwidth 1. Increase bandwidth via pipelined cache access 2. Increase bandwidth via multibanked caches, 3. Increase bandwidth via non-blocking caches, ComputerArchitecture_Introduction 1.28 28/30

How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation, way prediction, and trace caches 2. Increase cache bandwidth.--3 pipelined cache access, multibanked caches, nonblocking caches, 3. Reduce the miss penalty--4 multilevel caches, critical word first, read miss prior to writes, merging write buffers, and victim caches 4. Reduce the miss rate--4 larger block size, large cache size, higher associativity,and compiler optimizations 5. Reduce the miss penalty and miss rate via parallelism--2 hardware prefetching, compiler prefetching ComputerArchitecture_Introduction 1.29 29/30

1 st Miss Penalty Reduction Technique: Multilevel Caches This method focuses on the interface between the cache and main memory. Add an second-level cache between main memory and a small, fast first-level cache, to make the cache fast and large. The smaller first-level cache is fast enough to match the clock cycle time of the fast CPU and to fit on the chip with the CPU, thereby lessening the hits time. The second-level cache can be large enough to capture many memory accesses that would go to main memory, thereby lessening the effective miss penalty. ComputerArchitecture_Introduction 1.30 30/30

Parameter about Multilevel cache The performance of a two-level cache is calculated in a similar way to the performance for a single level cache. L2 Equations So the miss penalty for level 1 is calculated using the hit time, miss rate, and miss penalty for the level 2 cache. AMAT = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 Miss Miss rate rate AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 + Miss Penalty L2 ) L1 L2 = = Misses M Misses M L 2 L 1 L 2 = M Misses Miss L 2 rate L 1 ComputerArchitecture_Introduction 1.31 31/30

Two conceptions for two-level cache Definitions: Local miss rate misses in this cache divided by the total number of memory accesses to this cache (Miss rate L2 ) Global miss rate misses in this cache divided by the total number of memory accesses generated by the CPU Using the terms above, the global miss for the first-level cache is stall just Miss rate L1, but for the second-level cache it is : Miss rate Global L 2 Misses = M Misses = M L2 L1 M Miss rate = M MissesL 2 M Miss rate ComputerArchitecture_Introduction 1.32 32/30 L1 L1 MissesL 2 M Miss rate = Miss rate L1 L1 Miss rate L2

2 nd Miss Penalty Reduction Technique: Critical Word First and Early Restart Don t wait for full block to be loaded before restarting CPU Critical Word First Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Early restart As soon as the requested word of the block ar rives, send it to the CPU and let the CPU continue execution Generally useful only in large blocks, Spatial locality => tend to want next sequential word, so not clear if benefit by early restart ComputerArchitecture_Introduction 1.33 33/30

Assume: Example: Critical Word First cache block=64-byte L2: take 11 CLK to get the critical 8 bytes,(amd Athlon) and then 2 CLK per 8 byte to fetch the rest of the block There will be no other accesses to rest of the block Calculate the average miss penalty for critical word first. Then assuming the following instructions read data sequentially 8 bytes at a time from the rest of the block Compare the times with and without critical word first. ComputerArchitecture_Introduction 1.34 34/30

3rd Miss Penalty Reduction Technique: Giving Priority to Read Misses over Writes If a system has a write buffer, writes can be delayed to come after reads. The system must, however, be careful to check the write buffer to see if the value being read is about to be written. ComputerArchitecture_Introduction 1.35 35/30

Write buffer Write-back want buffer to hold displaced blocks Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as do read Write-through want write buffers => RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue ComputerArchitecture_Introduction 1.36 36/30

4 th Miss Penalty Reduction Technique: Merging write Buffer One word writes replaces with multiword writes, and it improves buffers s efficiency. In write-through,when write misses if the buffer contains other modified blocks,the addresses can be checked to see if the address of this new data matches the address of a valid write buffer entry.if so,the new data are combined with that entry. The optimization also reduces stalls due to the write buffer being full. ComputerArchitecture_Introduction 1.37 37/30

Write merging ComputerArchitecture_Introduction 1.38 38/30

Miss Penalty Reduction Technique: Victim Caches A victim cache is a small (usually, but not necessarily) fully-associative cache that holds a few of the most recently replaced blocks or victims from the main cache. This cache is checked on a miss data before going to next lower-level memory(main memory). to see if they have the desired item If found, the victim block and the cache block are swapped. The AMD Athlon has a victim caches (write buffer for write back blocks ) with 8 entries. ComputerArchitecture_Introduction 1.39 39/30

The Victim Cache ComputerArchitecture_Introduction 1.40 40/30

How to combine victim Cache? How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines TAGS Tag and Comparator Tag and Comparator Tag and Comparator Tag and Comparator DATA One Cache line of Data One Cache line of Data One Cache line of Data One Cache line of Data To Next Lower Level In Hierarchy ComputerArchitecture_Introduction 1.41 41/30

Summary: Miss Penalty Reduction CPUtime = IC CPI Execution + Memory accesses Instruction Miss rate Miss penalty Clock cycle time 1. Reduce penalty via Multilevel Caches 2. Reduce penalty via Critical Word First 3. Reduce penalty via Read Misses over Writes 4. Reducing penalty via Merging write Buffer ComputerArchitecture_Introduction 1.42 42/30