Lecture-18 (Cache Optimizations) CS422-Spring

Similar documents
Lec 11 How to improve cache performance

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Advanced Caching Techniques

Chapter-5 Memory Hierarchy Design

Memory Hierarchy Basics

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

EITF20: Computer Architecture Part4.1.1: Cache - 2

Types of Cache Misses: The Three C s

EITF20: Computer Architecture Part4.1.1: Cache - 2

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Cache performance Outline

Lecture 11. Virtual Memory Review: Memory Hierarchy

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

A Cache Hierarchy in a Computer System

Advanced optimizations of cache performance ( 2.2)

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Chapter 2: Memory Hierarchy Design Part 2

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Lecture notes for CS Chapter 2, part 1 10/23/18

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Chapter 2: Memory Hierarchy Design Part 2

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Lecture 11 Reducing Cache Misses. Computer Architectures S

Memory Hierarchies 2009 DAT105

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Pollard s Attempt to Explain Cache Memory

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Computer Architecture Spring 2016

COSC 6385 Computer Architecture - Memory Hierarchies (II)

CPE 631 Lecture 06: Cache Design

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Aleksandar Milenkovich 1

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Caching Basics. Memory Hierarchies

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Advanced cache memory optimizations

Copyright 2012, Elsevier Inc. All rights reserved.

Cache Optimisation. sometime he thought that there must be a better way

CS422 Computer Architecture

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

LECTURE 5: MEMORY HIERARCHY DESIGN

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Advanced Computer Architecture- 06CS81-Memory Hierarchy Design

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Recap: The Big Picture: Where are We Now? The Five Classic Components of a Computer. CS152 Computer Architecture and Engineering Lecture 20.

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 2 (cont) Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Adapted from David Patterson s slides on graduate computer architecture

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

CS222: Cache Performance Improvement

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

CMSC 611: Advanced Computer Architecture. Cache and Memory

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

Name: 1. Caches a) The average memory access time (AMAT) can be modeled using the following formula: AMAT = Hit time + Miss rate * Miss penalty

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

CSE 2021: Computer Organization

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Transcription:

Lecture-18 (Cache Optimizations) CS422-Spring 2018 Biswa@CSE-IITK

Compiler Optimizations Loop interchange Merging Loop fusion Blocking Refer H&P: You need it for PA3 and PA4 too. CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2

Non-blocking Cache Enable cache access when there is a pending miss Enable multiple misses in parallel Memory-level parallelism (MLP) generating and servicing multiple memory accesses in parallel Why generate multiple misses? isolated miss A B C parallel miss time Enables latency tolerance: overlaps latency of different misses How to generate multiple misses? Out-of-order execution, multithreading, prefetching CS422: Spring 2018 Biswabandan Panda, CSE@IITK 3

Miss-Status Holding Registers Also called miss buffer Keeps track of Outstanding cache misses Pending load/store accesses that refer to the missing cache block Fields of a single MSHR Valid bit Cache block address (to match incoming accesses) Control/status bits (prefetch, issued to memory, which subblocks have arrived, etc) Data for each subblock For each pending load/store Valid, type, data size, byte in block, destination register or store buffer entry address CS422: Spring 2018 Biswabandan Panda, CSE@IITK 4

MSHRs CS422: Spring 2018 Biswabandan Panda, CSE@IITK 5

MSHR in Action On a cache miss: Search MSHR for a pending access to the same block Found: Allocate a load/store entry in the same MSHR entry Not found: Allocate a new MSHR No free entry: stall When a subblock returns from the next level in memory Check which loads/stores waiting for it Forward data to the load/store unit Deallocate load/store entry in the MSHR entry Write subblock in cache or MSHR If last subblock, dellaocate MSHR (after writing the block in cache) CS422: Spring 2018 Biswabandan Panda, CSE@IITK 6

Early Restart and CWF Don t wait for full block to be loaded before restarting CPU Early restart As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Critical Word First Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Generally useful only in large blocks, Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart CS422: Spring 2018 Biswabandan Panda, CSE@IITK 7

Victim Cache How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines TAGS Tag and Comparator Tag and Comparator Tag and Comparator DATA One Cache line of Data One Cache line of Data One Cache line of Data Tag and Comparator One Cache line of Data To Next Lower Level In Hierarchy CS422: Spring 2018 Biswabandan Panda, CSE@IITK 8

Trace Cache Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace cache line BR BR BR BR BR BR Single fetch brings in multiple basic blocks Trace cache indexed by start address and next n branch predictions CS422: Spring 2018 Biswabandan Panda, CSE@IITK 9

Multi-Banked Cache Rather than treat the cache as a single monolithic block, divide into independent banks that can support simultaneous accesses E.g.,T1 ( Niagara ) L2 has 4 banks Banking works best when accesses naturally spread themselves across banks mapping of addresses to banks affects behavior of memory system Simple mapping that works well is sequential interleaving Spread block addresses sequentially across banks E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; CS422: Spring 2018 Biswabandan Panda, CSE@IITK 10

Way Prediction Way prediction helps select one block among those in a set, thus requiring only one tag comparison (if hit). Preserves advantages of direct-mapping (why?); In case of a miss, other block(s) are checked. Pseudoassociative (also called column associative) caches Operate exactly as direct-mapping caches when hit, thus again preserving advantages of the direct-mapping; In case of a miss, another block is checked (as if in set-associative caches), by simply inverting the most significant bit of the index field to find the other block in the pseudoset. real hit time < pseudo-hit time CS422: Spring 2018 Biswabandan Panda, CSE@IITK 11