Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Similar documents
Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

EITF20: Computer Architecture Part4.1.1: Cache - 2

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Performance metrics for caches

EITF20: Computer Architecture Part4.1.1: Cache - 2

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

Lec 11 How to improve cache performance

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Cache Performance (H&P 5.3; 5.5; 5.6)

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Caching Basics. Memory Hierarchies

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Memory Hierarchies 2009 DAT105

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Advanced Caching Techniques

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Advanced Caching Techniques

Lecture 11 Reducing Cache Misses. Computer Architectures S

CPE 631 Lecture 06: Cache Design

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Chapter 2: Memory Hierarchy Design Part 2

Aleksandar Milenkovich 1

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Page 1. Multilevel Memories (Improving performance using a little cash )

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Introduction to OpenMP. Lecture 10: Caches

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Advanced cache optimizations. ECE 154B Dmitri Strukov

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Types of Cache Misses: The Three C s

Cache performance Outline

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Memory Hierarchy Basics

CS152 Computer Architecture and Engineering

Lecture-18 (Cache Optimizations) CS422-Spring

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

Advanced optimizations of cache performance ( 2.2)

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture 7 - Memory Hierarchy-II

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Memory Hierarchy. Slides contents from:

Chapter 2: Memory Hierarchy Design Part 2

12 Cache-Organization 1

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Advanced Computer Architecture

Lecture notes for CS Chapter 2, part 1 10/23/18

Portland State University ECE 587/687. Caches and Prefetching

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Cray XE6 Performance Workshop

Chapter-5 Memory Hierarchy Design

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Computer Systems Architecture I. CSE 560M Lecture 15 Prof. Patrick Crowley

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

A Cache Hierarchy in a Computer System

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Computer Architecture Spring 2016

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Recap: The Big Picture: Where are We Now? The Five Classic Components of a Computer. CS152 Computer Architecture and Engineering Lecture 20.

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

Lecture 17: Memory Hierarchy: Cache Design

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Inside out of your computer memories (III) Hung-Wei Tseng

ארכיטקטורת יחידת עיבוד מרכזי ת

CS152 Computer Architecture and Engineering SOLUTIONS Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4

Transcription:

Computer Systems Architecture I CSE 560M Lecture 17 Guest Lecturer: Shakir James

Plan for Today Announcements and Reminders Project demos in three weeks (Nov. 23 rd ) Questions Today s discussion: Improving cache performance Decrease misses Cache organization Prefetching Compiler algorithms Decrease miss latency Prioritize reads and writes Enable multiple concurrent misses: lock-up free caches Cache hierarchies 2

Address Tag Index 10111100 Offset 0 Memory Cache Tag 0 1 10100 10111 00010 11111 Data Address 10111010 10111011 10111100 10111101 10111110 10111111 Data A9B7 0467 ABCD BCDD F156 9978 2 10101 11101 3 00111 11110 10111 ABCD BCDD 3

Improving Cache Performance CPI contributed by cache Miss rate * number of cycles to handle the miss Execution time contributed by cache Clock cycle time * CPI Clock cycle time might depend on cache organization To improve cache performance Decrease miss rate without increasing time to handle the miss Decrease time to handle the miss (i.e., penalty) without increasing miss rate In general, larger caches require larger access times 4

Obvious Solutions to Decrease Miss Rate Increase cache capacity Larger means slower Limitations for on-chip caches Increase cache associativity Diminishing returns (perhaps after 4-way) More comparisons (i.e., more logic) Make cache look more associative than it really is 5

What About Cache Block Size? For a given application, cache capacity and associativity, there is an optimal block size Benefits of long cache blocks good for spatial locality (code, vectors) reduce compulsory misses Drawbacks increase access time Increase inter-level transfer time Fewer blocks imply more conflicts 6

Miss Rate vs. Block Size 7

Example Block Analysis We are to choose between 32B and 64B blocks With miss rates m32 and m64, resp. Miss penalty components Send request + access time + transfer time Example Send request (block independent): 2 cycles Access time (assume block ind.): 28 cycles Transfer time depends on bus width. Say our bus is 64 bits wide, then transfer time is twice as much for 64B (8 cycles) than for 32B (4 cycles) 32B is better if 34*m32 < 38*m64 8

Example (cont d) From the text (p. C-27): For a 16K cache m32=2.87, m64=2.64 For a 64K cache m32=1.35, m64=1.06 Thus, 32B is better for 16K and 64B better for 64K 9

Associativity The same kind of analysis can be used for associativity miss rate vs. cycle time Old conventional wisdom Direct-mapped caches are faster; cache access is bottleneck for onchip L1; make L1 direct-mapped For on-board (L2) caches, direct-mapped are 10% faster New conventional wisdom Can make 2-way set-associative cache fast enough for L1. Allows larger caches to be addressed only with page offset bits Looks like it does not make much difference for L2/L3 caches 10

Victim Cache: Bringing Associativity to a Direct-Mapped Cache Goal: remove some of the conflict misses Idea: place a small, fully-associative buffer behind the L1 cache and before the L2 cache Procedure L1 hit, do nothing L1 miss in block b, hit in victim v, pay extra cycle, swap b and v L1 miss in b, victim miss, add b to victim (evict if necessary) Victim buffer of 4 to 8 entries in a 4KB cache works well 11

Victim Cache 12

Improving Hit Rate via Prefetching Goal: bring data in cache just in time for its use Not too early, otherwise might displace other useful blocks (i.e., pollution) Not too late otherwise will have hit-wait cycles Constrained by (among others) Imprecise knowledge of instruction stream Imprecise knowledge of data stream 13

Prefetching: Why, What, When & Where Why Hide memory latency, reduce cache misses What Ideally, a semantic object In practice, a cache line or sequence of them When Ideally, just in time In practice, depends dynamically on prefetching technique Where Either the cache or a prefetch buffer 14

How Hardware prefetching Nextline prefetching for instructions (many processors bring in a block on a miss and the following one as well) OBL one block lookahead for data prefetching (many variations, can be directed by previous successes) Stream buffers [Jouppi, ISCA 90] is an example that works well for instructions but not for data Stream buffers with hardware stride prediction mechanisms Works well for scientific programs with disciplined data access patterns 15

How (cont d) Software prefetching Use of special instructions touch (dcbt) in PowerPC Load to R31 in Alpha SSE prefetch in IA-32 lfetch in IA-64 Non-binding prefetch prefetch ignored if an exception occurs Must be inserted by software usually via compiler analysis Advantage: no special hardware needed Drawback: increases instruction count 16

Compiler and Algorithm Tiling or blocking Optimizations Reuse contents of the data cache as much as possible (e.g., structure code to work on blocks of a matrix) Same type of compiler analysis as for parallelizing code Code reordering Particularly for improving Icache performance Can be guided by profile information 17

Reduce Cache Miss Penalty Give priority to reads (e.g., write buffers) Lock-up free (i.e., non-blocking) caches Cache hierarchy 18

Write Buffers Reads are more important than Writes to memory if WT cache Replacement of dirty lines in WB Hence buffer the writes in write buffers FIFO queues to store data temporarily (loads must check contents of buffer) Since writes have a tendency to come in bunches, write buffer must be deep Allow read miss to bypass the writes in the write buffer (when there are no conflicts) 19

Coalescing Write Buffers and Write Caches Coalescing (or merging) write buffers: writes to an address (block) already in the write buffer are combined Extend write buffers to small fully associative write caches with WB strategy and dirty bit. Natural extension for write-through caches. 20

Lock-up Free Caches Idea: allow cache to have several outstanding miss requests simultaneously (i.e., hit under miss) Proposed in the early 1980 s but implemented in the late 90s/early 00s due to complexity Single hit under miss (a la HP PA1700) relatively simple For several outstanding misses, require the use of miss status holding registers (MSHR s) 21

MSHR s Each MSHR should hold A valid (busy) bit Address of the requested cache block Index into the cache where the block will go Comparator (to prevent using the same MSHR for a miss to the same block) If data is to be forwarded to CPU at the same time as the cache, needs register name Implemented in MIPS R10K, Alpha 21164, Pentium Pro 22

Cache Hierarchy Two and even three levels of caches are quite common now L2 is very large, but L1 filters many hits so local hit rate can appear low L2, in general, has longer blocks and higher associativity Multi-level inclusion property is implemented most of the time (also with L1 WT and L2 WB) 23

Assignment Project demos in three weeks Readings For Wednesday H&P: sections 5.3-5.6 24