Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Similar documents
Memory Hierarchy. Slides contents from:

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

A Cache Hierarchy in a Computer System

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Memory Hierarchy. Slides contents from:

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Page 1. Multilevel Memories (Improving performance using a little cash )

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

EITF20: Computer Architecture Part4.1.1: Cache - 2

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Lecture 7 - Memory Hierarchy-II

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Caching Basics. Memory Hierarchies

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

1/19/2009. Data Locality. Exploiting Locality: Caches

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

EC 513 Computer Architecture

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CSE 2021: Computer Organization

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

LECTURE 11. Memory Hierarchy

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

CSE 2021: Computer Organization

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

CS3350B Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Advanced cache optimizations. ECE 154B Dmitri Strukov

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 240 Stage 3 Abstractions for Practical Systems

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Lec 11 How to improve cache performance

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture-18 (Cache Optimizations) CS422-Spring

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Topics to be covered. EEC 581 Computer Architecture. Virtual Memory. Memory Hierarchy Design (II)

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Cache Performance (H&P 5.3; 5.5; 5.6)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Advanced Computer Architecture

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Copyright 2012, Elsevier Inc. All rights reserved.

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Lecture 11. Virtual Memory Review: Memory Hierarchy

LECTURE 5: MEMORY HIERARCHY DESIGN

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Pollard s Attempt to Explain Cache Memory

Page 1. Memory Hierarchies (Part 2)

CS3350B Computer Architecture

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

Adapted from David Patterson s slides on graduate computer architecture

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Transcription:

Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide CPU with necessary data (and instructions) as quickly as possible To achieve this goal, a cache should keep frequently used data Cache hit when CPU finds a requested data in cache Hit rate = # of cache hits/# of cache accesses Average memory access latency (AMAL) = cache hit time + (1 cache hit rate) miss penalty To decrease AMAL, reduce hit time, increase hit rate, and reduce miss penalty Main memory Hard disk DRAM Magnetics Larger Slower Cheaper per byte To reduce traffic on memory bus Cache becomes a filter Reduces the bandwidth requirements from the main memory Typically, max. L1 bandwidth (to CPU) > max. L2 bandwidth (to L1) > max. memory bandwidth

Cache organization L1 cache vs. L2 cache Caches use blocks or lines (block > byte) as their granule of management Memory > cache: we can only keep a subset of memory blocks Cache is in essence a fixed-width hash table; the memory blocks kept in a cache are thus associated with their addresses (or tagged ) addr N bits Memory 2 N words Specified data word addr N bits addr table data table supporting circuitry Hit/miss Specified data word Their basic parameters are similar Associativity, block size, and cache size (the capacity of data array) Address used to index L1: typically virtual address (to quickly index first) Using a virtual address causes some complexities L2: typically physical address Physical address is available by then System visibility L1: not visible L2: page coloring can affect hit rate Hardware organization (esp. in multicores) L1: private L2: often shared among cores Key questions Where to place a block? Where to place a block? How to find a block? Block placement is a matter of mapping If you have a simple rule to place data, you can find them later using the same rule Which block to replace for a new block? How to handle a write? Writes make a cache design much more complex!

Direct-mapped cache 2-way set-associative cache 2 B byte block 2 M entries N-bit address 2 B byte block 2 M entries N-bit address (N-B-M) bits 2 B bytes 2 M entries (N-B-M) bits M bits B bits N bits Address Tag Address DM Cache 2 (M-1) entries DM Cache 2 (M-1) entries =? Hit/miss Hit/miss Fully associative cache Why caches work (or do not work) 2 B byte block 2 M entries N-bit address CAM (Content addressable memory) Input: content Output: index Used for the tag memory Address (N-B) bits B bits (N-B) bits Tag M-bit index 2 B bytes Principle of locality Temporal locality If the location A is accessed now, it ll be accessed again soon Spatial locality If the location A is accessed now, the location nearby (e.g., A+1) will be accessed soon Can you explain how locality is manifested in your program (at the source code level)? Instructions Can you write the same program twice, having A high degree of locality Badly low locality Hit/miss

Which block to replace? How to handle a write? Which block to replace, to make room for a new block on a miss? Goal: minimize the # of total misses Trivial in a direct-mapped cache N choices in N-way associative cache What is the optimal policy? MRU (most remotely used) is considered optimal This is an oracle scheme we do not know the future Replacement approaches LRU (least recently used) look at the past to predict the future FIFO (first in first out) honor the new ones Random don t remember anything Cost-based what is the cost (e.g., latency) of bringing this block again? Design considerations Performance Design complexity Allocation policy (on a miss) Write-allocate No-write-allocate Write-validate Update policy Write-through Write-back Typical combinations Write-back with write-allocate Write-through with no-write-allocate Write-through vs. write-back Alpha 21264 example L1 cache: advantages of write-through + no-write-allocate Simple control No stalls for evicting dirty data on L1 miss with L2 hit Avoids L1 cache pollution with results that are not read for a while Avoids problems with coherence (L2 is consistent with L1) Allows efficient transient error handling: parity protection in L1 and ECC in L2 What about high traffic between L1 and L2, esp. in a multicore processor? 64B block L2 cache: advantages of write-back + write-allocate Typically reduces overall bus traffic by filtering all L1 write-through traffic Better able to capture temporal locality of infrequently written memory locations Provides a safety net for programs where write-allocate helps a lot Garbage-collected heaps Write-followed-by-read situations Linking loaders (if unified cache, need not be flushed before execution) Some ISA/caches support explicitly installing cache blocks with empty contents or common values (e.g., zero) 64 1024 = 64kB 2 ways

More examples Impact of caches on performance IBM Power5 L1I: 64kB 2-way 128B block LRU L1D: 32kB 4-way 128B block write-through LRU L2: 1.875MB (3 banks) 10-way 128B block pseudo LRU Intel Core Duo L1I: 32kB 8-way 64B block LRU L1D: 32kB 8-way 64B block LRU write-through L2: 2MB 8-way 64B line LRU write-back Sun Niagara L1I: 16kB 4-way 32B block random L1D: 8kB 4-way 16B block random write-through write no-allocate L2: 3MB 4 banks 64B block 12-way write-back Average memory access latency (AMAL) = cache hit time + (1 cache hit rate) miss penalty Example 1 Hit time = 1 cycle Miss penalty = 100 cycles Miss rate = 2% Average memory access latency? Example 2 1GHz processor Two configurations: 16kB direct-mapped, 16kB 2-way Two miss rates: 3%, 2% Hit time = 1 cycle, but clock cycle time is stretched by 1.1 in 2-way Miss penalty = 100ns (how many cycles?) Average memory access latency? 1. Reducing miss penalty Victim cache Multi-level caches w/o victim cache w/ victim cache miss penalty L1 = hit time L2 + miss rate L2 miss penalty L2 Critical word first and early restart When L2-L1 bus width is smaller than L1 cache block Giving priority to read misses Esp. in dynamically scheduled processors Merging write buffer Victim caches Non-blocking caches L1 cache Requests and replacements L2 cache L1 cache 2:1 Requests and replacements Victim cache Requests Replacements L2 cache Esp. in dynamically scheduled processors Jouppi, Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers, ISCA 1990.

Categorizing misses 2. Reducing miss rate Compulsory I ve not met this block before Capacity Working set is larger than my cache Conflict Cache has space but blocks in use map to busy sets How can you measure contributions from different miss categories? Larger block size Reduces compulsory misses Larger cache size Tackles capacity misses Higher associativity Attacks conflict misses Prefetching Relevancy issues (due to pollution) Pseudo-associative caches Compiler optimizations Loop interchange Blocking 3. Reducing hit time Prefetching Small & simple cache (e.g., direct-mapped cache) Test different cache configurations using cacti (http://quid.hpl.hp.com:9082/cacti/) Avoid address translation during cache indexing Pipeline access Memory hierarchy generally works well L1 cache: 1- ~ 3-cycle latency L2 cache: 8- ~ 13-cycle latency Main memory: 100- ~ 300-cycle latency Cache hit rates are critical to high performance Prefetching: if we know what data we ll need from level-n cache a priori, get data from level-(n+1) and place it in level-n cache before the data is accessed by the processor What are design goals and issues? E.g., shall we load prefetched block in L1 or not? E.g., instruction prefetch vs. data prefetch

Evaluating a prefetching scheme Hardware schemes Coverage By applying a prefetching scheme, how many misses are covered? Timeliness Are prefetched data arriving early enough so that the (otherwise) miss latency is effectively hidden? Relevance and usefulness Are prefetched blocks actually used? Do they replace other useful data? How would program execution time change? Especially, dynamically scheduled processors have intrinsic ability to tolerate long latencies to a degree Prefetch data under hardware control Without software modification Without increasing instruction count Hardware prefetch engines Programmable engines Program this engine with data access pattern information Who programs this engine, then? Automatic Hardware detects access behavior Stride detection Power4 example A popular method using a reference prediction table Load instruction PC Last address A i-1 Last stride S = A i-1 A i-2 Other flags, e.g., confidence, time stamp, Next address for this PC A i = A i-1 + S 8 stream buffers: ascending/descending Requires at least 4 sequential misses to install a stream Supports L2 to L1, L3 to L2, memory to L3 In practice, simpler stream buffer like methods are often used IBM Power4/5 use 8 stream buffers between L1 & L2, L2 & L3 (or main memory) Based on physical address When page boundary is met, stop there Software interface To explicitly (and quickly) install a stream

Software schemes Software prefetch example Use prefetch instruction needs ISA support Check if the desired block is in cache already If not, bring the cache block into the cache If yes, do nothing In any case, prefetch request is a performance hint and not a correctness requirement Not acting on it must not affect program output Compiler or programmer then inserts prefetch instructions in programs Hardware prefetch vs. software prefetch for (int i=0; i<100; ++i) { a[i] = b[i] + c[i]; } prefetch(&b[0]); prefetch(&c[0]); for (i=0; i<100; i++) { prefetch (&b[i+4]); prefetch (&c[i+4]); a[i] = b[i] + c[i]; }