Computer Architecture Spring 2016

Similar documents
Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Types of Cache Misses: The Three C s

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Cache performance Outline

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Lecture 11 Reducing Cache Misses. Computer Architectures S

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

CPE 631 Lecture 06: Cache Design

CSE502: Computer Architecture CSE 502: Computer Architecture

Aleksandar Milenkovich 1

EITF20: Computer Architecture Part4.1.1: Cache - 2

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

CMSC 611: Advanced Computer Architecture. Cache and Memory

Advanced optimizations of cache performance ( 2.2)

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Lec 12 How to improve cache performance (cont.)

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

Copyright 2012, Elsevier Inc. All rights reserved.

CS/ECE 250 Computer Architecture

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Aleksandar Milenkovich 1

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Lecture 7: Memory Hierarchy 3 Cs and 7 Ways to Reduce Misses Professor David A. Patterson Computer Science 252 Fall 1996

Chapter-5 Memory Hierarchy Design

CSE 502 Graduate Computer Architecture

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

Cache Performance (H&P 5.3; 5.5; 5.6)

Graduate Computer Architecture. Handout 4B Cache optimizations and inside DRAM

NOW Handout Page 1. Review: Who Cares About the Memory Hierarchy? EECS 252 Graduate Computer Architecture. Lec 12 - Caches

CS422 Computer Architecture

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Lecture 11. Virtual Memory Review: Memory Hierarchy

Memory Hierarchy Design. Chapter 5

5008: Computer Architecture

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

EITF20: Computer Architecture Part4.1.1: Cache - 2

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Lec 11 How to improve cache performance

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Memory Hierarchies 2009 DAT105

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Topics. Digital Systems Architecture EECE EECE Need More Cache?

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

CSE 502 Graduate Computer Architecture. Lec Advanced Memory Hierarchy and Application Tuning

Cache Optimisation. sometime he thought that there must be a better way

CS3350B Computer Architecture

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

! 11 Advanced Cache Optimizations! Memory Technology and DRAM optimizations! Virtual Machines

CS 2461: Computer Architecture 1

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Lecture 7 - Memory Hierarchy-II

CS 252 Graduate Computer Architecture. Lecture 8: Memory Hierarchy

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

MEMORY HIERARCHY DESIGN

Performance (1/latency) Capacity Access Time Cost. CPU Registers 100s Bytes <1 ns. Cache K Bytes M Bytes 1-10 ns.01 cents/byte

Advanced Computer Architecture- 06CS81-Memory Hierarchy Design

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Pollard s Attempt to Explain Cache Memory

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Levels in a typical memory hierarchy

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced Caching Techniques

Memory Hierarchy. Advanced Optimizations. Slides contents from:

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

EE 4683/5683: COMPUTER ARCHITECTURE

Chapter 2: Memory Hierarchy Design Part 2

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

A Cache Hierarchy in a Computer System

Page 1. Multilevel Memories (Improving performance using a little cash )

1/19/2009. Data Locality. Exploiting Locality: Caches

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Lecture 13: Memory Systems -- Cache Op7miza7ons CSE 564 Computer Architecture Summer 2017

Chapter 2: Memory Hierarchy Design Part 2

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Page 1. Memory Hierarchies (Part 2)

Memory Hierarchy Basics

Advanced Caching Techniques

Lecture notes for CS Chapter 2, part 1 10/23/18

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Transcription:

Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University

Improve Cache Performance Average memory access time (AMAT): AMAT = Hit Time + Miss Rate Miss Penalty Approaches to improve cache performance Reduce the miss penalty Reduce the miss rate Reduce the miss penalty or miss rate via parallelism Reduce the hit time

Reducing Misses Classifying Misses: 3 Cs Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. [Misses in even an Infinite Cache] Capacity If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. [Misses in Fully Associative Size X Cache] Conflict If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. [Misses in N-way Associative, Size X Cache] More recent, 4th C : Coherence - Misses caused by cache coherence.

Larger Block Size Reduce Miss Rate Larger Cache Higher Associativity Prefetching

Reduce Miss Rate: Compiler Optimizations McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts (using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows

Merging Arrays Example /* Before: 2 sequential arrays */ int val[size]; int key[size]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[size]; Reducing conflicts between val & key; improve spatial locality

Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; } 2 misses per access to a & c vs. one miss per access; improve spatial locality

/* Before */ for (i= 0; i < N; i = i+1) for (j= 0; j < N; j = j+1) { r = 0; for (k= 0; k < N; k = k+1) { r = r + y[i][k]*z[k][j]; } x[i][j] = r; } Two Inner Loops: Read all NxN elements of z[] Blocking Example Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: 2N 3 + N 2 => (assuming no conflict; otherwise ) Idea: compute on BxB submatrix that fits

Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+b) for (kk = 0; kk < N; kk = kk+b) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+b-1,n); j = j+1){ r = 0; for (k = kk; k < min(kk+b-1,n); k = k+1) { r = r + y[i][k]*z[k][j];} x[i][j] = x[i][j] + r;} B called Blocking Factor Capacity Misses from 2N 3 + N 2 to N 3 /B+2N 2 Conflict Misses Too?

Summary: Miss Rate Reduction 3 Cs: Compulsory, Capacity, Conflict Reduce Misses via Larger Block Size Reduce misses via Larger Cache Size Reduce Misses via Higher Associativity Reduce Misses via Prefetching Reduce Misses via Compiler Optimizations

Improve Cache Performance Average memory access time (AMAT): AMAT = Hit Time + Miss Rate Miss Penalty Approaches to improve cache performance Reduce the miss penalty Reduce the miss rate Reduce the miss penalty or miss rate via parallelism Reduce the hit time

Nonblocking Caches to Reduce Stalls on Cache Misses Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires F/E bits on registers or out-of-order execution requires multi-bank memories hit under miss reduces the effective miss penalty by working during miss vs. ignoring CPU requests hit under multiple miss or miss under miss may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Requires multiple memory banks (otherwise cannot support) Pentium Pro allows 4 outstanding memory misses

Multiple Accesses per Cycle Need high-bandwidth access to caches Core can make multiple access requests per cycle Multiple cores can access LLC at the same time Must either delay some requests, or Design SRAM with multiple ports Big and power-hungry Split SRAM into multiple banks Can result in delays, but usually not

Multi-Ported SRAMs Wordline 2 Wordline 1 b 2 b 1 b 1 b 2 Wordlines = 1 per port Bitlines = 2 per port Area = O(ports 2 )

Multi-Porting vs Banking Decoder Decoder Decoder Decoder SRAM Array Decoder S SRAM Array Decoder S SRAM Array Decoder SRAM Array Decoder SRAM Array Sense Sense Sense Sense Column Muxing S S 4 ports Big (and slow) Guarantees concurrent access 4 banks, 1 port each Each bank small (and fast) Conflicts (delays) possible How to decide which bank to go to?

Bank Conflicts Banks are address interleaved For block size b cache with N banks Bank = (Address / b) % N More complicated than it looks: just low-order bits of index Modern processors perform hashed cache indexing May randomize bank and index XOR some low-order tag bits with bank/index bits (why XOR?) Banking can provide high bandwidth But only if all accesses are to different banks For 4 banks, 2 accesses, chance of conflict is 25%

Improve Cache Performance Average memory access time (AMAT): AMAT = Hit Time + Miss Rate Miss Penalty Approaches to improve cache performance Reduce the miss penalty Reduce the miss rate Reduce the miss penalty or miss rate via parallelism Reduce the hit time

Reducing Hit Time: Parallel vs Serial Caches Tag and Data usually separate (tag is smaller & faster) State bits stored along with tags Valid bit, LRU bit(s), Parallel access to Tag and Data reduces latency (good for L1) Serial access to Tag and Data reduces power (good for L2+) enable = = = = = = = = valid? valid? hit? dat a hit? dat a

Way-Prediction How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? Way prediction: extra bits are kept in cache to predict which way to try on the next cache access Correct prediction delivers access latency as for a direct-mapped cache Otherwise, other ways are tries in subsequent clock cycles

Physically-Indexed Caches Core requests are VAs Cache index is PA[15:6] VA passes through TLB D-TLB on critical path Cache tag is PA[63:16] tag[63:14] index[13:6] block offset[5:0] virtual page[63:13] D-TLB Virtual Address / index[6:0] physical index[7:0] / / physical index[0:0] page offset[12:0] If index size < page size Can use VA for index / physical tag[51:1] = = = = Simple, but slow. Can we do better?

Virtually-Indexed Caches Core requests are VAs Cache index is VA[15:6] Cache tag is PA[63:16] Virtual Address tag[63:14] index[13:6] block offset[5:0] virtual page[63:13] page offset[12:0] / virtual index[7:0] Why not tag with VA? Cache flush on ctx switch Virtual aliases Ensure they don t exist or check all on miss D-TLB / physical tag[51:0] = = = = One bit overlaps Fast, but complicated

Reducing Hit Time: Avoiding Address Translation during Indexing of the Cache Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache Every time process is switched logically must flush the cache; otherwise get false hits Cost is time to flush + compulsory misses from empty cache Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address I/O must interact with cache, so need virtual address Solution to aliases HW guarantees covers index field & direct mapped, they must be unique; called page coloring Solution to cache flush Add process identifier tag that identifies process as well as address within process: can t get a hit if wrong process

Reducing Hit Time: Avoiding Address Translation during Indexing of the Cache Virtually indexed, physically tagged caches If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag Limits cache to page size: what if want bigger caches and uses same trick? Higher associativity moves barrier to right Page coloring

Reducing Hit Time: Pipelined Cache Access Pipelining cache access in multiple cycles Gives fast clock cycle time, but slow cache hits Disadvantages of pipelined cache access Increase the pipeline depth Increase the misprediction penalty Increase the clock cycles between the issue of the load and use of the data Pipelining increases the bandwidth of instructions

Reducing Hit Time: Trace Caches Wider superscalar datapath demands higher instruction fetch bandwidth Instruction fetch bandwidth is limited inst. Cache bandwidth, branch predictor, dynamic control flow Inst. Cache stores static code sequences generated by compiler, different from the dynamic execution sequences The idea of trace cache: capture and store dynamic inst. Sequences in trace cache Improve inst. Cache hit time by fetching multiple noncontinuous basic blocks in one fetch cycle, which would require multiple fetch cycles in the conventional cache.

Write Policies Writes are more interesting On reads, tag and data can be accessed in parallel On writes, needs two steps Is access time important for writes? Choices of Write Policies On write hits, update memory? Yes: write-through (higher bandwidth) No: write-back (uses Dirty bits to identify blocks to write back) On write misses, allocate a cache block frame? Yes: write-allocate No: no-write-allocate

Inclusion Core often accesses blocks not present on chip Should block be allocated in L3, L2, and L1? Called Inclusive caches Waste of space Requires forced evict (e.g., force evict from L1 on evict from L2+) Only allocate blocks in L1 Called Non-inclusive caches (why not exclusive?) Must write back clean lines Some processors combine both L3 is inclusive of L1 and L2 L2 is non-inclusive of L1 (like a large victim cache)

Parity & ECC Cosmic radiation can strike at any time Especially at high altitude Or during solar flares What can be done? Parity 1 bit to indicate if sum is odd/even (detects single-bit errors) Error Correcting Codes (ECC) 8 bit code per 64-bit word Generally SECDED (Single-Error-Correct, Double-Error-Detect) Detecting errors on clean cache lines is harmless Pretend it s a cache miss 0 1 0 1 1 0 0 1 10 01 1 0 0 1 0 1