Computer Architecture Spring 2016

Similar documents
Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

CSE502: Computer Architecture CSE 502: Computer Architecture

Caches and Prefetching

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Lec 11 How to improve cache performance

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

EITF20: Computer Architecture Part4.1.1: Cache - 2

CS3350B Computer Architecture

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

LECTURE 11. Memory Hierarchy

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Computer Architecture Spring 2016

COSC 6385 Computer Architecture - Memory Hierarchies (I)

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

EITF20: Computer Architecture Part4.1.1: Cache - 2

Memory Hierarchy Basics

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Review: Computer Organization

CACHE OPTIMIZATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CACHE OPTIMIZATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Fall 2007 Prof. Thomas Wenisch

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Caching Basics. Memory Hierarchies

ECE331: Hardware Organization and Design

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Memory. Lecture 22 CS301

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Science 432/563 Operating Systems The College of Saint Rose Spring Topic Notes: Memory Hierarchy

Copyright 2012, Elsevier Inc. All rights reserved.

Improving Cache Performance

1/19/2009. Data Locality. Exploiting Locality: Caches

A Cache Hierarchy in a Computer System

Caches. Samira Khan March 23, 2017

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

CSE 2021: Computer Organization

EE 4683/5683: COMPUTER ARCHITECTURE

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

CSE 2021: Computer Organization

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Lecture 33 Caches III What to do on a write hit? Block Size Tradeoff (1/3) Benefits of Larger Block Size

And in Review! ! Locality of reference is a Big Idea! 3. Load Word from 0x !

UCB CS61C : Machine Structures

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Improving Cache Performance

EEC 483 Computer Organization

LECTURE 5: MEMORY HIERARCHY DESIGN

Chapter 5. Memory Technology

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

The University of Adelaide, School of Computer Science 13 September 2018

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Introduction to OpenMP. Lecture 10: Caches

Memory Hierarchy Design

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS61C : Machine Structures

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

UC Berkeley CS61C : Machine Structures

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Page 1. Multilevel Memories (Improving performance using a little cash )

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

CMPSC 311- Introduction to Systems Programming Module: Caching

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Page 1. Memory Hierarchies (Part 2)

Course Administration

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Transcription:

Computer Architecture Spring 2016 Lecture 07: Caches II Shuai Wang Department of Computer Science and Technology Nanjing University

63 address 0 [63:6] block offset[5:0] Fully-AssociativeCache Keep blocks in cache frames (e.g., valid) address = = = = hit? multiplexor What happens when the cache runs out of space?

The 3 C s of Cache Misses Compulsory: Never accessed before Capacity: Accessed long ago and already replaced Conflict: Neither compulsory nor capacity (later today) Coherence: (To appear in multi-core lecture)

Cache Size Cache size is capacity (don t count and ) Bigger can exploit temporal locality better Not always better Too large a cache Smaller is faster bigger is slower Access time may hurt critical path Too small a cache Limited temporal locality Useful constantly replaced hit rate capacity working set size

Block Size Block size is the that is Associated with an address Not necessarily the unit of transfer between hierarchies Too small a block Don t exploit spatial locality well Excessive overhead Too large a block Useless transferred Too few total blocks Useful frequently replaced hit rate block size

8 8-bit SRAM Array wordline 1-of-8 decoder bitlines

64 1-bit SRAM Array wordline 1-of-8 decoder bitlines 1-of-8 decoder column mux

Direct-MappedCache [63:16] index[15:6] block offset[5:0] Use middle bits as index Only one comparison decoder multiplexor = match (hit?) Why take index bits out of the middle?

Cache Conflicts What if two blocks alias on a frame? Same index, but different s Address sequence: 0xDEADBEEF 11011110101011011011111011101111 0xFEEDBEEF 11111110111011011011111011101111 0xDEADBEEF 11011110101011011011111011101111 index block offset 0xDEADBEEF experiences a Conflict miss Not Compulsory (seen it before) Not Capacity (lots of other indexes available in cache)

Associativity (1/2) Where does block index 12 (b 1100) go? Block 0 1 2 3 4 5 6 7 Set/Block 0 1 2 3 0 1 0 1 0 1 0 1 Set 0 1 2 3 4 5 6 7 Fully-associative block goes in any frame (all frames in 1 set) Set-associative block goes in any frame in one set (frames grouped in sets) Direct-mapped block goes in exactly one frame (1 frame per set)

Larger associativity Associativity (2/2) lower miss rate (fewer conflicts) higher power consumption Smaller associativity lower cost faster hit time hit rate ~5 for L1-D associativity

N-Way Set-AssociativeCache [63:15] index[14:6] block offset[5:0] way set decoder decoder multiplexor multiplexor = = multiplexor Note the additional bit(s) moved from index to hit?

Associative Block Replacement Which block in a set to replace on a miss? Ideal replacement (Belady s Algorithm) Replace block accessed farthest in the future Trick question: How do you implement it? Least Recently Used (LRU) Optimized for temporal locality (expensive for >2-way) Not Most Recently Used (NMRU) Track MRU, random select among the rest Random Nearly as good as LRU, sometimes better (when?)

Improve Cache Performance Average memory access time (AMAT): AMAT =Hit Time +Miss Rate Miss Penalty Reducing cache miss penalty Reducing cache miss rate Reducing cache miss penalty or miss rate via parallelism Reducing hit time

Improve Cache Performance Average memory access time (AMAT): AMAT =Hit Time +Miss Rate Miss Penalty Approaches to improve cache performance Reduce the miss penalty Reduce the miss rate Reduce the miss penalty or miss rate via parallelism Reduce the hit time

1st. Reduce Miss Penalty: Multilevel Caches AMAT with a two-level cache: AMAT = Hit TimeL1+ Miss RateL1x Miss PenaltyL1 Miss PenaltyL1= Hit TimeL2+ Miss RateL2x Miss PenaltyL2 AMAT = Hit TimeL1+ Miss RateL1x (Hit TimeL2+ Miss RateL2+ Miss PenaltyL2) Definitions: Local miss rate misses in this cache divided by the total number of memory accesses to this cache(miss ratel2) Global miss rate misses in this cache divided by the total number of memory accesses generated by the CPU(Miss RateL1 x Miss RateL2) Global Miss Rate is what matters

Local and Global Miss Rates

Performance vs. Second Level Cache Size

2nd. Reduce Miss Penalty: Critical Word First and Early Restart Don t wait for full block to be loaded before restarting CPU Critical Word First Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Early Restart As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Generally useful only in large blocks Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart

3rd. Reduce Miss Penalty: Giving Priority to Read Misses over Writes

3rd. Reduce Miss Penalty: Giving Priority to Read Misses over Writes Write-through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write-back also want buffer to hold misplaced blocks Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as do read

4th. Reduce Miss Penalty: Merging Write Buffer

5th. Reduce Miss Penalty: Victim Caches To hold recently discarded in case it is needed again Add a small fully associative victim cache between a cache and its refill path Victim cache contains only blocks discarded from a cache miss Victim cache is checked during a cache miss before going to the lower-level memory If the is found in victim cache, victim block and cache block are swapped Example: AMD Athlon has a victim cache with 8 entries

Victim Cache Access Sequence: 4-way Set-Associative L1 Cache 4-way Set-Associative Fully-Associative L1 Cache + Victim Cache E A B N J C K L D M DE A BA CB DC X Y Z MN J KJ KL ML NJ KJ KL ML P Q R P Q R Every access is a miss! ABCDE and JKLMN do not fit in a 4-way set associative cache AE BA CB DC X Y Z Victim cache provides a fifth way so long as only four sets overflow into it at the same time DC AB JKLM Can even provide 6 th or 7 th ways Provide extra associativity, but not for all sets

Reducing Miss Penalty Summary Four techniques Read priority over write on miss Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in between First attempts at L2 caches can make things worse, since increased worst case is worse