Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Similar documents
Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Page 1. Multilevel Memories (Improving performance using a little cash )

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Page 1. Memory Hierarchies (Part 2)

Advanced Computer Architecture

Introduction to OpenMP. Lecture 10: Caches

Introduction. Memory Hierarchy

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Course Administration

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Computer Architecture Spring 2016

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Cray XE6 Performance Workshop

EE 4683/5683: COMPUTER ARCHITECTURE

Logical Diagram of a Set-associative Cache Accessing a Cache

Memory Hierarchy. Slides contents from:

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Performance metrics for caches

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Pollard s Attempt to Explain Cache Memory

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Advanced Memory Organizations

Memory Hierarchy: Caches, Virtual Memory

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Topics. Digital Systems Architecture EECE EECE Need More Cache?

CSE 2021: Computer Organization

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Caching Basics. Memory Hierarchies

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS3350B Computer Architecture

CSE 2021: Computer Organization

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Lecture 11 Cache. Peng Liu.

Lec 11 How to improve cache performance

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

CSE502: Computer Architecture CSE 502: Computer Architecture

ECE 30 Introduction to Computer Engineering

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Page 1. The Problem. Caches. Balancing Act. The Solution. Widening memory gap. As always. Deepen memory hierarchy

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

ECE331: Hardware Organization and Design

5 Solutions. Solution a. no solution provided. b. no solution provided

A Cache Hierarchy in a Computer System

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

1/19/2009. Data Locality. Exploiting Locality: Caches

Chapter 6 Objectives

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Cache Performance (H&P 5.3; 5.5; 5.6)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

EC 513 Computer Architecture

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Computer Science 432/563 Operating Systems The College of Saint Rose Spring Topic Notes: Memory Hierarchy

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Modern Computer Architecture

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

CS422 Computer Architecture

Memory Hierarchy. Slides contents from:

14:332:331. Week 13 Basics of Cache

CS 136: Advanced Architecture. Review of Caches

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Memory. Lecture 22 CS301

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Chapter 8. Virtual Memory

Transcription:

Portland State University ECE 587/687 Caches and Memory-Level Parallelism

Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each instruction: CPI = CPI(Perfect Memory) + Memory stall cycles per instruction With no caches, all memory requests require accessing main memory Very long latency (discuss) Caches filter out many memory accesses Reduce execution time Reduce memory bandwidth Portland State University ECE 587/687 Fall 2017 2

Cache Performance Memory stall cycles Per Instruction = Cache Misses per instruction x miss penalty Processor Performance: CPI = CPI(Perfect Cache) + miss rate x miss penalty Average memory access time = Hit ratio x Hit latency + Miss ratio x Miss penalty Cache hierarchies attempt to reduce average memory access time Portland State University ECE 587/687 Fall 2017 3

Cache Performance Metrics Hit ratio: #hits / #accesses Miss ratio: #misses / #accesses Miss rate: Misses per instruction Miss rate = miss ratio x memory accesses per inst Indicates memory intensity of the program Often used as misses per kilo instructions (MPKI) Hit time: Time from request issued to cache until data is returned to the processor Depends on cache design parameters Bigger caches, larger associativity, or more ports increase hit time Miss penalty: depends on memory hierarchy parameters Portland State University ECE 587/687 Fall 2017 4

Why Do Caches Work? Spatial Locality If data at a certain address is accessed, it is likely that data located at nearby addresses will also be accessed in the (near) future Implication: Cache line (block) size tradeoff Temporal Locality If data at a certain address is accessed, it is likely the same data will be accessed in the (near) future Implication: Replacement algorithms try to predict which lines will be accessed Portland State University ECE 587/687 Fall 2017 5

Memory Hierarchy First-level caches Usually Split I & D caches Small and fast Second-level caches Usually on-die SRAM cells Third-level more? Main memory DRAM cells focus on density Solid-State Disk? Hard Disk Usually magnetic device, non-volatile Slow access time CPU RF L1I$ L1D$ L2 Cache L3/LLC Main Memory Disk Portland State University ECE 587/687 Fall 2017 6

Basic Cache Structure Array of blocks (lines) Each block is usually 32-128 bytes Finding a block in cache: Data Address Tag Index Offset Offset: Byte offset in block (byte-in-block) Index: Which set in the cache is the block located Tag: Needs to match address tag in cache Portland State University ECE 587/687 Fall 2017 7

Associativity Set associativity Set: Group of blocks corresponding to same index Each block in the set is called a Way 2-way set associative cache: each set contains two blocks Direct-mapped cache: each set contains one block Fully-associative cache: the whole cache is one set Need to check all tags in a set to determine hit/miss status Discuss: Diminishing returns, hit-rate vs hit-time. Portland State University ECE 587/687 Fall 2017 8

Associativity Direct-mapped: 2-way set associative: Portland State University ECE 587/687 Fall 2017 9

Example: Cache Block Reference Consider a 4-way, 32KB cache with 64-byte lines, and 48-bit physical address Number of lines = cache size / line size = 32K / 64 = 512 lines 4-way Each set contains 4 lines # of sets = 512 / 4 = 128 sets Offset bits = log 2 (64 byte line) = 6 bits Index bits = log 2 (128 sets) = 7 bits Tag bits = 48-(6+7) = 35 bits Locate an address: 1101 10011101111001 Tag Index Offset Portland State University ECE 587/687 Fall 2017 10

Types of Cache Misses Compulsory (cold) misses: First access to a block Prefetching can reduce these misses Capacity misses: A cache cannot contain all blocks needed in a program - some blocks are discarded Replacement policies should target blocks that won t be used later Conflict misses: Blocks mapping to the same set may be discarded (in direct-mapped and set-associative caches) Increasing associativity can reduce these misses For multiprocessors, coherence misses can also happen Details in ECE 588/688 Portland State University ECE 587/687 Fall 2017 11

Cache Replacement and Insertion Cache replacement policy: On a cache line fill, which victim line to replace Only applicable to set-associative caches Examples: LRU, more advanced policies Discuss: True LRU, pseudo-lru, stack algorithms Cache insertion policy: When a cache line is filled, what would be its priority in the replacement stack LRU: fill line is inserted in Most Recently Used position Other policies: LIP, BIP, DIP Portland State University ECE 587/687 Fall 2017 12

Handling Writes When do we write the modified data in a cache to the next level? Write-through: When the write happens Write-back: When the line is evicted Write-through: + Simple, all-levels are up to date, easy coherence - Bandwidth intensive Write-back: + Can consolidate many writes before eviction - Need a modified (dirty) bit in the tag store Portland State University ECE 587/687 Fall 2017 13

Non-Blocking Cache Hierarchy Superscalar processors require parallel execution units Multiple pipelined functional units Cache hierarchies capable of simultaneously servicing multiple memory requests Do not block cache references that do not need the miss data Service multiple miss requests to memory concurrently Revisit miss penalty with memory-level parallelism Portland State University ECE 587/687 Fall 2017 14

Miss Status Holding (Handling) Registers MSHRs facilitate non-blocking memory level parallelism Used to track address, data, and status for multiple outstanding cache misses Need to provide correct memory ordering, respond to CPU requests, and maintain cache coherence Design details vary widely between different processors But basic functions are similar Portland State University ECE 587/687 Fall 2017 15

Cache & MSHR Organization Main Components: MSHR: One register for each miss to be handled concurrently N-way comparator: Compares an address to all block addresses in MSHRs (N = #MSHRs) Input Stack: Buffer space for all misses corresponding to MSHR entries (Fill Buffer) Size = #MSHRs x block size Status update and collecting networks Current implementations combine the MSHR and input stack Portland State University ECE 587/687 Fall 2017 16

MSHR Structure Each MSHR contains the following information Data address PC of requesting instruction Input identification tags & Send-to-CPU flags: Allows forwarding requested word(s) to the CPU Input stack indicators (flags): Allows reading data directly from input stack Partial write codes: Indicates which bytes in a word has been written to the cache Valid flag Obsolete flag: either info is not valid or MSHR hit on data in transit Portland State University ECE 587/687 Fall 2017 17

MSHR Operation On a cache miss, one MSHR is assigned Valid flag set Obsolete flag cleared Data address saved PC of requesting instruction saved Appropriate send-to-cpu flags set and others cleared Input identification tag saved in correct position Partial write codes cleared All MSHRs pointing to the same data address purged Optimal number of MSHRs depends on available MLP Diminishing returns since MLP is limited Portland State University ECE 587/687 Fall 2017 18

Virtual vs. Physical Addressing Using virtual addresses to access the L1 cache reduces latency Physical addresses need address translation However, virtually-addresses caches introduce problems Handing synonyms: multiple VAs mapping to same PA Address translation needed on L1 misses Reverse translation needed for coherence in a multiprocessor system Need to invalidate whole cache on a context switch More details later on when we discuss virtual memory Portland State University ECE 587/687 Fall 2017 19

Reducing Cache Misses Cache misses are very costly Need multiple cache levels with High associativity or/and victim caches to reduce conflict misses Effective replacement algorithms Data and instruction prefetch Preferably with multiple stream buffers More details in Thursday s class and next week Portland State University ECE 587/687 Fall 2017 20

What s Next? Reading assignment A. Jaleel et. al. High Performance Cache Replacement Using Re- Reference Interval Prediction (RRIP), ISCA 2010 (Review) ECE587_REVIEW_10_26 HW1 due Thursday (10/26) Submit at D2L dropbox. Mid-term exam next Tuesday (10/31) In-class written exam Covers all the topics we discussed May include key concepts from the papers Portland State University ECE 587/687 Fall 2017 21