Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Similar documents
Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Prefetching

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Page 1. Memory Hierarchies (Part 2)

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Page 1. Multilevel Memories (Improving performance using a little cash )

Portland State University ECE 587/687. Memory Ordering

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Pollard s Attempt to Explain Cache Memory

ECE 485/585 Microprocessor System Design

Course Administration

EE 4683/5683: COMPUTER ARCHITECTURE

ECE 30 Introduction to Computer Engineering

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Caching Basics. Memory Hierarchies

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Portland State University ECE 587/687. Memory Ordering

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

14:332:331. Week 13 Basics of Cache

Memory Hierarchy. Slides contents from:

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Advanced Computer Architecture

Lecture 11 Cache. Peng Liu.

CS3350B Computer Architecture

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Performance metrics for caches

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Introduction to OpenMP. Lecture 10: Caches

Logical Diagram of a Set-associative Cache Accessing a Cache

CSE 141 Computer Architecture Spring Lectures 17 Virtual Memory. Announcements Office Hour

Memory Hierarchies 2009 DAT105

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

V. Primary & Secondary Memory!

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Introduction. Memory Hierarchy

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Memory Hierarchy. Caching Chapter 7. Locality. Program Characteristics. What does that mean?!? Exploiting Spatial & Temporal Locality

CSE 2021: Computer Organization

Computer Architecture Spring 2016

EC 513 Computer Architecture

Advanced Memory Organizations

CSE 2021: Computer Organization

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Computer Architecture CS372 Exam 3

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Cray XE6 Performance Workshop

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Memory Hierarchy Design (Appendix B and Chapter 2)

Cache Performance (H&P 5.3; 5.5; 5.6)

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

The University of Adelaide, School of Computer Science 13 September 2018

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

A Cache Hierarchy in a Computer System

Memory Hierarchy: Caches, Virtual Memory

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lec 11 How to improve cache performance

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS422 Computer Architecture

ECE 587/687 Final Exam Solution

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

Memory. Lecture 22 CS301

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

ECE331: Hardware Organization and Design

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

ECE 2300 Digital Logic & Computer Organization. More Caches Measuring Performance

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

William Stallings Computer Organization and Architecture 8th Edition. Cache Memory

Transcription:

Portland State University ECE 587/687 Caches and Memory-Level Parallelism Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017

Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each instruction: CPI = CPI(Perfect Memory) + Memory stall cycles per instruction With no caches, all memory requests require accessing main memory Very long latency (discuss) Caches filter out many memory accesses Reduce execution time Reduce memory bandwidth Portland State University ECE 587/687 Fall 2017 2

Cache Performance Memory stall cycles Per Instruction = Cache Misses per instruction x miss penalty Processor Performance: CPI = CPI(Perfect Cache) + miss rate x miss penalty Average memory access time = Hit ratio x Hit latency + Miss ratio x Miss penalty Cache hierarchies attempt to reduce average memory access time Portland State University ECE 587/687 Fall 2017 3

Cache Performance Metrics Hit ratio: #hits / #accesses Miss ratio: #misses / #accesses Miss rate: Misses per instruction (or 1000 instructions) Miss rate = miss ratio x memory accesses per inst Hit time: time from request issued to cache until data is returned to the processor Depends on cache design parameters Bigger caches, larger associativity, or more ports increase hit time Miss penalty: depends on memory hierarchy parameters Portland State University ECE 587/687 Fall 2017 4

Memory Hierarchy First-level caches Usually Split I & D caches Small and fast Second-level caches Usually on-die SRAM cells Third-level etc.? Main memory DRAM cells focus on density Disk Usually magnetic device Non volatile Slow access time Processor L1I$ L1D$ L2 Cache Main Memory Disk Portland State University ECE 587/687 Fall 2017 5

Basic Cache Structure Array of blocks (lines) Each block is usually 32-128 bytes Finding a block in cache: Data Address Tag Index Offset Offset: byte offset in block Index: Which set in the cache is the block located Tag: Needs to match address tag in cache Portland State University ECE 587/687 Fall 2017 6

Set associativity Associativity Set: Group of blocks corresponding to same index Each block in the set is called a Way 2-way set associative cache: each set contains two blocks Direct-mapped cache: each set contains one block Fully-associative cache: the whole cache is one set Need to check all tags in a set to determine hit/miss status Portland State University ECE 587/687 Fall 2017 7

Example: Cache Block Placement Consider a 4-way, 32KB cache with 64-byte lines Where is the 48-bit address 0x0000FFFFAB64? Number of lines = cache size / line size = 32K / 64 = 512 Each set contains 4 lines Number of sets = 512/4 = 128 sets Offset bits = log 2 (64) = 6: Offset = 0x24 Index bits = log 2 (128) = 7: Index = 0x2D Tag bits = 48-(6+7) = 35: Tag = 0x00007FFFD Portland State University ECE 587/687 Fall 2017 8

Types of Cache Misses Compulsory (cold) misses: First access to a block Prefetching can reduce these misses Capacity misses: A cache cannot contain all blocks needed in a program - some blocks are discarded then accessed Replacement policies should target blocks that won t be used later Conflict misses: Blocks mapping to the same set may be discarded (in direct-mapped and set-associative caches) Increasing associativity can reduce these misses For multiprocessors, coherence misses can also happen Portland State University ECE 587/687 Fall 2017 9

Non-Blocking Cache Hierarchy Superscalar processors require parallel execution units Multiple pipelined functional units Cache hierarchies capable of simultaneously servicing multiple memory requests Do not block cache references that do not need the miss data Service multiple miss requests to memory concurrently Revisit miss penalty with memory-level parallelism Portland State University ECE 587/687 Fall 2017 10

Miss Status Holding (Handling) Registers MSHRs facilitate non-blocking memory level parallelism Used to track address, data, and status for multiple outstanding cache misses Need to provide correct memory ordering, respond to CPU requests, and maintain cache coherence Design details vary widely between different processors But basic functions are similar Portland State University ECE 587/687 Fall 2017 11

Cache & MSHR Organization Paper Fig 1: block diagram of cache organization Main Components: MSHR: One register for each miss to be handled concurrently N-way comparator: Compares an address to all block addresses in MSHRs (N = #MSHRs) Input Stack: Buffer space for all misses corresponding to MSHR entries Size = #MSHRs x block size Status update and collecting networks Current implementations combine the MSHR and input stack Portland State University ECE 587/687 Fall 2017 12

MSHR Structure Each MSHR contains the following information Data address PC of requesting instruction Input identification tags & Send-to-CPU flags: Allows forwarding requested word(s) to the CPU Input stack indicators (flags): Allows reading data directly from input stack Partial write codes: Indicates which bytes in a word has been written to the cache Valid flag Obsolete flag: either info is not valid or MSHR hit on data in transit Portland State University ECE 587/687 Fall 2017 13

MSHR Operation On a cache miss, one MSHR is assigned Valid flag set Obsolete flag cleared Data address saved PC of requesting instruction saved Appropriate send-to-cpu flags set and others cleared Input identification tag saved in correct position Partial write codes cleared All MSHRs pointing to the same data address purged Optimal number of MSHRs: paper figure 2 Portland State University ECE 587/687 Fall 2017 14

Reducing Cache Misses Cache misses are very costly Need multiple cache levels with High associativity or/and victim caches to reduce conflict misses Effective replacement algorithms Data and instruction prefetch Preferably with multiple stream buffers More details in next class Portland State University ECE 587/687 Fall 2017 15

Announcements Midterm exam next class (10/31) Open notes Bring your own calculator Reading assignment for 11/1: N. Jouppi, Improving Direct-Mapped Cache Performance by the Addition of a Small Fully- Associative Cache and Prefetch Buffers, ISCA 1990 (Review) T.F. Chen and J.L Baer, Effective Hardware-Based Data Prefetching for High-Performance Processors, IEEE Transactions on Computers, 1995 (Skim) Portland State University ECE 587/687 Fall 2017 16