Page 1. Multilevel Memories (Improving performance using a little cash )

Similar documents
Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Advanced Computer Architecture

Lecture 11 Cache. Peng Liu.

EC 513 Computer Architecture

Lecture 7 - Memory Hierarchy-II

Memory Hierarchy. Slides contents from:

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Slides contents from:

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Caching Basics. Memory Hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Course Administration

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Chapter 2: Memory Hierarchy Design Part 2

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Lecture notes for CS Chapter 2, part 1 10/23/18

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Page 1. Memory Hierarchies (Part 2)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Lecture-14 (Memory Hierarchy) CS422-Spring

Performance metrics for caches

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CS161 Design and Architecture of Computer Systems. Cache $$$$$

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

EE 660: Computer Architecture Advanced Caches

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

14:332:331. Week 13 Basics of Cache

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

CSC 631: High-Performance Computer Architecture

Question?! Processor comparison!

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Chapter 2: Memory Hierarchy Design Part 2

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Introduction to OpenMP. Lecture 10: Caches

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Modern Computer Architecture

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

Caches Concepts Review

CS3350B Computer Architecture

Cray XE6 Performance Workshop

CMPSC 311- Introduction to Systems Programming Module: Caching

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

COSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

CS 61C: Great Ideas in Computer Architecture Caches Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Advanced Memory Organizations

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 11 Cache and Virtual Memory

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

1/19/2009. Data Locality. Exploiting Locality: Caches

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

LECTURE 11. Memory Hierarchy

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

V. Primary & Secondary Memory!

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Transcription:

Page 1 Multilevel Memories (Improving performance using a little cash ) 1

Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency (time for a single access) Memory access time >> Processor cycle time Bandwidth (number of accesses per unit time) if fraction m of instructions access memory, 1+m memory references / instruction CPI = 1 requires 1+m memory refs / cycle 2

Processor-DRAM Gap (latency) Performance 1000 100 10 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 Time 1993 µproc 60%/yr. Moore s Law 1994 1995 1996 1997 1998 1999 CPU Processor-Memory Performance Gap: (grows 50% / year) DRAM 2000 DRAM 7%/yr. [Courtesy of David Patterson, UC Berkeley; Used with Permission] Four-issue superscalar executes 400 instructions during cache miss! 1

Page 4 Multilevel Memory Strategy: Hide latency using small, fast memories called caches. Caches are a mechanism to hide memory latency based on the empirical observation that the stream of memory references made by a processor exhibits locality PC 96 loop: ADD r2, r1, r1 100 SUBI r3, r3, #1 104 BNEZ r3, loop 108 112 What is the pattern of instruction memory addresses? 4

Page 5 Typical Memory Reference Patterns Instruction fetches Address n loop iterations linear sequence Stack accesses Data accesses Time 5

Page 6 Locality Properties of Patterns Two locality properties of memory references: If a location is referenced it is likely to be referenced again in the near future. This is called temporal locality. If a location is referenced it is likely that locations near it will be referenced in the near future. This is called spatial locality. 6

Page 7 Caches Caches exploit both properties of patterns. Exploit temporal locality by remembering the contents of recently accessed locations. Exploit spatial locality by remembering blocks of contents of recently accessed locations. 7

Page 8 Memory Hierarchy CPU A Small, Fast Memory (RF, SRAM) holds frequently used data B Big, Slow Memory (DRAM) size: Register << SRAM << DRAM why? latency: Register << SRAM << DRAM why? bandwidth: on-chip >> off-chip why? On a data access: hit (data fast memory) low latency access miss (data fast memory) long latency access (DRAM) Fast mem effective only if bandwidth reqrmnt at B << A 8 Due to cost Due to size of DRAM Due to cost and wire delays (wires on-chip cost much less, and are faster)

Page 9 Management of Memory Hierarchy Software managed, e.g., registers part of the software-visible processor state software in complete control of storage allocation» but hardware might do things behind software s back, e.g., register renaming Hardware managed, e.g., caches not part of the software-visible processor state hardware automatically decides what is kept in fast memory» but software may provide hints, e.g., don t cache or prefetch 9

Page 10 A Typical Memory Hierarchy c.2000 CPU RF Split instruction & data primary caches (on-chip SRAM) L1 Instruction Cache L1 Data Cache Unified L2 Cache Multiple interleaved memory banks (DRAM) Memory Memory Memory Memory Multiported register file (part of CPU) Large unified secondary cache (on-chip or off-chip SRAM) 10

Page 11 Inside a Cache Processor Address Data CACHE Address Data Main Memory copy of main memory location 100 copy of main memory location 101 100 304 Data Byte Data Byte Data Byte Line Address Tag 6848 Data Block 11

Page 12 Cache Algorithm (Read) Look at Processor Address, search cache tags to find match. Then either Found in cache a.k.a. HIT Not in cache a.k.a. MISS Return copy of data from cache Which line do we replace? Read block of data from Main Memory Wait Return data to processor and update cache 12

Page 13 Placement Policy Block Number 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 Memory Set Number 0 1 2 3 0 1 2 3 4 5 6 7 Cache block 12 can be placed Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into set 0 block 4 (12 mod 4) (12 mod 8) 13

Page 14 Direct-Mapped Cache Tag Index Block Offset t V Tag k Data Block b 2 k lines = t HIT Data Word or Byte 14

Page 15 Fully Associative Cache V Tag Data Block = t Block Offset Tag b t = = Data Word or Byte HIT 15

Page 16 Direct Map Address Selection higher-order vs. lower-order address bits Block Address index m- b tag offset b tag status data block = valid? Selection based on lower-order address bits or higher-order address bits? hit? word 16

Page 17 2-Way Set-Associative Cache Tag Index Block Offset b t V k Tag Data Block V Tag Data Block t = = Data Word or Byte HIT 17

Page 18 Highly-Associative Caches For high associativity, use content-addressable memory (CAM) for tags Overhead? tag t set i offset b Set i Set 1 Set 0 Tag =? Data Block Tag =? Data Block Tag =? Data Block Only one set enabled Only hit data accessed Hit? Data 18 Advantage is low power because only hit data is accessed.

Page 19 Average Read Latency using a Cache α is HIT RATIO: Fraction of references in cache 1 - α is MISS RATIO: Remaining references Average access time for serial search: Processor Addr Data CACHE Addr Data Main Memory t c + (1 - α) t m Average access time for parallel search: Processor Addr Data CACHE Data Main Memory α t c + (1 - α) t m t c is smallest for which type of cache? 19

Page 20 Write Policy Cache hit: write through: write both cache & memory - generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) - a dirty bit per block can further reduce the traffic Cache miss: no write allocate: only write to main memory write allocate: (aka fetch on write) fetch block into cache Common combinations: write through and no write allocate write back with write allocate 20

Page 21 Write Performance t Tag V Tag Index k Block Offset b Data 2 k lines = t WE HIT Data Word or Byte 21

Page 22 Replacement Policy In an associative cache, which block from a set should be evicted when the set becomes full? Random Least Recently Used (LRU) LRU cache state must be updated on every access true implementation only feasible for small sets (2-way easy) pseudo-lru binary tree often used for 4-8 way First In, First Out (FIFO) a.k.a. Round-Robin used in highly associative caches This is a second-order effect. Why? 22

Page 23 Causes for Cache Misses Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect placement & replacement policy Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity Determining the type of a miss requires running program traces on a cache simulator 23

Page 24 Improving Cache Performance Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate (e.g., larger cache) reduce the miss penalty (e.g., L2 cache) reduce the hit time Aside: What is the simplest design strategy? 24 Design the largest primary cache without slowing down the clock Or adding pipeline stages.

Page 25 Block Size and Spatial Locality Block is unit of transfer between the cache and memory Tag Word0 Word1 Word2 Word3 4 word block, b=2 Split CPU address block address offset b 32-b bits b bits 2 b = block size a.k.a line size (in bytes) Larger block size has distinct hardware advantages less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses What are the disadvantages of increasing block size? 25 Larger block size will reduce compulsory misses (first miss to a block). Larger blocks may increase conflict misses since the number of blocks is smaller.

Page 26 Victim Caches (HP 7200) CPU RF L1 Data Cache Unified L2 Cache Hit data from VC (miss in L1) Victim FA Cache 4 blocks Replaced data from L1 where? Replaced data From VC Works very well with a direct-mapped cache 26

Page 27 Pseudo-Associative Caches (MIPS R1000) Look at Processor Address, look in indexed set for data. Then either HIT MISS Return copy of data from cache Invert MSB of index field Look in pseudo set SLOW HIT MISS Read block of data from next level of cache 27

Page 28 Reducing (Read) Miss Penalty CPU RF L1 Data Cache Write buffer Unified L2 Cache Replaced dirty data from L1 in writeback cache OR All writes in writethru cache Write buffer may hold updated value of location needed by a read miss On a read miss, simple scheme is to wait for the write buffer to go empty Check write buffer on read miss, if no conflicts, allow read miss to continue (else, return value in write buffer) Doesn t help much. Why? 28 Deisgners of the MIPS M/1000 estimated that waiting for a four-word buffer to empty increased the read miss penalty by a factor of 1.5.

Page 29 Block-level Optimizations Tags are too large, i.e., too much overhead Simple solution: Larger blocks, but miss penalty could be large. Sub-block placement A valid bit added to units smaller than the full block, called sub- blocks Only read a sub- block on a miss If a tag matches, is the word in the cache? 100 300 204 1 1 1 1 1 0 0 1 0 1 0 1 29

Page 30 Write Alternatives - Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit - Design data RAM that can perform read and write in one cycle, restore old value after tag miss - Hold write data for store in single buffer ahead of cache, write cache data during next store s tag check - Need to bypass from write buffer if read matches write buffer tag 30

Page 31 Pipelining Cache Writes (Alpha 21064) t Tag V Tag Index k Block Offset b Data 2 k lines t = WE Bypass HIT Data Word or Byte 31

Page 32 Effect of Cache Parameters on Perf. Larger cache size + reduces capacity and conflict misses - hit time may increase Larger block size + spatial locality reduces compulsory misses and capacity reload misses - fewer blocks may increase conflict miss rate - larger blocks may increase miss penalty Higher associativity + reduces conflict misses (up to around 4-8 way) - may increase access time See page 427 of text for a nice summary 32