Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Similar documents
Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Main Memory Supporting Caches

Advanced Memory Organizations

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 2)

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

COMPUTER ORGANIZATION AND DESIGN

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CSE 2021: Computer Organization

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

CSE 2021: Computer Organization

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Chapter 5. Memory Technology

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Chapter Seven Morgan Kaufmann Publishers

Handout 4 Memory Hierarchy

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory. Lecture 22 CS301

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

EN1640: Design of Computing Systems Topic 06: Memory System

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

Chapter 3: Loaders and Linkers

Computer Systems Laboratory Sungkyunkwan University

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

EN1640: Design of Computing Systems Topic 06: Memory System

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

Chapter 4 The Processor (Part 4)

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

EEC 483 Computer Organization

Chapter 6 Storage and Other I/O Topics

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

A Cache Hierarchy in a Computer System

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

CS161 Design and Architecture of Computer Systems. Cache $$$$$

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Memory Hierarchy Y. K. Malaiya

Advanced Computer Architecture

ECE331: Hardware Organization and Design

The Memory Hierarchy & Cache

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

V. Primary & Secondary Memory!

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Memory Hierarchy. Slides contents from:

Lecture 11 Cache. Peng Liu.

LECTURE 11. Memory Hierarchy

Page 1. Multilevel Memories (Improving performance using a little cash )

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Lecture-14 (Memory Hierarchy) CS422-Spring

EE 4683/5683: COMPUTER ARCHITECTURE

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

CS3350B Computer Architecture

Memory Hierarchy and Caches

Memory Hierarchy Design (Appendix B and Chapter 2)

Caches. Hiding Memory Access Times

Memory Hierarchy. Slides contents from:

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

14:332:331. Week 13 Basics of Cache

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Introduction to cache memories

ECE232: Hardware Organization and Design

Course Administration

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

CS/ECE 3330 Computer Architecture. Chapter 5 Memory

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Lecture 11: Memory Systems -- Cache Organiza9on and Performance CSE 564 Computer Architecture Summer 2017

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Transcription:

Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline 5.1 Introduction 5.2 The Basics of Caches 5.3 Measuring and Improving Cache Performance 2

Depar rtment of Electr rical Engineering, Since 1980, CPU has outpaced DRAM... Q. How do architects address this gap? Put smaller, faster cache memories between CPU and DRAM. Create a memory hierarchy. Gap grew 50% per year 3 Feng-Chia Unive ersity Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM (Dynamic Random Access Memory): value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10) SRAM DRAM 4

Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Memory array Activation Precharge Source : http://en.wikipedia.org/wiki/dram DRAM 5 Exploiting Memory Hierarchy Users want large and fast memories! SRAM access times are 0.5 5ns at cost of $2,000 to $5,000 per GB. DRAM access times are 50-70ns at cost of $20 to $75 per GB. Disk access times are 5 to 20 million ns at cost of $0.20 to $2 per GB. Try and give it to them anyway build a memory hierarchy CPU Registers Cache (SRAM) Interconnection Memory (DRAM) Memory Controller Input Device Output Device in 2008 6

The Principle of Locality A principle that makes having a memory hierarchy a good idea Two different types of locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Our initial focus: two levels (upper, lower) Block (aka line): minimum unit of data Hit: data requested is in the upper level Miss: data requested is not in the upper level 7 Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline 5.1 Introduction 5.2 The Basics of Caches 5.3 Measuring and Improving Cache Performance 8

Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Direct mapped Fully associative Set associative Cache Organization 9 Two issues: Cache How do we know if a data item is in the cache? If it is, how do we find it? Our first example: block size is one word of data "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level 10

Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Direct Mapped Cache Mapping: address is modulo the number of blocks in the cache 000 Cache 001 010 011 100 101 110 111 00001 00101 01001 00 01101 10001 10101 11001 11101 Memory 11 For MIPS: Hit Direct Mapped Cache Index 0 1 2 1021 1022 1023 Address (showing bit positions) 31 30 13 12 11 2 1 0 Byte offset 20 10 Tag Valid Tag = 20 Index Data 32 Data 12

Tags and Valid Bits How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0 13 Depar rtment of Electrical Engineering, Feng-Chia Unive ersity Direct Mapped Cache Taking advantage of spatial locality: 14

Feng-Chia Unive ersity 4-Way Set Associative 31 30 12 11 10 9 8 3 2 1 0 22 8 Index V Tag Data V Tag Data V Tag Data V Tag Data 0 1 2 253 254 255 Hit 4-to-1 multiplexor Data 22 32 15 Feng-Chia Unive ersity Example : Direct Mapped Cache 32-bit Address Cache Size = 64KByte Block Size = 32Byte Direct mapped 16

Feng-Chia Unive ersity Example : Set Associative Cache 32-bit Address Cache Size = 64KByte Block Size = 32Byte 2-way set associative 17 Block Size Considerations Larger blocks should reduce miss rate Due to spatial locality But in a fixed-sized cache Larger blocks fewer of them More competition increased miss rate Larger blocks pollution Larger miss penalty Can override benefit of reduced miss rate Early restart and critical-word-first can help 18

Line Size and Locality Increasing the block size tends to decrease miss rate: rate Miss 40% 35% 30% 25% 20% 15% 10% 5% 0% 4 16 64 Block size (bytes) 256 1 KB 8 KB 16 KB 64 KB 256 KB Use split caches because there is more spatial locality in code: Block size in Instruction Data miss Effective combined Program words miss rate rate miss rate gcc 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% spice 1 12% 1.2% 13% 1.3% 12% 1.2% 4 0.3% 0.6% 0.4% 19 Cache Misses On cache hit, CPU proceeds normally On cache miss Stall the CPU pipeline Fetch block from next level of hierarchy Instruction cache miss Restart instruction fetch Data cache miss Complete data access 20

Hits vs. Misses Read Hit Read Read dmiss Memory References Write Hit Write Write Miss Write-through through Write-back Write-around Write-allocate Write-through Write-back 21 Write-Through On data-write hit, could just update the block in cache But then cache and memory would be inconsistent Write through: also update memory But makes writes take longer e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles Effective CPI = 1 + 0.1 100 = 11 Solution: write buffer Holds data waiting to be written to memory CPU continues immediately Only stalls on write if write buffer is already full 22

Write Buffers for Write-Through Caches Processor Cache Lower Level Memory Write Buffer Holds data awaiting write-through to lower level memory Q. Why a write buffer? So CPU doesn t stall Q. Why a buffer, why not just one register? Bursts of writes are common Q: Are Read After Write (RAW) hazards an issue for write buffer? Yes! Drain buffer before next read, or send read 1st after check write buffers. 23 Write-Back Alternative: On data-write hit, just update the block in cache Keep track of whether each block is dirty When a dirty block is replaced Write it back to memory Can use a write buffer to allow replacing block to be read first 24

Random: Replacement Policy Candidate blocks are randomly selected, possibly using some hardware assistance. Least Recently Used (LRU) The block replaced is the one that has been unused for the longest time First In, First Out (FIFO) Because LRU can be complicated to calculate, this approximates LRU by determining the oldest block rather than the LRU 25 Main Memory Supporting Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Example cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer For 4-word block, 1-word-wide DRAM Miss penalty = 1 + 4 15 + 4 1 = 65 bus cycles Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle 26

Depar rtment of Electrical Engineering, Hardware Issues Make reading multiple words easier by using banks of memory 2-word wide memory Miss penalty = 1 + (2x15) + 2x1 = 33 bus cycles Bandwidth = 16 bytes / 33 cycles = 0.48 B/cycle 4-bank interleaved memory Miss penalty = 1 + 15 + 4 1 = 20 bus cycles Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle 27 Depar rtment of Electrical Engineering, Non-interleaving memory Interleaving memory Pipelining read/write accesses Memory Word Word Word Word Bank 0 Bank 1 Bank 2 address address address address Bank 3 0 4 1 5 2 6 3 7 8 9 10 11 12 13 14 15 28

Advanced DRAM Organization Bits in a DRAM are organized as a rectangular array DRAM accesses an entire row Burst mode: supply successive words from a row with reduced latency Double data rate (DDR) DRAM Transfer on rising and falling clock edges Quad data rate (QDR) DRAM Separate DDR inputs and outputs 29 Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity 5.1 Introduction Outline 5.2 The Basics of Caches 5.3 Measuring and Improving Cache Performance 30

Feng-Chia Unive ersity Simplified model: Performance execution time = (execution cycles + stall cycles) cycle time stall cycles = # of instructions miss ratio miss penalty Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size? 31 Given Cache Performance Example I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions Miss cycles per instruction I-cache: 0.02 100 = 2 D-cache: 0.36 004 0.04 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 Ideal CPU is 5.44/2 =2.72 times faster 32

Depar rtment of Electrical Engineering, Feng-Chia Unive ersity Decreasing miss ratio with associativity 33 Decreasing miss penalty with multilevel lil l caches Add a second level cache: often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in 2nd level l cache Using multilevel caches: try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache 34

Feng-Chia Unive ersity Performance of Multilevel Caches Example: CPI of 1.0 on a 4 GHz machine with a 2% miss rate, 100ns DRAM access Adding 2nd level cache with 5ns access time decreases miss rate to 0.5% 35 Feng-Chia Unive ersity Interactions with Software Misses depend on memory access patterns Algorithm behavior Compiler optimization for memory access Difficult to predict best algorithm: need experimental data 36