CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Similar documents
Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

CMPSC 311- Introduction to Systems Programming Module: Caching

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Systems Programming and Computer Architecture ( ) Timothy Roscoe

CS3350B Computer Architecture

CMPSC 311- Introduction to Systems Programming Module: Caching

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

CS 240 Stage 3 Abstractions for Practical Systems

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

EN1640: Design of Computing Systems Topic 06: Memory System

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

EN1640: Design of Computing Systems Topic 06: Memory System

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Memory hierarchies: caches and their impact on the running time

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Advanced Memory Organizations

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

+ Random-Access Memory (RAM)

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

211: Computer Architecture Summer 2016

Page 1. Memory Hierarchies (Part 2)

Memory Hierarchy. Slides contents from:

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Page 1. Multilevel Memories (Improving performance using a little cash )

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

LECTURE 11. Memory Hierarchy

EE 4683/5683: COMPUTER ARCHITECTURE

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchy. Slides contents from:

CS3350B Computer Architecture

1/19/2009. Data Locality. Exploiting Locality: Caches

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, SPRING 2013

Caching Basics. Memory Hierarchies

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Course Administration

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Hierarchies &

The University of Adelaide, School of Computer Science 13 September 2018

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 26, FALL 2012

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Chapter 5. Memory Technology

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

CSE502: Computer Architecture CSE 502: Computer Architecture

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

EEC 483 Computer Organization

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

COMPUTER ORGANIZATION AND DESIGN

Logical Diagram of a Set-associative Cache Accessing a Cache

Introduction to OpenMP. Lecture 10: Caches

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Caches Concepts Review

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Problem: Processor- Memory Bo<leneck

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Introduction. Memory Hierarchy

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

CISC 360. The Memory Hierarchy Nov 13, 2008

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

Computer Systems. Memory Hierarchy. Han, Hwansoo

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Advanced Computer Architecture

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

CSE 2021: Computer Organization

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Transcription:

CS356: Discussion #9 Memory Hierarchy and Caches Marco Paolieri (paolieri@usc.edu) Illustrations from CS:APP3e textbook

The Memory Hierarchy So far... We modeled the memory system as an abstract array of bytes. The CPU could access any location in constant time. In practice, the memory system is a hierarchy of storage devices with different capacities, costs, and access times. Small, fast cache memories close to the CPU: staging area for data/instructions read from main memory. Main memory can be used as staging area for large, slow local disks. Local disks are often used as staging area for data from network devices. Well-written programs tend to access storage at any level more frequently than storage at the next lower level.

The Memory Hierarchy

Latency Numbers (2018) http://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Storage Technologies Static RAM Used for cache memories (inside/outside CPU) Faster, more expensive: 6 transistors/bit Resistant to noise, persistent (bistable cells) About 10 MB on a desktop computer Dynamic RAM Used for main memory and frame buffer of GPUs Very sensitive to noise, even light (array of capacitors) Must be periodically refreshed (recharge capacitors) 1000 cheaper, 10 slower About 16 GB on a desktop computer DRAM Organization Bidimensional array of supercells (i, j) each storing w bits Memory controller sends RAS / CAS (on same pins), reads/writes values

System Bus: Memory and I/O Access Once: northbridge (RAM / PCIe) and southbridge (PCI / IDE / SATA / USB). Since Sandy Bridge: northbridge functions integrated into CPU die.

Locality Temporal Locality A memory location referenced once is likely to be referenced again multiple times in the near future. Spatial Locality Nearby memory locations are likely to be referenced in the near future. Locality is exploited at all levels: Hardware level: CPU cache memories for main memory access. OS level: use main memory as cache for virtual memory or disk blocks. Application level: Web browsers caching page elements. Web servers caching frequently accessed pages/images in memory.

Example of Data Locality #include <stdio.h> int sum(int *array, int size) { int sum = 0; for (int i = 0; i < size; i++) { sum += array[i]; } return sum; } int main() { int array[5] = {1, 2, 3, 10, 20}; printf("%d\n", sum(array, 5)); return 0; } The function sum(int *array, int size) has good locality: Variable sum is referenced once in every loop cycle: good temporal locality. The elements of array are read sequentially: good spatial locality. Stride-k reference pattern: visiting every k-th element of an array. The smaller the stride, the better the spatial locality.

Sum by row or by column? #include <stdio.h> #define N 3 int sum_by_row(int matrix[n][n]) { int sum = 0; for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { sum += matrix[i][j]; } } return sum; } int sum_by_col(int matrix[n][n]) { int sum = 0; for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) { sum += matrix[i][j]; } } return sum; } Function sum_by_row(int matrix[n][n]) has better space locality. Multidimensional arrays are stored in row-major format in C. sum_by_row results in stride-1 accesses sum_by_col results in stride-n accesses Loops also have great temporal and spatial locality with respect to instruction fetches.

Caching A cache is a small, fast staging area for objects from a larger, slower device. Cache Hit. When an object in found in the cache of a level. Cache Miss. When an object is fetched from the next level. In turn, another cache miss can be caused in the next level. Once data is fetched, an eviction policy decides whether to store the data in the cache and, if the cache is full, which block to evict (random, LRU). Data is transferred between levels in blocks containing many objects: If another object in the block is required, it will be already in the cache.

Types of Cache Misses Compulsory Miss (or Cold Miss) When the cache is initially empty (or cold ). Conflict Miss When there are restrictive placement policies inside the cache, and two referenced data map to the same cache block. Capacity Miss When the cache is not large enough for the working set of a program phase. Who takes care of cache misses? Compiler manages the register file. CPU L1, L2, L3 caches are managed by hardware logic. For virtual memory, DRAM is managed by the OS and by the address translation hardware of the CPU.

Cache Organization Memory: addresses of m bits M = 2m memory locations Cache: S = 2s cache sets Each set has K lines Each line has: data block of B = 2b bytes, valid bit, t = m (s+b) tag bits How to check if the word at an address is in the cache?

Direct-Mapped Caches (K = 1)

Conflict Misses in Direct-Mapped Caches Multiple memory blocks can map to the same set/line! They are identified by different tags. They generate conflict misses. Eviction policy is trivial: must remove the only line in the set.

Why caches index with middle bits With high-order bit indexing, contiguous memory blocks would map to the same cache set.

K-way Set Associative Caches (1 < K < C/B) More than one line per set. Line-matching is more difficult: must check the tag of multiple lines. Requires a policy for cache eviction (when all lines in set are full). Random, FIFO, least-recently used (LRU), least-frequently used (LFU).

Fully Associative Caches (K = C/B) A single set contains all the cache lines. Line-matching is very difficult: must check the tags of all lines. Appropriate for small caches (e.g, TLBs in virtual memory buffers).

Memory Writes and Caching Write hit (writing to a word in the cache). Two options: Write-through: immediately update the word in the next cache level. Write-back: wait until the word is evicted from this cache level. Can significantly reduce traffic on the bus. Additional complexity (needs a dirty bit ). More common at lower levels (e.g., virtual memory). Also used for Intel L1. Write miss (the word is not in the cache). Two options: Write-allocate: load the word from next cache level, then write. No-write-allocate: bypass the cache and write directly into the next level. Write-through caches are typically no-write-allocate. Write-back caches are typically write-allocate.

Performance Tuning of Caches Average Access Time = (Hit Time) + (Miss Ratio) (Miss Penalty) Large caches decrease the miss ratio, but increase the hit time. Large blocks decrease the miss ratio with spatial locality, but having fewer lines per set can hurt programs where temporal locality dominates. Large blocks can also increase the miss penalty. Large associativity K decreases the chance of conflict misses, but it is more expensive to implement and hard to make fast. More tag bits per line. Additional LRU state bits per line. Additional control logic. can increase both hit time and miss penalty.

Exercise: Cache Size and Address Problem A processor has a 32-bit memory address space. The memory is broken into blocks of 32 bytes each. The cache is capable of storing 16 kb. How many blocks can the cache store? Break the address into tag, set, byte offset for direct-mapping cache. Break the address into tag, set, byte offset for a 4-way set-associative cache. Solution 16 kb / 32 bytes per block = 512 blocks. Direct-mapping: 18-bit tag (rest), 9-bit set address, 5-bit block offset. 4-way set-associative: each set has 4 lines, so there are 512 / 4 = 128 sets. 20-bit tag (rest) 7-bit set address 5-bit block offset

Exercise: Cache Size and Address Problem A processor has a 36-bit memory address space. The memory is broken into blocks of 64 bytes each. The cache is capable of storing 1 MB. How many blocks can the cache store? Break the address into tag, set, byte offset for direct-mapping cache. Break the address into tag, set, byte offset for a 8-way set-associative cache. Solution 1 MB / 64 bytes per block = 2**(20-6) = 16k blocks. Direct-mapping: 16-bit tag (rest), 14-bit set address, 6-bit block offset. 8-way set-associative: each set has 8 lines, so there are 16k / 8 = 2k sets 19-bit tag (rest) 11-bit set address 6-bit block offset

Exercise: Direct-Mapping Performance You are asked to optimize a cache capable of storing 8 bytes total for the given references. There are three direct-mapped cache designs possible by varying the block size: C1 has one-byte blocks, C2 has two-byte blocks, and C3 has four-byte blocks. In terms of miss rate, which cache design is best? If the miss stall time is 25 cycles, and C1 has an access time of 2 cycles, C2 takes 3 cycles, and C3 takes 5 cycles, which is the best cache design? (Every access, hit or miss, requires an access to the cache.) Trace (LSB) 1 134 212 1 135 213 162 161 2 44 41 221 0000 1000 1101 0000 1000 1101 1010 1010 0000 0010 0010 1101 0001 0110 0100 0001 0111 0101 0010 0001 0010 1100 1001 1101

Solution: Direct-Mapping Performance Address breakdown C1 has no block offset, 3-bit set address C2 has 1-bit block offset, 2-bit set address C3 has 2-bit block offset, 1-bit set address How to run a trace: extract set address (3, 2, 1 bits) from LSB; on miss, load (1, 2, 4) bytes. Running C3: Get 1: miss. Put bytes 0-3 in bucket 0. Get 134: miss. Put 132-135 in bucket 1. Get 212: miss. Put 212-215 in bucket 1. Get 1: hit. Get 135: miss. Put 132-135 in bucket 1. Get 213: miss. Put 212-215 in bucket 1. Get 162: miss. Put 160-163 in bucket 0. Get 161: hit. Trace MEM 1 134 212 1 135 213 162 161 2 44 41 221 LSB 0000 0001 1000 0110 1101 0100 0000 0001 1000 0111 1101 0101 1010 0010 1010 0001 0000 0010 0010 1100 0010 1001 1101 1101 C1 1m 6m 4m 1h 7m 5m 2m 1m 2m 4m 1m 5m C2 0m 3m 2m 0h 3h 2h 1m 0m 1m 2m 0m 2m C3 0m 1m 1m 0h 1m 1m 0m 0h 0m 1m 0m 1m m_rate: 11/12 9/12 10/12

Solution: Performance Average Access Time = (Hit Time) + (Miss Rate) (Miss Penalty) In terms of miss rate, C2 is best. If the miss stall time is 25 cycles, and C1 has an access time of 2 cycles, C2 takes 3 cycles, and C3 takes 5 cycles, which is the best cache design? (Every access, hit or miss, requires an access to the cache.) For C1, access time = 2 + 11/12 25 = 24.92 cycles For C2, access time = 3 + 9/12 25 = 21.75 cycles For C3, access time = 5 + 10/12 25 = 25.83 cycles