Caches Concepts Review

Similar documents
Example. How are these parameters decided?

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Page 1. Multilevel Memories (Improving performance using a little cash )

Caching Basics. Memory Hierarchies

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

EE 4683/5683: COMPUTER ARCHITECTURE

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

CSE 410 Computer Systems. Hal Perkins Spring 2010 Lecture 12 More About Caches

Page 1. Memory Hierarchies (Part 2)

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 11 Cache. Peng Liu.

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS429: Computer Organization and Architecture

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Memory. Lecture 22 CS301

CS3350B Computer Architecture

+ Random-Access Memory (RAM)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

CS429: Computer Organization and Architecture

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Cache Memory and Performance

Advanced Memory Organizations

CS3350B Computer Architecture

Computer Architecture CS372 Exam 3

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Memory Hierarchies &

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

EEC 483 Computer Organization

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Problem: Processor- Memory Bo<leneck

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

CS 240 Stage 3 Abstractions for Practical Systems

A Cache Hierarchy in a Computer System

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Main Memory Supporting Caches

211: Computer Architecture Summer 2016

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CS61C : Machine Structures

Adapted from David Patterson s slides on graduate computer architecture

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Chapter 2: Memory Hierarchy Design Part 2

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Architecture Spring 2016

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Performance metrics for caches

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Chapter 2: Memory Hierarchy Design Part 2

Binghamton University. CS-220 Spring Cached Memory. Computer Systems Chapter

Lecture 7 - Memory Hierarchy-II

ECE331: Hardware Organization and Design

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

ECE331: Hardware Organization and Design

Lecture notes for CS Chapter 2, part 1 10/23/18

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Cache introduction. April 16, Howard Huang 1

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

The Memory Hierarchy & Cache

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Welcome to Part 3: Memory Systems and I/O

CSE 2021: Computer Organization

EN1640: Design of Computing Systems Topic 06: Memory System

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Slides contents from:

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

CSE 2021: Computer Organization

Memory hierarchy review. ECE 154B Dmitri Strukov

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Course Administration

Transcription:

Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on a write miss Cache performance 1

Block addresses Now how can we figure out where data should be placed in the cache? It s time for block addresses! If the cache block size is 2 n bytes, we can conceptually split the main memory into 2 n -byte chunks too. To determine the block address of a byte address i, you can do the integer division Byte Address Block Address i / 2 n Our example has two-byte cache blocks, so we can think of a 16-byte main memory as an 8-block main memory instead. For instance, memory addresses 12 and 13 both correspond to block address 6, since 12 / 2 = 6 and 13 / 2 = 6. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 2

Set associative caches are a general idea By now you may have noticed the 1-way set associative cache is the same as a direct-mapped cache. Similarly, if a cache has 2 k blocks, a 2 k -way set associative cache would be the same as a fully-associative cache. 1-way 8 sets, 1 block each 2-way 4 sets, 2 blocks each 4-way 2 sets, 4 blocks each 8-way 1 set, 8 blocks Set 0 1 2 3 4 5 6 7 Set 0 1 2 3 Set 0 1 Set 0 direct mapped fully associative 3

Write-through caches A write-through cache forces all writes to update both the cache and the main memory. Mem[214] = 21763 Index V Tag Data Address Data 110 1 11010 21763 1101 0110 21763 This is simple to implement and keeps the cache and memory consistent. The bad thing is that forcing every write to go to main memory, we use up bandwidth between the cache and the memory. 4

Write-back caches In a write-back cache, the memory is not updated until the cache block needs to be replaced (e.g., when loading data into a full cache set). For example, we might write some data to the cache at first, leaving it inconsistent with the main memory as shown before. The cache block is marked dirty to indicate this inconsistency Mem[214] = 21763 Index V Dirty Tag Data Address 1000 1110 Data 1225 110 1 1 11010 21763 1101 0110 42803 Subsequent reads to the same memory address? Multiple writes to same block? 5

Write-back cache discussion The advantage of write-back caches is that not all write operations need to access main memory, as with write-through caches. If a single address is frequently written to, then it doesn t pay to keep writing that data through to main memory. If several bytes within the same cache block are modified, they will only force one memory write operation at write-back time. 6

Write misses A second scenario is if we try to write to an address that is not already contained in the cache; this is called a write miss. Let s say we want to store 21763 into Mem[1101 0110] but we find that address is not currently in the cache. Index V Tag Data Address Data 110 1 00010 123456 1101 0110 6378 When we update Mem[1101 0110], should we also load it into the cache? 7

Write around caches (a.k.a. write-no-allocate) With a write around policy, the write operation goes directly to main memory without affecting the cache. Mem[214] = 21763 Index V Tag Data 110 1 00010 123456 Address 1101 0110 Data 21763 8

Write around caches (a.k.a. write-no-allocate) With a write around policy, the write operation goes directly to main memory without affecting the cache. Mem[214] = 21763 Index V Tag Data 110 1 00010 123456 Address 1101 0110 Data 21763 This is good when data is written but not immediately used again, in which case there s no point to load it into the cache yet. for (int i = 0; i < SIZE; i++) a[i] = i; 9

Allocate on write An allocate on write strategy would instead load the newly written data into the cache. Mem[214] = 21763 Index V Tag Data 110 1 11010 21763 Address 1101 0110 Data 21763 If that data is needed again soon, it will be available in the cache. 10

Which is it? Given the following trace of accesses, can you determine whether the cache is write-allocate or write-no-allocate? Assume A and B are distinct, and can be in the cache simultaneously. Miss Miss Hit Hit Miss Hit Hit Load A Store B Store A Load A Load B Load B Load A 11

Which is it? Given the following trace of accesses, can you determine whether the cache is write-allocate or write-no-allocate? Assume A and B are distinct, and can be in the cache simultaneously. Miss Miss Hit Hit Miss Hit Load A Store B Store A Load A Load B Load B Hit Answer: Write-no-allocate Load A On a writeallocate cache this would be a hit 12

First Observations Split Instruction/Data caches: Pro: No structural hazard between IF & MEM stages A single-ported unified cache stalls fetch during load or store Con: Static partitioning of cache between instructions & data Bad if working sets unequal: e.g., code/data or CODE/data Cache Hierarchies: Trade-off between access time & hit rate L1 cache can focus on fast access time (okay hit rate) L2 cache can focus on good hit rate (okay access time) Such hierarchical design is another big idea We ll see this in section. CPU L1 cache L2 cache Main Memory 13

Opteron Vital Statistics CPU L1 cache L2 cache Main Memory L1 Caches: Instruction & Data 64 kb 64 byte blocks 2-way set associative 2 cycle access time L2 Cache: 1 MB 64 byte blocks 4-way set associative 16 cycle access time (total, not just miss penalty) Memory 200+ cycle access time 14

Comparing cache organizations Like many architectural features, caches are evaluated experimentally. As always, performance depends on the actual instruction mix, since different programs will have different memory access patterns. Simulating or executing real applications is the most accurate way to measure performance characteristics. The graphs on the next few slides illustrate the simulated miss rates for several different cache designs. Again lower miss rates are generally better, but remember that the miss rate is just one component of average memory access time and execution time. You ll probably do some cache simulations if you take CS433. 15

Associativity tradeoffs and miss rates As we saw last time, higher associativity means more complex hardware. But a highly-associative cache will also exhibit a lower miss rate. Each set has more blocks, so there s less chance of a conflict between two addresses which both belong in the same set. Overall, this will reduce AMAT and memory stall cycles. The textbook shows the miss rates decreasing as the associativity increases. 12% 9% Miss rate 6% 3% 0% One-way Two-way Four-way Eight-way Associativity 16

Cache size and miss rates The cache size also has a significant impact on performance. The larger a cache is, the less chance there will be of a conflict. Again this means the miss rate decreases, so the AMAT and number of memory stall cycles also decrease. The complete Figure 7.29 depicts the miss rate as a function of both the cache size and its associativity. 15% 12% Miss rate 9% 6% 1 KB 2 KB 4 KB 8 KB 3% 0% One-way Two-way Four-way Eight-way Associativity 17

Block size and miss rates Finally, Figure 7.12 on p. 559 shows miss rates relative to the block size and overall cache size. Smaller blocks do not take maximum advantage of spatial locality. 40% Miss rate 35% 30% 25% 20% 15% 10% 5% 1 KB 8 KB 16 KB 64 KB 0% 4 16 64 256 Block size (bytes) 18

Block size and miss rates Finally, Figure 7.12 on p. 559 shows miss rates relative to the block size and overall cache size. Smaller blocks do not take maximum advantage of spatial locality. But if blocks are too large, there will be fewer blocks available, and more potential misses due to conflicts. 40% Miss rate 35% 30% 25% 20% 15% 10% 5% 1 KB 8 KB 16 KB 64 KB 0% 4 16 64 256 Block size (bytes) 19

Memory and overall performance How do cache hits and misses affect overall system performance? Assuming a hit time of one CPU clock cycle, program execution will continue normally on a cache hit. (Our earlier computations always assumed one clock cycle for an instruction fetch or data access.) For cache misses, we ll assume the CPU must stall to wait for a load from main memory. The total number of stall cycles depends on the number of cache misses and the miss penalty. Memory stall cycles = Memory accesses x miss rate x miss penalty To include stalls due to cache misses in CPU performance equations, we have to add them to the base number of execution cycles. CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time 20

Performance example Assume that 33% of the instructions in a program are data accesses. The cache hit ratio is 97% and the hit time is one cycle, but the miss penalty is 20 cycles. Memory stall cycles = Memory accesses x Miss rate x Miss penalty = 0.33 I x 0.03 x 20 cycles = 0.2 I cycles If I instructions are executed, then the number of wasted cycles will be 0.2 x I. This code is 1.2 times slower than a program with a perfect CPI of 1! 21

Memory systems are a bottleneck CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time Processor performance traditionally outpaces memory performance, so the memory system is often the system bottleneck. For example, with a base CPI of 1, the CPU time from the last page is: CPU time = (I + 0.2 I) x Cycle time What if we could double the CPU performance so the CPI becomes 0.5, but memory performance remained the same? CPU time = (0.5 I + 0.2 I) x Cycle time The overall CPU time improves by just 1.2/0.7 = 1.7 times! Refer back to Amdahl s Law from textbook page 101. Speeding up only part of a system has diminishing returns. 22

Basic main memory design There are some ways the main memory can be organized to reduce miss penalties and help with caching. For some concrete examples, let s assume the following three steps are taken when a cache needs to load data from the main memory. 1. It takes 1 cycle to send an address to the RAM. 2. There is a 15-cycle latency for each RAM access. 3. It takes 1 cycle to return data from the RAM. In the setup shown here, the buses from the CPU to the cache and from the cache to RAM are all one word wide. If the cache has one-word blocks, then filling a block from RAM (i.e., the miss penalty) would take 17 cycles. CPU Cache Main Memory 1 + 15 + 1 = 17 clock cycles The cache controller has to send the desired address to the RAM, wait and receive the data. 23

Miss penalties for larger cache blocks If the cache has four-word blocks, then loading a single block would need four individual main memory accesses, and a miss penalty of 68 cycles! 4 x (1 + 15 + 1) = 68 clock cycles CPU Cache Main Memory 24

A wider memory A simple way to decrease the miss penalty is to widen the memory and its interface to the cache, so we can read multiple words from RAM in one shot. If we could read four words from the memory at once, a four-word cache load would need just 17 cycles. 1 + 15 + 1 = 17 cycles The disadvantage is the cost of the wider buses each additional bit of memory width requires another connection to the cache. CPU Cache Main Memory 25

An interleaved memory Another approach is to interleave the memory, or split it into banks that can be accessed individually. The main benefit is overlapping the latencies of accessing each word. For example, if our main memory has four banks, each one byte wide, then we could load four bytes into a cache block in just 20 cycles. 1 + 15 + (4 x 1) = 20 cycles Our buses are still one byte wide here, so four cycles are needed to transfer data to the caches. This is cheaper than implementing a four-byte bus, but not too much slower. CPU Cache Main Memory Bank 0 Bank 1 Bank 2 Bank 3 26

Interleaved memory accesses Load word 1 Load word 2 Load word 3 Load word 4 Clock cycles 15 cycles Here is a diagram to show how the memory accesses can be interleaved. The magenta cycles represent sending an address to a memory bank. Each memory bank has a 15-cycle latency, and it takes another cycle (shown in blue) to return data from the memory. This is the same basic idea as pipelining! As soon as we request data from one memory bank, we can go ahead and request data from another bank as well. Each individual load takes 17 clock cycles, but four overlapped loads require just 20 cycles. 27

Which is better? Increasing block size can improve hit rate (due to spatial locality), but transfer time increases. Which cache configuration would be better? Block size Miss rate Cache #1 32-bytes 5% Cache #2 64-bytes 4% Assume both caches have single cycle hit times. Memory accesses take 15 cycles, and the memory bus is 8-bytes wide: i.e., an 16-byte memory access takes 18 cycles: 1 (send address) + 15 (memory access) + 2 (two 8-byte transfers) recall: AMAT = Hit time + (Miss rate x Miss penalty) 28

Which is better? Increasing block size can improve hit rate (due to spatial locality), but transfer time increases. Which cache configuration would be better? Block size Miss rate Cache #1 32-bytes 5% Cache #2 64-bytes 4% Assume both caches have single cycle hit times. Memory accesses take 15 cycles, and the memory bus is 8-bytes wide: i.e., an 16-byte memory access takes 18 cycles: 1 (send address) + 15 (memory access) + 2 (two 8-byte transfers) Cache #1: Miss Penalty = 1 + 15 + 32B/8B = 20 cycles AMAT = 1 + (.05 * 20) = 2 Cache #2: Miss Penalty = 1 + 15 + 64B/8B = 24 cycles AMAT = 1 + (.04 * 24) = ~1.96 recall: AMAT = Hit time + (Miss rate x Miss penalty) 29

Summary Writing to a cache poses a couple of interesting issues. Write-through and write-back policies keep the cache consistent with main memory in different ways for write hits. Write-around and allocate-on-write are two strategies to handle write misses, differing in whether updated data is loaded into the cache. Memory system performance depends upon the cache hit time, miss rate and miss penalty, as well as the actual program being executed. We can use these numbers to find the average memory access time. We can also revise our CPU time formula to include stall cycles. AMAT = Hit time + (Miss rate x Miss penalty) Memory stall cycles = Memory accesses x miss rate x miss penalty CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time The organization of a memory system affects its performance. The cache size, block size, and associativity affect the miss rate. We can organize the main memory to help reduce miss penalties. For example, interleaved memory supports pipelined data accesses. 30

Writing Cache Friendly Code Two major rules: Repeated references to data are good (temporal locality) Stride-1 reference patterns are good (spatial locality) Example: cold cache, 4-byte words, 4-word cache blocks int sum_array_rows(int a[m][n]) { int i, j, sum = 0; int sum_array_cols(int a[m][n]) { int i, j, sum = 0; } for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; Miss rate = Miss rate = 1/4 = 25% 100% Adapted from Randy Bryant 31