Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.2 3, 5.5

Similar documents
Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.

Caches! Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University. See P&H 5.2 (writes), 5.3, 5.5

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Caches & Memory. CS 3410 Computer System Organization & Programming

Caches and Memory. Anne Bracy CS 3410 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.13, 5.15, 5.17

Virtual Memory 1! Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University. P & H Chapter 5.4 (up to TLBs)

Caches (Writing) P & H Chapter 5.2 3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Caches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Caches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Caches and Memory Deniz Altinbuken CS 3410, Spring 2015

Caches. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Page 1. Multilevel Memories (Improving performance using a little cash )

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Virtual Memory 2. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.4

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 136: Advanced Architecture. Review of Caches

Caches! Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

CS161 Design and Architecture of Computer Systems. Cache $$$$$

12 Cache-Organization 1

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Caching Basics. Memory Hierarchies

Locality and Data Accesses video is wrong one notes when video is correct

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Mo Money, No Problems: Caches #2...

1/19/2009. Data Locality. Exploiting Locality: Caches

Key Point. What are Cache lines

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Systems Programming and Computer Architecture ( ) Timothy Roscoe

EE 4683/5683: COMPUTER ARCHITECTURE

Course Administration

CS 61C: Great Ideas in Computer Architecture Caches Part 2

ECE ECE4680

Virtual Memory 3. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.4

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Lecture 7 - Memory Hierarchy-II

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Memory Hierarchy. Slides contents from:

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

CSE502: Computer Architecture CSE 502: Computer Architecture

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Advanced Computer Architecture

Page 1. Memory Hierarchies (Part 2)

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

EEC 483 Computer Organization

CS3350B Computer Architecture

ECE 571 Advanced Microprocessor-Based Design Lecture 10

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Memory Hierarchy. Slides contents from:

ECE331: Hardware Organization and Design

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Caches Concepts Review

Lecture 11 Cache. Peng Liu.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Introduction to OpenMP. Lecture 10: Caches

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

CS61C : Machine Structures

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

ECE 2300 Digital Logic & Computer Organization. More Caches

Advanced Memory Organizations

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Performance metrics for caches

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

See also cache study guide. Contents Memory and Caches. Supplement to material in section 5.2. Includes notation presented in class.

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

LECTURE 12. Virtual Memory

Introduction to cache memories

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

CS 3410, Spring 2014 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.15

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

Virtual Memory. CS 3410 Computer System Organization & Programming

Cray XE6 Performance Workshop

Transcription:

s (Writing) Hakim Weatherspoon CS, Spring Computer Science Cornell University P & H Chapter.,.

Administrivia Lab due next onday, April th HW due next onday, April th

Goals for Today Parameter Tradeoffs Conscious Programming Writing to the Write through vs Write back

Design Tradeoffs

Design Need to determine parameters: size Block size (aka line size) Number of ways of set associativity (, N, ) Eviction policy Number of levels of caching, parameters for each Separate I cache from D cache, or Unified cache Prefetching policies / instructions Write policy

A Real Example > dmidecode t cache Information Configuration: Enabled, Not Socketed, Level Operational ode: Write Back Installed Size: KB Error Correction Type: None Information Configuration: Enabled, Not Socketed, Level Operational ode: Varies With emory Address Installed Size: KB Error Correction Type: Single bit ECC > cd /sys/devices/system/cpu/cpu; grep cache/*/* cache/index/level: cache/index/type:data cache/index/ways_of_associativity: cache/index/number_of_sets: cache/index/coherency_line_size: cache/index/size:k cache/index/level: cache/index/type:instruction cache/index/ways_of_associativity: cache/index/number_of_sets: cache/index/coherency_line_size: cache/index/size:k cache/index/level: cache/index/type:unified cache/index/shared_cpu_list: cache/index/ways_of_associativity: cache/index/number_of_sets: cache/index/coherency_line_size: cache/index/size:k Dual core.ghz Intel (purchased in )

A Real Example Dual K L Instruction caches way set associative sets byte line size Dual K L Data caches Same as above Single L Unified cache way set associative (!!!) sets byte line size GB ain memory TB Disk Dual core.ghz Intel (purchased in )

Basic Organization Q: How to decide block size? A: Try it and see But: depends on cache size, workload, associativity, Experimental approach!

Experimental Results

Tradeoffs For a given total cache size, larger block sizes mean. fewer lines so fewer tags (and smaller tags for associative caches) so less overhead and fewer cold misses (within block prefetching ) But also fewer blocks available (for scattered accesses!) so more conflicts and larger miss penalty (time to fetch block)

Conscious Programming

Conscious Programming // H =, W = int A[H][W]; for(x=; x < W; x++) for(y=; y < H; y++) sum += A[y][x]; Every access is a cache miss! (unless entire matrix can fit in cache)

Conscious Programming // H =, W = int A[H][W]; for(y=; y < H; y++) for(x=; x < W; x++) sum += A[y][x]; Block size = % hit rate Block size =.% hit rate Block size =.% hit rate And you can easily prefetch to warm the cache.

Writing with s

Eviction Which cache line should be evicted from the cache to make room for a new line? Direct mapped no choice, must evict line selected by index Associative caches random: select one of the lines at random round robin: similar to random FIFO: replace oldest line LRU: replace line that has not been used in the longest time

d Write Policies Q: How to write data? CPU addr data SRA emory DRA If data is already in the cache No Write writes invalidate the cache and go directly to memory Write Through writes go to main memory and cache Write Back CPU writes only to cache cache writes to main memory later (when block is evicted)

What about Stores? Where should you write the result of a store? If that memory location is in the cache? Send it to the cache Should we also send it to memory right away? (write through policy) Wait until we kick the block out (write back policy) If it is not in the cache? Allocate the line (put it in the cache)? (write allocate policy) Write it directly to memory without allocation? (no write allocate policy)

Write Allocation Policies Q: How to write data? CPU addr data SRA emory DRA If data is not in the cache Write Allocate allocate a cache line for new data (and maybe write through) No Write Allocate ignore cache, just go to main memory

Handling Stores (Write-Through) Using byte addresses in this example! Addr Bus = bits Assume write allocate policy LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ Fully Associative cache lines word block bit tag field bit block offset field Vtag data isses: Hits: emory

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] $ $ $ $ lru V tag data isses: Hits:

Write-Through (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] H $ $ $ $ lru V tag data isses: Hits:

How any emory References? Write through performance Each miss (read or write) reads a block from mem misses memreads Each store writes an item to mem mem writes Evictions don t need to write to mem no need for dirty bit

Write-Through (REF,) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ H H H lru V tag data isses: Hits:

Write-Through (REF,) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ H H H H H lru V tag data isses: Hits:

Write-Through vs. Write-Back Can we also design the cache NOT to write all stores immediately to memory? Keep the most current copy in cache, and update memory when that data is evicted (write back policy) Do we need to write back all evicted lines? No, only blocks that have been stored into (written)

Write-Back eta-data V D Tag Byte Byte Byte N V = means the line has valid data D = means the bytes are newer than main memory When allocating line: Set V =, D =, fill in Tag and Data When writing line: Set D = When evicting line: If D = : just set V = If D = : write back Data, then set D =, V =

Handling Stores (Write-Back) Using byte addresses in this example! Addr Bus = bits Assume write allocate policy LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ Fully Associative cache lines word block bit tag field bit block offset field Vd tag data isses: Hits: emory

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] LB $ [ ] SB $ [ ] H SB $ [ ] $ $ $ $ lru V d tag data isses: Hits:

Write-Back (REF ) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ H H H lru V d tag data isses: Hits:

How any emory References? Write back performance Each miss (read or write) reads a block from mem misses memreads Some evictions write a block to mem dirty eviction mem writes (+ dirty evictions later + mem writes)

How many memory references? Each miss reads a block Two words in this cache Each evicted dirty cache line writes a block Total reads: six words Total writes: / words (after final eviction)

Write-Back (REF,) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ H H H lru V d tag data isses: Hits:

Write-Back (REF,) emory LB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] LB $ [ ] SB $ [ ] SB $ [ ] SB $ [ ] SB $ [ ] $ $ $ $ H H H H H lru V d tag data isses: Hits:

How any emory References? Write back performance Each miss (read or write) reads a block from mem misses memreads Some evictions write a block to mem dirty eviction mem writes (+ dirty evictions later + mem writes) By comparison write through was Reads: eight words Writes: // etc words Write through or Write back?

Write-through vs. Write-back Write through is slower But cleaner (memory always consistent) Write back is faster But complicated when multi cores sharing memory

Performance: An Example Performance: Write back versus Write through Assume: large associative cache, byte lines for (i=; i<n; i++) A[] += A[i]; for (i=; i<n; i++) B[i] = A[i]

Performance Tradeoffs Q: Hit time: write through vs. write back? A: Write through slower on writes. Q: iss penalty: write through vs. write back? A: Write back slower on evictions.

Write Buffering Q: Writes to main memory are slow! A: Use a write back buffer A small queue holding dirty lines Add to end upon eviction Remove from front upon completion Q: What does it help? A: short bursts of writes (but not sustained writes) A: fast eviction reduces miss penalty

Write-through vs. Write-back Write through is slower But simpler (memory always consistent) Write back is almost always faster write back buffer hides large eviction cost But what about multiple cores with separate caches but sharing memory? Write back requires a cache coherency protocol Inconsistent views of memory Need to snoop in each other s caches Extremely complex protocols, very hard to get right

-coherency Q: ultiple readers and writers? A: Potentially inconsistent views of memory CPU CPU CPU CPU L L L L L L L L L L net em disk coherency protocol ay need to snoop on other CPU s cache activity Invalidate cache line when other CPU writes Flush write back caches before other CPU reads Or the reverse: Before writing/reading Extremely complex protocols, very hard to get right

Caching assumptions Summary small working set: / rule can predict future: spatial & temporal locality Benefits (big & fast) built from (big & slow) + (small & fast) Tradeoffs: associativity, line size, hit cost, miss penalty, hit rate

Summary emory performance matters! often more than CPU performance because it is the bottleneck, and not improving much because most programs move a LOT of data Design space is huge Gambling against program behavior Cuts across all layers: users programs os hardware ulti core / ulti is complicated Inconsistent views of memory Extremely complex protocols, very hard to get right