CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

Similar documents
CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Page 1. Memory Hierarchies (Part 2)

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memory and Performance

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

EE 4683/5683: COMPUTER ARCHITECTURE

Introduction to OpenMP. Lecture 10: Caches

Page 1. Multilevel Memories (Improving performance using a little cash )

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Lecture 12: Memory hierarchy & caches

Memory Hierarchy Design (Appendix B and Chapter 2)

CS3350B Computer Architecture

Advanced Memory Organizations

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Course Administration

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

CSE502: Computer Architecture CSE 502: Computer Architecture

Memory Hierarchy. Slides contents from:

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Memory Hierarchy. Slides contents from:

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Cray XE6 Performance Workshop

1. Memory technology & Hierarchy

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

ECE331: Hardware Organization and Design

EN1640: Design of Computing Systems Topic 06: Memory System

LECTURE 11. Memory Hierarchy

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CSE 2021: Computer Organization

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

14:332:331. Week 13 Basics of Cache

CSE 2021: Computer Organization

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Logical Diagram of a Set-associative Cache Accessing a Cache

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

14:332:331. Week 13 Basics of Cache

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

CS 136: Advanced Architecture. Review of Caches

Key Point. What are Cache lines

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.

Today Cache memory organization and operation Performance impact of caches

Lecture 11 Cache. Peng Liu.

Caching Basics. Memory Hierarchies

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

MEMORY. Objectives. L10 Memory

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

EN1640: Design of Computing Systems Topic 06: Memory System

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

ECE 485/585 Microprocessor System Design

1/19/2009. Data Locality. Exploiting Locality: Caches

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Caches Concepts Review

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Memory Hierarchies &

Introduction. Memory Hierarchy

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

Caches II CSE 351 Spring #include <yoda.h> int is_try(int do_flag) { return!(do_flag!do_flag); }

ECE331: Hardware Organization and Design

COSC3330 Computer Architecture Lecture 19. Cache

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory. Lecture 22 CS301

Lecture 7 - Memory Hierarchy-II

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CMPSC 311- Introduction to Systems Programming Module: Caching

CS 240 Stage 3 Abstractions for Practical Systems

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ECE260: Fundamentals of Computer Engineering

Transcription:

CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2015 Lecture 15

LAST TIME! Discussed concepts of locality and stride Spatial locality: programs tend to access values near values they have already accessed Temporal locality: programs tend to access a particular value multiple times A stride-k reference pattern means the program accesses values every k steps! Covered basic cache designs and trade-offs The cache holds blocks of memory of a specific size Cache lines contain blocks, plus bookkeeping info! Unique tag associated with each block, and a valid flag Cache sets allow us to simplify our cache designs! Each address maps to exactly one cache-set 2

DIRECT-MAPPED CACHES! Direct-mapped caches have S sets, but each set contains only one cache line (E = 1) t = m (s + b) bits s bits b bits m-bit address: tag set index block offset! Example: direct-mapped cache with 4 sets 2 bits in set index Direct-Mapped Cache with 4 Sets Set 0 Valid Tag Block Set 1 Valid Tag Block Set 2 Valid Tag Block 3 Set 3 Valid Tag Block

FULLY ASSOCIATIVE CACHES! Fully associative caches have only one set, which contains all cache lines S = 2 s = 1 s = 0. No bits used for set index! m-bit address: t = m b bits tag! Example: fully-associative cache b bits block offset Fully Associative Cache Valid Tag Block Valid Tag Block Set 0 Valid Tag Block Valid Tag Block 4

SET-ASSOCIATIVE CACHES (2)! Set-associative caches blend strengths of directmapped caches and fully associative caches Multiple cache sets Multiple lines per set! For an m-bit address: t bits s = 2 bits b bits Two-Way Set-Associative Cache Set 0 Valid Tag Block Valid Tag Block tag set idx blk off (4 cache-sets 2 bits for cache-set index) Set 1 Set 2 Valid Tag Block Valid Tag Block Valid Tag Block Valid Tag Block Set 3 Valid Tag Block Valid Tag Block 5

WRITING TO CACHES! So far, only discussed reading from the cache! Most programs also write to memory Need to handle writes to the cache as well! Ideally, want to minimize performance penalties on writes, just like on reads! If CPU writes a word that is already in the cache: Called a write hit Several choices of how to handle this situation! Option 1: employ a write-through strategy Every write to cache causes cache to write the entire block back to memory Problem: every single write to the cache causes a write to main memory! Can t exploit data locality! 6

WRITING TO CACHES (2)! Option 2: use a write-back strategy Add a dirty flag to each cache line When a cached block is written to, set the dirty flag When a cache line is evicted, write the block back to main memory Cache line:! Benefit: Valid Dirty Tag Block Can now exploit locality of writes as well as reads Writes to adjacent values likely to hit cached blocks When dirty blocks are finally evicted, have a much smaller number of writes to main memory! Drawback: more complex than write-through These days, most caches use a write-back strategy 7

WRITING TO CACHES (3)! If address being written to isn t already in the cache: Called a write miss Again we have several options! Option 1: use a write-allocate strategy When a write miss occurs, load the block into cache and then perform the write against the cache Assumes the program will perform subsequent writes! Option 2: use a no-write-allocate strategy When a write miss occurs, just perform the write against main memory! Write-back caches usually use write-allocate strategy! Write-through caches typically use no-write-allocate strategy 8

CACHE PERFORMANCE ANALYSIS! Cache performance modeling can be extremely complicated Usually easiest to measure actual system behaviors! Basic ideas can be captured with only a few simple parameters! Miss rate: the fraction of memory references that result in cache misses Miss rate = # misses / # references (0 miss rate 1)! Hit rate: the fraction of memory references that hit the cache Hit rate = 1 miss rate 9

CACHE PERFORMANCE ANALYSIS (2)! Hit time: time to deliver a value from the cache to the CPU Includes all necessary steps for delivering the value! Time to select the appropriate cache set Time to find the cache line using the block s tag Time to retrieve the specified word from block data! Miss penalty: time required to handle cache miss (includes cache-set identification, checking tags, etc.) Need to fetch a block from main memory May need to evict another cache line from the cache (Evicted line may be dirty and need written back ) Other associated bookkeeping for storing cache line 10

CACHE PERFORMANCE ANALYSIS (3)! Simple example to see benefit of caching: Hit time = 1 clock (typical goal for L1 hits) Miss penalty = 100 clocks (main memory usu. 25-100)! If all reads were from cache, each read is 1 clock! Hit rate of 80% 0.8 1 clock + 0.2 100 clocks = 20.8 clocks/access! Hit rate of 90% 0.9 1 clock + 0.1 100 clocks = 10.9 clocks/access! Hit rate of 95% 0.95 1 clock + 0.05 100 clocks = 5.95 clocks/access! Hit rate is very important! For programs with low miss-rate (good data locality), get a large memory at (nearly) the cost of a small one! 11

CACHE DESIGN TRADE-OFFS! Previous example shows importance of designing programs to maximize cache usage and compilers that know how to optimize well! Cache designers also have important trade-offs to consider! Try to maximize hit rate, without increasing hit time! Can increase overall size of cache, but this will probably increase the hit time Idea is to increase number of lines, or size of each line Physics: larger memories are slower than smaller ones! Can increase block size, keeping cache size constant (Don t want to incur penalties due to physics) Miss penalty will increase due to larger transfer times 12

CACHE DESIGN TRADE-OFFS (2)! For associative caches, what about increasing number of cache lines per set (E)?! As number of cache lines per set increases: Complexity of line-matching goes up, since more lines to compare. Will probably increase hit times. Must choose a line to evict when a cache miss occurs; now there are more choices.! Will probably increase miss penalty! Complexity of replacement-policy logic also increases!! Direct mapped caches avoid above penalties with ultra-simple mapping of blocks to cache lines but dramatically increase likelihood of thrashing due to conflict misses! 13

INTEL PROCESSOR CACHES! So far have only discussed caching data Reading instructions also frequently has good locality! Processor can manage separate instruction caches (i-caches) and data caches (d-caches) Allows parallel data-access paths within the CPU Also, instruction caches usually only support reads registers L1 d-cache L2 unified cache Caches on Intel processor cores L1 i-cache 14

INTEL CORE 2 CACHES! Intel Core 2 Duo/Quad cache information (typical): Cache Associativity (E) Block Size (B) Sets (S) Cache Size (C) L1 i-cache 8 L1 d-cache 8 Unified L2 cache 8 64 bytes (16 dwords) 64 bytes (16 dwords) 64 bytes (16 dwords) 64 32KB 64 32KB 2048-16384 2 MB 6MB+ For Core-2 Quad, L2 is divided into two segments, each of which is shared by two cores! Small number of cache lines per set, but large number of cache sets Fast to determine if a block is in a cache set, and many cache sets to minimize thrashing issues 15

INTEL CORE I7 CACHES! Intel Core i7 cache information: Cache Associativity (E) Block Size (B) Sets (S) Cache Size (C) L1 i-cache 8 L1 d-cache 8 64 bytes (16 dwords) 64 bytes (16 dwords) 64 32KB 64 32KB Unified L2 cache 8 64 bytes (16 dwords) 512 256KB Unified L3 cache 16 Each core has its own L2 cache All cores share the L3 cache 64 bytes (16 dwords) 8192 8MB! With another level of caching, expect to see a smoother degradation in memory performance 16

MEASURING CACHE BEHAVIOR! Can measure the performance of various caches in our computer by constructing a special program: Allocate an array of size elements to scan through Access the array with k-stride reference patterns! Vary k from 1 to some large value Scan through the memory for each stride value! Measure memory access throughput as we vary both of these parameters! Remember, two different kinds of locality!! Spatial locality: Program accesses data items that are close to other data items that have been recently accessed! Temporal locality: Program accesses the same data item multiple times 17

MEASURING CACHE BEHAVIOR (2)! Can measure the performance of various caches in our computer by constructing a special program: Allocate an array of size elements to scan through Access the array with k-stride reference patterns! Vary k from 1 to some large value Scan through the memory for each stride value! When our working set and stride are small: Should fit entirely into L1 cache, giving fast access!! When working set doesn t fit completely within L1 cache, and stride is increased: Should see performance taper off, as ratio of L1 misses to L1 hits increases When all accesses miss L1, will see L2 cache performance! Should see similar behavior from L2-L3, L3-DRAM 18

INTEL CORE I7 MEMORY MOUNTAIN! Produces a 3D surface that shows our caches!,)))$!)))$ L1! Intel Core i7! 2.67 GHz! 32 KB L1 d-cache! 256 KB L2 cache! 8 MB L3 cache!,$-.%%)/0123/42)%&567*+% +)))$ ")))$ *)))$ L2! Ridges of temporal locality Slopes of spatial locality ')))$ &)))$ L3! )$ -&$ -*$!)0".$%&89%'()$*+% -+$ -,$ -.$ Mem! -&&$ -&*$ -&+$ -*'$!"#$ %#$ &#$ &'%($ &!($ '($!"#$%&'()$*+% 19

INTEL CORE I7 MEMORY MOUNTAIN (2)! As total size increases, temporal locality decreases Working set can no longer fit in a given cache See a significant dropoff as each cache level becomes ineffective,)))$!)))$ L1! Intel Core i7! 2.67 GHz! 32 KB L1 d-cache! 256 KB L2 cache! 8 MB L3 cache!,$-.%%)/0123/42)%&567*+% +)))$ ")))$ *)))$ L2! Ridges of temporal locality Slopes of spatial locality ')))$ &)))$ L3! )$ -&$ -*$!)0".$%&89%'()$*+% -+$ -,$ -.$ Mem! -&&$ -&*$ -&+$ -*'$!"#$ %#$ &#$ &'%($ &!($ '($!"#$%&'()$*+% 20

INTEL CORE I7 MEMORY MOUNTAIN (3)! As stride increases, spatial locality decreases Cache miss rate increases as stride increases Throughput tapers off to a minimum for each level,)))$!)))$ L1! Intel Core i7! 2.67 GHz! 32 KB L1 d-cache! 256 KB L2 cache! 8 MB L3 cache!,$-.%%)/0123/42)%&567*+% +)))$ ")))$ *)))$ L2! Ridges of temporal locality Slopes of spatial locality ')))$ &)))$ L3! )$ -&$ -*$!)0".$%&89%'()$*+% -+$ -,$ -.$ Mem! -&&$ -&*$ -&+$ -*'$!"#$ %#$ &#$ &'%($ &!($ '($!"#$%&'()$*+% 21

INTEL CORE 2 DUO MACBOOK! Different systems have very different statistics -++$ *++$ L1! Intel Core 2 Duo! 2.4 GHz! 32 KB L1 d-cache! 3MB Unified L2 cache!,$-.%%)/0123/42)%&567*+%,++$ )++$ &++$ %++$ L2! Ridges of temporal locality!++$ Slopes of spatial locality "++$ '++$ +$.'$.!$.&$.,$.-$ Mem! )%($ *($ 22!)0".$%&89%'()$*+%.''$.'!$.'&$.!"$!"#$ %#$ &'"($!"#$%&'()$*+%

INTEL CORE 2 DUO MACBOOK (2)! Core 2 only has two caches before main memory Sharper degradation on Core 2 than on Core i7 -++$ *++$ L1! Intel Core 2 Duo! 2.4 GHz! 32 KB L1 d-cache! 3MB Unified L2 cache!,$-.%%)/0123/42)%&567*+%,++$ )++$ &++$ %++$ L2! Ridges of temporal locality!++$ Slopes of spatial locality "++$ '++$ +$.'$.!$.&$.,$.-$ Mem! )%($ *($ 23!)0".$%&89%'()$*+%.''$.'!$.'&$.!"$!"#$ %#$ &'"($!"#$%&'()$*+%

INTEL CORE 2 DUO MACBOOK (3)! Dip for very small working sets at these sizes, cost of function invocation is overwhelming the measurements! -++$ *++$ L1! Intel Core 2 Duo! 2.4 GHz! 32 KB L1 d-cache! 3MB Unified L2 cache!,$-.%%)/0123/42)%&567*+%,++$ )++$ &++$ %++$ L2! Ridges of temporal locality!++$ Slopes of spatial locality "++$ '++$ +$.'$.!$.&$.,$.-$ Mem! )%($ *($ 24!)0".$%&89%'()$*+%.''$.'!$.'&$.!"$!"#$ %#$ &'"($!"#$%&'()$*+%

INTEL CORE 2 DUO MACBOOK (4)! Small bump where stride allows previously loaded cache lines to be used again -++$ *++$ L1! Intel Core 2 Duo! 2.4 GHz! 32 KB L1 d-cache! 3MB Unified L2 cache!,$-.%%)/0123/42)%&567*+%,++$ )++$ &++$ %++$ L2! Ridges of temporal locality!++$ Slopes of spatial locality "++$ '++$ +$.'$.!$.&$.,$.-$ Mem! )%($ *($ 25!)0".$%&89%'()$*+%.''$.'!$.'&$.!"$!"#$ %#$ &'"($!"#$%&'()$*+%

INTEL CORE 2 DUO MACBOOK (5)! Peak throughput is much lower than before Affected as much by motherboard chipset as by the processor itself -++$ *++$ L1! Intel Core 2 Duo! 2.4 GHz! 32 KB L1 d-cache! 3MB Unified L2 cache!,$-.%%)/0123/42)%&567*+%,++$ )++$ &++$ %++$ L2! Ridges of temporal locality!++$ Slopes of spatial locality "++$ '++$ +$.'$.!$.&$.,$.-$ Mem! )%($ *($ 26!)0".$%&89%'()$*+%.''$.'!$.'&$.!"$!"#$ %#$ &'"($!"#$%&'()$*+%

INTEL CORE 2 QUAD DESKTOP! Peak throughput is much more similar to Core i7 shown before *+++$,+++$ L1! Intel Core 2 Quad! 2.66 GHz! 32 KB L1 d-cache! 4 MB Unified L2 cache!,$-.%%)/0123/42)%&567*+% )+++$ &+++$ %+++$!+++$ L2! Ridges of temporal locality Slopes of spatial locality "+++$ '+++$ +$ -'$ -!$ -&$ -,$ -.$ Mem! )%($ *($ 27!)0".$%&89%'()$*+% -''$ -'!$ -'&$ -!"$!"#$ %#$ &'"($!"#$%&'()$*+%

SUMMARY: CACHE IS KING! Caches are essential components in computers Performance gap between CPU and memory is large and increasing! Caches give us an opportunity to have large memories that are nearly as fast as small ones As long as we keep our cache hit rates up, program performance will improve dramatically! As programmers, it is extremely important to be aware of data locality and caching considerations Sometimes, simple changes in a program can produce dramatic changes in memory throughput Other times, must carefully design program to take advantage of processor caches 28

PROCESSOR PERFORMANCE! Increases in CPU clock speeds have tapered off Issues: clock signal propagation, heat dissipation! New focus: design CPUs that do more with less Parallelize instruction execution! Pipelined instruction execution! Superscalar architectures multiple ALUs, etc. Out-of-order instruction execution Disabling unused CPU components to reduce power (See CS:APP chapters 4, 5 very good discussion!)! Although CPU speeds have remained in 1-3GHz ballpark, CPUs are still becoming more powerful 29

PROCESSOR PERFORMANCE (2)! Another approach: multi-core processors Put multiple independent CPUs onto a single chip Intel Core 2/Core i7, IBM/Sony/Toshiba Cell, NVIDIA GeForce 9/GeForce 200 series GPUs, etc.! Still suffers from the same processor-memory performance gap as before (And, the solution is still caching )! Explore the challenges and opportunities that multi-core introduces into cached memory access Very rich topic, so our discussion will be a pretty high-level overview 30

MULTICORE AND CACHING! Go ahead and stick multiple processors on a chip! What happens to caching in these systems?! For example: CPU 1 CPU 2 Addr Data Addr Data SRAM Cache SRAM Cache Addr Data Addr Data DRAM Main Memory! Any problems with this approach? More accurately, What new problems do we have? 31

INDEPENDENT CACHES! Two independent caches of a common resource: CPU 1 CPU 2 Addr Data Addr Data SRAM Cache SRAM Cache Addr Data Addr Data DRAM Main Memory! CPU 1 reads data in block 23, starts using it Block 23 is loaded into CPU1 s cache! CPU 2 issues a write on block 23 Block 23 also loaded into CPU2 s cache CPU 1 cache no longer consistent with memory state 32

COORDINATED CACHES! Must coordinate state between separate caches CPU 1 CPU 2 Addr Data Addr Data SRAM Cache Magic! SRAM Cache Addr Data Addr Data DRAM Main Memory! CPU 1 reads data in block 23, starts using it Block 23 is loaded into CPU1 s cache! CPU 2 issues a write on block 23 Block 23 also loaded into CPU2 s cache Somehow, tell CPU 1 s cache that block 23 has changed 33

CACHE COHERENCE! Cache coherence constrains the behavior of reads and writes to caches of a shared resource! Need to define how coherent memory should behave:! Processor P reads a location X, then writes to X No other processors modify X between read and write Subsequent reads of X must return the value that P wrote! Processor P1 writes to a location X Subsequently, another processor P2 reads X P2 must read the value that P1 wrote Specifically, P2 must not read the old value of X! Processor P1 writes value A to a location X Then, processor P2 writes value B to the same location X Reads of X must see A and then B, but never B and then A 34

CACHE COHERENCE (2)! Clearly need a mechanism to coordinate cache interactions! Approach: CPU 1 CPU 2 Addr Data Addr Data Previously had simple state info in each cache line SRAM Cache Coherence SRAM Cache Cache line: Valid Addr Data Addr Data Introduce additional state and operations to coordinate cache lines between processors DRAM Main Memory Dirty Tag Block 35

CACHE COHERENCE PROTOCOL! Mechanism is called a cache coherence protocol! Many different protocols to choose from!! Intel and AMD multi-core processors use variants of the MSI protocol Letters in protocol name specify the states of cache lines! Each CPU has its own cache, but caches coordinate read and write operations to maintain coherence! In each cache, a cache line is in one of these states: Modified the line contains data modified by the CPU, and is thus inconsistent with main memory Shared the line contains unmodified data, and appears in at least one CPU s cache Invalid the line s data has been invalidated (e.g. because another CPU wrote to their cache), and must be reread! either from main memory, or from another cache 36

MSI CACHE COHERENCE PROTOCOL! Basic principles: A memory block can appear in multiple caches only if it is shared (S) If modified (M), may only appear in one CPU cache!! When a processor reads a value from a block: If the block is in my cache, and marked either S or M, just return the value! If the block is not in my cache (or marked I): If no other cache has the block marked M, just read it from main memory If another cache has the block marked M, that cache must write data back to main memory, then switch to S or I state! Then processor can read block from memory, and mark it S 37

MSI CACHE COHERENCE PROTOCOL (2)! When a processor writes a value to a block: If the block is in my cache and already marked M, perform the write! If the block is marked S (shared) in my cache: Invalidate all other caches copies of the block! Change my block s status to M, then perform write! If block isn t in my cache, or is marked I (invalid): If block is marked S in other caches, invalidate them If block marked M in another cache, that cache must write data back to main memory, then switch it to I! i.e. block is completely evicted from the other CPU s cache Then processor can read block from memory, mark it M, then perform the write 38

NEXT TIME! Finish discussion of cache coherence protocols Variants of MSI used by Intel/AMD processors New cache performance issues to consider! Begin discussing a new topic: virtualization 39