ECE PP used in class for assessing cache coherence protocols

Similar documents
[ 5.4] What cache line size is performs best? Which protocol is best to use?

Snooping coherence protocols (cont.)

A three-state update protocol

Cache Coherence: Part 1

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

Performance of coherence protocols

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Shared Memory Multiprocessors

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

ECE 669 Parallel Computer Architecture

Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions

The MESI State Transition Graph

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Cache Coherence in Scalable Machines

Scalable Cache Coherence

ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

12 Cache-Organization 1

Chapter 5. Multiprocessors and Thread-Level Parallelism

Advanced OpenMP. Lecture 3: Cache Coherency

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Lect. 6: Directory Coherence Protocol

Processor Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Why Multiprocessors?

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Flynn s Classification

Cache Coherence: Part II Scalable Approaches

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Phastlane: A Rapid Transit Optical Routing Network

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter 5. Multiprocessors and Thread-Level Parallelism

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Lecture 24: Board Notes: Cache Coherency

CS152 Computer Architecture and Engineering Lecture 17: Cache System

Interconnect Routing

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

ECE ECE4680

Cache Coherence in Scalable Machines

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Snooping-Based Cache Coherence

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Chapter 8. Virtual Memory

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Scientific Applications. Chao Sun

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Shared Symmetric Memory Systems

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

ECE/CS 757: Homework 1

Scalable Cache Coherence

CSC 631: High-Performance Computer Architecture

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Lecture 11 Cache. Peng Liu.

Lecture 7: Implementing Cache Coherence. Topics: implementation details

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Memory Hierarchy. Slides contents from:

CSC/ECE 506: Architecture of Parallel Computers Program 2: Bus-Based Cache Coherence Protocols Due: Wednesday, October 25, 2017

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Scalable Cache Coherent Systems

Cache Coherence in Bus-Based Shared Memory Multiprocessors

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Chapter 2: Memory Hierarchy Design Part 2

CS3350B Computer Architecture

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory

Switch Gear to Memory Consistency

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Introduction to OpenMP. Lecture 10: Caches

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Systems Programming and Computer Architecture ( ) Timothy Roscoe

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Shared Memory Architectures. Approaches to Building Parallel Machines

Scalable Multiprocessors

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Show Me the $... Performance And Caches

10/19/17. You Are Here! Review: Direct-Mapped Cache. Typical Memory Hierarchy

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Cray XE6 Performance Workshop

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Transcription:

ECE 5315 PP used in class for assessing cache coherence protocols

Assessing Protocol Design The benchmark programs are executed on a multiprocessor simulator The state transitions observed determine the frequency of various events such as cache misses and bus transactions Evaluate the effect of protocols in terms of design parameters (e.g., bandwidth requirements, cache block size, ) The analysis is based on the frequency of various events, not on the absolute time (since it is simulation)

16 Processors, 1MB 4-way set associative cache, 64B block, MESI protocol State Transitions

Bandwidth Requirements 2 0 0 1 8 0 1 6 0 A d d r e s s b u s D a t a b u s T r a f f i c ( M B / s ) 140 120 100 80 60 40 20 0 x d l t x I l l t E x T r a f f i c ( M B / s ) 80 70 60 50 40 30 20 10 0 E A d d r e s s b u s D a t a b u s Barnes/III Barnes/3St Barnes/3St-RdEx LU/III LU/3St LU/3St-RdEx Ocean/III Ocean/3S Ocean/3St-RdEx Radiosity/III Radiosity/3St Radiosity/3St-RdEx E Radix/III Radix/3St Radix/3St-RdEx Raytrace/III Raytrace/3St Raytrace/3St-RdEx Appl-Code/III Appl-Code/3St Appl-Code/3St-RdEx Appl-Data/III Appl-Data/3St Appl-Data/3St-RdEx OS-Code/III OS-Code/3St OS-Code/3St-RdEx OS-Data/III OS-Data/3St OS-Data/3St-RdEx Parallel program workload 200 MIPS/MFLOPS, 1MB cache III MESI protocol 3St MSI with BusUpgr 3St-Rdex MSI with BusRdX Multiprogram workload

Cache-Miss Types Cold miss --- occurs on the first reference to a memory block by a processor. (compulsory miss) Capacity miss --- occurs when all the blocks that are referenced during the execution of a program do not fit in the cache. Collision miss --- occurs caches with less than full associativity, i.e., the referenced block does not fit in the set. (conflict miss) Coherence miss --- occurs when blocks of data are shared among multiple processors. True sharing: a word in a cache block produced by one processor is used by another processor. False sharing: words accessed by different processors happen to be placed in the same block

Sharing Misses: Illustration True Sharing Miss One writes some words in a cache block The same block in other processors are invalidated The second processor reads one of the modified words (read miss) False Sharing Miss One writes some words in a cache block The same block in other processors are invalidated The second processor reads a different word in the same cache block.

Sharing Misses True Sharing Miss Reduced by increasing the cache block size and the spatial locality of the workload False Sharing Miss Increases as the cache bloc size increases Would not occur if the cache block size is one word Current trend is enlarging the cache block size, which potentially increases false sharing misses

Classification of Cache Misses Miss classi cation First reference to memory block by processor Reason for miss Yes 1. Cold No First access systemwide No Written before Other Reason for elimination of last copy Invalidation Replacement 2. Cold No Yes Modi ed word(s) accessed during lifetime No Old copy with state = invalid still there Yes 3. False-sharingcold Yes 4. True-sharingcold Modi ed word(s) accessed No during lifetime 5. False-sharinginval-cap Yes No 6. True-sharinginval-cap 7. Purefalse-sharing Modi ed word(s) accessed during lifetime Yes 8. Puretrue-sharing No Has block been modi ed since replacement Yes No 9. Purecapacity Modi ed word(s) accessed during lifetime Yes No Modi ed word(s) accessed Yes during lifetime 10. True-sharingcapacity 11. False-sharing-12. True-sharingcap-inval cap-inval

Impact of block size on miss rates (1MB cache) 0. 6 1 2 U p g r a d e U p g r a d e 0. 5 F a l s e s h a r i n g T r u e s h a r i n g 1 0 F a l s e s h a r i n g T r u e s h a r i n g C a p a c i t y C a p a c i t y 0. 4 C o l d 8 C o l d Miss rate (%) 0. 3 Miss rate (%) 6 0. 2 4 0. 1 2 0 8 0 8 6 2 4 8 6 8 Barnes/8 Barnes/16 Barnes/32 Barnes/64 Barnes/128 Barnes/256 Lu/8 Lu/16 Lu/32 Lu/64 Lu/128 Lu/256 Block Size Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Ocean/8 Ocean/16 Ocean/32 Ocean/64 Ocean/128 Ocean/256 Radix/8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Block Size Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 16 processors, 1MB cache, 4-way set associative Cold, capacity, and true sharing misses tend to decrease with increasing block size False sharing misses tend to increase with block size

Impact of block size on miss rates (64KB cache) Increases in overall miss rates Capacity misses are a much larger portion of overall misses

Impact of Block Size on Bus Traffic (1MB Cache) Traffic affects performance indirectly through contention Traffic (bytes/instructions) 0. 18 0. 16 0. 14 0. 12 0. 1 0. 08 0. 06 0. 04 0. 02 0 Barnes/8 A d d r e s s b u s D a t a b u s Barnes/16 Barnes/32 Barnes/64 Barnes/128 Barnes/256 Radiosity/8 Radiosity/16 Radiosity/32 2 4 2 8 Radiosity/64 Radiosity/128 Radiosity/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 Traffic (bytes/instruction) 10 9 8 7 6 5 4 3 2 1 0 Radix/8 Address bus Data bus Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Traffic (bytes/flop) 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 LU/8 LU/16 Address bus Data bus LU/32 LU/64 LU/128 LU/256 Ocean/8 Ocean/16 Ocean/32 Ocean/64 Ocean/128 Ocean/256 Data traffic quickly increases with block size Address bus traffic tends to decrease with block size Address traffic overhead comprises a significant fraction for small block sizes

Impact of Block Size on Bus Traffic (64KB Cache) For Ocean, data traffic slowly increases with block size (cmp 1MB)

Drawbacks of Large Cache Blocks The trend toward larger cache block size is driven by availability of increasing density for processors and memory chips This trend bodes poorly for multiprocessor designs because of potential increase in false sharing misses

Countering the effects of large block size Organize data structures or work assignments so that data accessed by different processes is not interleaved finely in the shared address space (software approach) Use sub-blocks within a cache block. One subblock may be valid while others are invalid Small cache blocks are used, but on a miss the system prefetches blocks beyond the accessed block Use adjustable block size (complex) Delay propagating or applying invalidations from a processor until it has issued multiple writes

Update-Based Vs. Invalidation-Based Protocols Update-based protocols perform better, if the processors that were using the data before it was updated are likely to use the new values in the future Invalidation-based protocols perform better, if the processors are never going to use the new values in the future (since traffic update is useless)

Hybrid of Update and Invalidation (Mixed) Start with an update protocol and set a counter to each block (k, called a threshold) Whenever a cache block is accessed by a local processor, the counter is reset to k Every time an update is received for a block, the counter is decremented If the counter goes to zero, the block is locally invalidated Next time an update is generated, the block is switched to the modified state and will stop generating updates If some other processor now accesses the block, the block again will switch to shared state and generate updates

Update vs Invalidate: Miss Rates 0.60 False sharing 2.50 0.50 True sharing Capacity 2.00 Cold 0.40 Miss rate (%) 0.30 Miss rate (%) 1.50 1.00 0.20 0.10 0.50 0.00 0.00 LU/inv LU/upd Ocean/inv Ocean/mix Ocean/upd Raytrace/inv Raytrace/upd Radix/inv Radix/mix Radix/upd K=4 for mixed Lots of coherence misses: updates help Lots of capacity misses: updates hurt (keep data in cache uselessly)

Update Protocols For applications with significant capacity miss rates, the misses increase with an update protocol False sharing decreases with an update protocol The traffic associated with update is quite substantial (many bus transactions vs one in invalidation) The increased traffic can cause contention and can greatly increase the cost of misses Update protocols have greater problems for scalable systems The trend is away from the update based protocols as default