Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Similar documents
Cray XE6 Performance Workshop

Advanced OpenMP. Lecture 3: Cache Coherency

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Processor Architecture

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Shared Memory Multiprocessors

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Computer Architecture

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence in Bus-Based Shared Memory Multiprocessors

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Multiprocessors & Thread Level Parallelism

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Modern CPU Architectures

Advanced Parallel Programming I

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Switch Gear to Memory Consistency

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Memory Hierarchy in a Multiprocessor

Shared Symmetric Memory Systems

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

ECE 485/585 Microprocessor System Design

Snooping-Based Cache Coherence

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CS315A Midterm Solutions

Lecture 24: Board Notes: Cache Coherency

A Basic Snooping-Based Multi-Processor Implementation

Chapter 5. Multiprocessors and Thread-Level Parallelism

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Computer Systems Architecture

EC 513 Computer Architecture

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Multiprocessors 1. Outline

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

CSC/ECE 506: Architecture of Parallel Computers Program 2: Bus-Based Cache Coherence Protocols Due: Wednesday, October 25, 2017

An#Aside# Course#thus#far:# We#have#focused#on#parallelism:#how#to#divide#up#work#into# parallel#tasks#so#that#they#can#be#done#faster.

Page 1. Cache Coherence

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

CSC 631: High-Performance Computer Architecture

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

Computer Systems Architecture

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Getting Started with SMPCache 2.0

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Snooping coherence protocols (cont.)

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

LOCK-BASED CACHE COHERENCE PROTOCOL FOR CHIP MULTIPROCESSORS

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Concurrent Preliminaries

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Computer Organization

Computer Architecture

Shared Memory Multiprocessors

Review: Symmetric Multiprocesors (SMP)

CSC 252: Computer Organization Spring 2018: Lecture 26

CS 152 Computer Architecture and Engineering

Snooping coherence protocols (cont.)

Cray XE6 Performance Workshop

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Organization. Chapter 16

Characteristics of Mult l ip i ro r ce c ssors r

12 Cache-Organization 1

Physical Design of Snoop-Based Cache Coherence on Multiprocessors

1. Memory technology & Hierarchy

Wenisch Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Prof. Thomas Wenisch. Lecture 23 EECS 470.

Parallel Arch. Review

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture 24: Virtual Memory, Multiprocessors

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Shared Memory Architectures. Approaches to Building Parallel Machines

Chap. 4 Multiprocessors and Thread-Level Parallelism

Other consistency models

ECE/CS 757: Homework 1

Lecture 10: Cache Coherence. Parallel Computer Architecture and Programming CMU / 清华 大学, Summer 2017

Chapter 18 Parallel Processing

Transcription:

Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh

Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth P P P P P P Bus/Interconnect Memory Examples IBM servers, Sun HPC Servers, Fujitsu PrimePower, multiprocessor PCs

Multicore chips Now possible (and economically desirable) to place multiple processors on a chip. From a programming perspective, this is largely irrelevant simply a convenient way to build a small SMP on-chip buses can have very high bandwidth Main difference is that processors may share caches Typically, each core has its own Level 1 and Level 2 caches, but the Level 3 cache is shared between cores 3

Typical cache hierarchy 4 CPU CPU CPU CPU L1 L1 L1 L1 Chip L2 L2 L2 L2 L3 Cache Main Memory

Power4 two-core chip 5

Intel Nehalem quad-core chip 6

Power7 8-core chip 7

This means that multiple cores on the same chip can communicate with low latency and high bandwidth via reads and writes which are cached in the shared cache However, cores contend for space in the shared cache one thread may suffer capacity and/or conflict misses caused by threads/processes on another core harder to have precise control over what data is in the cache if only single core is running, then it may have access to the whole shared cache Cores also have to share off-chip bandwidth for access to main memory 8

9 Multiprocessors Simple way to build a (small-scale) parallel machine is to connect multiple processors to a single memory (true shared memory)

10 Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable has a unique at a given time. Caching in a shared memory system means that multiple copies of a memory location may exist in the hardware. To avoid two processors caching different s of the same memory location, caches must be kept coherent. To achieve this, a write to a memory location must cause all other copies of this location to be removed from the caches they are in.

Coherence protocols 11 Need to store information about sharing status of cache blocks has this block been modified? is this block stored in more than one cache? Two main types of protocol 1. Snooping (or broadcast) based every cached copy caries sharing status no central status all processors can see every request 2. Directory based sharing status stored centrally (in a directory)

Snoopy protocols Already have a valid tag on cache lines: this can be used for invalidation. Need an extra tag to indicate sharing status. can use clean/dirty bit in write-back caches All processors monitor all bus transactions if an invalidation message is on the bus, check to see if the block is cached, and if so invalidate it if a memory read request is on the bus, check to see if the block is cached, and if so return data and cancel memory request. Many different possible implementations 12

3 state snoopy protocol: MSI Simplest protocol which allows multiple copies to exist Each cache block can exist in one of three states: Modified: this is the only valid copy in any cache and its is different from that in memory Shared: this is a valid copy, but other caches may also contain it, and its is the same as in memory Invalid: this copy is out of date and cannot be used. Model can be described by a state transition diagram. state transitions can occur due to actions by the processor, or by the bus. state transitions may trigger actions Processor actions read (PrRd) write (PrWr) Bus actions read (BusRd) read exclusive (BusRdX) flush to memory (Flush) 13

MSI Protocol walk through 14 Assume we have three processors. Each is reading/writing the same from memory where R1 means a read by processor 1 and W3 means a write by processor 3. For simplicity sake, the memory location will be referred to as. The memory access stream we will walk through is: R1, R2, W3, R2, W1, W2, R3, R2

15 R1 P1 P2 P3 PrRd S BusRd Snooper Snooper Snooper Main Memory P1 wants to read the. The cache does not have it and generates a BusRd for the data. Main memory controller provides the data. The data goes into the cache in the shared state.

16 R2 P1 P2 P3 PrRd S S Snooper Snooper Snooper BusRd Main Memory P2 wants to read the. Its cache does not have the data, so it places a BusRd to notify other processors and ask for the data. The memory controller provides the data.

17 W3 P1 P2 P3 PrWr S I S I M Snooper Snooper Snooper BusRdX Main Memory P3 wants to write the. It places a BusRdX to get exclusive access and the most recent copy of the data. The caches of P1 and P2 see the BusRdX and invalidate their copies. Because the is still up-to-date in memory, memory provides the data.

18 R2 P1 P2 PrRd P3 I I S M S Snooper Snooper Snooper BusRd Flush Main Memory P2 wants to read the. P3 s cache has the most up-to-date copy and will provide it. P2 s cache puts a BusRd on the bus. P3 s cache snoops this and cancels the memory access because it will provide the data. P3 s cache flushes the data to the bus.

19 W1 P1 PrWr P2 P3 I M S I S I BusRdX Snooper Snooper Snooper Main Memory P1 wants to write to its cache. The cache places a BusRdX on the bus to gain exclusive access and the most up-to-date. Main memory is not stale so it provides the data. The snoopers for P2 and P3 see the BusRdX and invalidate their copies in cache.

20 W2 P1 P2 PrWr P3 M I I M I Snooper Snooper Snooper Flush BusRdX Main Memory P2 wants to write the. Its cache places a BusRdX to get exclusive access and the most recent copy of the data. P1 s snooper sees the BusRdX and flushes the data to the bus. Also, it invalides the data in its cache and cancels the memory access.

21 R3 P1 P2 P3 PrRd I M S I S Snooper Snooper Snooper Flush BusRd Main Memory P3 wants to read the. Its cache does not have a valid copy, so it places a BusRd on the bus. P2 has a modified copy, so it flushes the data on the bus and changes the status of the cache data to shared. The flush cancels the memory access and updates the data in memory as well.

22 R2 P1 P2 PrRd P3 I S S Snooper Snooper Snooper Main Memory P2 wants to read the. Its cache has an up-to-date copy. No bus transactions need to take place as there is no cache miss.

MSI state transitions PrRd PrWr M PrWr=>BusRdX PrRd BusRd BusRd=>Flush S PrWr=>BusRdX PrRd=>BusRd BusRdX BusRdX=>Flush I A=>B means that when action A occurs, the state transition indicated happens, and action B is generated 23

Other protocols MSI is inefficient: it generates more bus traffic than is necessary Can be improved by adding other states, e.g. Exclusive: this copy has not been modified, but it is the only copy in any cache Owned: this copy has been modified, but there may be other copies in shared state MESI and MOESI protocols are more commonly used protocols than MSI MSI is nevertheless a useful mental model for the programmer Also possible to update s in other caches on writes, instead of invalidating them 24

False sharing The units of data on which coherency operations are performed are cache blocks: the size of these units is usually 64 or 128 bytes. The fact that coherency units consist of multiple words of data gives rise to the phenomenon of false sharing. Consider what happens when two processors are both writing to different words on the same cache line. no data s are actually being shared by the processors Each write will invalidate the copy in the other processor s cache, causing a lot of bus traffic and memory accesses. same problem if one processor is writing and the other reading Can be a significant performance problem in threaded programs Quite difficult to detect 25