Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Similar documents
Overview: Shared Memory Hardware

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiprocessors & Thread Level Parallelism

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Shared Symmetric Memory Systems

Chapter 5. Multiprocessors and Thread-Level Parallelism

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Computer Architecture

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Processor Architecture

Chapter 5. Multiprocessors and Thread-Level Parallelism

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

1. Memory technology & Hierarchy

Computer Architecture

Page 1. Cache Coherence

Handout 3 Multiprocessor and thread level parallelism

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

EC 513 Computer Architecture

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Foundations of Computer Systems

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Systems Architecture

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Limitations of parallel processing

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Aleksandar Milenkovich 1

Advanced OpenMP. Lecture 3: Cache Coherency

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Multiprocessors and Locking

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Snooping-Based Cache Coherence

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

The Cache Write Problem

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Shared Memory Multiprocessors

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Parallel Architecture. Hwansoo Han

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Computer Systems Architecture

CMSC 611: Advanced. Distributed & Shared Memory

Scalable Cache Coherence

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Flynn s Classification

ECSE 425 Lecture 30: Directory Coherence

ECE 485/585 Microprocessor System Design

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

CS/COE1541: Intro. to Computer Architecture

CMSC 611: Advanced Computer Architecture

Parallel Architecture. Sathish Vadhiyar

Portland State University ECE 588/688. Cache Coherence Protocols

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Lecture 13. Shared memory: Architecture and programming

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Lecture 24: Board Notes: Cache Coherency

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Midterm Exam 02/09/2009

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 25: Multiprocessors

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Memory Hierarchy in a Multiprocessor

ECE/CS 757: Homework 1

EC 513 Computer Architecture

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism

EE382 Processor Design. Processor Issues for MP

Lecture 24: Virtual Memory, Multiprocessors

Transcription:

Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing MSI protocol implementation snoopy cache-based systems directory cache-based systems, and their cache coherency issues cache coherency protocols in practice HPC study article: 12 Ways to Fool the Masses: Fast Forward to 2011 Refs: Lin & Snyder Ch 2, Grama et al Ch 2 SGI Origin architecture, AMD Northbridge architecture, Intel Quickpath technology COMP4300/8300 L14,15: Shared Memory Hardware 2017 1 systems with caches but otherwise flat memory are generally called UMA if access to local memory is cheaper than remote (NUMA), this should be built into your algorithm how to do this and O/S support is another matter man numa will give details of NUMA support a global address space is considered easier to program read-only interactions invisible to programmer and can be coded like a sequential program read/write are harder, require mutual exclusion for concurrent accesses the main programming models are threads and directive-based (we will use Pthreads and OpenMP) synchronization using locks and related mechanisms COMP4300/8300 L14,15: Shared Memory Hardware 2017 3 Shared Memory Hardware Shared Address Space and Shared Memory Computers shared memory was historically used for architectures in which memory is physically shared among various processors, and all processors have equal access to any memory segment this is identical to the UMA model the term SMP originally meant a Symmetric Multi-Processor: all CPUs had equal OS capabilities (interrupts, I/0 & other system calls). It now means Shared Memory Processor (almost all are symmetric ) c.f. distributed-memory computers where different memory segments are physically associated with different processing elements. either of these physical models can present the logical view of a disjoint or shared-address space platform a distributed-memory shared-address-space computer is a NUMA system (Fig 2.5 Grama et al, Intro to Parallel Computing) COMP4300/8300 L14,15: Shared Memory Hardware 2017 2 COMP4300/8300 L14,15: Shared Memory Hardware 2017 4

Cache Hierarchy on the Intel Core i7 (2013) Cache Coherency intuitive behaviour: reading value at address X should return the last value written to address X by any processor what does last mean? What if simultaneous or closer in time than the time required to communicate between two processors? in a sequential program, last is determined by program order (not time) holds true within one thread of a parallel program, but what does this mean with multiple threads? (64 byte cache line size) Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1 COMP4300/8300 L14,15: Shared Memory Hardware 2017 5 COMP4300/8300 L14,15: Shared Memory Hardware 2017 7 Caches on Multiprocessors Cache/Memory Coherency multiple copies of some data word being manipulated by two or more processors at the same time two requirements: an address translation mechanism that locates each physical memory word in system concurrent operations on multiple copies have well defined semantics the latter is generally known as a cache coherency protocol input/output using direct memory access (DMA) on machines with caches also leads to coherency issues some machines only provide shared address space mechanisms and leave coherence to (system or user-level) software a memory system is coherent if: Ordered as Issued: a read by processor P to address X that follows a write by P to address X should return the value of the write by P (assuming no other processor writes to X in between) Write Propagation: a read by processor P1 to address X that follows a write by processor P2 to X returns the written value if the read and write are sufficiently separated in time (assuming no other write to X occurs in between) Write Serialization: writes to the same address are serialized: two writes to any two processors are observed in the same order by all processors (later to be contrasted with memory consistency!) e.g. Texas Instrument Keystone II system, Intel Single Chip Cloud Computer COMP4300/8300 L14,15: Shared Memory Hardware 2017 6 COMP4300/8300 L14,15: Shared Memory Hardware 2017 8

Two Cache Coherency Protocols Update vs Invalidate update protocol: when a data item is written, all of its copies in the system are updated invalidate protocol (most common): before a data item is written, all other copies are marked as invalid comparison: (Fig 2.21 Grama et al, Intro to Parallel Computing) multiple writes to same word with no intervening reads require multiple write broadcasts in an update protocol, but only one initial invalidation with multi-word cache blocks, each word written in a cache block (line) must be broadcast in an update protocol, but only one invalidate per line is required the delay between writing a word in one processor and reading the written data in another is usually less for the update protocol COMP4300/8300 L14,15: Shared Memory Hardware 2017 9 COMP4300/8300 L14,15: Shared Memory Hardware 2017 11 Cache Line View False Sharing Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1 need to augment cache line information with information regarding validity two processors modify different parts of the same cache line the invalidate protocol leads to ping-ponged cache lines the update protocol performs reads locally but updates much traffic between processors this effect is entirely an artifact of hardware need to design parallel systems/programs with this issue in mind cache line size: the longer, the more likely alignment of data structures with respect to cache line size http://15418.courses.cs.cmu.edu/ spring2015/lecture/cachecoherence1 COMP4300/8300 L14,15: Shared Memory Hardware 2017 10 COMP4300/8300 L14,15: Shared Memory Hardware 2017 12

Implementing Cache Coherency MSI Coherency Protocol on small-scale bus-based machines a processor must obtain access to the bus to broadcast a write invalidation with two competing processors, the first to gain access to the bus will invalidate the others data a cache miss needs to locate the top copy of the data easy for a write-through cache for a write-back cache, each processor s cache snoops the bus and responds if it has the top copy the data for writes, we would like to know if any other copies of the block are cached i.e. whether a write-back cache needs to put details on the bus handled by having a tag to indicate shared status minimizing processor stalls either by duplication of tags or having multiple inclusive caches COMP4300/8300 L14,15: Shared Memory Hardware 2017 13 (Fig 2.23 Grama et al, Intro to Parallel Computing) COMP4300/8300 L14,15: Shared Memory Hardware 2017 15 3 State (MSI) Cache Coherency Protocol Snoopy Cache Systems read: local read write: local write c read (coherency read): read (miss) on remote processor gives rise to shown transition in local cache c write (coherency write): write miss, or write in Shared state, on remote processor gives rise to shown transition in local cache (Fig 2.22 Grama et al, Intro to Parallel Computing) all caches broadcast all transactions (read or write misses, writes in S state) suited (easy to implement) to bus or ring interconnects however scalability is limited (i.e. 8 processors) What about torus on-chip networks? (assume wormhole routing) all processor s caches monitor the bus (or interconnect port) for transactions of interest each processor s cache has a set of tag bits that determine the state of the cache block tags are updated according to state diagram for relevant protocol e.g. snoop hardware detects that a read has been issued for a cache block that it has a dirty copy of, it asserts control of the bus and puts data out (to requesting cache and to main memory), sets tag to S state what sort of data access characteristics are likely to perform well/badly on snoopy-based systems? COMP4300/8300 L14,15: Shared Memory Hardware 2017 14 COMP4300/8300 L14,15: Shared Memory Hardware 2017 16

Snoopy Cache-Based System: Bus Directory Cache-Based Systems need to broadcast is clearly not scalable a solution is to only send information to processing elements specifically interested in that data this requires a directory to store the necessary information augment global memory with a presence bitmap to indicate which caches each memory block is located in (Fig 2.24 Grama et al, Intro to Parallel Computing) COMP4300/8300 L14,15: Shared Memory Hardware 2017 17 COMP4300/8300 L14,15: Shared Memory Hardware 2017 19 Snoopy Cache-Based System: Ring Directory-Based Cache Coherency The Core i7 (Sandy Bridge) on-chip interconnect revisited: a ring-based interconnect between Cores, Graphics, Last Level Cache (LLC) and System Agent domains has 4 physical rings: Data (32B), Request, Acknowledge and Snoop rings fully pipelined; bandwidth, latency and power scale with cores shortest path chosen to minimize latency has distributed arbitration & sophisticated protocols to handle coherency and ordering (courtesy www.lostcircuits.com) (Fig 2.25 Grama et al, Intro to Parallel Computing) COMP4300/8300 L14,15: Shared Memory Hardware 2017 18 COMP4300/8300 L14,15: Shared Memory Hardware 2017 20

Directory-Based Cache Coherency Costs on SGI Origin 3000 (clock cycles) a simple protocol might be: shared: one or more processors have the block cached, and the value in memory is up to date uncached: no processor has a copy exclusive: only one processor (the owner) has a copy and the value in memory is out of date must handle a read/write miss and a write to a shared, clean cache block these first reference the directory entry to determine the current state of this block then update the entry s status and presence bitmap send the appropriate state update transactions to the processors in the presence bitmap <= 16 CPU > 16 CPU cache hit 1 1 cache miss to local memory 85 85 cache miss to remote home directory 125 150 cache miss to remotely cached data (3 hops) 140 170 Figure from http://people.nas.nasa.gov/ schang/origin opt.html Data from: Computer Architecture: A Quantitative Approach, By David A. Patterson, John L. Hennessy, David Goldberg Ed 3, Morgan Kaufmann, 2003 COMP4300/8300 L14,15: Shared Memory Hardware 2017 21 COMP4300/8300 L14,15: Shared Memory Hardware 2017 23 Issues in Directory-Based Systems Real Cache Coherency Protocols from Wikipedia: how much memory is required to store the directory? what sort of data access characteristics are likely to perform well/badly on directory-based systems? how do distributed and centralized systems compare? should the presence bitmaps be replicated in the caches? Must they be? how would you implement sending an invalidation message to all (and only to all) processors in the presence bitmap? Modern systems use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect. The MESI protocol adds an Exclusive state to reduce the traffic caused by writes of blocks that only exist in one cache. The MOSI protocol adds an Owned state to reduce the traffic caused by write-backs of blocks that are read by other caches [The processor owner of the cache line services requests for that data]. The MOESI protocol does both of these things. The MESIF protocol uses the Forward state to reduce the traffic caused by multiple responses to read requests when the coherency architecture allows caches to respond to snoop requests with data. case study: coherency via the MOESI protocol in the SunFire V1289 NUMA SMP (2003) COMP4300/8300 L14,15: Shared Memory Hardware 2017 22 COMP4300/8300 L14,15: Shared Memory Hardware 2017 24

MESI Protocol (on a bus) The Coherency Wall: Cache Coherency Considered Harmful interconnects are expected to consume 50 more energy than logic circuits standard protocols requires a broadcast message for each invalidation maintaining (MOESI) protocol also requires a broadcast on every miss energy cost of each is O(p); overall cost is O(p 2 )! also causes contention (& delay) in the network (worse than O(p 2 )?) directory-based protocols can direct invalidation messages to only the caches holding the same data far more scalable, for lightly-shared data worse otherwise; also introduces overhead through indirection for each cached line, need a bit vector of length p: O(p 2 ) storage cost false sharing in any case results wasted traffic Ref: https://www.cs.tcd.ie/jeremy.jones/vivio/caches/mesihelp.htm COMP4300/8300 L14,15: Shared Memory Hardware 2017 25 atomic instructions (essential for locks etc) sync the memory system down to the LLC, cost O(p) energy each! cache line size is sub-optimal for messages on on-chip networks COMP4300/8300 L14,15: Shared Memory Hardware 2017 27 Multi-Level Caches what is visibility of changes between levels of cache? Cache Coherency Summary cache coherency arises because abstraction of a single shared address space is not actually implemented by a single storage unit in a machine three components to cache coherency: issue order, write propagation, write serialization two implementations: broadcast/snoop: suitable for small-medium intra-chip and small inter-socket systems directory-based: suitable for medium-large inter-socket systems false sharing is a potential performance issue more likely, the longer the cache line http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1 easiest model is inclusive: if line is in owned state in L1, it is also in owned state in L2 COMP4300/8300 L14,15: Shared Memory Hardware 2017 26 energy considerations argue for no coherency for large intra-chip systems, e.g. the PEZY-Sc OS-managed distributed shared memory or message-passing programming models COMP4300/8300 L14,15: Shared Memory Hardware 2017 28