Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing MSI protocol implementation snoopy cache-based systems directory cache-based systems, and their cache coherency issues cache coherency protocols in practice HPC study article: 12 Ways to Fool the Masses: Fast Forward to 2011 Refs: Lin & Snyder Ch 2, Grama et al Ch 2 SGI Origin architecture, AMD Northbridge architecture, Intel Quickpath technology COMP4300/8300 L14,15: Shared Memory Hardware 2017 1 systems with caches but otherwise flat memory are generally called UMA if access to local memory is cheaper than remote (NUMA), this should be built into your algorithm how to do this and O/S support is another matter man numa will give details of NUMA support a global address space is considered easier to program read-only interactions invisible to programmer and can be coded like a sequential program read/write are harder, require mutual exclusion for concurrent accesses the main programming models are threads and directive-based (we will use Pthreads and OpenMP) synchronization using locks and related mechanisms COMP4300/8300 L14,15: Shared Memory Hardware 2017 3 Shared Memory Hardware Shared Address Space and Shared Memory Computers shared memory was historically used for architectures in which memory is physically shared among various processors, and all processors have equal access to any memory segment this is identical to the UMA model the term SMP originally meant a Symmetric Multi-Processor: all CPUs had equal OS capabilities (interrupts, I/0 & other system calls). It now means Shared Memory Processor (almost all are symmetric ) c.f. distributed-memory computers where different memory segments are physically associated with different processing elements. either of these physical models can present the logical view of a disjoint or shared-address space platform a distributed-memory shared-address-space computer is a NUMA system (Fig 2.5 Grama et al, Intro to Parallel Computing) COMP4300/8300 L14,15: Shared Memory Hardware 2017 2 COMP4300/8300 L14,15: Shared Memory Hardware 2017 4

Cache Hierarchy on the Intel Core i7 (2013) Cache Coherency intuitive behaviour: reading value at address X should return the last value written to address X by any processor what does last mean? What if simultaneous or closer in time than the time required to communicate between two processors? in a sequential program, last is determined by program order (not time) holds true within one thread of a parallel program, but what does this mean with multiple threads? (64 byte cache line size) Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1 COMP4300/8300 L14,15: Shared Memory Hardware 2017 5 COMP4300/8300 L14,15: Shared Memory Hardware 2017 7 Caches on Multiprocessors Cache/Memory Coherency multiple copies of some data word being manipulated by two or more processors at the same time two requirements: an address translation mechanism that locates each physical memory word in system concurrent operations on multiple copies have well defined semantics the latter is generally known as a cache coherency protocol input/output using direct memory access (DMA) on machines with caches also leads to coherency issues some machines only provide shared address space mechanisms and leave coherence to (system or user-level) software a memory system is coherent if: Ordered as Issued: a read by processor P to address X that follows a write by P to address X should return the value of the write by P (assuming no other processor writes to X in between) Write Propagation: a read by processor P1 to address X that follows a write by processor P2 to X returns the written value if the read and write are sufficiently separated in time (assuming no other write to X occurs in between) Write Serialization: writes to the same address are serialized: two writes to any two processors are observed in the same order by all processors (later to be contrasted with memory consistency!) e.g. Texas Instrument Keystone II system, Intel Single Chip Cloud Computer COMP4300/8300 L14,15: Shared Memory Hardware 2017 6 COMP4300/8300 L14,15: Shared Memory Hardware 2017 8

Two Cache Coherency Protocols Update vs Invalidate update protocol: when a data item is written, all of its copies in the system are updated invalidate protocol (most common): before a data item is written, all other copies are marked as invalid comparison: (Fig 2.21 Grama et al, Intro to Parallel Computing) multiple writes to same word with no intervening reads require multiple write broadcasts in an update protocol, but only one initial invalidation with multi-word cache blocks, each word written in a cache block (line) must be broadcast in an update protocol, but only one invalidate per line is required the delay between writing a word in one processor and reading the written data in another is usually less for the update protocol COMP4300/8300 L14,15: Shared Memory Hardware 2017 9 COMP4300/8300 L14,15: Shared Memory Hardware 2017 11 Cache Line View False Sharing Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1 need to augment cache line information with information regarding validity two processors modify different parts of the same cache line the invalidate protocol leads to ping-ponged cache lines the update protocol performs reads locally but updates much traffic between processors this effect is entirely an artifact of hardware need to design parallel systems/programs with this issue in mind cache line size: the longer, the more likely alignment of data structures with respect to cache line size http://15418.courses.cs.cmu.edu/ spring2015/lecture/cachecoherence1 COMP4300/8300 L14,15: Shared Memory Hardware 2017 10 COMP4300/8300 L14,15: Shared Memory Hardware 2017 12

Implementing Cache Coherency MSI Coherency Protocol on small-scale bus-based machines a processor must obtain access to the bus to broadcast a write invalidation with two competing processors, the first to gain access to the bus will invalidate the others data a cache miss needs to locate the top copy of the data easy for a write-through cache for a write-back cache, each processor s cache snoops the bus and responds if it has the top copy the data for writes, we would like to know if any other copies of the block are cached i.e. whether a write-back cache needs to put details on the bus handled by having a tag to indicate shared status minimizing processor stalls either by duplication of tags or having multiple inclusive caches COMP4300/8300 L14,15: Shared Memory Hardware 2017 13 (Fig 2.23 Grama et al, Intro to Parallel Computing) COMP4300/8300 L14,15: Shared Memory Hardware 2017 15 3 State (MSI) Cache Coherency Protocol Snoopy Cache Systems read: local read write: local write c read (coherency read): read (miss) on remote processor gives rise to shown transition in local cache c write (coherency write): write miss, or write in Shared state, on remote processor gives rise to shown transition in local cache (Fig 2.22 Grama et al, Intro to Parallel Computing) all caches broadcast all transactions (read or write misses, writes in S state) suited (easy to implement) to bus or ring interconnects however scalability is limited (i.e. 8 processors) What about torus on-chip networks? (assume wormhole routing) all processor s caches monitor the bus (or interconnect port) for transactions of interest each processor s cache has a set of tag bits that determine the state of the cache block tags are updated according to state diagram for relevant protocol e.g. snoop hardware detects that a read has been issued for a cache block that it has a dirty copy of, it asserts control of the bus and puts data out (to requesting cache and to main memory), sets tag to S state what sort of data access characteristics are likely to perform well/badly on snoopy-based systems? COMP4300/8300 L14,15: Shared Memory Hardware 2017 14 COMP4300/8300 L14,15: Shared Memory Hardware 2017 16

Snoopy Cache-Based System: Bus Directory Cache-Based Systems need to broadcast is clearly not scalable a solution is to only send information to processing elements specifically interested in that data this requires a directory to store the necessary information augment global memory with a presence bitmap to indicate which caches each memory block is located in (Fig 2.24 Grama et al, Intro to Parallel Computing) COMP4300/8300 L14,15: Shared Memory Hardware 2017 17 COMP4300/8300 L14,15: Shared Memory Hardware 2017 19 Snoopy Cache-Based System: Ring Directory-Based Cache Coherency The Core i7 (Sandy Bridge) on-chip interconnect revisited: a ring-based interconnect between Cores, Graphics, Last Level Cache (LLC) and System Agent domains has 4 physical rings: Data (32B), Request, Acknowledge and Snoop rings fully pipelined; bandwidth, latency and power scale with cores shortest path chosen to minimize latency has distributed arbitration & sophisticated protocols to handle coherency and ordering (courtesy www.lostcircuits.com) (Fig 2.25 Grama et al, Intro to Parallel Computing) COMP4300/8300 L14,15: Shared Memory Hardware 2017 18 COMP4300/8300 L14,15: Shared Memory Hardware 2017 20

Directory-Based Cache Coherency Costs on SGI Origin 3000 (clock cycles) a simple protocol might be: shared: one or more processors have the block cached, and the value in memory is up to date uncached: no processor has a copy exclusive: only one processor (the owner) has a copy and the value in memory is out of date must handle a read/write miss and a write to a shared, clean cache block these first reference the directory entry to determine the current state of this block then update the entry s status and presence bitmap send the appropriate state update transactions to the processors in the presence bitmap <= 16 CPU > 16 CPU cache hit 1 1 cache miss to local memory 85 85 cache miss to remote home directory 125 150 cache miss to remotely cached data (3 hops) 140 170 Figure from http://people.nas.nasa.gov/ schang/origin opt.html Data from: Computer Architecture: A Quantitative Approach, By David A. Patterson, John L. Hennessy, David Goldberg Ed 3, Morgan Kaufmann, 2003 COMP4300/8300 L14,15: Shared Memory Hardware 2017 21 COMP4300/8300 L14,15: Shared Memory Hardware 2017 23 Issues in Directory-Based Systems Real Cache Coherency Protocols from Wikipedia: how much memory is required to store the directory? what sort of data access characteristics are likely to perform well/badly on directory-based systems? how do distributed and centralized systems compare? should the presence bitmaps be replicated in the caches? Must they be? how would you implement sending an invalidation message to all (and only to all) processors in the presence bitmap? Modern systems use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect. The MESI protocol adds an Exclusive state to reduce the traffic caused by writes of blocks that only exist in one cache. The MOSI protocol adds an Owned state to reduce the traffic caused by write-backs of blocks that are read by other caches [The processor owner of the cache line services requests for that data]. The MOESI protocol does both of these things. The MESIF protocol uses the Forward state to reduce the traffic caused by multiple responses to read requests when the coherency architecture allows caches to respond to snoop requests with data. case study: coherency via the MOESI protocol in the SunFire V1289 NUMA SMP (2003) COMP4300/8300 L14,15: Shared Memory Hardware 2017 22 COMP4300/8300 L14,15: Shared Memory Hardware 2017 24

MESI Protocol (on a bus) The Coherency Wall: Cache Coherency Considered Harmful interconnects are expected to consume 50 more energy than logic circuits standard protocols requires a broadcast message for each invalidation maintaining (MOESI) protocol also requires a broadcast on every miss energy cost of each is O(p); overall cost is O(p 2 )! also causes contention (& delay) in the network (worse than O(p 2 )?) directory-based protocols can direct invalidation messages to only the caches holding the same data far more scalable, for lightly-shared data worse otherwise; also introduces overhead through indirection for each cached line, need a bit vector of length p: O(p 2 ) storage cost false sharing in any case results wasted traffic Ref: https://www.cs.tcd.ie/jeremy.jones/vivio/caches/mesihelp.htm COMP4300/8300 L14,15: Shared Memory Hardware 2017 25 atomic instructions (essential for locks etc) sync the memory system down to the LLC, cost O(p) energy each! cache line size is sub-optimal for messages on on-chip networks COMP4300/8300 L14,15: Shared Memory Hardware 2017 27 Multi-Level Caches what is visibility of changes between levels of cache? Cache Coherency Summary cache coherency arises because abstraction of a single shared address space is not actually implemented by a single storage unit in a machine three components to cache coherency: issue order, write propagation, write serialization two implementations: broadcast/snoop: suitable for small-medium intra-chip and small inter-socket systems directory-based: suitable for medium-large inter-socket systems false sharing is a potential performance issue more likely, the longer the cache line http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1 easiest model is inclusive: if line is in owned state in L1, it is also in owned state in L2 COMP4300/8300 L14,15: Shared Memory Hardware 2017 26 energy considerations argue for no coherency for large intra-chip systems, e.g. the PEZY-Sc OS-managed distributed shared memory or message-passing programming models COMP4300/8300 L14,15: Shared Memory Hardware 2017 28