Recall: Sequential Consistency of Directory Protocols How to get exclusion zone for directory protocol? Recall: Mechanisms for reducing depth

Size: px
Start display at page:

Download "Recall: Sequential Consistency of Directory Protocols How to get exclusion zone for directory protocol? Recall: Mechanisms for reducing depth"

Transcription

1 S Graduate omputer Architecture Lecture 1 April 11 th, 1 Distributed Shared Memory (con t) Synchronization rof John D. Kubiatowicz Recall: Sequential onsistency of Directory rotocols How to get exclusion zone for directory protocol? learly need to make sure that invalidations really invalidate copies» Keep state to handle reordering at client (previous slide s problem) While acknowledgements outstanding, cannot handle read requests» NAK read requests» Queue read requests Example for invalidationbased scheme: block owner (home node) provides appearance of atomicity by waiting for all invalidations to be ack d before allowing access to new value As a result, write commit point becomes point at which WData leaves home node (after last ack received) Much harder in update schemes! REQ NAK Req WData Req HOME Inv Ack Ack Inv Reader Inv Ack Reader Reader Recall: Deadlock Issues with rotocols :intervention L 1: req H a:revise R :reply b:response 1: req :intervention L H R :reply :response 1: req :intervention a:revise L H R b:response 1 1 onsider Dual graph of message dependencies Nodes: Networks, Arcs: rotocol actions Number of networks = length of longest dependency Must always make sure response (end) can be absorbed! Networks Sufficient to Avoid Deadlock 1 a a b b Need Networks to Avoid Deadlock Need Networks to Avoid Deadlock Recall: Mechanisms for reducing depth 1: req :intervention a:revise L H R b:response 1: req :intervention a:revise L H R NAK b:response 1: req :intervention a:revise L H R :SendInt To R b:response 1 X 1 1 a a Original: Need Networks to Avoid Deadlock Optional NAK When blocked Need Networks to Transform to Request/Resp: Need Networks to

2 A opular Middle Ground Example Two-level Hierarchies Two-level hierarchy Individual nodes are multiprocessors, connected nonhiearchically e.g. mesh of SMs oherence across nodes is directory-based directory keeps track of nodes, not individual processors oherence within nodes is snooping or directory orthogonal, but needs a good interface of functionality Examples: onvex Exemplar: directory-directory Sequent, Data General, HAL: directory-snoopy SM on a chip? Main Mem B1 Snooping Adapter M/D A M/D A Network1 Directory adapter B (a) Snooping-snooping Network B1 Snooping Adapter (c) Directory-directory Main Mem M/D A M/D A Network1 Directory adapter Dir. B1 Main Mem Assist M/D A M/D A Network1 Dir/Snoopy adapter Network Assist (b) Snooping-directory Bus (or Ring) (d) Directory-snooping B1 Main Mem Dir. M/D A M/D A Network1 Dir/Snoopy adapter Advantages of Multiprocessor Nodes otential for cost and performance advantages amortization of node fixed costs over multiple processors» applies even if processors simply packaged together but not coherent can use commodity SMs less nodes for directory to keep track of much communication may be contained within node (cheaper) nodes prefetch data for each other (fewer remote misses) combining of requests (like hierarchical, only two-level) can even share caches (overlapping of working sets) benefits depend on sharing pattern (and mapping)» good for widely read-shared: e.g. tree data in Barnes-Hut» good for nearest-neighbor, if properly mapped» not so good for all-to-all communication Disadvantages of oherent M Nodes Bandwidth shared among nodes all-to-all example applies to coherent or not Bus increases latency to local memory With coherence, typically wait for local snoop results before sending remote requests Snoopy bus at remote node increases delays there too, increasing latency and reducing bandwidth May hurt performance if sharing patterns don t comply 8

3 Insight into Directory Requirements If most misses involve O() transactions, might as well broadcast! Study Inherent program characteristics: frequency of write misses? how many sharers on a write miss how these scale Also provides insight into how to organize and store directory information ache Invalidation atterns LU Invalidation atterns to 11 1 to 1 1 to 19 to to 8 to 1 to to 9 to to 8 to 1 to to 9 to # of invalidations Ocean Invalidation atterns to 11 1 to 1 1 to 19 to to 8 to 1 to to 9 to to 8 to 1 to to 9 to # of invalidations 9 1 ache Invalidation atterns Sharing atterns Summary Barnes-Hut Invalidation atterns to 11 1 to 1 1 to 19 to to 8 to 1 to to 9 to to 8 to 1 to to 9 to # of invalidations Radiosity Invalidation atterns to 11 1 to 1 1 to 19 to to 8 to 1 to to 9 to to 8 to 1 to to 9 to # of invalidations 11 Generally, few sharers at a write, scales slowly with ode and read-only objects (e.g, scene data in Raytrace)» no problems as rarely written Migratory objects (e.g., cost array cells in LocusRoute)» even as # of Es scale, only 1- invalidations Mostly-read objects (e.g., root of tree in Barnes)» invalidations are large but infrequent, so little impact on performance Frequently read/written objects (e.g., task queues)» invalidations usually remain small, though frequent Synchronization objects» low-contention locks result in small invalidations» high-contention locks need special support (SW trees, queueing locks) Implies directories very useful in containing traffic if organized properly, traffic and latency shouldn t scale too badly Suggests techniques to reduce storage overhead 1

4 Organizing Directories How to Find Directory Information How to find source of directory information entralized Directory Schemes Distributed Flat Hierarchical centralized memory and directory - easy: go to it but not scalable distributed memory and directory flat schemes» directory distributed with memory: at the home» location based on address (hashing): network xaction sent directly to home hierarchical schemes»?? How to locate copies Memory-based ache-based 1 1 How Hierarchical Directories Work (Tracks which of its children level-1 directories have a copy of the memory block. Also tracks which local memory blocks are cached outside this subtree. Inclusion is maintained between level-1 directories and level- directory.) level- directory processing nodes level-1 directory (Tracks which of its children processing nodes have a copy of the memory block. Also tracks which local memory blocks are cached outside this subtree. Inclusion is maintained between processor caches and directory.) Find Directory Info (cont) distributed memory and directory flat schemes» hash hierarchical schemes» node s directory entry for a block says whether each subtree caches the block» to find directory info, send search message up to parent routes itself through directory lookups» like hiearchical snooping, but point-to-point messages between children and parents Directory is a hierarchical data structure leaves are processing nodes, internal nodes just directory logical hierarchy, not necessarily phyiscal» (can be embedded in general network) 1 1

5 How Is Location of opies Stored? Hierarchical Schemes through the hierarchy each directory has presence bits child subtrees and dirty bit Flat Schemes vary a lot different storage overheads and performance characteristics Memory-based schemes» info about copies stored all at the home with the memory block» Dash, Alewife, SGI Origin, Flash ache-based schemes» info about copies distributed among copies themselves each copy points to next» Scalable oherent Interface (SI: IEEE standard) 1 Flat, Memory-based Schemes info about copies co-located with block at the home just like centralized scheme, except distributed erformance Scaling traffic on a write: proportional to number of sharers latency on write: can issue invalidations to sharers in parallel Storage overhead simplest representation: full bit vector, (called Full-Mapped Directory ), i.e. one presence bit per node storage overhead doesn t scale well with ; -byte line implies» nodes: 1.% ovhd.» nodes: % ovhd.; 1 nodes: % ovhd. for M memory blocks in memory, storage overhead is proportional to *M:» Assuming each node has memory M local = M/, M local» This is why people talk about full-mapped directories as scaling with the square of the number of processors 18 M Reducing Storage Overhead Optimizations for full bit vector schemes increase cache block size (reduces storage overhead proportionally) use multiprocessor nodes (bit per mp node, not per processor) still scales as *M, but reasonable for all but very large machines» -procs, per cluster, 18B line:.% ovhd. Reducing width addressing the term? Reducing height addressing the M term? M = M local Storage Reductions Width observation: most blocks cached by only few nodes don t have a bit per node, but entry contains a few pointers to sharing nodes» alled Limited Directory rotocols =1 => 1 bit ptrs, can use 1 pointers and still save space sharing patterns indicate a few pointers should suffice (five or so) need an overflow strategy when there are more sharers Height observation: number of memory blocks >> number of cache blocks most directory entries are useless at any given time ould allocate directory from pot of directory entries» If memory line doesn t have a directory, no-one has copy» What to do if overflow? Invalidate directory with invaliations organize directory as a cache, rather than having one entry per memory block 19

6 ase Study: Alewife Architecture ost Effective Mesh Network ro: Scales in terms of hardware ro: Exploits Locality Directory Distributed along with main memory Bandwidth scales with number of processors on: Non-Uniform Latencies of ommunication Have to manage the mapping of processes/threads onto processors due Alewife employs techniques for latency minimization and latency tolerance so programmer does not have to manage ontext Switch in 11 cycles between processes on remote memory request which has to incur communication network latency ache ontroller holds tags and implements the coherence protocol 1 LimitLESS rotocol (Alewife) Limited Directory that is Locally Extended through Software Support Handle the common case (small worker set) in hardware and the exceptional case (overflow) in software rocessor with rapid trap handling (executes trap code within cycles of initiation) State Shared rocessor needs complete access to coherence related controller state in the hardware directories Directory ontroller can invoke processor trap handlers Machine needs an interface to the network that allows the processor to launch and intercept coherence protocol packets The rotocol Transition to Software Alewife: p=-entry limited directory with software extension (LimitLESS) Read-only directory transaction: Incoming RREQ with n p Hardware memory controller responds If n > p: send RREQ to processor for handling Trap routine can either discard packet or store it to memory Store-back capability permits message-passing and block transfers otential Deadlock Scenario with rocessor Stalled and waiting for a remote cache-fill Solution: Synchronous Trap (stored in local memory) to empty input queue

7 Transition to Software (on t) Overflow Trap Scenario First Instance: Full-Map bit-vector allocated in local memory and hardware pointers transferred into this and vector entered into hash table Otherwise: Transfer hardware pointers into bit vector Meta-State Set to Trap-On-Write While emptying hardware pointers, Meta-State: Trans-In-rogress Incoming Write Request Scenario Empty hardware pointers to memory Set Acktr to number of bits that are set in bit-vector Send invalidations to all caches except possibly requesting one Free vector in memory Upon invalidate acknowledgement (Acktr == ), send Write-ermission and set Memory State to Read-Write Flat, ache-based Schemes How they work: home only holds pointer to rest of directory info distributed linked list of copies, weaves through caches» cache tag has pointer, points to next cache with a copy on read, add yourself to head of the list (comm. needed) on write, propagate chain of invals down the list Scalable oherent Interface (SI) IEEE Standard doubly linked list ache ache Main Memory (Home) Node Node 1 Node ache Scaling roperties (ache-based) Traffic on write: proportional to number of sharers Latency on write: proportional to number of sharers! don t know identity of next sharer until reach current one also assist processing at each node along the way (even reads involve more than one other assist: home and first sharer on list) Storage overhead: quite good scaling along both axes Only one head ptr per memory block» rest is all prop to cache size Very complex!!! Summary of Directory Organizations Flat Schemes: Issue (a): finding source of directory data go to home, based on address Issue (b): finding out where the copies are memory-based: all info is in directory at home cache-based: home has pointer to first element of distributed linked list Issue (c): communicating with those copies memory-based: point-to-point messages (perhaps coarser on overflow)» can be multicast or overlapped cache-based: part of point-to-point linked list traversal to find them» serialized Hierarchical Schemes: all three issues through sending messages up and down tree no single explict list of sharers only direct communication is between parents and children 8

8 Summary of Directory Approaches Directories offer scalable coherence on general networks no need for broadcast media Many possibilities for organizing directory and managing protocols Hierarchical directories not used much high latency, many network transactions, and bandwidth bottleneck at root Both memory-based and cache-based flat schemes are alive for memory-based, full bit vector suffices for moderate scale» measured in nodes visible to directory protocol, not processors will examine case studies of each Role of Synchronization Types of Synchronization Mutual Exclusion Event synchronization» point-to-point» group» global (barriers) How much hardware support? high-level operations? atomic instructions? specialized interconnect? 9 omponents of a Synchronization Event Acquire method Acquire right to the synch» enter critical section, go past event Waiting algorithm Wait for synch to become available when it isn t busy-waiting, blocking, or hybrid Release method Enable other processors to acquire right to the synch Waiting algorithm is independent of type of synchronization makes no sense to put in hardware Strawman Lock Busy-Wait lock: ld register, location /* copy location to register */ cmp location, # /* compare with */ bnz lock /* if not, try again */ st location, #1 /* store 1 to mark it locked */ ret /* return control to caller */ unlock: st location, # /* write to location */ ret /* return control to caller */ Why doesn t the acquire method work? Release method? 1

9 What to do if only load and store? Here is a possible two-thread solution: Thread A Thread B Set A=1; Set B=1; while (B) {//X if (!A) {//Y do nothing; ritical Section; ritical Section; Set B=; Set A=; Does this work? Yes. Both can guarantee that: Only one will enter critical section at a time. At X: if B=, safe for A to perform critical section, otherwise wait to find out what will happen At Y: if A=, safe for B to perform critical section. Otherwise, A is in critical section or waiting for B to quit But: Really messy Generalization gets worse Atomic Instructions Specifies a location, register, & atomic operation Value in location read into a register Another value (function of value read or not) stored into location Many variants Varying degrees of flexibility in second part Simple example: test&set Value in location read into a specified register onstant 1 stored into location Successful if value loaded into register is Other constants could be used instead of 1 and How to implement test&set in distributed cache coherent machine? Wait until have write privileges, then perform operation without allowing any intervening operations (either locally or remotely) lock: t&s register, location bnz lock /* if not, try again */ ret /* return control to caller */ unlock: st location, # /* write to location */ ret /* return control to caller */ Time (s) T&S Lock Microbenchmark: SGI hal Test&set, c = Test&set, exponential backof f, c =. Test&set, exponential backof f, c = Ideal 9 Number of processors lock; delay(c); unlock; Zoo of hardware primitives test&set (&address) { /* most architectures */ result = M[address]; M[address] = 1; return result; swap (&address, register) { /* x8 */ temp = M[address]; M[address] = register; register = temp; compare&swap (&address, reg1, reg) { /* 8 */ if (reg1 == M[address]) { M[address] = reg; return success; else { return failure; load-linked&store conditional(&address) { /* R, alpha */ loop: ll r1, M[address]; movi r, 1; /* an do arbitrary comp */ sc r, M[address]; beqz r, loop;

10 Mini-Instruction Set debate atomic read-modify-write instructions IBM : included atomic compare&swap for multiprogramming x8: any instruction can be prefixed with a lock modifier High-level language advocates want hardware locks/barriers» but it s goes against the RIS flow,and has other problems SAR: atomic register-memory ops (swap, compare&swap) MIS, IBM ower: no atomic operations but pair of instructions» load-locked, store-conditional» later used by ower and DE Alpha too 8: S: ompare and compare and swap» No-one does this any more Rich set of tradeoffs Other forms of hardware support Separate lock lines on the bus Lock locations in memory Lock registers (ray Xmp,Intel Single-hip ) Hardware full/empty bits (Tera, Alewife) QOLB (machines supporting SI protocol) Bus support for interrupt dispatch 8 Enhancements to Simple Lock Reduce frequency of issuing test&sets while waiting Test&set lock with backoff Don t back off too much or will be backed off when lock becomes free Exponential backoff works quite well empirically: i th time = k*c i Busy-wait with read operations rather than test&set Test-and-test&set lock Keep testing with ordinary load» cached lock variable will be invalidated when release occurs When value changes (to ), try to obtain lock with test&set» only one attemptor will succeed; others will fail and start testing again 9 Busy-wait vs Blocking Busy-wait: I.e. spin lock Keep trying to acquire lock until read Very low latency/processor overhead! Very high system overhead!» ausing stress on network while spinning» rocessor is not doing anything else useful Blocking: If can t acquire lock, deschedule process (I.e. unload state) Higher latency/processor overhead (1s of cycles?)» Takes time to unload/restart task» Notification mechanism needed Low system overheadd» No stress on network» rocessor does something useful Hybrid: Spin for a while, then block -competitive: spin until have waited blocking time

11 Improved Hardware rimitives: LL-S Goals: Test with reads Failed read-modify-write attempts don t generate invalidations Nice if single primitive can implement range of r-m-w operations Load-Locked (or -linked), Store-onditional LL reads variable into register Follow with arbitrary instructions to manipulate its value S tries to store back to location succeed if and only if no other write to the variable since this processor s LL» indicated by condition codes; If S succeeds, all three steps happened atomically If fails, doesn t write or generate invalidations must retry aquire 1 Simple Lock with LL-S lock: ll reg1, location /* LL location to reg1 */ sc location, reg /* S reg into location*/ beqz reg, lock /* if failed, start again */ ret unlock: st location, # /* write to location */ ret an do more fancy atomic ops by changing what s between LL & S But keep it small so S likely to succeed Don t include instructions that would need to be undone (e.g. stores) S can fail (without putting transaction on bus) if: Detects intervening write even before trying to get bus Tries to get bus but another processor s S gets bus first LL, S are not lock, unlock respectively Only guarantee no conflicting write to lock variable between them But can use directly to implement simple operations on shared variables Ticket Lock Only one r-m-w per acquire Two counters per lock (next_ticket, now_serving) Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket» atomic op when arrive at lock, not when it s free (so less contention) Release: increment now-serving erformance low latency for low-contention - if fetch&inc cacheable O(p) read misses at release, since all spin on same variable FIFO order» like simple LL-S lock, but no inval when S succeeds, and fair Backoff? Wouldn t it be nice to poll different locations... Array-based Queuing Locks Waiting processes poll on different locations in an array of size p Acquire» fetch&inc to obtain address on which to spin (next array element)» ensure that these addresses are in different cache lines or memories Release» set next location in array, thus waking up process spinning on it O(1) traffic per acquire with coherent caches FIFO ordering, as in ticket lock, but, O(p) space per lock Not so great for non-cache-coherent machines with distributed memory» array location I spin on not necessarily in my local memory Example: MS lock (Mellor-rummey and Scott)

12 Time (s) Lock erformance on SGI hallenge Array-based LL-S LL-S, exponential Ticket Ticket, proportional Loop: lock; delay(c); unlock; delay(d); Time (s) Number of processors Number of processors Number of processors (a) Null (c =, d = ) (b) ritical-section (c =. s, d = ) (c) Delay (c =. s, d = 1.9 s) Time (s) oint to oint Event Synchronization Software methods: Interrupts Busy-waiting: use ordinary variables as flags Blocking: use semaphores Full hardware support: full-empty bit with each word in memory Set when word is full with newly produced data (i.e. when written) Unset when word is empty due to being consumed (i.e. when read) Natural for word-level producer-consumer synchronization» producer: write if empty, set to full; consumer: read if full; set to empty Hardware preserves atomicity of bit manipulation with read or write roblem: flexibility» multiple consumers, or multiple writes before consumer reads?» needs language support to specify when to use» composite data structures? Barriers Software algorithms implemented using locks, flags, counters Hardware barriers Wired-AND line separate from address/data bus» Set input high when arrive, wait for output to be high to leave In practice, multiple wires to allow reuse Useful when barriers are global and very frequent Difficult to support arbitrary subset of processors» even harder with multiple processes per processor Difficult to dynamically change number and identity of participants» e.g. latter due to process migration Not common today on bus-based machines A Simple entralized Barrier Shared counter maintains number of processes that have arrived increment when arrive (lock), check until reaches numprocs roblem? struct bar_type {int counter; struct lock_type lock; int flag = ; bar_name; BARRIER (bar_name, p) { LOK(bar_name.lock); if (bar_name.counter == ) bar_name.flag = ; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = ; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ else while (bar_name.flag == ) {; /* busy wait for release */ 8

13 A Working entralized Barrier onsecutively entering the same barrier doesn t work Must prevent process from entering until all have left previous instance ould use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times Toggle this value only when all processes reach BARRIER (bar_name, p) { local_sense =!(local_sense); /* toggle private sense variable */ LOK(bar_name.lock); mycount = bar_name.counter++; /* mycount is private */ if (bar_name.counter == p) UNLOK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else { UNLOK(bar_name.lock); while (bar_name.flag!= local_sense) {; 9 entralized Barrier erformance Latency entralized has critical path length at least proportional to p Traffic About p bus transactions Storage ost Very low: centralized counter and flag Fairness Same processor should not always be last to exit barrier No such bias in centralized Key problems for centralized barrier are latency and traffic Especially with distributed memory, traffic goes to same node Improved Barrier Algorithms for a Bus Software combining tree Only k processors access the same location, where k is degree of tree Flat ontention Little contention Tree structured Separate arrival and exit trees, and use sense reversal Valuable in distributed network: communicate along different paths On bus, all traffic goes on same bus, and no less total traffic Higher latency (log p steps of work, and O(p) serialized bus xactions) Advantage on bus is use of ordinary reads/writes instead of locks Barrier erformance on SGI hallenge Time (s) entralized ombining tree Tournament Dissemination Number of processors entralized does quite well» Will discuss fancier barrier algorithms for distributed machines Helpful hardware support: piggybacking of reads misses on bus» Also for spinning on highly contended locks 1

14 Lock-Free Synchronization What happens if process grabs lock, then goes to sleep??? age fault rocessor scheduling Etc Lock-free synchronization: Operations do not require mutual exclusion of multiple insts Nonblocking: Some process will complete in a finite amount of time even if other processors halt Wait-Free (Herlihy): Every (nonfaulting) process will complete in a finite amount of time Systems based on LL&S can implement these Using of ompare&swap for queues compare&swap (&address, reg1, reg) { /* 8 */ if (reg1 == M[address]) { M[address] = reg; return success; else { return failure; Here is an atomic add to linked-list function: addtoqueue(&object) { do { // repeat until no conflict ld r1, M[root] // Get ptr to current head st r1, M[object] // Save link in new object until (compare&swap(&root,r1,object)); root next next next New Object Transactional Memory Transaction-based model of memory Interface: start transaction(); read/write data commit transaction(): If conflicts detected, commit will abort and must be retried What is a conflict?» If values you read are written by others before commit Hardware support for transactions Typically uses cache coherence protocol to help process Brief discussion of Transactional Memory LogTM: Log-based Transactional Memory Kevin Moore, Jayaram Bobba, Michelle Moravan, Mark Hill & David Wood Use of ache oherence protocol to detect transaction conflicts Transactional Interface: begin_transaction(): Request that subsequent statements for a transaction commit_transaction(): Ends successful transaction begun by matching begin_transaction(). Discards any transaction state saved for potential abort abort_transaction(): Transfers control to a previously register conflict handler which should undo and discard work since last begin_transaction()

15 Specific Logging Mechanism Summary erformance Enhancements Reduce number of hops Reduce occupancy of transactions in memory controller Deadlock Issues with rotocols Many protocols are not simply request-response onsider Dual graph of message dependencies» Nodes: Networks, Arcs: rotocol actions» onsider maximum depth of graph to discover number of networks Distributed Directory Structure Flat: Each address has a home node Hierarchical: directory spread along tree Mechanism for locating copies of data Memory-based schemes» info about copies stored all at the home with the memory block ache-based schemes» info about copies distributed among copies themselves 8 Synchronization Summary Rich interaction of hardware-software tradeoffs Must evaluate hardware primitives and software algorithms together primitives determine which algorithms perform well Evaluation methodology is challenging Use of delays, microbenchmarks Should use both microbenchmarks and real workloads Simple software algorithms with common hardware primitives do well on bus Will see more sophisticated techniques for distributed machines Hardware support still subject of debate Theoretical research argues for swap or compare&swap, not fetch&op Algorithms that ensure constant-time access, but complex 9

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout CS 28 Parallel Computer Architecture Lecture 23 Hardware-Software Trade-offs in Synchronization and Data Layout April 21, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs28 Role of

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must: ontext for Scalable ache oherence ache oherence in Scalable Machines Realizing gm Models through net transaction protocols - efficient node-to-net interface - interprets transactions Switch Scalable network

More information

Cache Coherence: Part II Scalable Approaches

Cache Coherence: Part II Scalable Approaches ache oherence: art II Scalable pproaches Hierarchical ache oherence Todd. Mowry S 74 October 27, 2 (a) 1 2 1 2 (b) 1 Topics Hierarchies Directory rotocols Hierarchies arise in different ways: (a) processor

More information

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist

More information

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

Recall: Sequential Consistency Example. Recall: MSI Invalidate Protocol: Write Back Cache. Recall: Ordering: Scheurich and Dubois

Recall: Sequential Consistency Example. Recall: MSI Invalidate Protocol: Write Back Cache. Recall: Ordering: Scheurich and Dubois ecall: Sequential onsistency Example S22 Graduate omputer rchitecture Lecture 2 pril 9 th, 212 Distributed Shared Memory rof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs22 rocessor 1 rocessor

More information

Recall: Sequential Consistency Example. Implications for Implementation. Issues for Directory Protocols

Recall: Sequential Consistency Example. Implications for Implementation. Issues for Directory Protocols ecall: Sequential onsistency Example S252 Graduate omputer rchitecture Lecture 21 pril 14 th, 2010 Distributed Shared ory rof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs252 rocessor 1 rocessor

More information

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency CSE 564 Computer Architecture Fall 2016 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu www.secs.oakland.edu/~yan

More information

The MESI State Transition Graph

The MESI State Transition Graph Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization

More information

A Scalable SAS Machine

A Scalable SAS Machine arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines Cache Coherence in Scalable Machines COE 502 arallel rocessing Architectures rof. Muhamed Mudawar Computer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based

More information

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Types of Synchronization

More information

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3 MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance

More information

Recall Ordering: Scheurich and Dubois. CS 258 Parallel Computer Architecture Lecture 21 P 1 : Directory Based Protocols. Terminology for Shared Memory

Recall Ordering: Scheurich and Dubois. CS 258 Parallel Computer Architecture Lecture 21 P 1 : Directory Based Protocols. Terminology for Shared Memory ecall Ordering: Scheurich and Dubois S 258 arallel omputer rchitecture Lecture 21 : 1 : W Directory Based rotocols 2 : pril 14, 28 rof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs258 Exclusion

More information

Scalable Multiprocessors

Scalable Multiprocessors Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system

More information

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization. Synchronization sum := thread_create Execution on a sequentially consistent shared-memory machine: Erik Hagersten Uppsala University Sweden while (sum < threshold) sum := sum while + (sum < threshold)

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols Cache Coherence Todd C. Mowry CS 740 November 10, 1998 Topics The Cache Coherence roblem Snoopy rotocols Directory rotocols The Cache Coherence roblem Caches are critical to modern high-speed processors

More information

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues Lecture 8: Directory-Based Cache Coherence Topics: scalable multiprocessor organizations, directory protocol design issues 1 Scalable Multiprocessors P1 P2 Pn C1 C2 Cn 1 CA1 2 CA2 n CAn Scalable interconnection

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important Outline ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Memory Consistency Models Copyright 2006 Daniel J. Sorin Duke University Slides are derived from work by Sarita

More information

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins 5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses, false sharing Cache coherence and interconnects

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed memory plus coherent replication Scalable distributed memory machines -C-M nodes connected by network communication

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Shared Memory Multiprocessors Jesús Labarta Index 1 Shared Memory architectures............... Memory Interconnect Cache Processor Concepts? Memory Time 2 Concepts? Memory Load/store (@) Containers Time

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

5 Chip Multiprocessors (II) Robert Mullins

5 Chip Multiprocessors (II) Robert Mullins 5 Chip Multiprocessors (II) ( MPhil Chip Multiprocessors (ACS Robert Mullins Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses Cache coherence and interconnects Directory-based

More information

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

The need for atomicity This code sequence illustrates the need for atomicity. Explain. Lock Implementations [ 8.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock

More information

SHARED-MEMORY COMMUNICATION

SHARED-MEMORY COMMUNICATION SHARED-MEMORY COMMUNICATION IMPLICITELY VIA MEMORY PROCESSORS SHARE SOME MEMORY COMMUNICATION IS IMPLICIT THROUGH LOADS AND STORES NEED TO SYNCHRONIZE NEED TO KNOW HOW THE HARDWARE INTERLEAVES ACCESSES

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches Overview ost cache protocols are more complicated than two state Snooping not effective for network-based systems Consider three

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Multiprocessor Synchronization

Multiprocessor Synchronization Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections ) Lecture: Coherence, Synchronization Topics: directory-based coherence, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory) keeps track

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections ) Lecture 19: Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 1 Caching Locks Spin lock: to acquire a lock, a process may enter an infinite loop that keeps attempting

More information

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms. The Lecture Contains: Synchronization Waiting Algorithms Implementation Hardwired Locks Software Locks Hardware Support Atomic Exchange Test & Set Fetch & op Compare & Swap Traffic of Test & Set Backoff

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

LogTM: Log-Based Transactional Memory

LogTM: Log-Based Transactional Memory LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood 12th International Symposium on High Performance Computer Architecture () 26 Mulitfacet

More information

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:

More information

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O

More information

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott Presentation by Joe Izraelevitz Tim Kopp Synchronization Primitives Spin Locks Used for

More information

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections ) Lecture: Coherence and Synchronization Topics: synchronization primitives, consistency models intro (Sections 5.4-5.5) 1 Performance Improvements What determines performance on a multiprocessor: What fraction

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

M4 Parallelism. Implementation of Locks Cache Coherence

M4 Parallelism. Implementation of Locks Cache Coherence M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory

More information

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based) Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution

More information

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University

More information

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 19: Synchronization CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 4 due tonight at 11:59 PM Synchronization primitives (that we have or will

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019 250P: Computer Systems Architecture Lecture 14: Synchronization Anton Burtsev March, 2019 Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 2 Constructing Locks Applications

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch Directory-based Coherence Winter 2019 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk, Reinhardt,

More information

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Snoopy Cache Coherence rof. Michel A. Kinsy Consistency in SMs CU-1 CU-2 A 100 Cache-1 A 100 Cache-2 CU- bus A 100 Consistency in SMs CU-1 CU-2 A 200 Cache-1

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)

More information

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas Cache Coherence (II) Instructor: Josep Torrellas CS533 Copyright Josep Torrellas 2003 1 Sparse Directories Since total # of cache blocks in machine is much less than total # of memory blocks, most directory

More information

Memory Hierarchy in a Multiprocessor

Memory Hierarchy in a Multiprocessor EEC 581 Computer Architecture Multiprocessor and Coherence Department of Electrical Engineering and Computer Science Cleveland State University Hierarchy in a Multiprocessor Shared cache Fully-connected

More information

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 9: Multiprocessor OSs & Synchronization CSC 469H1F Fall 2006 Angela Demke Brown The Problem Coordinated management of shared resources Resources may be accessed by multiple threads Need to control

More information

... The Composibility Question. Composing Scalability and Node Design in CC-NUMA. Commodity CC node approach. Get the node right Approach: Origin

... The Composibility Question. Composing Scalability and Node Design in CC-NUMA. Commodity CC node approach. Get the node right Approach: Origin The Composibility Question Composing Scalability and Node Design in CC-NUMA CS 28, Spring 99 David E. Culler Computer Science Division U.C. Berkeley adapter Sweet Spot Node Scalable (Intelligent) Interconnect

More information

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal. Shared emory ultiprocessors Basic Architecture of SP Buses are good news and bad news The (memory) bus is a point all processors can see and thus be informed of what is happening A bus is serially used,

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 4 Multiprocessors and Thread-Level Parallelism --II CA Lecture08 - multiprocessors and TLP (cwliu@twins.ee.nctu.edu.tw) 09-1 Review Caches contain all information on

More information

Hardware: BBN Butterfly

Hardware: BBN Butterfly Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors John M. Mellor-Crummey and Michael L. Scott Presented by Charles Lehner and Matt Graichen Hardware: BBN Butterfly shared-memory

More information

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming CS758: Multicore Programming Today s Outline: Shared Memory Review Shared Memory & Concurrency Introduction to Shared Memory Thread-Level Parallelism Shared Memory Prof. David A. Wood University of Wisconsin-Madison

More information

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY> Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: This is a closed book, closed notes exam. 80 Minutes 19 pages Notes: Not all questions

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models [ 9.1] In Lecture 13, we saw a number of relaxed memoryconsistency models. In this lecture, we will cover some of them in more detail. Why isn t sequential consistency

More information

ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors

ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors ECE7660 Parallel Computer Architecture Shared Memory Multiprocessors 1 Layer Perspective CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed addr ess Message passing Data parallel

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Portland State University ECE 588/688. Transactional Memory

Portland State University ECE 588/688. Transactional Memory Portland State University ECE 588/688 Transactional Memory Copyright by Alaa Alameldeen 2018 Issues with Lock Synchronization Priority Inversion A lower-priority thread is preempted while holding a lock

More information

A More Sophisticated Snooping-Based Multi-Processor

A More Sophisticated Snooping-Based Multi-Processor Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes The Projects Handsome Boy Modeling School (So... How

More information

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors:

More information

Foundations of Computer Systems

Foundations of Computer Systems 18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 19 Processor Design Overview Special features in microprocessors provide support for parallel processing Already discussed bus snooping Memory latency becoming

More information

Multiprocessors 1. Outline

Multiprocessors 1. Outline Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples

More information

Lecture #7: Implementing Mutual Exclusion

Lecture #7: Implementing Mutual Exclusion Lecture #7: Implementing Mutual Exclusion Review -- 1 min Solution #3 to too much milk works, but it is really unsatisfactory: 1) Really complicated even for this simple example, hard to convince yourself

More information

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations 1 Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes

More information