5 Chip Multiprocessors (II) Robert Mullins

Size: px
Start display at page:

Download "5 Chip Multiprocessors (II) Robert Mullins"

Transcription

1 5 Chip Multiprocessors (II) ( MPhil Chip Multiprocessors (ACS Robert Mullins

2 Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses Cache coherence and interconnects Directory-based Coherency Protocols 2 ( MPhil Chip Multiprocessors (ACS

3 Synchronization The lock problem The lock is suppose to provide atomicity for critical sections Unfortunately, as implemented this lock is lacking atomicity in its own implementation Multiple processors could read the lock as free and progress past the branch simultaneously lock: ld reg, lock-addr cmp reg, #0 bnz lock st lock-addr, #1 ret unlock: st lock-addr, #0 ret Culler p ( MPhil Chip Multiprocessors (ACS

4 Synchronization Test and Set Executes the following atomically: reg=m[lock-addr] m[lock-addr]=1 The branch makes sure that if the lock was already taken we try again A more general, but similar, instruction is swap reg1=m[lock-addr] m[lock-addr]=reg2 lock: t&s reg, lock-addr bnz reg, lock ret unlock: st lock-addr, #0 ret 4 ( MPhil Chip Multiprocessors (ACS

5 Synchronization We could implement test&set with two bus transactions A read and a write transaction We could lock down the bus for these two cycles to ensure the sequence is atomic More difficult with a split-transaction bus performance and deadlock issues 5 ( MPhil Chip Multiprocessors (ACS Culler p.391

6 Synchronization If we assume an invalidation-based CC protocol with a WB cache, a better approach is to: Issue a read exclusive (BusRdX) transaction then perform the read and write (in the ( cache without giving up ownership Any incoming requests to the block are buffered until the data is written in the cache Any other processors are forced to wait 6 ( MPhil Chip Multiprocessors (ACS

7 Synchronization Other common synchronization instructions: swap fetch&op fetch&inc fetch&add compare&swap Many x86 instructions can be prefixed with the lock modifier to make them atomic A simpler general purpose solution? 7 ( MPhil Chip Multiprocessors (ACS

8 LL/SC LL/SC Load-Linked (LL) Read memory Set lock flag and put address in lock register Intervening writes to the address in the lock register will cause the lock flag to be reset Store-Conditional (SC) Check lock flag to ensure an intervening conflicting write has not occurred If lock flag is not set, SC will fail If (atomic_update) then mem[addr]=rt, rt=1 else rt=0 8 ( MPhil Chip Multiprocessors (ACS

9 LL/SC reg2=1 lock: ll reg1, lock-addr bnz reg1, lock ; lock already taken? sc lock-addr, reg2 beqz lock ; if SC failed goto lock ret unlock: st lock-addr, #0 ret 9 ( MPhil Chip Multiprocessors (ACS

10 LL/SC This SC will fail as the lock flag will be reset by the store from P2 Culler p ( MPhil Chip Multiprocessors (ACS

11 LL/SC LL/SC can be implemented using the CC protocol: LL loads cache line with write permission (issues ( M BusRdX, holds line in state SC Only succeeds if cache line is still in state M, otherwise fails 11 ( MPhil Chip Multiprocessors (ACS

12 LL/SC Need to ensure forward progress. May prevent LL giving up M state for n cycles (or after repeated fails ( state guarantee success, i.e. simply don't give up M We normally implement a restricted form of LL/SC called RLL/RSC: SC may experience spurious failures e.g. due to context switches and TLB misses We add restrictions to avoid cache line (holding lock ( variable from being replaced Disallow memory memory-referencing instructions between LL and SC Prohibit out-of-order execution between LL and SC 12 ( MPhil Chip Multiprocessors (ACS

13 Coherence misses Remember your 3 C's! Compulsory Cold-start of first-reference misses Capacity If cache is not large enough to store all the blocks needed during the execution of the program ( collision Conflict (or Conflict misses occur due to direct-mapped or set associative block placement strategies Coherence Misses that arise due to interprocessor communication 13 ( MPhil Chip Multiprocessors (ACS

14 True sharing A block typically contains many words (e.g. 4- Coherency is maintained at the granularity.( 8 of cache blocks True sharing miss Misses that arise from the communication of data e.g., the 1 st write to a shared block ( S ) will causes an invalidation to establish ownership Additionally, subsequent reads of the invalidated block by another processor will also cause a miss Both these misses are classified as true sharing if data is communicated and they would occur irrespective of block size. 14 ( MPhil Chip Multiprocessors (ACS

15 False sharing False sharing miss Different processors are writing and reading different words in a block, but no communication is taking place e.g. a block may contain words X and Y P1 repeatedly writes to X, P2 repeatedly writes to Y The block will be repeatedly invalidated (leading to cache ( misses even though no communication is taking place These are false misses and are due to the fact that the block contains multiple words They would not occur if the block size = a single word For more details see Coherence miss classification for performance debugging in multi-core processors, Venkataramani et al. Interact ( MPhil Chip Multiprocessors (ACS

16 Cache coherence and interconnects Broadcast-based snoopy protocols These protocols rely on bus-based interconnects Buses have limited scalability Energy and bandwidth implications of broadcasting They permit direct cache-to-cache transfers Low-latency communication 2 hops» 1. broadcast» 2. receive data from remote cache Very useful for applications with lots of fine-grain sharing 16 ( MPhil Chip Multiprocessors (ACS

17 Cache coherence and interconnects Totally-ordered interconnects All messages are delivered to all destinations in the same order. Totally-ordered interconnects often employ a centralised arbiter or switch e.g. a bus or pipelined broadcast tree Traditional snoopy protocols are built around the concept of a bus (or virtual :( bus ( 1 ) Broadcast - All transactions are visible to all components connected to the bus ( 2 ) The interconnect provides a total order of messages Chip Multiprocessors (ACS ( MPhil 17

18 Cache coherence and interconnects A pipelined broadcast tree is sufficiently similar to a bus to support traditional [( Wisconsin ) snooping protocols. [Reproduced from Milo Martin's PhD thesis The centralised switch guarantees a total ordering of messages, i.e. messages are sent to the root switch then broadcast. 18 ( MPhil Chip Multiprocessors (ACS

19 Cache coherence and interconnects Unordered interconnects Networks (e.g. mesh, ( torus can't typically provide strong ordering guarantees, i.e. nodes don't perceive transactions in a single global order. Point-to-point ordering Networks may be able to ensure messages sent between a pair of nodes are guaranteed not to be reordered. e.g. a mesh with a single VC and deterministic dimension ordered ( XY ) routing 19 ( MPhil Chip Multiprocessors (ACS

20 Directory-based cache coherence The state of the blocks in each cache in a snoopy protocol is maintained by broadcasting all memory operations on the bus We want to avoid the need to broadcast. So maintain the state of each block explicitly We store this information in the directory Requests can be made to the appropriate directory entry to read or write a particular block The directory orchestrates the appropriate actions necessary to satisfy the request 20 ( MPhil Chip Multiprocessors (ACS

21 Directory-based cache coherence The directory provides a per-block ordering point to resolve races All requests for a particular block are made to the same directory. The directory decides the order the requests will be satisfied. Directory protocols can operate over unordered interconnects 21 ( MPhil Chip Multiprocessors (ACS

22 Broadcast-based directory protocols A number of recent coherence protocols broadcast transactions over unordered interconnects: Similar to snoopy coherence protocols They provide a directory, or coherence hub, that serves as an ordering point. The directory simply broadcasts requests ( maintained to all nodes (no sharer state is The ordering point also buffers subsequent coherence requests to the same cache line to prevent races with a request in progress e.g. early example AMD's Hammer protocol High bandwidth requirements, but simple, no need to maintain/read sharer state 22 ( MPhil Chip Multiprocessors (ACS

23 Directory-based cache coherence The directory keeps track of who has a copy of the block and their states Broadcasting is replaced by cheaper point-to-point communications by maintaining a list of sharers The number of invalidations on a write is typically small in real applications, giving us a significant reduction in communication costs (esp. in systems ( processors with a large number of 23 ( MPhil Chip Multiprocessors (ACS

24 Directory-based cache coherence Read Miss to a block in a modified state in a cache ( 8.5 Fig. (Culler, An example of a simple protocol. This is only meant to introduce the concept of a directory 24 ( MPhil Chip Multiprocessors (ACS

25 Directory-based cache coherence Write miss to a block with two sharers 25 ( MPhil Chip Multiprocessors (ACS

26 Directory-based cache coherence Let's consider the requester, directory and sharer state transitions for the previous slide... Requester State Directory State Sharer State The processor executes a store I->P (1) The block is initially in the ( I(nvalid state We make a ExclReq to the directory and move to a pending state P->E (4) We receive write permission and data from the directory Block is initially marked as shared. The directory holds a list of the sharers Shared->TransWaitForInvalidate (2) The directory receives a ExclReq from cache 'id', id is not in the sharers list and the sharers list is not empty. It must send invalidate requests to all sharers and wait for their responses TransWaitForInvalidate->M (4) All invalidate acks are recieved, directory can reply to requester and provide data + write permissions. It moves to a state that records that the requester has the only copy S->I (3) On receiving a InvReq each sharer invalidates its copy of the block and moves to state I. It then acks with a InvRep message 26 ( MPhil Chip Multiprocessors (ACS

27 Directory-based cache coherence We now have two types of controller, one at each directory and one at each private cache The complete cache coherence protocol is specified in state-diagrams for both controllers The stable cache states are often MESI as in a snoopy protocol There are some complete example protocols ( Gold available on the wiki (Courtesy of Brian Exercise: try and understand how each of these protocols handle read and write misses 27 ( MPhil Chip Multiprocessors (ACS

28 Organising directory information How do we know which directory to send our request to? How is directory state actually stored? 28 ( MPhil Chip Multiprocessors (ACS

29 Organising directory information Directory Schemes Centralized Distributed How to find source of directory information How to locate copies Memory-based Information about all sharers is stored at the directory using a full bit-vector organization, limited-pointer scheme etc. ( book Figure 8.7 (reproduced from Culler Parallel Flat Cache-based Hierarchical Requests traverse up a tree to find a node with information on the block information is distributed amongst sharers, e.g. sharers form a linked list ( NUMA-Q (IEEE SCI, Sequent Typically operations to: add to head, ( neighbours remove a node (by contacting ( only and invalidate all nodes (from head we won't discuss 29 ( MPhil Chip Multiprocessors (ACS

30 Organising directory information How do we store the list of sharers in a flat, memorybased directory scheme? Full bit-vector P presence bits, which indicate for each of the p processors whether the processor has a copy of the block Limited-pointer schemes Maintain a fixed (and ( limited number of pointers [Culler p.568] Typically the number of sharers is small (4 pointers may often ( suffice Need a backup or overflow strategy Overflow to memory or resort to broadcast ( Origin Or a coarse vector scheme (e.g. SGI ( processors (where each bit represents groups of Extract from duplicated L1 tags (reverse-mapped) Query local copy of tags to find sharers 30 ( MPhil Chip Multiprocessors (ACS

31 Organising directory information Four examples of how we might store our directory information in a CMP: Append state to L2 tags ( 1 Duplicate L1 tags at the directory ( 2 Store directory state in main memory and include ( 3 a directory cache at each node A hierarchical directory ( 4 I assume the L2 is the first shared cache. In a real system this could as easily be the L3 or interface to main memory. The directory is placed at the first shared memory regardless of the number of levels of cache. 31 ( MPhil Chip Multiprocessors (ACS

32 Organising directory information 1. Append state to L2 tags Perhaps conceptually the simplest scheme Assume a shared banked inclusive L2 cache The location of the directory depends only on the block address Directory state can simply be appended to the L2 cache tags Reproduced from Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, Zhang/Asanovic, ISCA'05 32 ( MPhil Chip Multiprocessors (ACS

33 Organising directory information 1. Append state to L2 tags May be expensive in terms of memory L2 may contain many more cache lines than can reside in the aggregated L1s (or on a per bank basis, those L1 ( bank lines that can map to the L2 May be unnecessarily power and area hungry Doesn't provide support for non-inclusive L2 caches Assumes the L2 is always caching anything in the L1's Problematic if L2 is small in comparison to aggregated L1 capacity 33 ( MPhil Chip Multiprocessors (ACS

34 Organising directory information 2. Duplicating L1 tags (A reverse mapped directory CAM) :( bank At each directory (e.g. L2 Duplicate the L1 tags of those L1 lines that can map to the bank We can interrogate the duplicated tags to determine the sharers list At what granularity do we interleave addresses across banks for the directory and L2 cache? Simpler if we interleave the directory and L2 in the same way What about the impact of granularity on the directory? 34 ( MPhil Chip Multiprocessors (ACS

35 Organising directory information 2. Duplicating L1 tags In this example precisely one quarter of the L1 lines map to each of the 4 L2 banks 35 ( MPhil Chip Multiprocessors (ACS

36 Organising directory information 2. Duplicating L1 tags A fine-grain interleaving as illustrated on the previous slide means that only a subset of each L1's lines may map to a particular L2 bank Each directory is organised as: s/n sets of n*a ways Where n = no. of processors, a = associativity, s = no. of sets in L1 If a coarse-grain interleaving is selected (where the L2 bank is selected from bits outside the L1s index ( bits any L1 line could map to any L2 bank, hence each directory is organised as: s sets of n*a ways each 36 ( MPhil Chip Multiprocessors (ACS

37 Organising directory information 2. Duplicating L1 tags Example: Sun Niagara T1 L1 caches are write-through, 16-byte lines Allocate on load, no-allocate on a store L2 maintains directory by duplicating L1 tags L2 is banked and interleaved at a 64-byte granularity No. of L1 lines that may map to each L2 bank is much less than the total number of L2 lines in a bank. Duplicating L1 tags saves area and power over adding directory state to each L2 tag. 37 ( MPhil Chip Multiprocessors (ACS

38 Organising directory information 3. Directory-caches Directory state is stored in main memory and cached at each node Note: The L2 caches are private in this example Figure reproduced from Proximity- Aware Directory-based Coherence for Multi-core Processor Architectures, Brown/Kumar/Tullsen, SPAA'07 38 ( MPhil Chip Multiprocessors (ACS

39 Organising directory information 3. Directory-caches Each tile and corresponding memory channel has access to a different range of physical memory locations ( directory There is only one possible home (location of the associated for each memory block Two different directories never share directory state, so there are no coherence worries between directory caches! Each cache line in the directory cache may hold state corresponding to multiple contiguous memory blocks to exploit spatial locality (as you ( cache would in a normal We typically assign home nodes at a page-granularity using a firsttouch policy The cited work use a 4-way set associative, 16KB cache at each tile. (The proximity-aware protocol described is able to request data from a (. L2 nearby sharer if it is not present in the home node's 39 ( MPhil Chip Multiprocessors (ACS

40 Organising directory information 4. A hierarchical directory Reproduced from A consistency architecture for hierarchical shared caches, Ladan-Mozes/Lesierson, SPAA'08 40 ( MPhil Chip Multiprocessors (ACS

41 Organising directory information 4. A hierarchical directory Aimed at processors with large number of cores The black dots indicate where a particular block may be cached or stored in memory There is only one place as we move up each level of the tree Example: If a L3 cache holds write permissions for a block (holds block in state ( M it can manage the line in its subtree as if it were main memory No need to tell its parent (! proofs See paper for details (and See also Fractal Coherence paper from MICRO'10 41 ( MPhil Chip Multiprocessors (ACS

42 Organising directory information 4. A hierarchical directory Less extreme examples of hierarchical schemes are common where larger-scale machines exploit bus-based first-level coherence (commodity ( hardware and a directory protocol at the secondlevel. In such schemes a bridge between the two protocols monitors activity on the bus and when necessary intervenes to ensure coherence actions are handled at the second level (removing the transaction from the bus, completing the coherence actions at the 2 nd level and then ( bus replaying the request on the 42 ( MPhil Chip Multiprocessors (ACS

43 Sharing patterns Invalidation frequency and size distribution How many writes require copies in other caches to be invalidated? (invalidating ( writes i.e. the local private cache does not already hold block in M state What is the distribution of the no. of invalidations ( sharers ) required upon these writes? 43 ( MPhil Chip Multiprocessors (ACS

44 Sharing patterns Barnes-Hut Invalidation Patterns to to to to to to to to to to to to to to 63 # of invalidations Radiosity Invalidation Patterns to to to to to to to to to to to to to to 63 # of invalidations See Culler p.574 for more (assumes inf. large private ( caches 44 ( MPhil Chip Multiprocessors (ACS

45 Sharing patterns Read-only No invalidating writes Producer-consumer Processor A writes, then one or more processors read the data, then processor A writes again, the data is read again, and so on Invalidation size is often 1, all or a few This categorization is originally from Cache invalidation patterns in shared memory multiprocessors, Gupta/Weber, See also Culler Section ( MPhil Chip Multiprocessors (ACS

46 Sharing patterns Migratory Data migrates from one processor to another. Often being read as well as written along the way Invalidation size = 1, only previous writer has a copy (it ( copy invalidated the previous Irregular read-write Irregular/unpredictable read/write access patterns Invalidation size is normally concentrated around the small end of the spectrum 46 ( MPhil Chip Multiprocessors (ACS

47 Protocol optimisations Goals? Performance, power, complexity and area! Aim to lower the average memory access time If we look at the protocol in isolation, the typical approach is to: Aim to reduce the number of network transactions ( 1 Reduce the number of transactions on the critical ( 2 path of the processor 47 ( MPhil Chip Multiprocessors (ACS Culler Section 8.4.1

48 Protocol optimisations Let's look again at the simple protocol we introduced in slides 24/25 In the case of a read miss to a block in a modified state in another cache we required: 5 transactions in total 4 transactions are on the critical path Let's look at forwarding as a protocol optimisation An intervention here is just like a request, but issued in reaction to a request to a cache 48 ( MPhil Chip Multiprocessors (ACS

49 Directory-based cache coherence Read Miss to a block in a modified state in a cache ( 8.5 Fig. (Culler, 49 ( MPhil Chip Multiprocessors (ACS

50 Directory-based cache coherence 1: req 3:intervention 4a:revise L H R 2:reply 4b:response (a) Strict request-reply 1: req 2:intervention L H R 4:reply 3:response (a) Intervention forwar ding 1: req 2:intervention 3a:revise L H R 3b:response (a) Reply forwarding Culler, p ( MPhil Chip Multiprocessors (ACS

51 Protocol optimisations Other possible ways improvements can be made: Optimise the protocol for common sharing patterns e.g. producer-consumer and migratory Exploit a particular network topology or hierarchical directory structure Perhaps multiple networks tuned to different types of traffic ( sense Exploit locality (in a physical Obtain required data using a cache-to-cache transfer from the nearest sharer or an immediate neighbour Perform speculative transactions to accelerate acquisition of permissions or data Compiler assistance ( MPhil Chip Multiprocessors (ACS

52 Correctness Directory protocols can quickly become very complicated Timeouts, retries, negative acknowledgements have all been used in different protocols to avoid deadlock and livelock issues (and guarantee ( progress forward 52 ( MPhil Chip Multiprocessors (ACS

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins 5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses, false sharing Cache coherence and interconnects

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

The MESI State Transition Graph

The MESI State Transition Graph Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based) Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Shared Memory Multiprocessors Jesús Labarta Index 1 Shared Memory architectures............... Memory Interconnect Cache Processor Concepts? Memory Time 2 Concepts? Memory Load/store (@) Containers Time

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

A More Sophisticated Snooping-Based Multi-Processor

A More Sophisticated Snooping-Based Multi-Processor Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes The Projects Handsome Boy Modeling School (So... How

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns

More information

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues Lecture 8: Directory-Based Cache Coherence Topics: scalable multiprocessor organizations, directory protocol design issues 1 Scalable Multiprocessors P1 P2 Pn C1 C2 Cn 1 CA1 2 CA2 n CAn Scalable interconnection

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252

More information

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations 1 Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines Cache Coherence in Scalable Machines COE 502 arallel rocessing Architectures rof. Muhamed Mudawar Computer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O 6.823, L21--1 Cache Coherence Protocols: Implementation Issues on SMP s Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Coherence Issue in I/O 6.823, L21--2 Processor Processor

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 18: Directory-Based Cache Protocols John Wawrzynek EECS, University of California at Berkeley http://inst.eecs.berkeley.edu/~cs152 Administrivia 2 Recap:

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication

More information

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock

More information

Cache Coherence: Part II Scalable Approaches

Cache Coherence: Part II Scalable Approaches ache oherence: art II Scalable pproaches Hierarchical ache oherence Todd. Mowry S 74 October 27, 2 (a) 1 2 1 2 (b) 1 Topics Hierarchies Directory rotocols Hierarchies arise in different ways: (a) processor

More information

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution

More information

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

The need for atomicity This code sequence illustrates the need for atomicity. Explain. Lock Implementations [ 8.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock

More information

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache

More information

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must: ontext for Scalable ache oherence ache oherence in Scalable Machines Realizing gm Models through net transaction protocols - efficient node-to-net interface - interprets transactions Switch Scalable network

More information

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols Lecture 3: Directory Protocol Implementations Topics: coherence vs. msg-passing, corner cases in directory protocols 1 Future Scalable Designs Intel s Single Cloud Computer (SCC): an example prototype

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Special Topics. Module 14: Directory-based Cache Coherence Lecture 33: SCI Protocol Directory-based Cache Coherence: Sequent NUMA-Q. Directory-based Cache Coherence: Special Topics Sequent NUMA-Q SCI protocol Directory overhead Cache overhead Handling read miss Handling write miss Handling writebacks Roll-out protocol Snoop interaction

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols Cache Coherence Todd C. Mowry CS 740 November 10, 1998 Topics The Cache Coherence roblem Snoopy rotocols Directory rotocols The Cache Coherence roblem Caches are critical to modern high-speed processors

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly

More information

A Scalable SAS Machine

A Scalable SAS Machine arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation Lecture 15: A Basic Snooping-Based Multi-Processor Implementation Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Pushing On (Oliver $ & Jimi Jules) Time for the second

More information

Scalable Multiprocessors

Scalable Multiprocessors Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM Lecture 4: Directory Protocols and TM Topics: corner cases in directory protocols, lazy TM 1 Handling Reads When the home receives a read request, it looks up memory (speculative read) and directory in

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Speculative Locks. Dept. of Computer Science

Speculative Locks. Dept. of Computer Science Speculative Locks José éf. Martínez and djosep Torrellas Dept. of Computer Science University it of Illinois i at Urbana-Champaign Motivation Lock granularity a trade-off: Fine grain greater concurrency

More information

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains: The Lecture Contains: Shared Memory Multiprocessors Shared Cache Private Cache/Dancehall Distributed Shared Memory Shared vs. Private in CMPs Cache Coherence Cache Coherence: Example What Went Wrong? Implementations

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

Lecture: Consistency Models, TM

Lecture: Consistency Models, TM Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) No class on Monday (please watch TM videos) Wednesday: TM wrap-up, interconnection networks 1 Coherence Vs. Consistency

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

M4 Parallelism. Implementation of Locks Cache Coherence

M4 Parallelism. Implementation of Locks Cache Coherence M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY> Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: This is a closed book, closed notes exam. 80 Minutes 19 pages Notes: Not all questions

More information

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout CS 28 Parallel Computer Architecture Lecture 23 Hardware-Software Trade-offs in Synchronization and Data Layout April 21, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs28 Role of

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Foundations of Computer Systems

Foundations of Computer Systems 18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:

More information

CMSC 611: Advanced. Distributed & Shared Memory

CMSC 611: Advanced. Distributed & Shared Memory CMSC 611: Advanced Computer Architecture Distributed & Shared Memory Centralized Shared Memory MIMD Processors share a single centralized memory through a bus interconnect Feasible for small processor

More information

SHARED-MEMORY COMMUNICATION

SHARED-MEMORY COMMUNICATION SHARED-MEMORY COMMUNICATION IMPLICITELY VIA MEMORY PROCESSORS SHARE SOME MEMORY COMMUNICATION IS IMPLICIT THROUGH LOADS AND STORES NEED TO SYNCHRONIZE NEED TO KNOW HOW THE HARDWARE INTERLEAVES ACCESSES

More information

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Avinash Kodi Department of Electrical Engineering & Computer

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

... The Composibility Question. Composing Scalability and Node Design in CC-NUMA. Commodity CC node approach. Get the node right Approach: Origin

... The Composibility Question. Composing Scalability and Node Design in CC-NUMA. Commodity CC node approach. Get the node right Approach: Origin The Composibility Question Composing Scalability and Node Design in CC-NUMA CS 28, Spring 99 David E. Culler Computer Science Division U.C. Berkeley adapter Sweet Spot Node Scalable (Intelligent) Interconnect

More information

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Module 9: Introduction to Shared Memory Multiprocessors Lecture 16: Multiprocessor Organizations and Cache Coherence Shared Memory Multiprocessors Shared Memory Multiprocessors Shared memory multiprocessors Shared cache Private cache/dancehall Distributed shared memory Shared vs. private in CMPs Cache coherence Cache coherence: Example What went

More information

Recall: Sequential Consistency Example. Implications for Implementation. Issues for Directory Protocols

Recall: Sequential Consistency Example. Implications for Implementation. Issues for Directory Protocols ecall: Sequential onsistency Example S252 Graduate omputer rchitecture Lecture 21 pril 14 th, 2010 Distributed Shared ory rof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs252 rocessor 1 rocessor

More information

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Protocols Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt Consistency & Coherence 4/14/2016 Sec5on 12 Colin Schmidt Agenda Brief mo5va5on Consistency vs Coherence Synchroniza5on Fences Mutexs, locks, semaphores Hardware Coherence Snoopy MSI, MESI Power, Frequency,

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Scientific Applications. Chao Sun

Scientific Applications. Chao Sun Large Scale Multiprocessors And Scientific Applications Zhou Li Chao Sun Contents Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization:

More information

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections ) Lecture: Coherence, Synchronization Topics: directory-based coherence, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory) keeps track

More information