5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

Size: px
Start display at page:

Download "5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins"

Transcription

1 5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

2 Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses, false sharing Cache coherence and interconnects Directory-based Coherency Protocols Introduction, organising directories in a CMP, sharing patterns and protocol optimisations, correctness, broadcasting over unordered interconnects Chip Multiprocessors (ACS MPhil) 2

3 Synchronization The lock problem The lock is suppose to provide atomicity for critical sections Unfortunately, as implemented this lock is lacking atomicity in its own implementation Multiple processors could read the lock as free and progress past the branch simultaneously lock: ld reg, lock-addr cmp reg, #0 bnz lock st lock-addr, #1 ret unlock: st lock-addr, #0 ret Culler p.338 Chip Multiprocessors (ACS MPhil) 3

4 Synchronization Test and Set Executes the following atomically: reg=m[lock-addr] m[lock-addr]=1 The branch makes sure that if the lock was already taken we try again A more general, but similar, instruction is swap reg1=m[lock-addr] m[lock-addr]=reg2 lock: t&s reg, lock-addr bnz reg, lock ret unlock: st lock-addr, #0 ret Chip Multiprocessors (ACS MPhil) 4

5 Synchronization We could implement test&set with two bus transactions A read and a write transaction We could lock down the bus for these two cycles to ensure the sequence is atomic More difficult with a split-transaction bus performance and deadlock issues Chip Multiprocessors (ACS MPhil) 5 Culler p.391

6 Synchronization If we assume an invalidation-based CC protocol with a WB cache, a better approach is to: Issue a read exclusive (BusRdX) transaction then perform the read and write (in the cache) without giving up ownership Any incoming requests to the block are buffered until the data is written in the cache Any other processors are forced to wait Chip Multiprocessors (ACS MPhil) 6

7 Synchronization Other common synchronization instructions: swap fetch&op fetch&inc fetch&add compare&swap Many x86 instructions can be prefixed with the lock modifier to make it atomic A simpler general purpose solution? Chip Multiprocessors (ACS MPhil) 7

8 Synchronization LL/SC Load-Linked (LL) Read memory Set lock flag and put address in lock register Intervening writes to the address in the lock register will cause the lock flag to be reset Store-Conditional (SC) Check lock flag to ensure an intervening conflicting write has not occurred If lock flag is not set, SC will fail If (atomic_update) then mem[addr]=rt, rt=1 else rt=0 Chip Multiprocessors (ACS MPhil) 8

9 Synchronization reg2=1 lock: ll reg1, lock-addr bnz reg1, lock ; lock already taken? sc lock-addr, reg2 beqz lock ; if SC failed goto lock ret unlock: st lock-addr, #0 ret Chip Multiprocessors (ACS MPhil) 9

10 Synchronization This SC will fail as the lock flag will be reset by the store from P2 Culler p.391 Chip Multiprocessors (ACS MPhil) 10

11 Synchronization LL/SC can be implemented using the CC protocol: LL loads cache line with write permission (issues BusRdX, holds line in state M) SC Only succeeds if cache line is still in state M, otherwise fails Implementations often come with caveats: SC may experience spurious failures e.g. due to context switches and TLB misses Restrictions to avoid cache line (holding lock variable) from being replaced Disallow memory memory-referencing instructions between LL and SC Prohibit out-of-order execution between LL and SC Chip Multiprocessors (ACS MPhil) 11

12 Coherence misses Remember your 3 C's! Compulsory Cold-start of first-reference misses Capacity If cache is not large enough to store all the blocks needed during the execution of the program Conflict (or collision) Conflict misses occur due to direct-mapped or set associative block placement strategies Coherence Misses that arise due to interprocessor communication Chip Multiprocessors (ACS MPhil) 12

13 True sharing A block typically contains many words (e.g. 4-8). Coherency is maintained at the granularity of cache blocks True sharing miss Misses that arise from the communication of data 1 st write to a shared block causes an invalidate Subsequent read of the block by another processor will also cause a miss Both these misses are classified as true sharing Chip Multiprocessors (ACS MPhil) 13

14 False sharing False sharing miss Different processors are writing and reading different words in a block, but no communication is taking place e.g. a block may contain words X and Y P1 repeatedly writes to X, P2 repeatedly writes to Y The block will be repeatedly invalidated (leading to cache misses) even though no communication is taking place These are false misses and are due to the fact that the block contains multiple words They would not occur if the block size = a single word Chip Multiprocessors (ACS MPhil) 14

15 Cache coherence and interconnects Broadcast-based snoopy protocols Discussed in Seminar 4 These protocols rely on bus-based interconnects Buses have limited scalability Energy and bandwidth implications of broadcasting They permit direct cache-to-cache transfers Low-latency communication 2 hops» 1. broadcast» 2. receive data from remote cache Very useful for applications with lots of fine-grain sharing Chip Multiprocessors (ACS MPhil) 15

16 Cache coherence and interconnects Totally-ordered interconnects All messages are delivered to all destinations in the same order. Totally-ordered interconnects often employ a centralised arbiter or switch e.g. a bus or pipelined broadcast tree Traditional snoopy protocols are built around the concept of a bus (or virtual bus): (1) Broadcast - All transactions are visible to all components connected to the bus (2) The interconnect provides a total order of messages Chip Multiprocessors (ACS MPhil) 16

17 Cache coherence and interconnects A pipelined broadcast tree is sufficiently similar to a bus to support traditional snooping protocols. [Reproduced from Milo Martin's PhD thesis (Wisconsin)] The centralised switch guarantees a total ordering of messages. Chip Multiprocessors (ACS MPhil) 17

18 Cache coherence and interconnects Unordered interconnects Networks (e.g. mesh, torus) can't typically provide strong ordering guarantees, i.e. nodes don't perceive transactions in a single global order. Point-to-point ordering Networks may be able to ensure messages sent between a pair of nodes are guaranteed not to be reordered. e.g. a mesh with a single VC and XY routing Chip Multiprocessors (ACS MPhil) 18

19 Directory-based cache coherence The state of the blocks in each cache in a snoopy protocol is maintained by broadcasting all memory operations on the bus We want to avoid the need to broadcast. So maintain the state of each block explicitly We store this information in the directory Requests can be made to the appropriate directory entry to read or write a particular block The directory orchestrates the appropriate actions necessary to satisfy the request Chip Multiprocessors (ACS MPhil) 19

20 Directory-based cache coherence The directory provides a per-block ordering point to resolve races All requests for a particular block are made to the same directory entry. The directory decides the order the requests will be satisfied. Directory protocols can operate over unordered interconnects Chip Multiprocessors (ACS MPhil) 20

21 Broadcasting over unordered interconnects A number of recent commercial solutions broadcast transactions over unordered interconnects: Cannot use snoopy protocols directly But don't require directory state to be stored, just provide an ordering point The ordering point also blocks subsequent coherent requests to the same cache line to prevent races with a request in progress. e.g. AMD's Hammer, Intel's E8870 Scalability port, IBM's Power4, xseries Summit systems Disadvantage: high bandwidth requirements Chip Multiprocessors (ACS MPhil) 21

22 Directory-based cache coherence The directory keeps track of who has a copy of the block and their states Broadcasting is replaced by cheaper point-to-point communications by maintaining a list of sharers The number of invalidations on a write is typically small in real applications, giving us a significant reduction in communication costs. Chip Multiprocessors (ACS MPhil) 22

23 Directory-based cache coherence Read Miss to a block in a modified state in a cache (Culler, Fig. 8.5) An example of a simple protocol. This is only meant to introduce the concept of a directory Chip Multiprocessors (ACS MPhil) 23

24 Directory-based cache coherence Write miss to a block with two sharers Chip Multiprocessors (ACS MPhil) 24

25 Directory-based cache coherence Let's consider the requester, directory and sharer state transitions for the previous slide... Requester State Directory State Sharer State The processor executes a store I->P (1) The block is initially in the I(nvalid) state We make a ExclReq to the directory and move to a pending state P->E (4) We receive write permission and data from the directory Block is initially marked as shared. The directory holds a list of the sharers Shared->TransWaitForInvalidate (2) The directory receives a ExclReq from cache 'id', id is not in the sharers list and the sharers list is not empty. It must send invalidate requests to all sharers and wait for their responses TransWaitForInvalidate->M (4) All invalidate acks are recieved, directory can reply to requester and provide data + write permissions. It moves to a state that records that the requester has the only copy S->I (3) On receiving a InvReq each sharer invalidates its copy of the block and moves to state I. It then acks with a InvRep message Chip Multiprocessors (ACS MPhil) 25

26 Directory-based cache coherence We now have two types of controller, one at each directory and one at each private cache The complete cache coherence protocol is specified in state-diagrams for both controllers The stable cache states are often MESI as in a snoopy protocol There are some complete example protocols available on the wiki (Courtesy of Brian Gold) Exercise: try and understand how each of these protocols handles the situations described in slides 22 and 23. Chip Multiprocessors (ACS MPhil) 26

27 Organising directory information How do we know which directory to send our request to? How is directory state actually stored? Chip Multiprocessors (ACS MPhil) 27

28 Organising directory information Directory Schemes Centralized Distributed How to find source of directory information How to locate copies Memory-based Information about all sharers is stored at the directory, e.g. using a full bit-vector organization or limited-pointer scheme Flat Cache-based Hierarchical Requests traverse up a tree to find a node with information on the block information is distributed amongst sharers, e.g. sharers form a linked list (IEEE SCI, Sequent NUMA-Q) Figure 8.7 (reproduced from Culler Parallel book) Chip Multiprocessors (ACS MPhil) 28

29 Organising directory information How do we store the sharer's list in a flat, memory-based directory scheme? Full bit-vector P presence bits, which indicate for each of the p processors whether the processor has a copy of the block Limited-pointer schemes Maintain a fixed (and limited) number of pointers Typically the number of sharers is small (4 pointers may often suffice) Need a backup or overflow strategy Overflow to memory or resort to broadcast Or a coarse vector scheme (where each bit represents groups of processors) Extract from duplicated L1 tags Query local copy of tags to find sharers Chip Multiprocessors (ACS MPhil) 29 [Culler p.568]

30 Organising directory information Four examples of how we might store our directory information in a CMP: 1) Append state to L2 tags 2) Duplicate L1 tags at the directory 3) Store directory state in main memory and include a directory cache at each node 4) A hierarchical directory I assume the L2 is the first shared cache. In a real system this could as easily be the L3 or interface to main memory. The directory is placed at the first shared memory regardless of the number of levels of cache. Chip Multiprocessors (ACS MPhil) 30

31 Organising directory information 1. Append state to L2 tags Perhaps conceptually the simplest scheme Assume a shared banked inclusive L2 cache The location of the directory depends only on the block address Directory state can simply be appended to the L2 cache tags Chip Multiprocessors (ACS MPhil) 31 Reproduced from Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, Zhang/Asanovic, ISCA'05

32 Organising directory information 1. Append state to L2 tags May be expensive in terms of memory L2 may contain many more cache lines than can reside in the aggregated L1s (or on a per bank basis, those L1 lines that can map to the L2 bank) May be unnecessarily power and area hungry Doesn't provide support for non-inclusive L2 caches Assumes the L2 is always caching anything in the L1's Problematic if L2 is small in comparison to aggregated L1 capacity Chip Multiprocessors (ACS MPhil) 32

33 Organising directory information 2. Duplicating L1 tags (A reverse mapped directory CAM) At each directory (e.g. L2 bank): Duplicate the L1 tags of those L1 lines that can map to the bank We can interrogate the duplicated tags to determine the sharers list At what granularity do we interleave addresses across banks for the directory and L2 cache? Simpler if we interleave the directory and L2 in the same way What about the impact of granularity on the directory? Chip Multiprocessors (ACS MPhil) 33

34 Organising directory information 2. Duplicating L1 tags In this example precisely one quarter of the L1 lines map to each of the 4 L2 banks Chip Multiprocessors (ACS MPhil) 34

35 Organising directory information 2. Duplicating L1 tags A fine-grain interleaving as illustrated on the previous slide means that only a subset of each L1's lines may map to a particular L2 bank Each directory is organised as: s/n sets of n*a ways Where n no. of processors If a coarse-grain interleaving is selected (where the L2 bank is selected from bits outside the L1s index bits) any L1 line could map to any L2 bank, hence each directory is organised as: s sets of n*a ways each Chip Multiprocessors (ACS MPhil) 35

36 Organising directory information 2. Duplicating L1 tags Example: Sun Niagara T1 L1 caches are write-through, 16-byte lines Allocate on load, no-allocate on a store L2 maintains directory by duplicating L1 tags L2 is banked and interleaved at a 64-byte granularity No. of L1 lines that may map to each L2 bank is much less than the total number of L2 lines in a bank. Duplicating L1 tags saves area and power over adding directory state to each L2 tag. Chip Multiprocessors (ACS MPhil) 36

37 Organising directory information 3. Directory-caches Directory state is stored in main memory and cached at each node Note: The L2 caches are private in this example Figure reproduced from Proximity- Aware Directory-based Coherence for Multi-core Processor Architectures, Brown/Kumar/Tullsen, SPAA'07 Chip Multiprocessors (ACS MPhil) 37

38 Organising directory information 3. Directory-caches Each tile and corresponding memory channel has access to a different range of physical memory locations There is only one possible home (location of the associated directory) for each memory block Two different directories never share directory state, so there are no coherence worries between directory caches! Directory information can be associated with multiple contiguous memory blocks to take advantage of spatial locality We typically assign home nodes at a page-granularity using a first-touch policy Chip Multiprocessors (ACS MPhil) 38

39 Organising directory information 4. A hierarchical directory Reproduced from A consistency architecture for hierarchical shared caches, Ladan-Mozes/Lesierson, SPAA'08 Chip Multiprocessors (ACS MPhil) 39

40 Organising directory information 4. A hierarchical directory Aimed at processors with large number of cores The black dots indicate where a particular block may be cached or stored in memory There is only one place as we move up each level of the tree Example: If a L3 cache holds write permissions for a block (holds block in state M) it can manage the line in its subtree as if it were main memory No need to tell its parent See paper for details (and proofs!) See also Fractal Coherence paper from MICRO'10 Chip Multiprocessors (ACS MPhil) 40

41 Organising directory information 4. A hierarchical directory Less extreme examples of hierarchical schemes are common where larger scale machines exploit bus-based first-level coherence (commodity hardware) and a directory protocol at the secondlevel. In such schemes a bridge between the two protocols monitors activity on the bus and when necessary intervenes to ensure coherence actions are handled at the second level when necessary (removing the transaction from the bus, completing the coherence actions at the 2 nd level and then replaying the request on the bus) Chip Multiprocessors (ACS MPhil) 41

42 Sharing patterns Invalidation frequency How many writes might require invalidating other copies? (invalidating writes) i.e. the local private cache does not already hold block in M state What is the distribution of the no. of invalidations (sharers) required upon these writes? Invalidation size distribution Chip Multiprocessors (ACS MPhil) 42

43 Sharing patterns Barnes-Hut Invalidation Pattern to to to to to to to to to to to to to to 63 # of invalidations Radiosity Invalidation Patterns to to to to to 27 # of invalidations 28 to to to to to to to to to 63 See Culler p.574 for more Chip Multiprocessors (ACS MPhil) 43

44 Sharing patterns Read-only No invalidating writes Producer-consumer A processor writes, then one or more reads the data, the processor writes again, the data is read again, and so on Invalidation size is often 1, all or a few This categorization is originally from Cache invalidation patterns in shared memory multiprocessors, Gupta/Weber, See also Culler Section 8.3 Chip Multiprocessors (ACS MPhil) 44

45 Sharing patterns Migratory Data migrates from one processor to another. Often being read as well as written along the way Invalidation size = 1, only previous writer has a copy (it invalidated the previous copy) Irregular read-write Irregular/unpredictable read/write access patterns Invalidation size is normally concentrated around the small end of the spectrum Chip Multiprocessors (ACS MPhil) 45

46 Protocol optimisations Goals? Performance, power, complexity and area! Aim to lower the average memory access time If we look at the protocol in isolation, the typical approach is to: 1) Aim to reduce the number of network transactions 2) Reduce the number of transactions on the critical path of the processor Chip Multiprocessors (ACS MPhil) 46 Culler Section 8.4.1

47 Protocol optimisations Let's look again at the simple protocol we introduced in slides 22/23 In the case of a read miss to a block in a modified state in another cache we required: 5 transactions in total 4 transactions are on the critical path Let's look at forwarding as a protocol optimisation An intervention here is just like a request, but issued in reaction to a request to a cache Chip Multiprocessors (ACS MPhil) 47

48 Directory-based cache coherence Read Miss to a block in a modified state in a cache (Culler, Fig. 8.5) Chip Multiprocessors (ACS MPhil) 48

49 Directory-based cache coherence 1: req 3:interv ention 4a:revise L H R 2:reply 4b:response (a) Strict request-reply 1: req 2:interv ention L H R 4:reply 3:response (a) Intervention forwarding 1: req 2:interv ention 3a:revise L H R 3b:response (a) Reply forwarding Culler, p.586 Chip Multiprocessors (ACS MPhil) 49

50 Protocol optimisations Other possible ways improvements can be made: Optimise the protocol for common sharing patterns e.g. producer-consumer and migratory Exploit a particular network topology or hierarchical directory structure Perhaps multiple networks tuned to different types of traffic Exploit locality (in a physical sense) Obtain required data using a cache-to-cache transfer from the nearest sharer or an immediate neighbour Perform speculative transactions to accelerate acquisition of permissions or data Compiler assistance... Chip Multiprocessors (ACS MPhil) 50

51 Correctness Directory protocols can quickly become very complicated Timeouts, retries, negative acknowledgements have all been used in different protocols to avoid deadlock and livelock issues (and guarantee forward progress) Chip Multiprocessors (ACS MPhil) 51

5 Chip Multiprocessors (II) Robert Mullins

5 Chip Multiprocessors (II) Robert Mullins 5 Chip Multiprocessors (II) ( MPhil Chip Multiprocessors (ACS Robert Mullins Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses Cache coherence and interconnects Directory-based

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

The MESI State Transition Graph

The MESI State Transition Graph Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Shared Memory Multiprocessors Jesús Labarta Index 1 Shared Memory architectures............... Memory Interconnect Cache Processor Concepts? Memory Time 2 Concepts? Memory Load/store (@) Containers Time

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

A More Sophisticated Snooping-Based Multi-Processor

A More Sophisticated Snooping-Based Multi-Processor Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes The Projects Handsome Boy Modeling School (So... How

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns

More information

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based) Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a

More information

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share

More information

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)

More information

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues Lecture 8: Directory-Based Cache Coherence Topics: scalable multiprocessor organizations, directory protocol design issues 1 Scalable Multiprocessors P1 P2 Pn C1 C2 Cn 1 CA1 2 CA2 n CAn Scalable interconnection

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

The need for atomicity This code sequence illustrates the need for atomicity. Explain. Lock Implementations [ 8.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O 6.823, L21--1 Cache Coherence Protocols: Implementation Issues on SMP s Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Coherence Issue in I/O 6.823, L21--2 Processor Processor

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 18: Directory-Based Cache Protocols John Wawrzynek EECS, University of California at Berkeley http://inst.eecs.berkeley.edu/~cs152 Administrivia 2 Recap:

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines Cache Coherence in Scalable Machines COE 502 arallel rocessing Architectures rof. Muhamed Mudawar Computer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

Scalable Multiprocessors

Scalable Multiprocessors Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system

More information

Cache Coherence: Part II Scalable Approaches

Cache Coherence: Part II Scalable Approaches ache oherence: art II Scalable pproaches Hierarchical ache oherence Todd. Mowry S 74 October 27, 2 (a) 1 2 1 2 (b) 1 Topics Hierarchies Directory rotocols Hierarchies arise in different ways: (a) processor

More information

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication

More information

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations 1 Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes

More information

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Special Topics. Module 14: Directory-based Cache Coherence Lecture 33: SCI Protocol Directory-based Cache Coherence: Sequent NUMA-Q. Directory-based Cache Coherence: Special Topics Sequent NUMA-Q SCI protocol Directory overhead Cache overhead Handling read miss Handling write miss Handling writebacks Roll-out protocol Snoop interaction

More information

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols Lecture 3: Directory Protocol Implementations Topics: coherence vs. msg-passing, corner cases in directory protocols 1 Future Scalable Designs Intel s Single Cloud Computer (SCC): an example prototype

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly

More information

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY> Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: This is a closed book, closed notes exam. 80 Minutes 19 pages Notes: Not all questions

More information

Lecture: Consistency Models, TM

Lecture: Consistency Models, TM Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) No class on Monday (please watch TM videos) Wednesday: TM wrap-up, interconnection networks 1 Coherence Vs. Consistency

More information

Foundations of Computer Systems

Foundations of Computer Systems 18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation Lecture 15: A Basic Snooping-Based Multi-Processor Implementation Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Pushing On (Oliver $ & Jimi Jules) Time for the second

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

SHARED-MEMORY COMMUNICATION

SHARED-MEMORY COMMUNICATION SHARED-MEMORY COMMUNICATION IMPLICITELY VIA MEMORY PROCESSORS SHARE SOME MEMORY COMMUNICATION IS IMPLICIT THROUGH LOADS AND STORES NEED TO SYNCHRONIZE NEED TO KNOW HOW THE HARDWARE INTERLEAVES ACCESSES

More information

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM Lecture 4: Directory Protocols and TM Topics: corner cases in directory protocols, lazy TM 1 Handling Reads When the home receives a read request, it looks up memory (speculative read) and directory in

More information

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains: The Lecture Contains: Shared Memory Multiprocessors Shared Cache Private Cache/Dancehall Distributed Shared Memory Shared vs. Private in CMPs Cache Coherence Cache Coherence: Example What Went Wrong? Implementations

More information

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must: ontext for Scalable ache oherence ache oherence in Scalable Machines Realizing gm Models through net transaction protocols - efficient node-to-net interface - interprets transactions Switch Scalable network

More information

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations Lecture 21: Transactional Memory Topics: Hardware TM basics, different implementations 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking,

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols Cache Coherence Todd C. Mowry CS 740 November 10, 1998 Topics The Cache Coherence roblem Snoopy rotocols Directory rotocols The Cache Coherence roblem Caches are critical to modern high-speed processors

More information

CMSC 611: Advanced. Distributed & Shared Memory

CMSC 611: Advanced. Distributed & Shared Memory CMSC 611: Advanced Computer Architecture Distributed & Shared Memory Centralized Shared Memory MIMD Processors share a single centralized memory through a bus interconnect Feasible for small processor

More information

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout CS 28 Parallel Computer Architecture Lecture 23 Hardware-Software Trade-offs in Synchronization and Data Layout April 21, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs28 Role of

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Module 9: Introduction to Shared Memory Multiprocessors Lecture 16: Multiprocessor Organizations and Cache Coherence Shared Memory Multiprocessors Shared Memory Multiprocessors Shared memory multiprocessors Shared cache Private cache/dancehall Distributed shared memory Shared vs. private in CMPs Cache coherence Cache coherence: Example What went

More information

Cache Coherence and Atomic Operations in Hardware

Cache Coherence and Atomic Operations in Hardware Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some

More information

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Protocols Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152

More information

Speculative Locks. Dept. of Computer Science

Speculative Locks. Dept. of Computer Science Speculative Locks José éf. Martínez and djosep Torrellas Dept. of Computer Science University it of Illinois i at Urbana-Champaign Motivation Lock granularity a trade-off: Fine grain greater concurrency

More information

A Scalable SAS Machine

A Scalable SAS Machine arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Recall: Sequential Consistency Example. Implications for Implementation. Issues for Directory Protocols

Recall: Sequential Consistency Example. Implications for Implementation. Issues for Directory Protocols ecall: Sequential onsistency Example S252 Graduate omputer rchitecture Lecture 21 pril 14 th, 2010 Distributed Shared ory rof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs252 rocessor 1 rocessor

More information

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt Consistency & Coherence 4/14/2016 Sec5on 12 Colin Schmidt Agenda Brief mo5va5on Consistency vs Coherence Synchroniza5on Fences Mutexs, locks, semaphores Hardware Coherence Snoopy MSI, MESI Power, Frequency,

More information

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6) Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) 1 Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors,

More information

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based

More information

M4 Parallelism. Implementation of Locks Cache Coherence

M4 Parallelism. Implementation of Locks Cache Coherence M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections ) Lecture: Coherence, Synchronization Topics: directory-based coherence, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory) keeps track

More information

Interconnect Routing

Interconnect Routing Interconnect Routing store-and-forward routing switch buffers entire message before passing it on latency = [(message length / bandwidth) + fixed overhead] * # hops wormhole routing pipeline message through

More information