5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins
|
|
- Anissa Welch
- 5 years ago
- Views:
Transcription
1 5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins
2 Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses, false sharing Cache coherence and interconnects Directory-based Coherency Protocols Introduction, organising directories in a CMP, sharing patterns and protocol optimisations, correctness, broadcasting over unordered interconnects Chip Multiprocessors (ACS MPhil) 2
3 Synchronization The lock problem The lock is suppose to provide atomicity for critical sections Unfortunately, as implemented this lock is lacking atomicity in its own implementation Multiple processors could read the lock as free and progress past the branch simultaneously lock: ld reg, lock-addr cmp reg, #0 bnz lock st lock-addr, #1 ret unlock: st lock-addr, #0 ret Culler p.338 Chip Multiprocessors (ACS MPhil) 3
4 Synchronization Test and Set Executes the following atomically: reg=m[lock-addr] m[lock-addr]=1 The branch makes sure that if the lock was already taken we try again A more general, but similar, instruction is swap reg1=m[lock-addr] m[lock-addr]=reg2 lock: t&s reg, lock-addr bnz reg, lock ret unlock: st lock-addr, #0 ret Chip Multiprocessors (ACS MPhil) 4
5 Synchronization We could implement test&set with two bus transactions A read and a write transaction We could lock down the bus for these two cycles to ensure the sequence is atomic More difficult with a split-transaction bus performance and deadlock issues Chip Multiprocessors (ACS MPhil) 5 Culler p.391
6 Synchronization If we assume an invalidation-based CC protocol with a WB cache, a better approach is to: Issue a read exclusive (BusRdX) transaction then perform the read and write (in the cache) without giving up ownership Any incoming requests to the block are buffered until the data is written in the cache Any other processors are forced to wait Chip Multiprocessors (ACS MPhil) 6
7 Synchronization Other common synchronization instructions: swap fetch&op fetch&inc fetch&add compare&swap Many x86 instructions can be prefixed with the lock modifier to make it atomic A simpler general purpose solution? Chip Multiprocessors (ACS MPhil) 7
8 Synchronization LL/SC Load-Linked (LL) Read memory Set lock flag and put address in lock register Intervening writes to the address in the lock register will cause the lock flag to be reset Store-Conditional (SC) Check lock flag to ensure an intervening conflicting write has not occurred If lock flag is not set, SC will fail If (atomic_update) then mem[addr]=rt, rt=1 else rt=0 Chip Multiprocessors (ACS MPhil) 8
9 Synchronization reg2=1 lock: ll reg1, lock-addr bnz reg1, lock ; lock already taken? sc lock-addr, reg2 beqz lock ; if SC failed goto lock ret unlock: st lock-addr, #0 ret Chip Multiprocessors (ACS MPhil) 9
10 Synchronization This SC will fail as the lock flag will be reset by the store from P2 Culler p.391 Chip Multiprocessors (ACS MPhil) 10
11 Synchronization LL/SC can be implemented using the CC protocol: LL loads cache line with write permission (issues BusRdX, holds line in state M) SC Only succeeds if cache line is still in state M, otherwise fails Implementations often come with caveats: SC may experience spurious failures e.g. due to context switches and TLB misses Restrictions to avoid cache line (holding lock variable) from being replaced Disallow memory memory-referencing instructions between LL and SC Prohibit out-of-order execution between LL and SC Chip Multiprocessors (ACS MPhil) 11
12 Coherence misses Remember your 3 C's! Compulsory Cold-start of first-reference misses Capacity If cache is not large enough to store all the blocks needed during the execution of the program Conflict (or collision) Conflict misses occur due to direct-mapped or set associative block placement strategies Coherence Misses that arise due to interprocessor communication Chip Multiprocessors (ACS MPhil) 12
13 True sharing A block typically contains many words (e.g. 4-8). Coherency is maintained at the granularity of cache blocks True sharing miss Misses that arise from the communication of data 1 st write to a shared block causes an invalidate Subsequent read of the block by another processor will also cause a miss Both these misses are classified as true sharing Chip Multiprocessors (ACS MPhil) 13
14 False sharing False sharing miss Different processors are writing and reading different words in a block, but no communication is taking place e.g. a block may contain words X and Y P1 repeatedly writes to X, P2 repeatedly writes to Y The block will be repeatedly invalidated (leading to cache misses) even though no communication is taking place These are false misses and are due to the fact that the block contains multiple words They would not occur if the block size = a single word Chip Multiprocessors (ACS MPhil) 14
15 Cache coherence and interconnects Broadcast-based snoopy protocols Discussed in Seminar 4 These protocols rely on bus-based interconnects Buses have limited scalability Energy and bandwidth implications of broadcasting They permit direct cache-to-cache transfers Low-latency communication 2 hops» 1. broadcast» 2. receive data from remote cache Very useful for applications with lots of fine-grain sharing Chip Multiprocessors (ACS MPhil) 15
16 Cache coherence and interconnects Totally-ordered interconnects All messages are delivered to all destinations in the same order. Totally-ordered interconnects often employ a centralised arbiter or switch e.g. a bus or pipelined broadcast tree Traditional snoopy protocols are built around the concept of a bus (or virtual bus): (1) Broadcast - All transactions are visible to all components connected to the bus (2) The interconnect provides a total order of messages Chip Multiprocessors (ACS MPhil) 16
17 Cache coherence and interconnects A pipelined broadcast tree is sufficiently similar to a bus to support traditional snooping protocols. [Reproduced from Milo Martin's PhD thesis (Wisconsin)] The centralised switch guarantees a total ordering of messages. Chip Multiprocessors (ACS MPhil) 17
18 Cache coherence and interconnects Unordered interconnects Networks (e.g. mesh, torus) can't typically provide strong ordering guarantees, i.e. nodes don't perceive transactions in a single global order. Point-to-point ordering Networks may be able to ensure messages sent between a pair of nodes are guaranteed not to be reordered. e.g. a mesh with a single VC and XY routing Chip Multiprocessors (ACS MPhil) 18
19 Directory-based cache coherence The state of the blocks in each cache in a snoopy protocol is maintained by broadcasting all memory operations on the bus We want to avoid the need to broadcast. So maintain the state of each block explicitly We store this information in the directory Requests can be made to the appropriate directory entry to read or write a particular block The directory orchestrates the appropriate actions necessary to satisfy the request Chip Multiprocessors (ACS MPhil) 19
20 Directory-based cache coherence The directory provides a per-block ordering point to resolve races All requests for a particular block are made to the same directory entry. The directory decides the order the requests will be satisfied. Directory protocols can operate over unordered interconnects Chip Multiprocessors (ACS MPhil) 20
21 Broadcasting over unordered interconnects A number of recent commercial solutions broadcast transactions over unordered interconnects: Cannot use snoopy protocols directly But don't require directory state to be stored, just provide an ordering point The ordering point also blocks subsequent coherent requests to the same cache line to prevent races with a request in progress. e.g. AMD's Hammer, Intel's E8870 Scalability port, IBM's Power4, xseries Summit systems Disadvantage: high bandwidth requirements Chip Multiprocessors (ACS MPhil) 21
22 Directory-based cache coherence The directory keeps track of who has a copy of the block and their states Broadcasting is replaced by cheaper point-to-point communications by maintaining a list of sharers The number of invalidations on a write is typically small in real applications, giving us a significant reduction in communication costs. Chip Multiprocessors (ACS MPhil) 22
23 Directory-based cache coherence Read Miss to a block in a modified state in a cache (Culler, Fig. 8.5) An example of a simple protocol. This is only meant to introduce the concept of a directory Chip Multiprocessors (ACS MPhil) 23
24 Directory-based cache coherence Write miss to a block with two sharers Chip Multiprocessors (ACS MPhil) 24
25 Directory-based cache coherence Let's consider the requester, directory and sharer state transitions for the previous slide... Requester State Directory State Sharer State The processor executes a store I->P (1) The block is initially in the I(nvalid) state We make a ExclReq to the directory and move to a pending state P->E (4) We receive write permission and data from the directory Block is initially marked as shared. The directory holds a list of the sharers Shared->TransWaitForInvalidate (2) The directory receives a ExclReq from cache 'id', id is not in the sharers list and the sharers list is not empty. It must send invalidate requests to all sharers and wait for their responses TransWaitForInvalidate->M (4) All invalidate acks are recieved, directory can reply to requester and provide data + write permissions. It moves to a state that records that the requester has the only copy S->I (3) On receiving a InvReq each sharer invalidates its copy of the block and moves to state I. It then acks with a InvRep message Chip Multiprocessors (ACS MPhil) 25
26 Directory-based cache coherence We now have two types of controller, one at each directory and one at each private cache The complete cache coherence protocol is specified in state-diagrams for both controllers The stable cache states are often MESI as in a snoopy protocol There are some complete example protocols available on the wiki (Courtesy of Brian Gold) Exercise: try and understand how each of these protocols handles the situations described in slides 22 and 23. Chip Multiprocessors (ACS MPhil) 26
27 Organising directory information How do we know which directory to send our request to? How is directory state actually stored? Chip Multiprocessors (ACS MPhil) 27
28 Organising directory information Directory Schemes Centralized Distributed How to find source of directory information How to locate copies Memory-based Information about all sharers is stored at the directory, e.g. using a full bit-vector organization or limited-pointer scheme Flat Cache-based Hierarchical Requests traverse up a tree to find a node with information on the block information is distributed amongst sharers, e.g. sharers form a linked list (IEEE SCI, Sequent NUMA-Q) Figure 8.7 (reproduced from Culler Parallel book) Chip Multiprocessors (ACS MPhil) 28
29 Organising directory information How do we store the sharer's list in a flat, memory-based directory scheme? Full bit-vector P presence bits, which indicate for each of the p processors whether the processor has a copy of the block Limited-pointer schemes Maintain a fixed (and limited) number of pointers Typically the number of sharers is small (4 pointers may often suffice) Need a backup or overflow strategy Overflow to memory or resort to broadcast Or a coarse vector scheme (where each bit represents groups of processors) Extract from duplicated L1 tags Query local copy of tags to find sharers Chip Multiprocessors (ACS MPhil) 29 [Culler p.568]
30 Organising directory information Four examples of how we might store our directory information in a CMP: 1) Append state to L2 tags 2) Duplicate L1 tags at the directory 3) Store directory state in main memory and include a directory cache at each node 4) A hierarchical directory I assume the L2 is the first shared cache. In a real system this could as easily be the L3 or interface to main memory. The directory is placed at the first shared memory regardless of the number of levels of cache. Chip Multiprocessors (ACS MPhil) 30
31 Organising directory information 1. Append state to L2 tags Perhaps conceptually the simplest scheme Assume a shared banked inclusive L2 cache The location of the directory depends only on the block address Directory state can simply be appended to the L2 cache tags Chip Multiprocessors (ACS MPhil) 31 Reproduced from Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, Zhang/Asanovic, ISCA'05
32 Organising directory information 1. Append state to L2 tags May be expensive in terms of memory L2 may contain many more cache lines than can reside in the aggregated L1s (or on a per bank basis, those L1 lines that can map to the L2 bank) May be unnecessarily power and area hungry Doesn't provide support for non-inclusive L2 caches Assumes the L2 is always caching anything in the L1's Problematic if L2 is small in comparison to aggregated L1 capacity Chip Multiprocessors (ACS MPhil) 32
33 Organising directory information 2. Duplicating L1 tags (A reverse mapped directory CAM) At each directory (e.g. L2 bank): Duplicate the L1 tags of those L1 lines that can map to the bank We can interrogate the duplicated tags to determine the sharers list At what granularity do we interleave addresses across banks for the directory and L2 cache? Simpler if we interleave the directory and L2 in the same way What about the impact of granularity on the directory? Chip Multiprocessors (ACS MPhil) 33
34 Organising directory information 2. Duplicating L1 tags In this example precisely one quarter of the L1 lines map to each of the 4 L2 banks Chip Multiprocessors (ACS MPhil) 34
35 Organising directory information 2. Duplicating L1 tags A fine-grain interleaving as illustrated on the previous slide means that only a subset of each L1's lines may map to a particular L2 bank Each directory is organised as: s/n sets of n*a ways Where n no. of processors If a coarse-grain interleaving is selected (where the L2 bank is selected from bits outside the L1s index bits) any L1 line could map to any L2 bank, hence each directory is organised as: s sets of n*a ways each Chip Multiprocessors (ACS MPhil) 35
36 Organising directory information 2. Duplicating L1 tags Example: Sun Niagara T1 L1 caches are write-through, 16-byte lines Allocate on load, no-allocate on a store L2 maintains directory by duplicating L1 tags L2 is banked and interleaved at a 64-byte granularity No. of L1 lines that may map to each L2 bank is much less than the total number of L2 lines in a bank. Duplicating L1 tags saves area and power over adding directory state to each L2 tag. Chip Multiprocessors (ACS MPhil) 36
37 Organising directory information 3. Directory-caches Directory state is stored in main memory and cached at each node Note: The L2 caches are private in this example Figure reproduced from Proximity- Aware Directory-based Coherence for Multi-core Processor Architectures, Brown/Kumar/Tullsen, SPAA'07 Chip Multiprocessors (ACS MPhil) 37
38 Organising directory information 3. Directory-caches Each tile and corresponding memory channel has access to a different range of physical memory locations There is only one possible home (location of the associated directory) for each memory block Two different directories never share directory state, so there are no coherence worries between directory caches! Directory information can be associated with multiple contiguous memory blocks to take advantage of spatial locality We typically assign home nodes at a page-granularity using a first-touch policy Chip Multiprocessors (ACS MPhil) 38
39 Organising directory information 4. A hierarchical directory Reproduced from A consistency architecture for hierarchical shared caches, Ladan-Mozes/Lesierson, SPAA'08 Chip Multiprocessors (ACS MPhil) 39
40 Organising directory information 4. A hierarchical directory Aimed at processors with large number of cores The black dots indicate where a particular block may be cached or stored in memory There is only one place as we move up each level of the tree Example: If a L3 cache holds write permissions for a block (holds block in state M) it can manage the line in its subtree as if it were main memory No need to tell its parent See paper for details (and proofs!) See also Fractal Coherence paper from MICRO'10 Chip Multiprocessors (ACS MPhil) 40
41 Organising directory information 4. A hierarchical directory Less extreme examples of hierarchical schemes are common where larger scale machines exploit bus-based first-level coherence (commodity hardware) and a directory protocol at the secondlevel. In such schemes a bridge between the two protocols monitors activity on the bus and when necessary intervenes to ensure coherence actions are handled at the second level when necessary (removing the transaction from the bus, completing the coherence actions at the 2 nd level and then replaying the request on the bus) Chip Multiprocessors (ACS MPhil) 41
42 Sharing patterns Invalidation frequency How many writes might require invalidating other copies? (invalidating writes) i.e. the local private cache does not already hold block in M state What is the distribution of the no. of invalidations (sharers) required upon these writes? Invalidation size distribution Chip Multiprocessors (ACS MPhil) 42
43 Sharing patterns Barnes-Hut Invalidation Pattern to to to to to to to to to to to to to to 63 # of invalidations Radiosity Invalidation Patterns to to to to to 27 # of invalidations 28 to to to to to to to to to 63 See Culler p.574 for more Chip Multiprocessors (ACS MPhil) 43
44 Sharing patterns Read-only No invalidating writes Producer-consumer A processor writes, then one or more reads the data, the processor writes again, the data is read again, and so on Invalidation size is often 1, all or a few This categorization is originally from Cache invalidation patterns in shared memory multiprocessors, Gupta/Weber, See also Culler Section 8.3 Chip Multiprocessors (ACS MPhil) 44
45 Sharing patterns Migratory Data migrates from one processor to another. Often being read as well as written along the way Invalidation size = 1, only previous writer has a copy (it invalidated the previous copy) Irregular read-write Irregular/unpredictable read/write access patterns Invalidation size is normally concentrated around the small end of the spectrum Chip Multiprocessors (ACS MPhil) 45
46 Protocol optimisations Goals? Performance, power, complexity and area! Aim to lower the average memory access time If we look at the protocol in isolation, the typical approach is to: 1) Aim to reduce the number of network transactions 2) Reduce the number of transactions on the critical path of the processor Chip Multiprocessors (ACS MPhil) 46 Culler Section 8.4.1
47 Protocol optimisations Let's look again at the simple protocol we introduced in slides 22/23 In the case of a read miss to a block in a modified state in another cache we required: 5 transactions in total 4 transactions are on the critical path Let's look at forwarding as a protocol optimisation An intervention here is just like a request, but issued in reaction to a request to a cache Chip Multiprocessors (ACS MPhil) 47
48 Directory-based cache coherence Read Miss to a block in a modified state in a cache (Culler, Fig. 8.5) Chip Multiprocessors (ACS MPhil) 48
49 Directory-based cache coherence 1: req 3:interv ention 4a:revise L H R 2:reply 4b:response (a) Strict request-reply 1: req 2:interv ention L H R 4:reply 3:response (a) Intervention forwarding 1: req 2:interv ention 3a:revise L H R 3b:response (a) Reply forwarding Culler, p.586 Chip Multiprocessors (ACS MPhil) 49
50 Protocol optimisations Other possible ways improvements can be made: Optimise the protocol for common sharing patterns e.g. producer-consumer and migratory Exploit a particular network topology or hierarchical directory structure Perhaps multiple networks tuned to different types of traffic Exploit locality (in a physical sense) Obtain required data using a cache-to-cache transfer from the nearest sharer or an immediate neighbour Perform speculative transactions to accelerate acquisition of permissions or data Compiler assistance... Chip Multiprocessors (ACS MPhil) 50
51 Correctness Directory protocols can quickly become very complicated Timeouts, retries, negative acknowledgements have all been used in different protocols to avoid deadlock and livelock issues (and guarantee forward progress) Chip Multiprocessors (ACS MPhil) 51
5 Chip Multiprocessors (II) Robert Mullins
5 Chip Multiprocessors (II) ( MPhil Chip Multiprocessors (ACS Robert Mullins Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses Cache coherence and interconnects Directory-based
More information4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins
4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P
More informationCache Coherence in Scalable Machines
ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor
More informationThe MESI State Transition Graph
Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization
More informationMultiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems
Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationLecture 7: Implementing Cache Coherence. Topics: implementation details
Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,
More informationShared Memory Multiprocessors
Shared Memory Multiprocessors Jesús Labarta Index 1 Shared Memory architectures............... Memory Interconnect Cache Processor Concepts? Memory Time 2 Concepts? Memory Load/store (@) Containers Time
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationLecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations
Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,
More informationCSC 631: High-Performance Computer Architecture
CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:
More informationPage 1. Cache Coherence
Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale
More informationMULTIPROCESSORS AND THREAD LEVEL PARALLELISM
UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared
More informationScalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels
More informationA More Sophisticated Snooping-Based Multi-Processor
Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes The Projects Handsome Boy Modeling School (So... How
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationLecture 25: Multiprocessors
Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed
More informationChapter-4 Multiprocessors and Thread-Level Parallelism
Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns
More informationLecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)
Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a
More informationReview: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology
Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share
More informationScalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:
Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist
More informationScalable Cache Coherence
arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationLecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections
Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)
More informationLecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues
Lecture 8: Directory-Based Cache Coherence Topics: scalable multiprocessor organizations, directory protocol design issues 1 Scalable Multiprocessors P1 P2 Pn C1 C2 Cn 1 CA1 2 CA2 n CAn Scalable interconnection
More informationScalable Cache Coherence
Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence
CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory
More informationEE382 Processor Design. Processor Issues for MP
EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency
More informationCS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II
CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252
More informationPage 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence
SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In
More informationIntroduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization
Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency
More informationLecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks
Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationThe need for atomicity This code sequence illustrates the need for atomicity. Explain.
Lock Implementations [ 8.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock
More informationModule 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks
Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationPage 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology
CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationCache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O
6.823, L21--1 Cache Coherence Protocols: Implementation Issues on SMP s Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Coherence Issue in I/O 6.823, L21--2 Processor Processor
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit
More informationComputer Architecture
18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University
More informationCache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri
Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 18: Directory-Based Cache Protocols John Wawrzynek EECS, University of California at Berkeley http://inst.eecs.berkeley.edu/~cs152 Administrivia 2 Recap:
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationCache Coherence in Scalable Machines
Cache Coherence in Scalable Machines COE 502 arallel rocessing Architectures rof. Muhamed Mudawar Computer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor
More informationScalable Multiprocessors
Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system
More informationCache Coherence: Part II Scalable Approaches
ache oherence: art II Scalable pproaches Hierarchical ache oherence Todd. Mowry S 74 October 27, 2 (a) 1 2 1 2 (b) 1 Topics Hierarchies Directory rotocols Hierarchies arise in different ways: (a) processor
More informationScalable Cache Coherent Systems
NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication
More informationLecture 5: Directory Protocols. Topics: directory-based cache coherence implementations
Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations 1 Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes
More informationSpecial Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.
Directory-based Cache Coherence: Special Topics Sequent NUMA-Q SCI protocol Directory overhead Cache overhead Handling read miss Handling write miss Handling writebacks Roll-out protocol Snoop interaction
More informationLecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols
Lecture 3: Directory Protocol Implementations Topics: coherence vs. msg-passing, corner cases in directory protocols 1 Future Scalable Designs Intel s Single Cloud Computer (SCC): an example prototype
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More informationParallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence
Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly
More informationComputer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>
Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: This is a closed book, closed notes exam. 80 Minutes 19 pages Notes: Not all questions
More informationLecture: Consistency Models, TM
Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) No class on Monday (please watch TM videos) Wednesday: TM wrap-up, interconnection networks 1 Coherence Vs. Consistency
More informationFoundations of Computer Systems
18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationChapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST
Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will
More informationA Basic Snooping-Based Multi-Processor Implementation
Lecture 15: A Basic Snooping-Based Multi-Processor Implementation Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Pushing On (Oliver $ & Jimi Jules) Time for the second
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationSHARED-MEMORY COMMUNICATION
SHARED-MEMORY COMMUNICATION IMPLICITELY VIA MEMORY PROCESSORS SHARE SOME MEMORY COMMUNICATION IS IMPLICIT THROUGH LOADS AND STORES NEED TO SYNCHRONIZE NEED TO KNOW HOW THE HARDWARE INTERLEAVES ACCESSES
More informationLecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM
Lecture 4: Directory Protocols and TM Topics: corner cases in directory protocols, lazy TM 1 Handling Reads When the home receives a read request, it looks up memory (speculative read) and directory in
More informationParallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University
18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution
More informationMultiprocessors and Locking
Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access
More informationLect. 6: Directory Coherence Protocol
Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor
More informationModule 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:
The Lecture Contains: Shared Memory Multiprocessors Shared Cache Private Cache/Dancehall Distributed Shared Memory Shared vs. Private in CMPs Cache Coherence Cache Coherence: Example What Went Wrong? Implementations
More informationCS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols
CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationCMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today
More informationNOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:
ontext for Scalable ache oherence ache oherence in Scalable Machines Realizing gm Models through net transaction protocols - efficient node-to-net interface - interprets transactions Switch Scalable network
More informationEN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors
EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationLecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations
Lecture 21: Transactional Memory Topics: Hardware TM basics, different implementations 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking,
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationChapter 5. Thread-Level Parallelism
Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated
More informationShared Memory Multiprocessors
Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O
More informationModule 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.
MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line
More informationCache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols
Cache Coherence Todd C. Mowry CS 740 November 10, 1998 Topics The Cache Coherence roblem Snoopy rotocols Directory rotocols The Cache Coherence roblem Caches are critical to modern high-speed processors
More informationCMSC 611: Advanced. Distributed & Shared Memory
CMSC 611: Advanced Computer Architecture Distributed & Shared Memory Centralized Shared Memory MIMD Processors share a single centralized memory through a bus interconnect Feasible for small processor
More informationRole of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout
CS 28 Parallel Computer Architecture Lecture 23 Hardware-Software Trade-offs in Synchronization and Data Layout April 21, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs28 Role of
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationModule 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors
Shared Memory Multiprocessors Shared memory multiprocessors Shared cache Private cache/dancehall Distributed shared memory Shared vs. private in CMPs Cache coherence Cache coherence: Example What went
More informationCache Coherence and Atomic Operations in Hardware
Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some
More informationCS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols
CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Protocols Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152
More informationSpeculative Locks. Dept. of Computer Science
Speculative Locks José éf. Martínez and djosep Torrellas Dept. of Computer Science University it of Illinois i at Urbana-Champaign Motivation Lock granularity a trade-off: Fine grain greater concurrency
More informationA Scalable SAS Machine
arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationRecall: Sequential Consistency Example. Implications for Implementation. Issues for Directory Protocols
ecall: Sequential onsistency Example S252 Graduate omputer rchitecture Lecture 21 pril 14 th, 2010 Distributed Shared ory rof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs252 rocessor 1 rocessor
More informationConsistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt
Consistency & Coherence 4/14/2016 Sec5on 12 Colin Schmidt Agenda Brief mo5va5on Consistency vs Coherence Synchroniza5on Fences Mutexs, locks, semaphores Hardware Coherence Snoopy MSI, MESI Power, Frequency,
More informationLecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)
Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) 1 Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors,
More informationSynchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based
More informationM4 Parallelism. Implementation of Locks Cache Coherence
M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationLecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )
Lecture: Coherence, Synchronization Topics: directory-based coherence, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory) keeps track
More informationInterconnect Routing
Interconnect Routing store-and-forward routing switch buffers entire message before passing it on latency = [(message length / bandwidth) + fixed overhead] * # hops wormhole routing pipeline message through
More information