Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.
|
|
- Baldric Day
- 6 years ago
- Views:
Transcription
1 Shared emory ultiprocessors Basic Architecture of SP Buses are good news and bad news The (memory) bus is a point all processors can see and thus be informed of what is happening A bus is serially used, so can be a bottleneck By constructing a standard sequential processor and its memory system intelligently, i.e. adopting a slightly more general design, a multiprocessor element can be constructed P Cache Cache emory Pn Cache /O Devices Cache Coherency -- The Problem ndependent processors modifying shared locations can change values without other processors being aware of it reads a into cache P reads a into cache writes a to, and writes through to main memory, correcting it -- P has stale data -- P reads a into cache emory is arbiter of present value... move it closer to processors a: a: P a: emory a: P Cache Coherency, The Goal A multiprocessor memory system is coherent if for every location there exists a serial order for the operations on that location consistent with the results of the execution such that... The subsequence of operations issued by any processor are in the order issued The value urned by each read is the value written by the last write in serial order mplied property of coherency... Write Serialization -- all writes to a location are seen in the same order by all processes Snooping To Solve Coherency The cache controllers can snoop on the bus, meaning that they watch the events on the bus even if they do not issue them, noting any action applied to cache lines they hold There are two possible actions when a memory location held by processor A is changed by processor B nvalidate -- mark local copy as invalid Update -- make the same change B made The unit of cache coherency is a cache line or block Snooping At Work By snooping the cache controller for processor P can take action in response to s write reads a into cache P reads a into cache writes a to, and writes through to main memory, correcting it P sees WT, invalidates or updates P reads a into cache a: a: P a: emory P a:
2 Prd/-- Write-through Coherency Applying The WT Protocol V State diagrams show the protocol... Prd/-- Prd/Busd V BusWr/-- States on cache line... V is valid is invalid Transitions... ed is processor initiated Blue is bus initiated Labeling A/B... f A is observed Then transaction B is generated Consider the transitions reads a into cache P reads a into cache writes a to, and writes through to main memory, correcting it P sees WT, invalidates P reads a into cache : ---> V Prd/Busd P: ---> V Prd/Busd : V ---> V P: V ---> BusWr/-- P: ---> V Prd/Busd Prd/Busd BusWr/-- Partial Order On emory Operations Write bus transactions define a global sequence of events between which processors can read... any total order produced by interleaving P P P W W emory Consistency What should it mean for processors to see a consistent view of memory? Coherency is too weak because it only requires ordering with respect to individual locations, but there are other ways of binding values together -- a and flag are initially -- P a := ; while (flag!= ) ; -- spin flag := ; print a; 9 SC -- Sequential Consistency Write Atomicity Lamport: A multiprocessor is sequentially consistent if the result of any execution is the same as some sequential order, and within any processor, the operations are executed in program order Write atomicity says that all writes to a location should appear to all processors to have occurred in the same order -- a and b are initially -- P a := ; print b; b := ; print a; Possible esults a b -- a and b are initially -- P P a := ; while a!= do; while b!= do; b := ; print a;
3 Sufficient Conditions For SC Three conditions suffice for SC emory requests are issued in program order Wait for writes to complete Wait for reads to complete and the write filling the read to complete -- a and b are initially -- P P a := ; while a!= do; while b!= do; b := ; print a; Basic Write-back Snoopy Cache Designs Write-back protocols are more complex than write-through because modified data remains in the cache ntroduce more cache states odified, or dirty, the value differs from memory xclusive, no other cache has this value We consider three S, an invalidation protocol S, an invalidation protocol Dragon, an update protocol S Protocol Prd/-- S n Action Prd/-- Busd/Flush BusdX BusdX/Flush BusdX V Prd/-- BusdX/-- Prd/Busd Busd/-- Proc Action P P Bus Data P: r a S - - Bd em P: r a S - S Bd em P:w a - Bdx em P: r a S - S Bd P : r a S S S Bd em BusdX BusdX V Busd/Flush BusdX/Flush Prd/-- BusdX/-- Prd/Busd Busd/-- S (llinois) Protocol Dragon -- An Update Protocol The problem with S is that bus transactions are needed just to load and update a value even with no one is sharing Add a new state to get = modified or dirty, value differs from memory = exclusive, clean, one cached copy S = shared, multiple cached copies = invalid The caches are the valid memory contents -- memory is changed only when a line is needed ntroduce Shared clean (Sc) and Shared odified (Sm) states, dump nvalid Need to introduce the concept of ead iss and Write iss Add a shared line to the bus The basic idea: Keep all lines of all caches current -- note that updates will update modified word only
4 Yes First reference to memory block by processor. False-sharinginval-cap iss classi cation Other. Cold eplacement last copy nvalidation. Cold Yes Yes still there. False-sharing- Yes cold odi ed Yes. True-sharing- word(s) accessed odi ed cold word(s) accessed Yes. True-sharinginval-cap. Purefalse-sharing. Purue-sharing replacement Yes Yes odi ed word(s) accessed Yes 9. Pure-. True-sharingcapacity capacity. False-sharing-. True-sharingcap-inval cap-inval x d l t x t 9 Dragon Protocol Prdiss/ Busd(S ) PrWriss/ (Busd(S); BusUpd) Prd/-- Sm Prd/-- Busd/Flush BusUpd(S ) Prd/-- BusUpd(S) Busd/Flush BusUpd(S) Prd/-- Sc BusUpd(S ) Prdiss/ Busd(S) PrWriss/ Busd(S ) Prd/-- Prdiss/ Busd(S ) Prd/-- Dragon n Action PrWriss/ (Busd(S); BusUpd) Sm Prd/-- Prd/-- Busd/Flush BusUpd(S ) Prd/-- BusUpd(S) Busd/Flush BusUpd(S) Sc BusUpd(S ) Proc Action P P Bus Data Proc Action P P Bus Data P: r a - - Bd em P: r a Sc - Sm null - P: r a Sc - Sc Bd em : r a Sc Sc Sm Bd P P:w a Sc - Sm BUpd P Prdiss/ Busd(S) PrWriss/ Busd(S ) Prd/-- Assessing trade-offs You cannot design a protocol and reason through how it will work, and probably not even whether it is correct... any criteria for success Good use of bandwidth apid response Performance must be relative to some computation... what s typical? Protocol Optimizations: Worth t? ffect of state, and of BusUpgr instead of BusdX Traffic (B/s) Barnes/ Barnes/St Barnes/St-dx LU/ LU/St LU/St-dx Ocean/ Ocean/S Ocean/St-dx adiosity/ adiosity/st adiosity/st-dx adix/ adix/st adix/st-dx ll x aytrace/ aytrace/st aytrace/st-dx Traffic (B/s) Appl-Code/ Appl-Code/St Appl-Code/St-dx Appl-Data/ Appl-Data/St Appl-Data/St-dx OS-Code/ OS-Code/St OS-Code/St-dx OS-Data/ OS-Data/St OS-Data/St-dx S vs S little difference; Upgrade helps some Caching Properties Cache-block size Cache misses have long been categorized Compulsory Capacity Conflict Add Coherence Sharing in forms True False First access systemwide Written before odi ed word(s) accessed eason for miss eason for elimination of Old copy with state = invalid odi ed word(s) accessed Has block been modi ed since iss rate (%) Barnes/ Several features illustrated Upgrade False sharing True sharing Capacity Cold Barnes/ Barnes/ Barnes/ Barnes/ Barnes/ Lu/ Lu/ Lu/ Lu/ Lu/ Lu/ adiosity/ adiosity/ adiosity/ adiosity/ adiosity/ adiosity/ iss rate (%) Upgrade False sharing True sharing Capacity Cold Ocean/ Ocean/ Ocean/ Ocean/ Ocean/ Ocean/ adix/ adix/ adix/ adix/ adix/ adix/ aytrace/ aytrace/ aytrace/ aytrace/ aytrace/ aytrace/
5 Block Size Affects Traffic Update vs nvalidate Traffic (bytes/instructions) Barnes/ Barnes/ Barnes/ Barnes/ Barnes/ Barnes/ adiosity/ adiosity/ adiosity/ adiosity/ adiosity/ adiosity/ aytrace/ aytrace/ aytrace/ aytrace/ aytrace/ aytrace/ Contention increases Traffic (bytes/instruction) 9 adix/ adix/ adix/ adix/ adix/ adix/ Traffic (bytes/flop) LU/ LU/ LU/ LU/ LU/ LU/ Ocean/ Ocean/ Ocean/ Ocean/ Ocean/ Ocean/ ntuition -- f use implies continued to use, and writes between use are few, update should do better e.g. producer-consumer pattern f use implies unlikely to use again, or many writes between reads, updates not good pack rat phenomenon particularly bad under process migration useless updates where only last one will be used Update vs nvalidation uch coherence: updates help uch capacity updates hurt iss rate (%) False sharing True sharing Capacity Cold iss rate (%)..... There s ore To Story Bus traffic is huge A single processor tends to write a lot before other proc resds any bus updates vs one invalidate! LU/inv LU/upd Ocean/inv Ocean/mix Ocean/upd aytrace/inv aytrace/upd... Upgrade/update rate (%).... Upgrade/update rate (%) adix/inv.. adix/mix LU/inv LU/upd Ocean/inv Ocean/mix Ocean/upd aytrace/inv aytrace/upd adix/inv adix/mix adix/upd adix/upd Synchronization Simple Software Lock A long and glorious past A huge time cost in parallel programs Though studied intensively it is still not really solved The problem: Processes must share information, but its integrity must be preserved Some hardware assist is essential in order to achieve atomicity User say, Just give me primitives that work lock: ld,loc -- get fresh value cmp loc, # -- test if it changed bnz,lock -- spin if loc ~ free? st loc, # -- its free, set it and unlock: st loc,# -- clear setting The code seems simple enough, but consider various interleavings 9
6 Test&Set s A Simple Solution Test_and_set, Loc fetches Loc s value, and sets it to, urning value to Consider its operation lock: t&s,loc -- atomically set bnz,loc -- spin if loc ~ free? and unlock: st Loc,# -- clear setting Performance Of a Lock lock; delay (c) ; unlock on SG Challenge Test&set, c = Test&set, exponential backof f, c =. Test&set, exponential backof f, c = deal 9 Number of processors Performance Goals of Locks mproved Hardware Primitives Locks affect performance -- critical aspects Low Latency Low Traffic Scalability Low Storage Cost Fairness Seek a basic primitive suitable for range of cases Load-Locked (or Load-Linked), Store-Conditional LL reads location into a register Follow with arbitrary instructions to manipulate value SC tries to store back to location if and only if no other processor has written to the variable since this processor s LL f SC succeeds, all three steps happened atomically f fail, don t write or generate invalidations (must ry LL) Success indicated by condition codes Sample Use of LL-SC Performance lock: ll, Loc -- Load-lock Loc to bnz, lock -- Spin if Loc locked sc Loc, -- Cond ly store in Loc beqz lock -- f failed, repeat and unlock: st Loc, # -- Clear location any processes can do an LL at once, but only the first to the SC wins lock; delay(c); unlock; delay(d) on SG Challenge Array-based LL-SC LL-SC, exponential Ticket Ticket, proportional Number of processors Number of processors Number of processors (a) Null (c =, d = ) (b) Critical-section (c =. µs, d = ) (c) Delay (c =. µs, d =.9 µs)
7 Software mplications Blocked allocation of D array eferences straddling cache lines loses on Fragmentation False Sharing Cache block straddles partition boundary P P P P P P P P P (a) Two-dimensional array
ECE PP used in class for assessing cache coherence protocols
ECE 5315 PP used in class for assessing cache coherence protocols Assessing Protocol Design The benchmark programs are executed on a multiprocessor simulator The state transitions observed determine the
More informationL7 Shared Memory Multiprocessors. Shared Memory Multiprocessors
L7 Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor Dominate
More informationCache Coherence: Part 1
Cache Coherence: art 1 Todd C. Mowry CS 74 October 5, Topics The Cache Coherence roblem Snoopy rotocols The Cache Coherence roblem 1 3 u =? u =? $ 4 $ 5 $ u:5 u:5 1 I/O devices u:5 u = 7 3 Memory A Coherent
More informationShared Memory Architectures. Approaches to Building Parallel Machines
Shared Memory Architectures Arvind Krishnamurthy Fall 2004 Approaches to Building Parallel Machines P 1 Switch/Bus P n Scale (Interleaved) First-level $ P 1 P n $ $ (Interleaved) Main memory Shared Cache
More informationSnooping coherence protocols (cont.)
Snooping coherence protocols (cont.) A four-state update protocol [ 5.3.3] When there is a high degree of sharing, invalidation-based protocols perform poorly. Blocks are often invalidated, and then have
More informationSymmetric Multiprocessors
Symmetric Multiprocessors Implementing a single memory image operated upon by multiple processors is possible at small scales. The SMP is a standard design in which cache coherency allows all processors
More information[ 5.4] What cache line size is performs best? Which protocol is best to use?
Performance results [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part
More informationApproaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures
Approaches to Building arallel achines Switch/Bus n Scale Shared ory Architectures (nterleaved) First-level (nterleaved) ain memory n Arvind Krishnamurthy Fall 2004 (nterleaved) ain memory Shared Cache
More informationA three-state update protocol
A three-state update protocol Whenever a bus update is generated, suppose that main memory as well as the caches updates its contents. Then which state don t we need? What s the advantage, then, of having
More informationLecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions
Outline protocol Dragon updatebased protocol mpact of protocol optimizations LowerLevel Protocol Choices observed in state: what transition to make? Change to : assume ll read again soon good for mostly
More informationRecall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols
CS 258 Parallel Computer Architecture Lecture 15 Sequential Consistency and Snoopy Protocols arch 17, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs258 ecall: Sequential Consistency
More informationNOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.
Recap Protocol Design Space of Snooping Cache Coherent ultiprocessors CS 28, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Snooping cache coherence solve difficult problem by applying
More informationPerformance of coherence protocols
Performance of coherence protocols Cache misses have traditionally been classified into four categories: Cold misses (or compulsory misses ) occur the first time that a block is referenced. Conflict misses
More informationThe MESI State Transition Graph
Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization
More informationShared Memory Multiprocessors
Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O
More informationLecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors:
More informationAlewife Messaging. Sharing of Network Interface. Alewife User-level event mechanism. CS252 Graduate Computer Architecture.
CS252 Graduate Computer Architecture Lecture 18 April 5 th, 2010 ory Consistency Models and Snoopy Bus Protocols Alewife Messaging Send message write words to special network interface registers Execute
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationLecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment
More informationM4 Parallelism. Implementation of Locks Cache Coherence
M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationBus-Based Coherent Multiprocessors
Bus-Based Coherent Multiprocessors Lecture 13 (Chapter 7) 1 Outline Bus-based coherence Memory consistency Sequential consistency Invalidation vs. update coherence protocols Several Configurations for
More information4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins
4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the
More informationReview: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology
Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share
More informationLecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It
More informationCS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)
CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived
More informationCaches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2
Parallel ystems Parallel ystems Parallel ystems Outline for lecture 3 s (a quick review) hared memory multiprocessors hierarchies coherence nooping protocols» nvalidation protocols (, )» Update protocol
More informationProcessor Architecture
Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their
More informationChapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST
Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will
More informationRole of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout
CS 28 Parallel Computer Architecture Lecture 23 Hardware-Software Trade-offs in Synchronization and Data Layout April 21, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs28 Role of
More informationMultiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems
Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing
More informationLecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )
Lecture 19: Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 1 Caching Locks Spin lock: to acquire a lock, a process may enter an infinite loop that keeps attempting
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationModule 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.
MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line
More informationMULTIPROCESSORS AND THREAD LEVEL PARALLELISM
UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared
More informationShared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation
hared Architectures hared Multiprocessors ngo ander ngo@imit.kth.se hared Multiprocessor are often used pecial Class: ymmetric Multiprocessors (MP) o ymmetric access to all of main from any processor A
More informationLecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations
Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationPage 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology
CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:
More informationPage 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence
SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it
More informationIntroduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization
Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency
More informationPage 1. Cache Coherence
Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale
More informationLecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )
Lecture: Coherence and Synchronization Topics: synchronization primitives, consistency models intro (Sections 5.4-5.5) 1 Performance Improvements What determines performance on a multiprocessor: What fraction
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit
More informationChapter-4 Multiprocessors and Thread-Level Parallelism
Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns
More informationMemory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB
Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that
More informationMultiprocessor Synchronization
Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing
More informationThe need for atomicity This code sequence illustrates the need for atomicity. Explain.
Lock Implementations [ 8.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock
More informationSwitch Gear to Memory Consistency
Outline Memory consistency equential consistency Invalidation vs. update coherence protocols MI protocol tate diagrams imulation Gehringer, based on slides by Yan olihin 1 witch Gear to Memory Consistency
More informationFall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley
Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Avinash Kodi Department of Electrical Engineering & Computer
More informationComputer Architecture
18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University
More informationSuggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!
1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and
More informationLecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections
Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)
More informationEE382 Processor Design. Processor Issues for MP
EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P
More informationSymmetric Multiprocessors: Synchronization and Sequential Consistency
Constructive Computer Architecture Symmetric Multiprocessors: Synchronization and Sequential Consistency Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November
More informationShared Memory Multiprocessors
Shared Memory Multiprocessors Jesús Labarta Index 1 Shared Memory architectures............... Memory Interconnect Cache Processor Concepts? Memory Time 2 Concepts? Memory Load/store (@) Containers Time
More informationCMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today
More informationParallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence
Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly
More information250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019
250P: Computer Systems Architecture Lecture 14: Synchronization Anton Burtsev March, 2019 Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 2 Constructing Locks Applications
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationScalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels
More informationAleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory
Lecture 20: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Review: Small-Scale Shared Memory Caches serve to: Increase
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors
ECE7660 Parallel Computer Architecture Shared Memory Multiprocessors 1 Layer Perspective CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed addr ess Message passing Data parallel
More informationMemory Consistency and Multiprocessor Performance
Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationMultiprocessor Cache Coherency. What is Cache Coherence?
Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by
More informationCache Coherence in Bus-Based Shared Memory Multiprocessors
Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition
More informationScalable Cache Coherence
Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient
More informationScalable Cache Coherent Systems
NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication
More informationScalable Cache Coherence
arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationLecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )
Lecture: Coherence, Synchronization Topics: directory-based coherence, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory) keeps track
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence
CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationMultiprocessors. Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors?
Multiprocessors Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors? Two main approaches to sharing data: Shared memory (single address space)
More informationPage 1. Outline. Coherence vs. Consistency. Why Consistency is Important
Outline ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Memory Consistency Models Copyright 2006 Daniel J. Sorin Duke University Slides are derived from work by Sarita
More informationSynchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.
Synchronization Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data. Coherency protocols do not: make sure that only one thread accesses shared data
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In
More informationParallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University
18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More informationCS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II
CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationModule 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.
The Lecture Contains: Synchronization Waiting Algorithms Implementation Hardwired Locks Software Locks Hardware Support Atomic Exchange Test & Set Fetch & op Compare & Swap Traffic of Test & Set Backoff
More informationChapter 5. Thread-Level Parallelism
Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated
More informationEN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors
EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationSnooping-Based Cache Coherence
Lecture 10: Snooping-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Elle King Ex s & Oh s (Love Stuff) Once word about my code profiling skills
More informationSynchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based
More informationAdvanced OpenMP. Lecture 3: Cache Coherency
Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable
More informationSynchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Types of Synchronization
More information5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins
5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses, false sharing Cache coherence and interconnects
More informationScalable Locking. Adam Belay
Scalable Locking Adam Belay Problem: Locks can ruin performance 12 finds/sec 9 6 Locking overhead dominates 3 0 0 6 12 18 24 30 36 42 48 Cores Problem: Locks can ruin performance the locks
More informationCMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3
MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance
More informationLecture 24: Board Notes: Cache Coherency
Lecture 24: Board Notes: Cache Coherency Part A: What makes a memory system coherent? Generally, 3 qualities that must be preserved (SUGGESTIONS?) (1) Preserve program order: - A read of A by P 1 will
More informationCache Coherence in Scalable Machines
ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor
More informationScalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:
Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist
More informationLecture 24: Multiprocessing Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this
More information