Multiprocessors continued

Similar documents
Multiprocessors continued

On Power and Multi-Processors

On Power and Multi-Processors

Wenisch Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Prof. Thomas Wenisch. Lecture 23 EECS 470.

EECS 470. Lecture 17 Multiprocessors I. Fall 2018 Jon Beaumont

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

CSE502: Computer Architecture CSE 502: Computer Architecture

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Thread-Level Parallelism. Shared Memory. This Unit: Shared Memory Multiprocessors. U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiprocessors & Thread Level Parallelism

Computer Architecture

EECS 570 Lecture 9. Snooping Coherence. Winter 2017 Prof. Thomas Wenisch h6p://

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

PARALLEL MEMORY ARCHITECTURE

Chapter 5. Multiprocessors and Thread-Level Parallelism

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Multiprocessors 1. Outline

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Processor Architecture

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Cache Coherence and Atomic Operations in Hardware

Chapter 5. Thread-Level Parallelism

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Beyond ILP In Search of More Parallelism

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Handout 3 Multiprocessor and thread level parallelism

Foundations of Computer Systems

EECS 570 Lecture 2. Message Passing & Shared Memory. Winter 2018 Prof. Satish Narayanasamy

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Shared Memory Architectures. Approaches to Building Parallel Machines

Computer Systems Architecture

Cache Coherence in Bus-Based Shared Memory Multiprocessors

1. Memory technology & Hierarchy

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Limitations of parallel processing

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Chap. 4 Multiprocessors and Thread-Level Parallelism

Alewife Messaging. Sharing of Network Interface. Alewife User-level event mechanism. CS252 Graduate Computer Architecture.

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Shared Symmetric Memory Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Other consistency models

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Shared Memory Multiprocessors

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Computer Science 146. Computer Architecture

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Bus-Based Coherent Multiprocessors

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Shared Memory Multiprocessors

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

EC 513 Computer Architecture

Computer Architecture

Multiplying Performance. Application Domains for Multiprocessors. Multicore: Mainstream Multiprocessors. Identifying Parallelism

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Lecture 25: Multiprocessors

Page 1. Cache Coherence

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Chapter-4 Multiprocessors and Thread-Level Parallelism

Parallel Architecture. Hwansoo Han

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Lecture 24: Virtual Memory, Multiprocessors

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Computer Systems Architecture

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Transcription:

Multiprocessors continued IBM's Power7 with eight cores and 32 Mbytes edram Quad-core Kentsfield package

Quick overview Why do we have multi-processors? What type of programs will work well? What type work poorly?

Overview of today s lecture Performance expectations Amdahl s law. Review Shared memory vs. message passing Interconnection networks Thread-level parallelism Context Bandwidth Coherence Snoopy-bus consistency protocols. Consistency Strong (sequential consistency), Weak (relaxed consistency) Locking mechanisms Test and Set; Load lock/store conditional

Let s start at the top: Performance Expectations Amdahl's Law: If a fraction P of your program can be sped up by a factor of S, then your performance is: So if P=0.3 and S=2 we get (1/(.7+.15)) 1/.85 which is about 1.17. Question: If you have a program in which 10% can t be done in parallel (i.e. must be serial) and you ve got 64 processors, what s the best speed up you could hope for? ( 1 1 P) P S

TLP Review: Thread-Level Parallelism struct acct_t { int bal; }; shared struct acct_t accts[max_acct]; int id,amt; if (accts[id].bal >= amt) { accts[id].bal -= amt; spew_cash(); } 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 6:......... Thread-level parallelism (TLP) Collection of asynchronous tasks: not started and stopped together Data shared loosely, dynamically Example: database/web server (each query is a thread) accts is shared, can t register allocate 1 even if it were scalar id and amt are private variables, register allocated to r1, r2 1 Keyword in C is volatile

TLP Thread level parallelism (TLP) We exploited parallelism at the instruction level (ILP) Hardware can t find TLP. Complier/programmer needs to find it (we lack the window size) Compilers getting better at this. But hardware can take advantage of TLP Run on multiple cores If data is shared, provide rules to the software about what it can assume about shared data.

Shared memory vs. message passing Review: Pluses: Why Shared Memory? For applications looks like multitasking uniprocessor For OS only evolutionary extensions required Easy to do communication without OS Software can worry about correctness first then performance Minuses: Result: Proper synchronization is complex Communication is implicit so harder to optimize Hardware designers must implement Traditionally bus-based Symmetric Multiprocessors (SMPs), and now the CMPs are the most success parallel machines ever And the first with multi-billion-dollar markets

Shared memory vs. message passing Alternative to shared memory? Explicit message passing Rather difficult to code We ll not be doing anything with it But while not a 470 topic, it is important. Graduate level parallel programming class covers this in detail.

Shared memory vs. message passing But shared memory systems have their Two in particular own problems Cache coherence Memory consistency model Different solutions depending on interconnect So let s do interconnect now, and jump back to these two later.

Interconnect Review: Interconnect There are many ways to connect processors. Shared network Bus-based Crossbar Point-to-point More topologies than I want to think about Mesh (in any amount of dimensionality) Torus (in any amount of dimensionality) ncube Tree etc. etc. etc. etc.

Interconnect Shared vs. Point-to-Point Networks Shared network: e.g., bus (left) + Low latency Low bandwidth: doesn t scale beyond ~8 processors + Shared property simplifies cache coherence protocols (later) Point-to-point network: e.g., mesh or ring (right) Longer latency: may need multiple hops to communicate + Higher bandwidth: scales to 1000s of processors Cache coherence protocols are complex Mem R Mem R Mem R Mem R Mem R R Mem Mem R R Mem

Interconnect Organizing Point-To-Point Networks Network topology: organization of network Tradeoff performance (connectivity, latency, bandwidth) cost Router chips Networks that require separate router chips are indirect Networks that use processor/memory/router packages are direct + Fewer components, Glueless MP Point-to-point network examples Indirect tree (left) Direct mesh or ring (right) R R R Mem R R Mem Mem R Mem R Mem R Mem R Mem R R Mem

Interconnect Implementation #1: Snooping Bus MP Mem Mem Mem Mem Two basic implementations Bus-based systems Typically small: 2 8 (maybe 16) processors Typically processors split from memories (UMA) Sometimes multiple processors on single chip (CMP) Symmetric multiprocessors (SMPs) Common, I use one everyday

Interconnect Implementation #2: Scalable MP Mem R R Mem Mem R R Mem General point-to-point network-based systems Typically processor/memory/router blocks (NUMA) Glueless MP: no need for additional glue chips Can be arbitrarily large: 1000 s of processors Massively parallel processors (MPPs)

Bandwidth Context: Bandwidth As in the uniprocessor we need caches To reduce average memory latency. But recall caches are also important for reducing bandwidth. And now we ve got (say) 4x as many requests We ve probably got a bandwidth problem. It s important that caches work well! Mem R Mem R Mem R Mem R

Next: Shared data It s easy without a cache! Just make all accesses go to memory. But we need a cache, both for latency and bandwidth reasons. With caches, two different processors might see an incoherent view of the same memory location. Let s quickly see why.

Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash An Example Execution Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash CPU0 CPU1 Mem Two $100 withdrawals from account #241 at two ATMs Each transaction maps to thread on different processor Track accts[241].bal (address is in r3)

No-Cache, No-Problem Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 500 500 400 400 300 Scenario I: processors have no caches No problem

Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Cache Incoherence Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 500 V:500 500 D:400 500 D:400 V:500 500 D:400 D:400 500 Scenario II: processors have write-back caches Potentially 3 copies of accts[241].bal: memory, p0$, p1$ Can get incoherent (inconsistent)

D$ tags D$ data Hardware Cache Coherence CPU Coherence controller: CC Examines bus traffic (addresses and data) Executes coherence protocol bus What to do with local copy when you see different things happening on bus

Snooping Cache-Coherence Protocols Bus provides serialization point Each cache controller snoops all bus transactions take action to ensure coherence invalidate update supply value depends on state of the block and the protocol

Snooping Design Choices Controller updates state of blocks in response to processor and snoop events and generates bus xactions Often have duplicate cache tags Snoopy protocol set of states state-transition diagram actions Basic Choices write-through vs. write-back invalidate vs. update Processor ld/st Cache State Tag... Data Snoop (observed bus transaction)

Simple bus-based scheme Write through/no write allocate with invalidate protocol. Either have the data or don t (Valid or Invalid) Bus just supports read and write. When anyone else does a write, we go to I. Could go with an update protocol. How would that be different? What are the 4C s of caching now?

The Simple Invalidate Snooping Protocol PrRd / -- PrWr / BusWr Write-through, nowrite-allocate cache PrRd / BusRd Valid Invalid BusWr Inputs: PrRd, PrWr, BusRd, BusWr Outputs: BusRd, BusWr PrWr / BusWr Yes, that s confusing!

Example time Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Processor 1 Processor 2 Bus Processor Transaction Read A Read A Write A Write A Cache State Processor Transaction Read A Read A Write A : PrRd, PrWr, BusRd, BusWr Cache State

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch More Generally: MOESI [Sweazey & Smith ISCA86] M - Modified (dirty) O - Owned (dirty but shared) WHY? E - Exclusive (clean unshared) only copy, not dirty S - Shared I - Invalid Variants MSI MESI MOSI MOESI S O M E I ownership validity exclusiveness

MESI Write Back/Write allocate/invalidate Cache line is in one of 4 states Modified (dirty) Exclusive (clean unshared) only copy, not dirty Shared Invalid Can issue 4 transactions on the bus (Read, Write, Read and Invalidate, Invalidate. Each processor snoops it s own cache and checks to see if it has the data. If so announces if it has it clean (HIT) or dirty (HITM)

Actions: PrRd, PrWr, BRL Bus Read Line (BusRd) BWL Bus Write Line (BusWr) BRIL Bus Read and Invalidate BIL Bus Invalidate Line MESI example Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Processor 1 Processor 2 Bus Processor Transaction Read A Read A Write A Write A Cache State Processor Transaction Read A Read A Write A Cache State M - Modified (dirty) E - Exclusive (clean unshared) only copy, not dirty S - Shared I - Invalid

A wrong FSM for MESI It is actually not so much wrong as different (well, some of both)

Let s draw a better one.

Two processors, each have a 2-line direct-mapped cache with each line consisting of 16 bytes. R, W, R&I, I are the transactions. Fill in the tables. Processor Address Read/Write Bus transaction(s) Hit/Miss HIT/ HITM 4C miss type (if any) 1 0x100 Read 1 0x100 Write 1 0x210 Read 1 0x220 Read 2 0x210 Read 2 0x220 Write 1 0x220 Read Proc 1 Proc 2 Address State Address State Set 0 Set 0 Set 1 Set 1

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Scalability problems of Snoopy Coherence Prohibitive bus bandwidth Required bandwidth grows with # CPUS but available BW per bus is fixed Adding busses makes serialization/ordering hard Prohibitive processor snooping bandwidth All caches do tag lookup when ANY processor accesses memory Inclusion limits this to L2, but still lots of lookups Upshot: bus-based coherence doesn t scale beyond 8 16 CPUs

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Scalable Cache Coherence Scalable cache coherence: two part solution Part I: bus bandwidth Replace non-scalable bandwidth substrate (bus) with scalable bandwidth one (point-to-point network, e.g., mesh) Part II: processor snooping bandwidth Interesting: most snoops result in no action Replace non-scalable broadcast protocol (spam everyone) with scalable directory protocol (only spam processors that care)

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Directory Coherence Protocols Observe: physical address space statically partitioned + Can easily determine which memory module holds a given line That memory module sometimes called home Can t easily determine which processors have line in their caches Bus-based protocol: broadcast events to all processors/caches ± Simple and fast, but non-scalable Directories: non-broadcast coherence protocol Extend memory to track caching information For each physical cache line whose home this is, track: Owner: which processor has a dirty copy (I.e., M state) Sharers: which processors have clean copies (I.e., S state) Processor sends coherence event to home directory Home directory only sends events to processors that care

Read Processing Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Node #1 Directory Node #2 Read A (miss) A: Shared, #1

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Write Processing Node #1 Directory Node #2 Read A (miss) A: Shared, #1 A: Mod., #2 Trade-offs: Longer accesses (3-hop between Processor, directory, other Processor) Lower bandwidth no snoops necessary Makes sense either for CMPs (lots of L1 miss traffic) or large-scale servers (shared-memory MP > 32 nodes)

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Serialization and Ordering Let A and flag be 0 P1 P2 A += 5 while (flag == 0) flag = 1 print A Assume A and flag are in different cache blocks What happens? How do you implement it correctly?

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Coherence vs. Consistency Intuition says loads should return latest value what is latest? Coherence concerns only one memory location Consistency concerns apparent ordering for all locations A Memory System is Coherent if can serialize all operations to that location such that, operations performed by any processor appear in program order program order = order defined program text or assembly code value returned by a read is value written by last store to that location

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Why Coherence!= Consistency /* initial A = B = flag = 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; print A; flag = 1; print B; Intuition says printed A = B = 1 Coherence doesn t say anything. Why? Your uniprocessor ordering mechanisms (ld/st queue) hurts here. Why?

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Sequential Consistency (SC) processors issue memory ops in program order P1 P2 P3 switch randomly set after each memory op provides single sequential order among all operations Memory

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Sufficient Conditions for SC Every proc. issues memory ops in program order Memory ops happen (start and end) atomically must wait for store to complete before issuing next memory op after load, issuing proc waits for load to complete, before issuing next op Easily implemented with a shared bus

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Relaxed Memory Models Motivation with Directory Protocols Misses have longer latency (do cache hits to get to next miss) Collecting acknowledgements can take even longer Recall SC has Each processor generates at total order of its reads and writes (R-->R, R-->W, W-->W, & W-->R) That are interleaved into a global total order Example Relaxed Models PC: Relax ordering from writes to (other proc s) reads RC: Relax all read/write orderings (but add fences )

Locks Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Lock-based Mutual Exclusion xfer Crit. sec release wait xfer Crit. sec release wait xfer Crit. sec Acquire starts No contention: Want low latency Synchronization period Acquire done Release starts Contention: Want low period Low traffic Fairness Release done

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch How Not to Implement Locks LOCK while (lock_variable == 1); lock_variable = 1 Context switch! UNLOCK lock_variable = 0;

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Solution: Atomic Read-Modify-Write Test&Set(r,x) {r=m[x]; m[x]=1;} r is register m[x] is memory location x Fetch&Op(r1,r2,x,op) {r1=m[x]; m[x]=op(r1,r2);} Swap(r,x) {temp=m[x]; m[x]=r; r=temp;} Compare&Swap(r1,r2,x) {temp=r2; r2=m[x]; if r1==r2 then m[x]=temp;}

Lock: Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Write a lock and unlock with test-and-set unlock:

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Implementing RMWs Bus-based systems: Hold bus and issue load/store operations without any intervening accesses by other processors Scalable systems Acquire exclusive ownership via cache coherence Perform load/store operations without allowing external coherence requests

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Load-Locked Store-Conditional Load-locked Issues a normal load and sets a flag and address field Store-conditional Checks that flag is set and address matches If so, performs store and sets store register to 1. Otherwise sets store register to 0. Flag is cleared by Invalidation Cache eviction Context switch lock: lda r2, #1 ll r1, lock_variable sc lock_variable, r2 beqz r2, lock neqz r1, lock unlock:st lock_variable, #0 // branch not equal to zero

Brehob 2014 -- Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch Test-and-Set Spin Lock (T&S) Lock is acquire, Unlock is release acquire(lock_ptr): while (true): // Perform test-and-set old = compare_and_swap(lock_ptr, UNLOCKED, LOCKED) if (old == UNLOCKED): break // lock acquired! // keep spinning, back to top of while loop release(lock_ptr): store[lock_ptr] <- UNLOCKED Performance problem CAS is both a read and write; spinning causes lots of invalidations