M4 Parallelism. Implementation of Locks Cache Coherence

Similar documents
Flynn's Classification

Multiprocessor Synchronization

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

SHARED-MEMORY COMMUNICATION

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

The MESI State Transition Graph

Cache Coherence and Atomic Operations in Hardware

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 5. Thread-Level Parallelism

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Chapter 5. Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 21: Multiprocessors

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5. Multiprocessors and Thread-Level Parallelism

Concurrent Programming

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Chapter-4 Multiprocessors and Thread-Level Parallelism

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Lecture 25: Multiprocessors

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

5008: Computer Architecture

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Synchronization. P&H Chapter 2.11 and 5.8. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

CS3350B Computer Architecture

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Page 1. Cache Coherence

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Last Class: Synchronization

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Shared Memory Architecture

Lecture: Consistency Models, TM

Today: Synchronization. Recap: Synchronization

Synchronization. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. P&H Chapter 2.11

Multiprocessors and Locking

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CS5460: Operating Systems

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Synchronization. Han Wang CS 3410, Spring 2012 Computer Science Cornell University. P&H Chapter 2.11

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

Lecture 25: Synchronization. James C. Hoe Department of ECE Carnegie Mellon University

Potential violations of Serializability: Example 1

Shared Memory Multiprocessors

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Multiprocessors. Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors?

EE382 Processor Design. Processor Issues for MP

Chapter 5: Thread-Level Parallelism Part 1

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Lecture #7: Implementing Mutual Exclusion

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Other consistency models

Presentation 8 Shared Memory Architecture

«Real Time Embedded systems» Multi Masters Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Computer Science 146. Computer Architecture

G52CON: Concepts of Concurrency

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Mutex Implementation

Concurrent Preliminaries

Parallel Architecture. Hwansoo Han

6.852: Distributed Algorithms Fall, Class 15

Chapter - 5 Multiprocessors and Thread-Level Parallelism

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Shared Symmetric Memory Systems

Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

Embedded Systems Architecture. Multiprocessor and Multicomputer Systems

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

Memory Consistency Models

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

CSE502: Computer Architecture CSE 502: Computer Architecture

C09: Process Synchronization

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Control Instructions

Control Instructions. Computer Organization Architectures for Embedded Computing. Thursday, 26 September Summary

Transcription:

M4 Parallelism Implementation of Locks Cache Coherence

Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory Multiprocessing, Message Passing Synchronization Primitives Locks, LL-SC Cache coherence

Synchronization Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don t synchronize Result depends of order of accesses

Synchronization Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don t synchronize Result depends of order of accesses Hardware support required Atomic read/write memory operation No other access to the location allowed between the read and write

Synchronization Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don t synchronize Result depends of order of accesses Hardware support required Atomic read/write memory operation No other access to the location allowed between the read and write Could be a single instruction E.g., atomic swap of register memory Or an atomic pair of instructions

Implementing Locks Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion

Implementing Locks Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion Mutex Lock: a synchronization primitive

Implementing Locks Mutex Lock: a synchronization primitive AcquireLock(L) ReleaseLock(L)

Implementing Locks Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L)

Implementing Locks Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L) Done after critical section Allows another process to acquire lock

Implementing Locks int L=0; AcquireLock(L): while (L==1) ; L = 1; /* BUSY WAITING */

Implementing Locks int L=0; AcquireLock(L): while (L==1) ; L = 1; /* BUSY WAITING */ ReleaseLock(L): L = 0;

Problem in Implementing Locks AcquireLock(L): while (L==1) ; L = 1; wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L)

Problem in Implementing Locks wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.

Problem in Implementing Locks Process 1 Process 2 wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.

Problem in Implementing Locks Process 1 Process 2 LW R1, Addr(L) Context Switch wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.

Problem in Implementing Locks Process 1 Process 2 LW R1, Addr(L) Context Switch LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 # Critical Section # Context Switch wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.

Problem in Implementing Locks Process 1 Process 2 LW R1, Addr(L) Context Switch LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 # Critical Section # Context Switch wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock. BNEZ wait ADDI R1, R1, 1 # Critical Section #

Problem in Implementing Locks Process 1 Process 2 LW R1, Addr(L) Context Switch LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 # Critical Section # Context Switch wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock. BNEZ wait ADDI R1, R1, 1 # Critical Section # Both P1 and P2 are executing in the Critical Section!!!

Atomic Exchange Hardware support for lock implementation

Atomic Exchange Hardware support for lock implementation Atomic exchange: Swap contents between register and memory.

Atomic Exchange Test&Set Takes one memory operand and a register operand Test&Set Lock tmp = Lock Lock = 1 return tmp Test&Set occurs atomically (indivisibly).

Atomic Exchange Test&Set Takes one memory operand and a register operand Test&Set Lock tmp = Lock Lock = 1 return tmp Test&Set occurs atomically (indivisibly). Atomic Read-Modify-Write (RMW) instruction

Lock Implementation lock: Test&Set R1, L BNZ R1, lock Critical Section SW R0, L PP R1 R1 L L MM MM

Lock Implementation lock: Test&Set R1, L BNZ R1, lock PP 1 R1 R1 Critical Section SW R0, L 0 L L MM MM

Lock Implementation lock: Test&Set R1, L BNZ R1, lock PP 0 R1 R1 Critical Section SW R0, L 1 L L MM MM

Test&Set Implementation Test&Set R1, L t t = L; L; # Store a copy of of Lock L = 1; 1; # Set Lock R1 R1 t; t; # Write original value of of the Lock in in R1 R1 PP R1 R1 L L MM MM

Test&Set Implementation Test&Set R1, L t t = L; L; # Store a copy of of Lock L = 1; 1; # Set Lock R1 R1 t; t; # Write original value of of the Lock in in R1 R1 2 stores (t, (t, L) L) and 1 load (R1) (atomic practically 1 instruction) PP MM MM R1 R1 L L

Test&Set Implementation Test&Set R1, L t t = L; L; # Store a copy of of Lock L = 1; 1; # Set Lock R1 R1 t; t; # Write original value of of the Lock in in R1 R1 2 stores (t, (t, L) L) and 1 load (R1) (atomic practically 1 instruction) PP MM MM R1 R1 L L The atomic read-modify-write hardware primitive facilitates synchronization implementations (locks, barriers, etc.)

Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic

Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Consider N processors executing T&S on a single lock variable L

Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Consider N processors executing T&S on a single lock variable

Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Consider N processors executing T&S on a single lock variable Redundant, useless work Performance loss

Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Use LL-SC

LL-SC Example LD R2, X SW R2, X

LL-SC Example Atomic execution required! LD R2, X SW R2, X

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X:

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X:

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X:

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X: Some other thread has has modified X

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X: Some other thread has has modified X R2 R2 is is filled filled with with a special value indicating failure of of SC SC

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X: if (R2 == )... Some other thread has has modified X R2 R2 is is filled filled with with a special value indicating failure of of SC SC

Load Linked and Store Conditional LL records the lock variable address in a table

Load Linked and Store Conditional LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table.

Load Linked and Store Conditional LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table. Before the thread updates its cache copy, SC checks the table.

Load Linked and Store Conditional LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table. Before the thread updates its cache copy, SC checks the table. If flag is set, SC fails (no coherence traffic). Fail is indicated by a special value in the register.

Load Linked and Store Conditional LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table. Before the thread updates its cache copy, SC checks the table. If flag is set, SC fails (no coherence traffic). Fail is indicated by a special value in the register. If flag is not set, SC succeeds. Lock acquired.

Spin Lock using LL-SC lockit: LL R2, 0(R1) BNEZ R2, lockit DADDUI R2, R0, #1 SC R2, 0(R1) BEQZ R2, lockit ; no coherence traffic ; not available, keep spinning ; put value 1 in R2 ; store-conditional succeeds if no one ; updated the lock since the last LL ; confirm that SC succeeded, else keep trying

Spin Lock using LL-SC lockit: LL R2, 0(R1) BNEZ R2, lockit DADDUI R2, R0, #1 SC R2, 0(R1) BEQZ R2, lockit Spin lock with lower coherence traffic. ; no coherence traffic ; not available, keep spinning ; put value 1 in R2 ; store-conditional succeeds if no one ; updated the lock since the last LL ; confirm that SC succeeded, else keep trying

Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory Multiprocessing, Message Passing Synchronization Primitives Locks, LL-SC Cache coherence