M4 Parallelism Implementation of Locks Cache Coherence
Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory Multiprocessing, Message Passing Synchronization Primitives Locks, LL-SC Cache coherence
Synchronization Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don t synchronize Result depends of order of accesses
Synchronization Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don t synchronize Result depends of order of accesses Hardware support required Atomic read/write memory operation No other access to the location allowed between the read and write
Synchronization Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don t synchronize Result depends of order of accesses Hardware support required Atomic read/write memory operation No other access to the location allowed between the read and write Could be a single instruction E.g., atomic swap of register memory Or an atomic pair of instructions
Implementing Locks Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion
Implementing Locks Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion Mutex Lock: a synchronization primitive
Implementing Locks Mutex Lock: a synchronization primitive AcquireLock(L) ReleaseLock(L)
Implementing Locks Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L)
Implementing Locks Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L) Done after critical section Allows another process to acquire lock
Implementing Locks int L=0; AcquireLock(L): while (L==1) ; L = 1; /* BUSY WAITING */
Implementing Locks int L=0; AcquireLock(L): while (L==1) ; L = 1; /* BUSY WAITING */ ReleaseLock(L): L = 0;
Problem in Implementing Locks AcquireLock(L): while (L==1) ; L = 1; wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L)
Problem in Implementing Locks wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.
Problem in Implementing Locks Process 1 Process 2 wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.
Problem in Implementing Locks Process 1 Process 2 LW R1, Addr(L) Context Switch wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.
Problem in Implementing Locks Process 1 Process 2 LW R1, Addr(L) Context Switch LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 # Critical Section # Context Switch wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.
Problem in Implementing Locks Process 1 Process 2 LW R1, Addr(L) Context Switch LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 # Critical Section # Context Switch wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock. BNEZ wait ADDI R1, R1, 1 # Critical Section #
Problem in Implementing Locks Process 1 Process 2 LW R1, Addr(L) Context Switch LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 # Critical Section # Context Switch wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock. BNEZ wait ADDI R1, R1, 1 # Critical Section # Both P1 and P2 are executing in the Critical Section!!!
Atomic Exchange Hardware support for lock implementation
Atomic Exchange Hardware support for lock implementation Atomic exchange: Swap contents between register and memory.
Atomic Exchange Test&Set Takes one memory operand and a register operand Test&Set Lock tmp = Lock Lock = 1 return tmp Test&Set occurs atomically (indivisibly).
Atomic Exchange Test&Set Takes one memory operand and a register operand Test&Set Lock tmp = Lock Lock = 1 return tmp Test&Set occurs atomically (indivisibly). Atomic Read-Modify-Write (RMW) instruction
Lock Implementation lock: Test&Set R1, L BNZ R1, lock Critical Section SW R0, L PP R1 R1 L L MM MM
Lock Implementation lock: Test&Set R1, L BNZ R1, lock PP 1 R1 R1 Critical Section SW R0, L 0 L L MM MM
Lock Implementation lock: Test&Set R1, L BNZ R1, lock PP 0 R1 R1 Critical Section SW R0, L 1 L L MM MM
Test&Set Implementation Test&Set R1, L t t = L; L; # Store a copy of of Lock L = 1; 1; # Set Lock R1 R1 t; t; # Write original value of of the Lock in in R1 R1 PP R1 R1 L L MM MM
Test&Set Implementation Test&Set R1, L t t = L; L; # Store a copy of of Lock L = 1; 1; # Set Lock R1 R1 t; t; # Write original value of of the Lock in in R1 R1 2 stores (t, (t, L) L) and 1 load (R1) (atomic practically 1 instruction) PP MM MM R1 R1 L L
Test&Set Implementation Test&Set R1, L t t = L; L; # Store a copy of of Lock L = 1; 1; # Set Lock R1 R1 t; t; # Write original value of of the Lock in in R1 R1 2 stores (t, (t, L) L) and 1 load (R1) (atomic practically 1 instruction) PP MM MM R1 R1 L L The atomic read-modify-write hardware primitive facilitates synchronization implementations (locks, barriers, etc.)
Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic
Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Consider N processors executing T&S on a single lock variable L
Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Consider N processors executing T&S on a single lock variable
Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Consider N processors executing T&S on a single lock variable Redundant, useless work Performance loss
Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Use LL-SC
LL-SC Example LD R2, X SW R2, X
LL-SC Example Atomic execution required! LD R2, X SW R2, X
LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X
LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X:
LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X:
LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X:
LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X: Some other thread has has modified X
LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X: Some other thread has has modified X R2 R2 is is filled filled with with a special value indicating failure of of SC SC
LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X: if (R2 == )... Some other thread has has modified X R2 R2 is is filled filled with with a special value indicating failure of of SC SC
Load Linked and Store Conditional LL records the lock variable address in a table
Load Linked and Store Conditional LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table.
Load Linked and Store Conditional LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table. Before the thread updates its cache copy, SC checks the table.
Load Linked and Store Conditional LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table. Before the thread updates its cache copy, SC checks the table. If flag is set, SC fails (no coherence traffic). Fail is indicated by a special value in the register.
Load Linked and Store Conditional LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table. Before the thread updates its cache copy, SC checks the table. If flag is set, SC fails (no coherence traffic). Fail is indicated by a special value in the register. If flag is not set, SC succeeds. Lock acquired.
Spin Lock using LL-SC lockit: LL R2, 0(R1) BNEZ R2, lockit DADDUI R2, R0, #1 SC R2, 0(R1) BEQZ R2, lockit ; no coherence traffic ; not available, keep spinning ; put value 1 in R2 ; store-conditional succeeds if no one ; updated the lock since the last LL ; confirm that SC succeeded, else keep trying
Spin Lock using LL-SC lockit: LL R2, 0(R1) BNEZ R2, lockit DADDUI R2, R0, #1 SC R2, 0(R1) BEQZ R2, lockit Spin lock with lower coherence traffic. ; no coherence traffic ; not available, keep spinning ; put value 1 in R2 ; store-conditional succeeds if no one ; updated the lock since the last LL ; confirm that SC succeeded, else keep trying
Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory Multiprocessing, Message Passing Synchronization Primitives Locks, LL-SC Cache coherence