M4 Parallelism. Implementation of Locks Cache Coherence

M4 Parallelism Implementation of Locks Cache Coherence

Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory Multiprocessing, Message Passing Synchronization Primitives Locks, LL-SC Cache coherence

Synchronization Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don t synchronize Result depends of order of accesses

Synchronization Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don t synchronize Result depends of order of accesses Hardware support required Atomic read/write memory operation No other access to the location allowed between the read and write

Implementing Locks Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion

Implementing Locks Must synchronize processes so that they access shared variable one at a time in critical section; called Mutual Exclusion Mutex Lock: a synchronization primitive

Implementing Locks Mutex Lock: a synchronization primitive AcquireLock(L) ReleaseLock(L)

Implementing Locks Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L)

Implementing Locks Mutex Lock: a synchronization primitive AcquireLock(L) Done before critical section of code Returns when safe for process to enter critical section ReleaseLock(L) Done after critical section Allows another process to acquire lock

Implementing Locks int L=0; AcquireLock(L): while (L==1) ; L = 1; /* BUSY WAITING */

Implementing Locks int L=0; AcquireLock(L): while (L==1) ; L = 1; /* BUSY WAITING */ ReleaseLock(L): L = 0;

Problem in Implementing Locks AcquireLock(L): while (L==1) ; L = 1; wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L)

Problem in Implementing Locks wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.

Problem in Implementing Locks Process 1 Process 2 wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.

Problem in Implementing Locks Process 1 Process 2 LW R1, Addr(L) Context Switch wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.

Problem in Implementing Locks Process 1 Process 2 LW R1, Addr(L) Context Switch LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 # Critical Section # Context Switch wait: LW R1, Addr(L) BNEZ wait ADDI R1, R1, 1 SW R1, Addr(L) Initally L=0. P1 and P2 are in contention to acquire the lock.

Atomic Exchange Hardware support for lock implementation

Atomic Exchange Hardware support for lock implementation Atomic exchange: Swap contents between register and memory.

Atomic Exchange Test&Set Takes one memory operand and a register operand Test&Set Lock tmp = Lock Lock = 1 return tmp Test&Set occurs atomically (indivisibly).

Atomic Exchange Test&Set Takes one memory operand and a register operand Test&Set Lock tmp = Lock Lock = 1 return tmp Test&Set occurs atomically (indivisibly). Atomic Read-Modify-Write (RMW) instruction

Lock Implementation lock: Test&Set R1, L BNZ R1, lock Critical Section SW R0, L PP R1 R1 L L MM MM

Lock Implementation lock: Test&Set R1, L BNZ R1, lock PP 1 R1 R1 Critical Section SW R0, L 0 L L MM MM

Lock Implementation lock: Test&Set R1, L BNZ R1, lock PP 0 R1 R1 Critical Section SW R0, L 1 L L MM MM

Test&Set Implementation Test&Set R1, L t t = L; L; # Store a copy of of Lock L = 1; 1; # Set Lock R1 R1 t; t; # Write original value of of the Lock in in R1 R1 PP R1 R1 L L MM MM

Test&Set Implementation Test&Set R1, L t t = L; L; # Store a copy of of Lock L = 1; 1; # Set Lock R1 R1 t; t; # Write original value of of the Lock in in R1 R1 2 stores (t, (t, L) L) and 1 load (R1) (atomic practically 1 instruction) PP MM MM R1 R1 L L The atomic read-modify-write hardware primitive facilitates synchronization implementations (locks, barriers, etc.)

Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic

Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Consider N processors executing T&S on a single lock variable L

Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Consider N processors executing T&S on a single lock variable

Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Consider N processors executing T&S on a single lock variable Redundant, useless work Performance loss

Test&Set Issues T&S is an atomic Read-Modify-Write instruction 2 Stores, 1 Load Atomic Use LL-SC

LL-SC Example LD R2, X SW R2, X

LL-SC Example Atomic execution required! LD R2, X SW R2, X

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X:

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X: Some other thread has has modified X

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X: Some other thread has has modified X R2 R2 is is filled filled with with a special value indicating failure of of SC SC

LL-SC Example Atomic execution required! LD R2, X SW R2, X LL R2, X SC R2, X X: X: if (R2 == )... Some other thread has has modified X R2 R2 is is filled filled with with a special value indicating failure of of SC SC

Load Linked and Store Conditional LL records the lock variable address in a table

Load Linked and Store Conditional LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table.

Load Linked and Store Conditional LL records the lock variable address in a table Snoop on the bus; Record the first Write Invalidate request on the lock variable in the table. Before the thread updates its cache copy, SC checks the table. If flag is set, SC fails (no coherence traffic). Fail is indicated by a special value in the register.

Spin Lock using LL-SC lockit: LL R2, 0(R1) BNEZ R2, lockit DADDUI R2, R0, #1 SC R2, 0(R1) BEQZ R2, lockit ; no coherence traffic ; not available, keep spinning ; put value 1 in R2 ; store-conditional succeeds if no one ; updated the lock since the last LL ; confirm that SC succeeded, else keep trying

Spin Lock using LL-SC lockit: LL R2, 0(R1) BNEZ R2, lockit DADDUI R2, R0, #1 SC R2, 0(R1) BEQZ R2, lockit Spin lock with lower coherence traffic. ; no coherence traffic ; not available, keep spinning ; put value 1 in R2 ; store-conditional succeeds if no one ; updated the lock since the last LL ; confirm that SC succeeded, else keep trying