Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA

Size: px

Start display at page:

Download "Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA"

Drusilla Shepherd
6 years ago
Views:

1 Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA

2 Agenda Background Motivation Remote Memory Request Shared Address Synchronization Remote Core Locking Combiner Approach Wait Free Locks Limitations of SAS Performance tied to Cache Coherency model Hybrid Solution with Message Passing Server Approach MP-SERVER Combiner Approach HYBCOMB Performance Evaluation Conclusion

3 Motivation Application acceleration through many-core processors Obstacles of Efficient Parallelization Artificial Communication Critical Sections & Synchronization Effectively linearize parallel processing Commonly Shared Objects Involving Synchronization Queues Stacks

4 Remote Memory Request (RMRs) An access to data shared between all processes that requires communication to shared memory Data that is not readily available in the local process cache Generated for any updates or accesses to a shared variable not present in the core

5 Traditional SAS Synchronization Techniques Remote Core Locking Dedicated process (server) for executing critical sections Combiner Approach Every process will adopt critical section execution duty when needed Wait Free Objects Locally modify shared object and only commit if no other thread has modified it

6 Remote Core Locking Transfer execution of a critical section (CS) to the server Server is the only entity that can modify variables in the critical section Improve data locality Reduce Cache Misses Reduce Hot Spots

7 RCL Request List [2] Data to Server Address of the lock Variables (encapsulated in a structure) Address of the CS function Stored in a cache line for SAS

8 Remote Core Locking

9 Combiner Approach No dedicated server to handle CS requests A Head Node or combiner services requests When a CS is reached the task is submitted to the combiner If there is no combiner, the process submitting the task is now the combiner Combiner runs until the request queue is empty or h requests have been serviced SAS Implementations CC-Synch H-Synch Message Passing Implementations HybComb

10 CC-Synch Coherent-Cache Synch List of CS requests maintained in shared memory Atomic operations are used to access the CS request list When a process reaches a critical section Adds the request to the list Tries to acquire a global lock If Successful Thread is now the head node or combiner Services up to h requests Then returns If not successful Spin until it becomes the head node or the CS is completed

11 H-Synch Hierarchal-Synch Implementation for clusters m processors, c clusters m/c processors per cluster Communication slower across clusters CC-Synch locally on each cluster CS Synchronization across clusters: Before the local combiner executes requests Acquire lock L to ensure other clusters do not access the CS Then release L after servicing the request

Wait Free Construction Goal: Eliminate serialization delay in shared objects Use atomic primitives on shared objects that execute in a fixed number of

12 Wait Free Construction Goal: Eliminate serialization delay in shared objects Use atomic primitives on shared objects that execute in a fixed number of cycles CAS LL/SC Fetch&Add Combination of the three primitives are often used for synchronization Use the wait-free primitives for the combiner identity

13 Wait Free Construction CAS Compare and Swap Supports two operations 1. read(o) return the current value of O 2. CAS(O, u, v) Compare O to u If equal, set O to v and return true Not equal, return false LL/SC - Locked Load Store Conditional Supports two operations 1. LL(O)return current value of O 2. SC(O, v) by process p O takes the value of v if and only if no other process has performed SC on O Return true if successful, otherwise false

14 Wait Free Construction Fetch & Add Supports two operations 1. read(o) 2. Fetch&Add(O, x) O+=x Return previous O One memory access Minimize serialization delays

15 Shortcomings of Traditional SAS CC Difficult to scale Performance is tied to cache coherency model RCL or Combiner on SAS still have to synchronize request queue in cache Short CSes dominated by cache-coherency stalls Optimization requires in-depth understanding of hardware Most programmers do not understand the underlying hardware architecture HW Venders hide hardware information

16 Thread Synchronization Using Message Passing Message passing allows explicit control over communication Message Passing Only MP-Server Hybrid SAS & Message Passing HybComb Message Passing better suited for sending tasks to the combiner SAS better suited for maintaining the identity of the combiner Meant for a platform with HW for SAS & Message Passing

17 MP-Server Essentially RCL using Message Passing Server reads requests from local message queue No Remote Memory Requests (RMRs) involved When sending response for a finished CS request Server does not wait for transmission to complete

18 HybComb HW Message Passing synchronization between the combiner and other processes SAS and Cache Stores identity of the combiner When a CS is reached by a process 1. Check the shared memory for the identity of the combiner 2. Send the task to the combiner If no combiner is present, become the combiner Write process identity to the shared memory Using CAS(O,u,v) 3. Wait for completion

19 Performance Concurrent Counter Message Passing implementations perform better MP-Server has the best throughput & latency Little difference between HybComb and SAS RCL & CC-Synch until 15+ application threads [1]

20 Performance [1] Much lower percentage of stalls with Message Passing Shorter Critical Section Length

21 Performance Stacks & Queues Message Passing still offers a performance advantage with queues and stacks Best Performance is achieved by limiting the number of locks in the queue [1]

22 Conclusion Message Passing is a better alternative for thread synchronization MP-Server is the best candidate Out performs HybComb in every characteristic Much lower algorithmic complexity Only Message Passing HW needed

23 References [1]D. Petrović, T. Ropars and A. Schiper, "Leveraging hardware message passing for efficient thread synchronization", ACM SIGPLAN Notices, vol. 49, no. 8, pp , [2]J. Lozi, F. David, G. Thomas, J. Lawal and G. Muller, "Remote Core Locking: MIgrating Critical-Section Execution to Improve the Performance of Multithreaded Applications", Proceedings of the 2012 USENIX Annual Technical Conference, [3]P. Fatourou and N. Kallimanis, "Revisiting the combining synchronization technique", ACM SIGPLAN Notices, vol. 47, no. 8, p. 257, [4]P. Fatourou and N. Kallimanis, "Highly-Efficient Wait-Free Synchronization", Theory of Computing Systems, vol. 55, no. 3, pp , 2013.

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Alexey Paznikov Saint Petersburg Electrotechnical University