Lock Oscillation: Boosting the Performance of Concurrent Data Structures

Similar documents
Lock Oscillation: Boosting the Performance of Concurrent Data Structures

Revisiting the Combining Synchronization Technique

Distributed Computing

Reducing contention in STM

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA

Flat Parallelization. V. Aksenov, ITMO University P. Kuznetsov, ParisTech. July 4, / 53

Non-blocking Array-based Algorithms for Stacks and Queues!

Using Elimination and Delegation to Implement a Scalable NUMA-Friendly Stack

Lightweight Contention Management for Efficient Compare-and-Swap Operations

AST: scalable synchronization Supervisors guide 2002

Fast and Portable Locking for Multicore Architectures

Queue Delegation Locking

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Advance Operating Systems (CS202) Locks Discussion

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

CS377P Programming for Performance Multicore Performance Synchronization

Implementing Mutual Exclusion. Sarah Diesburg Operating Systems CS 3430

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Section 4 Concurrent Objects Correctness, Progress and Efficiency

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Fast and Scalable Rendezvousing

Software-based contention management for efficient compare-and-swap operations

Scalable Flat-Combining Based Synchronous Queues

Synchronising Threads

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017

Modern High-Performance Locking

CMSC421: Principles of Operating Systems

CS377P Programming for Performance Multicore Performance Cache Coherence

CS 153 Lab4 and 5. Kishore Kumar Pusukuri. Kishore Kumar Pusukuri CS 153 Lab4 and 5

An Asymmetric Flat-Combining Based Queue Algorithm

Review: Creating a Parallel Program. Programming for Performance

Design of Concurrent and Distributed Data Structures

Today: Synchronization. Recap: Synchronization

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Flat Combining and the Synchronization-Parallelism Tradeoff

CS5460: Operating Systems

Lock cohorting: A general technique for designing NUMA locks

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 14 November 2014

Symmetric Multiprocessors: Synchronization and Sequential Consistency

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

Fast Dual Ring Queues

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

1) If a location is initialized to 0, what will the first invocation of TestAndSet on that location return?

A Non-Blocking Concurrent Queue Algorithm

Compositional C++ Page 1 of 17

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Per-Thread Batch Queues For Multithreaded Programs

Concurrent Counting using Combining Tree

CSE 120 Principles of Operating Systems

CPSC/ECE 3220 Summer 2017 Exam 2

An Efficient Wait-free Resizable Hash Table

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

arxiv: v1 [cs.dc] 24 May 2013

Page 1. Cache Coherence

Fine-grained synchronization & lock-free data structures

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Combining Techniques Application for Tree Search Structures

Distributed Operating Systems

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

Allocating memory in a lock-free manner

Resource management. Real-Time Systems. Resource management. Resource management

COMP 3361: Operating Systems 1 Final Exam Winter 2009

Synchronization I. Jo, Heeseung

Distributed Systems Synchronization. Marcus Völp 2007

Proving liveness. Alexey Gotsman IMDEA Software Institute

[537] Locks. Tyler Harter

Chapter 6: Process Synchronization

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

SYNCHRONIZATION M O D E R N O P E R A T I N G S Y S T E M S R E A D 2. 3 E X C E P T A N D S P R I N G 2018

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

Last Class: Synchronization

Example: The Dekker Algorithm on SMP Systems. Memory Consistency The Dekker Algorithm 43 / 54

arxiv: v2 [cs.dc] 9 May 2017

G52CON: Concepts of Concurrency

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Concurrency: Mutual Exclusion (Locks)

CS533 Concepts of Operating Systems. Jonathan Walpole

Spinlocks. Spinlocks. Message Systems, Inc. April 8, 2011

Distributed Queues in Shared Memory

Linearizability of Persistent Memory Objects

Advanced Topic: Efficient Synchronization

how to implement any concurrent data structure

Chapter 5. Multiprocessors and Thread-Level Parallelism

CS510 Advanced Topics in Concurrency. Jonathan Walpole

ARMv8-A Synchronization primitives. primitives. Version 0.1. Version 1.0. Copyright 2017 ARM Limited or its affiliates. All rights reserved.

Concurrent Preliminaries

TDDD56 Multicore and GPU computing Lab 2: Non-blocking data structures

Synchronization 1. Synchronization

6.852: Distributed Algorithms Fall, Class 15

Mutual Exclusion Algorithms with Constant RMR Complexity and Wait-Free Exit Code

Martin Kruliš, v

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Tradeoff Evaluation. Comparison between CB-R and CB-A as they only differ for this aspect.

Transcription:

Lock Oscillation: Boosting the Performance of Concurrent Data Structures Panagiota Fatourou FORTH ICS & University of Crete Nikolaos D. Kallimanis FORTH ICS

The Multicore Era The dominance of Multicore Machines necessitates the development of efficient parallel software. Parallelism may be inefficient due to synchronization costs of parts that cannot be parallelized. Need for efficient synchronization mechanisms with low cost.

The cost of Synchronization Synchronization requests (e.g. accesses to the same shared data) must be executed in mutual exclusion. Best time to execute m such requests time required by a single thread to execute them, sequentially, sidestepping the synchronization protocol. Ideally: One thread undertakes the task to execute all m synchronization requests. The rest of the threads execute only their local workload. In practice: This is never the case: contention effects may have a drastic impact in performance.

The Basics of the Combining Technique Tail req 1 req 2 req 3 req 4 List of synchronization requests Combining technique significantly enhances the performance. Each thread announces its operation by appending a node in the list. A thread attempts to become a combiner and serve, in addition to its own request, active requests by other threads. A thread that wants to perform a synchronization operation: 1. It announces its requests, 2. either try to become the combiner (not always successfully ) 3. or perform local spinning until the combiner performs their requests. The combiner applies, in addition to its operation, other announced operations before releasing the lock.

Related Work Combining Synchronization Protocols Blocking: Oyama Algorithm: Oyama, Taura, and Yonezawa, PDSIA'99. Flat-Combining: Hendler, Incze, Shavit, and Tzafrir, SPAA 10. CC-Synch: Fatourou and Kallimanis, PPoPP 12. Wait-Free: P-Sim: Fatourou and Kallimanis, SPAA 11. Other synchronization protocols have lower or similar performance as CC-Synch.

Why performance is so low compared to ideal?

Why performance is so low compared to ideal? For announcing requests: 1. At least one cache line is invalidated. For serving requests: 2. A cache miss is caused to the combiner for reading a request and its arguments. 3. Combiner causes at least one cache line invalidation for waking up each requesting thread. 4. Requests are usually not placed on consecutive addresses the prefetcher does not help. req 1 req 2 req 3 req 4 List of synchronization requests

Is it possible to further improve the performance?

Our Contribution I Osci enables batching on a single node, the synchronization requests initiated by multiple threads running on the same core. A fat node contains more than one requests and is appended to the list by performing just a single expensive atomic operation. 1. More requests are announced with less remote cache line invalidations. 2. With a single cache miss, combiner efficiently applies more than one requests. 3. More than one requesting threads wake up with one cache line invalidation. 4. Processor's prefetcher handles the reading of announced requests more efficiently. When OSCI is combined with cheap context switching (i.e. user-level threads) performs extremely well. It outperforms by far all previous state-of-the-art synchronization algorithms.

Our Contribution II We discuss PSimX, a simple variant of PSim with highly upgraded performance. It ensures wait-freedom. Its performance is much closer to the ideal than that of PSim. Based on PSimX, it is straightforward to implement useful complex primitives (e.g. CAS on multiple words, etc.) in a wait-free manner, at a very low cost. We built concurrent queues based on OSCI and PSimX which outperform all state-of-the-art concurrent queue implementations. We built concurrent stacks based on OSCI and PSimX which outperform all state-of-the-art concurrent stack implementations.

The OSCI Synchronization Technique General Idea Tail next door threads @ corex next door next door Osci maintains: This fat node is appended in the list by performing a threads @ corey NULL a linked list of nodes that store synchronization requests the shared variables implementing the simulated state Each node of the list contains the requests announced by multiple active threads running on the same single core. single expensive synchronization primitive (i.e. SWAP). threads @ corez One of the threads that have announced requests in the head node of the list plays the role of the combiner.

The OSCI Synchronization Technique Requesters side next door = CLOSED LOCKED OPEN Announce threads @ corex Each thread initially allocates two nodes. The first thread (or director) among those running on the same core, that wants to apply a request, successfully installs (i.e. successful CAS) a node to Announce. After director has recorded its request: door: LOCKED OPEN calls Yield to allow other threads running on the same core All other threads on the same core: 1. run their computation, 2. eventually announce their requests, and 3. call Yield. Whenever the director is rescheduled: door: OPEN CLOSED Announces the node to the list of requests. Director is the only thread that can later the role of combiner.

The OSCI Synchronization Technique Combiner s side Tail next door threads @ corex next next door door A combiner serves the requests of the list. NULL After applying a request of some thread, it unlocks the thread by <request, setting ret, completed = true, locked = false. Whenever, the combiner thread gives up its role identifies the completed, director locked> from the next node (if any) to be the new combiner. threads @ If corey the list is non-empty, threads @ corezthen there is exactly one combiner. If the list is empty, then no combiner exists.

Performance Evaluation I Osci outperforms CC-Synch by a factor of up to 11. The performance advantages of Osci over all other algorithms are even higher. PSimX outperforms all algorithms other than Osci.

Performance Evaluation II Concurrent Queues based on Osci and PSimX outperform: LCRQ (Morrison & Afek 13) CC-Queue (Fatourou & Kallimanis 12) SimQueue (Fatourou & Kallimanis 11) MS-Queue (Michael & Scott 96) Two-locks queue (Michael & Scott 96) Concurrent Stacks based on Osci and PSimX outperform: CC-Stack (Fatourou & Kallimanis 12) SimStack (Fatourou & Kallimanis 11) CLH-Stack Lock-Free stack (Treiber 86)

Performance Analysis Algorithm cache misses (all levels) cycles spent in backend stalls combining degree Osci-x64 0.20 247 1404 Psim-x64 0.24 2306 1307 H-Synch-x32 0.47 666 32 CC-Synch 0.47 4210 1079 PSim 0.4 14300 22 Osci spends the lowest amount of cache misses per operation. The cpu cycles spent in backend stalls per operation are the lowest. Osci achieves the highest combining degree. PSimX also spends a low amount of cache misses per operation and achieves high combining degree.

Thank You