Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Similar documents
Allocating memory in a lock-free manner

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Efficient and Reliable Lock-Free Memory Reclamation Based on Reference Counting

Non-blocking Array-based Algorithms for Stacks and Queues!

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

A Non-Blocking Concurrent Queue Algorithm

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

CS377P Programming for Performance Multicore Performance Synchronization

A Skiplist-based Concurrent Priority Queue with Minimal Memory Contention

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Wait-Free Multi-Word Compare-And-Swap using Greedy Helping and Grabbing

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Progress Guarantees When Composing Lock-Free Objects

A Wait-free Multi-word Atomic (1,N) Register for Large-scale Data Sharing on Multi-core Machines

Understanding the Performance of Concurrent Data Structures on Graphics Processors

FIFO Queue Synchronization

Design of Concurrent and Distributed Data Structures

Non-Blocking Concurrent FIFO Queues With Single Word Synchronization Primitives

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Scheduler Activations. CS 5204 Operating Systems 1

Concurrent Preliminaries

Scalable Concurrent Hash Tables via Relativistic Programming

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

DESIGN CHALLENGES FOR SCALABLE CONCURRENT DATA STRUCTURES for Many-Core Processors

CS 241 Honors Concurrent Data Structures

Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming

Advanced Topic: Efficient Synchronization

Lock-Free Techniques for Concurrent Access to Shared Objects

The Wait-Free Hierarchy

Håkan Sundell University College of Borås Parallel Scalable Solutions AB

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Distributed Queues in Shared Memory

Per-Thread Batch Queues For Multithreaded Programs

Synchronization and memory consistency on Intel Single-chip Cloud Computer. Ivan Walulya

Order Is A Lie. Are you sure you know how your code runs?

Linked Lists: The Role of Locking. Erez Petrank Technion

AST: scalable synchronization Supervisors guide 2002

Multi-core Architecture and Programming

Programming Paradigms for Concurrency Lecture 3 Concurrent Objects

Clustered Communication for Efficient Pipelined Multithreading on Commodity MCPs

Linearizability of Persistent Memory Objects

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Real-Time Systems. Lecture #4. Professor Jan Jonsson. Department of Computer Science and Engineering Chalmers University of Technology

Resource management. Real-Time Systems. Resource management. Resource management

Advance Operating Systems (CS202) Locks Discussion

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

A Wait-Free Queue for Multiple Enqueuers and Multiple Dequeuers Using Local Preferences and Pragmatic Extensions

Lock free Algorithm for Multi-core architecture. SDY Corporation Hiromasa Kanda SDY

A Consistency Framework for Iteration Operations in Concurrent Data Structures

Lock Oscillation: Boosting the Performance of Concurrent Data Structures

Tom Hart, University of Toronto Paul E. McKenney, IBM Beaverton Angela Demke Brown, University of Toronto

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

Parallelization and Synchronization. CS165 Section 8

CS510 Concurrent Systems. Jonathan Walpole

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Concurrent Objects. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Lock vs. Lock-free Memory Project proposal

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Concurrent Access Algorithms for Different Data Structures: A Research Review

arxiv: v1 [cs.dc] 12 Feb 2013

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

Deadlock. Concurrency: Deadlock and Starvation. Reusable Resources

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract

Tasks. Task Implementation and management

Lock-free Serializable Transactions

Fall COMP3511 Review

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Fine-grained synchronization & lock-free programming

Midterm Exam Amy Murphy 6 March 2002

how to implement any concurrent data structure

Programming Languages

Process Synchronization. Mehdi Kargahi School of ECE University of Tehran Spring 2008

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

The Relative Power of Synchronization Methods

Replacing Competition with Cooperation to Achieve Scalable Lock-Free FIFO Queues

An Efficient Synchronisation Mechanism for Multi-Core Systems

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Efficient & Lock-Free Modified Skip List in Concurrent Environment

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Characterizing the Performance and Energy Efficiency of Lock-Free Data Structures

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Implementing caches. Example. Client. N. America. Client System + Caches. Asia. Client. Africa. Client. Client. Client. Client. Client.

Chapters 5 and 6 Concurrency

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Concurrent Computing

Fine-grained synchronization & lock-free data structures

Operating Systems: William Stallings. Starvation. Patricia Roy Manatee Community College, Venice, FL 2008, Prentice Hall

6.852: Distributed Algorithms Fall, Class 15

Overview of Lecture 4. Memory Models, Atomicity & Performance. Ben-Ari Concurrency Model. Dekker s Algorithm 4

Unit 6: Indeterminate Computation

Lease/Release: Architectural Support for Scaling Contended Data Structures

Shared Objects. Shared Objects

What is the Race Condition? And what is its solution? What is a critical section? And what is the critical section problem?

Distributed Operating Systems

SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (II)

Concurrency: Deadlock and Starvation. Chapter 6

Part 1: Concepts and Hardware- Based Approaches

Maximum CPU utilization obtained with multiprogramming. CPU I/O Burst Cycle Process execution consists of a cycle of CPU execution and I/O wait

Transcription:

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Anders Gidenstam Håkan Sundell Philippas Tsigas School of business and informatics University of Borås Distributed Computing and Systems group, Department of Computer Science and Engineering, Chalmers University of Technology

Outline Introduction Lock-free synchronization The Problem & Related work The new lock-free queue algorithm Experiments Conclusions Anders Gidenstam, University of Borås 2

Synchronization on a shared object P1 P2 P3 Lock-free synchronization Concurrent operations without enforcing mutual exclusion Avoids: Blocking (or busy waiting), convoy effects, priority inversion and risk of deadlock Progress Guarantee At least one operation always makes progress P4 Anders Gidenstam, University of Borås 3

Correctness of a concurrent object Desired semantics of a shared data object Linearizability [Herlihy & Wing, 1990] For each operation invocation there must be one single time instant during its duration where the operation appears to take O 1 effect. The observed effects should be consistent with a sequential execution of the operations in that order. O 2 O 3 Anders Gidenstam, University of Borås 4

Correctness of a concurrent object Desired semantics of a shared data object Linearizability [Herlihy & Wing, 1990] For each operation invocation there must be one single time instant during its duration where the operation appears to take O 1 effect. The observed effects should be consistent with a sequential execution of the operations in that order. O 2 O 3 O 1 O 3 O 2 Anders Gidenstam, University of Borås 5

System Model Processes can read/write single memory words Synchronization primitives Built into CPU and memory system Atomic read-modify-write (i.e. a critical section of one instruction) Examples: Compare-and-Swap, Load-Linked / Store-Conditional CPU CPU Shared Memory Anders Gidenstam, University of Borås 6

System Model: Memory Consistency A process Reads/writes may reach memory out of order Reads of own writes appear in program order Atomic synchronization primitive/instruction Single word Compare-and-Swap Atomic Acts as memory barrier for the process own reads and writes All own reads/writes before are done before All own reads/writes after are done after The affected cache block is held exclusively CPU core Store buffer cache CPU core Store buffer cache Shared Memory (or shared cache) Anders Gidenstam, University of Borås 7

Outline Introduction Lock-free synchronization The Problem & Related work The new lock-free queue algorithm Experiments Conclusions Anders Gidenstam, University of Borås 8

The Problem Concurrent FIFO queue shared data object Basic operations: enqueue and dequeue Tail Head F A B C D E G Desired Properties Linearizable and Lock-free Dynamic size (maximum only limited by available memory) Bounded memory usage (in terms of live contents) Fast on real systems Anders Gidenstam, University of Borås 9

Related Work: Lock-free Multi-P/C Queues [Michael & Scott, 1996] Linked-list, one element/node Global shared head and tail pointers [Tsigas & Zhang, 2001] Static circular array of elements Two different NULL values for distinguishing initially empty from dequeued elements Global shared head and tail indices, lazily updated [Michael & Scott, 1996] + Elimination [Moir, Nussbaum, Shalev & Shavit, 2005] Same as the above + elimination of concurrent pairs of enqueue and dequeue when the queue is near empty [Hoffman, Shalev & Shavit, 2007] Baskets queue Linked-list, one element/node Reduces contention between concurrent enqueues after conflict N-1 0 Needs stronger memory management than M&S (SLFRC or Beware&Cleanup) Anders Gidenstam, University of Borås 10

Outline Introduction Lock-free synchronization The Problem & Related work The new lock-free queue algorithm Experiments Conclusions Anders Gidenstam, University of Borås 11

The Algorithm Basic idea: Cut and unroll the circular array queue Primary synchronization on the elements Compare-And-Swap (NULL1 -> Value -> NULL2 avoids the ABA problem) Head and tail both move to the right Need an infinite array of elements Anders Gidenstam, University of Borås 12

The Algorithm globaltailblock globalheadblock * Basic idea: Creating an infinite array of elements. Divide into blocks of elements, and link them together New empty blocks added as needed Emptied blocks are marked deleted and eventually reclaimed Block fields: Elements, next, (filled, emptied flags), deleted flag. Linked chain of dynamically allocated blocks Lock-free memory management needed for safe reclamation! Beware&Cleanup [Gidenstam, Papatriantafilou, Sundell & Tsigas, 2009] Anders Gidenstam, University of Borås 13

The Algorithm globaltailblock globalheadblock * Thread local storage Last used Thread B headblock head tailblock tail Head block/index for Enqueue Tail block/index for Dequeue Reduces need to read/update global shared variables Anders Gidenstam, University of Borås 14

The Algorithm globaltailblock * globalheadblock scan Enqueue Thread B headblock head tailblock Find the right block (first via TLS, then TLS->next or globalheadblock) Search the block for the first empty element tail Anders Gidenstam, University of Borås 15

The Algorithm globaltailblock * globalheadblock Add with CAS Enqueue Thread B headblock head tailblock Find the right block (first via TLS, then TLS->next or globalheadblock) Search the block for the first empty element Update element with CAS (Also, the linearization point) tail Anders Gidenstam, University of Borås 16

The Algorithm globaltailblock * scan globalheadblock Dequeue Thread B headblock head tailblock Find the right block (first via TLS, then TLS->next or globaltailblock) Search the block for the first valid element tail Anders Gidenstam, University of Borås 17

The Algorithm globaltailblock globalheadblock Remove with CAS * Dequeue Thread B headblock head tailblock Find the right block (first via TLS, then TLS->next or globaltailblock) Search the block for the first valid element Remove with CAS, replace with NULL2 (linearization point) tail Anders Gidenstam, University of Borås 18

The Algorithm globaltailblock globalheadblock * Maintaining the chain of blocks Helping scheme when moving between blocks Invariants to be maintained globalheadblock points to The newest block or the block before it globaltailblock points to The oldest active block (not deleted) or the block before it Anders Gidenstam, University of Borås 19

Maintaining the chain of blocks globaltailblock globalheadblock * Updating globaltailblock Case 1 Leader Finds the block empty If needed help to ensure globaltailblock points to tailblock (or a newer block) Thread A headblock head tailblock tail Anders Gidenstam, University of Borås 20

Maintaining the chain of blocks globaltailblock globalheadblock * * Updating globaltailblock Case 1 Leader Finds the block empty Helping done Set delete mark Thread A headblock head tailblock tail Anders Gidenstam, University of Borås 21

Maintaining the chain of blocks globaltailblock globalheadblock * * Updating globaltailblock Case 1 Leader Finds the block empty Helping done Set delete mark Update globaltailblock pointer Move own tailblock pointer Thread A headblock head tailblock tail Anders Gidenstam, University of Borås 22

Maintaining the chain of blocks globaltailblock globalheadblock * * Updating globaltailblock Case 2: Way out of date tailblock->next marked deleted Restart with globaltailblock Thread A headblock head tailblock tail Anders Gidenstam, University of Borås 23

Maintaining the chain of blocks globaltailblock globalheadblock * * Updating globaltailblock Case 2: Way out of date tailblock->next marked deleted Restart with globaltailblock Thread A headblock head tailblock tail Anders Gidenstam, University of Borås 24

Minding the Cache globaltailblock globalheadblock Blocks occupy one cache-line Cache-lines for enqueue v.s. dequeue are disjoint (except when near empty) Enqueue/dequeue will cause coherence traffic for the affected block Scanning for the head/tail involves one cache-line Thread A headblock head tailblock tail Anders Gidenstam, University of Borås 25

Minding the Cache globaltailblock globalheadblock Blocks occupy one cache-line Cache-lines for enqueue v.s. Cache-lines dequeue are disjoint (except when near empty) Enqueue/dequeue will cause coherence traffic for the affected block Scanning for the head/tail involves one cache-line Thread A headblock head tailblock tail Anders Gidenstam, University of Borås 26

Scanning in a weak memory globaltailblock model? globalheadblock Scanning for empty Scanning for first item to dequeue Key observations Life cycle for element values (NULL1 -> value -> NULL2) Elements are updated with CAS thus requiring the old value to be the expected one. Scanning only skips values later in the life cycle Reading an old value is safe (will try CAS and fail) Anders Gidenstam, University of Borås 27

Outline Introduction Lock-free synchronization The Problem & Related work The lock-free queue algorithm Experiments Conclusions Anders Gidenstam, University of Borås 28

Micro benchmark Experimental evaluation Threads execute enqueue and dequeue operations on a shared queue High contention. Test Configurations 1. Random 50% / 50%, initial size 0 2. Random 50% / 50%, initial size 1000 3. 1 Producer / N-1 Consumers 4. N-1 Producers / 1 Consumer Measured throughput in items/sec #dequeues not returning EMPTY Anders Gidenstam, University of Borås 29

Experimental evaluation Micro benchmark Algorithms [Michael & Scott, 1996] [Michael & Scott, 1996] + Elimination [Moir, Nussbaum, Shalev & Shavit, 2005] [Hoffman, Shalev & Shavit, 2007] [Tsigas & Zhang, 2001] The new Cache-Aware Queue [Gidenstam, Sundell & Tsigas, 2010] PC Platform CPU: Intel Core i7 920 @ 2.67 GHz 4 cores with 2 hardware threads each RAM: 6 GB DDR3 @ 1333 MHz Windows 7 64-bit Anders Gidenstam, University of Borås 30

Experimental evaluation (i) Anders Gidenstam, University of Borås 31

Experimental evaluation (ii) Anders Gidenstam, University of Borås 32

Experimental evaluation (iii) Anders Gidenstam, University of Borås 33

Experimental evaluation (iv) Anders Gidenstam, University of Borås 34

Conclusions The Cache-Aware Lock-free Queue The first lock-free queue algorithm for multiple producers/consumers with all of the properties below Designed to be cache-friendly Designed for the weak memory consistency provided by contemporary hardware Is disjoint-access parallel (except when near empty) Use thread-local storage for reduced communication Use a linked-list of array blocks for efficient dynamic size support Anders Gidenstam, University of Borås 35

Thank you for listening! Questions? Anders Gidenstam, University of Borås 36