A Skiplist-based Concurrent Priority Queue with Minimal Memory Contention

Similar documents
Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

arxiv: v1 [cs.ds] 16 Mar 2016

CBPQ: High Performance Lock-Free Priority Queue

The Contention Avoiding Concurrent Priority Queue

Linked Lists: The Role of Locking. Erez Petrank Technion

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

A Concurrent Skip List Implementation with RTM and HLE

Allocating memory in a lock-free manner

Per-Thread Batch Queues For Multithreaded Programs

Efficient & Lock-Free Modified Skip List in Concurrent Environment

Course Outline. Performance Tuning and Optimizing SQL Databases Course 10987B: 4 days Instructor Led

Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model

Flat Parallelization. V. Aksenov, ITMO University P. Kuznetsov, ParisTech. July 4, / 53

Concurrent Data Structures Concurrent Algorithms 2016

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 31 October 2012

Non-blocking Array-based Algorithms for Stacks and Queues!

A Simple Optimistic skip-list Algorithm

CS4021/4521 INTRODUCTION

Wait-Free Multi-Word Compare-And-Swap using Greedy Helping and Grabbing

Progress Guarantees When Composing Lock-Free Objects

Mounds: Array-Based Concurrent Priority Queues

Building Efficient Concurrent Graph Object through Composition of List-based Set

Dynamic Concurrent Van Emde Boas Array

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

arxiv: v1 [cs.dc] 5 Aug 2014

arxiv: v2 [cs.dc] 9 May 2017

[MS10987A]: Performance Tuning and Optimizing SQL Databases

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 17 November 2017

Advanced Multiprocessor Programming Project Topics and Requirements

Concurrent Access Algorithms for Different Data Structures: A Research Review

A Practical Scalable Distributed B-Tree

Improving STM Performance with Transactional Structs 1

Concurrent Skip Lists. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Flat Combining and the Synchronization-Parallelism Tradeoff

Combining Techniques Application for Tree Search Structures

Multiprocessor Support

Lock vs. Lock-free Memory Project proposal

SQL Server Administration 10987: Performance Tuning and Optimizing SQL Databases. Upcoming Dates. Course Description.

A Study of the Behavior of Synchronization Methods in Commonly Used Languages and Systems

CS377P Programming for Performance Multicore Performance Synchronization

Efficient and Reliable Lock-Free Memory Reclamation Based on Reference Counting

Scalable Concurrent Hash Tables via Relativistic Programming

Modern High-Performance Locking

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 21 November 2014

A Heap-Based Concurrent Priority Queue with Mutable Priorities for Faster Parallel Algorithms

MSL Based Concurrent and Efficient Priority Queue

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

DESIGN CHALLENGES FOR SCALABLE CONCURRENT DATA STRUCTURES for Many-Core Processors

Linked Lists: Locking, Lock-Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

A Skip List for Multicore

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

10987: Performance Tuning and Optimizing SQL Databases

Performance Tuning & Optimizing SQL Databases Microsoft Official Curriculum (MOC 10987)

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Log-Free Concurrent Data Structures

Extreme Performance Platform for Real-Time Streaming Analytics

Implementations. Priority Queues. Heaps and Heap Order. The Insert Operation CS206 CS206

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Data Structures and Algorithms

Sorting and Searching

Lock-Free Concurrent Binomial Heaps

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

M4 Parallelism. Implementation of Locks Cache Coherence

Chapter 6 Heaps. Introduction. Heap Model. Heap Implementation

Drop the Anchor: Lightweight Memory Management for Non-Blocking Data Structures

Sorting and Searching

Concurrent specifications beyond linearizability

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Non-blocking Priority Queue based on Skiplists with Relaxed Semantics

CS106X Programming Abstractions in C++ Dr. Cynthia Bailey Lee

Locking Granularity. CS 475, Spring 2019 Concurrent & Distributed Systems. With material from Herlihy & Shavit, Art of Multiprocessor Programming

PrimeBase XT. A transactional engine for MySQL. Paul McCullagh SNAP Innovation GmbH

Java Performance: The Definitive Guide

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017

Concurrent Preliminaries

The SkipTrie: Low-Depth Concurrent Search without Rebalancing

High Performance Transactions in Deuteronomy

From Lock-Free to Wait-Free: Linked List. Edward Duong

Characterizing the Performance and Energy Efficiency of Lock-Free Data Structures

6.852: Distributed Algorithms Fall, Class 15

Energy-centric DVFS Controlling Method for Multi-core Platforms

Algorithms and Data Structures

High-Performance Composable Transactional Data Structures

Stamp-it: A more Thread-efficient, Concurrent Memory Reclamation Scheme in the C++ Memory Model

Distributed Computing Group

Multiprocessor Scheduling. Multiprocessor Scheduling

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Heap Model. specialized queue required heap (priority queue) provides at least

Introduction. CS3026 Operating Systems Lecture 01

High-Performance Key-Value Store on OpenSHMEM

Lock Oscillation: Boosting the Performance of Concurrent Data Structures

Course Syllabus. Operating Systems

Transactional Memory. Concurrency unlocked Programming. Bingsheng Wang TM Operating Systems

Application Programming

Transcription:

A Skiplist-based Concurrent Priority Queue with Minimal Memory Contention Jonatan Lindén and Bengt Jonsson Uppsala University, Sweden December 18, 2013 Jonatan Lindén 1

Contributions Motivation: Improve performance of concurrent Discrete Event Simulator. Outcome: New lock-free skiplist-based priority queue. New representation for logically deleted nodes. Minimizes the contention. Improved performance over existing algorithms by 30 80% on multiprocessors. Linearizable. Jonatan Lindén 2

Outline Contributions Background Priority queue Skiplist Standard solution to increase concurrency The problem: contention Our algorithm Correctness Evaluation Jonatan Lindén 3

Priority Queues Priority Queue - A set of (key,value) pairs with two operations: Insert(key, value) Applications: Discrete Event Simulation, Numerical algorithms. Jonatan Lindén 4

Priority Queues Priority Queue - A set of (key,value) pairs with two operations: Insert(key, value) Applications: Discrete Event Simulation, Numerical algorithms. Implementations: Traditionally, implemented on top of heaps or tree data structures. Skiplists [Pugh:1990] have been used for several parallel implementations. Jonatan Lindén 4

Skiplist layered linked list lowest-level list defines logical state, ordered higher-level lists are shortcuts probabilistic guarantee of logarithmic search time Insert(4) H 1 2 5 7 T Jonatan Lindén 5

Skiplist layered linked list lowest-level list defines logical state, ordered higher-level lists are shortcuts probabilistic guarantee of logarithmic search time Insert(4) Peek H 1 2 5 7 T Jonatan Lindén 5

Skiplist layered linked list lowest-level list defines logical state, ordered higher-level lists are shortcuts probabilistic guarantee of logarithmic search time Insert(4) H 1 2 5 7 T Jonatan Lindén 5

Skiplist layered linked list lowest-level list defines logical state, ordered higher-level lists are shortcuts probabilistic guarantee of logarithmic search time Insert(4) H 1 2 5 7 T Jonatan Lindén 5

Skiplist layered linked list lowest-level list defines logical state, ordered higher-level lists are shortcuts probabilistic guarantee of logarithmic search time Insert(4) H 1 2 5 7 T Jonatan Lindén 5

Skiplist layered linked list lowest-level list defines logical state, ordered higher-level lists are shortcuts probabilistic guarantee of logarithmic search time Insert(4) H 1 2 5 7 T Jonatan Lindén 5

Skiplist layered linked list lowest-level list defines logical state, ordered higher-level lists are shortcuts probabilistic guarantee of logarithmic search time Insert(4) H 1 2 5 7 T Jonatan Lindén 5

Skiplist layered linked list lowest-level list defines logical state, ordered higher-level lists are shortcuts probabilistic guarantee of logarithmic search time Insert(4) H 1 2 5 7 T Jonatan Lindén 5

Skiplist layered linked list lowest-level list defines logical state, ordered higher-level lists are shortcuts probabilistic guarantee of logarithmic search time Insert(4) H 1 2 5 7 T Jonatan Lindén 5

Skiplist layered linked list lowest-level list defines logical state, ordered higher-level lists are shortcuts probabilistic guarantee of logarithmic search time Insert(4) H 1 2 4 5 7 T Jonatan Lindén 5

Skiplist layered linked list lowest-level list defines logical state, ordered higher-level lists are shortcuts probabilistic guarantee of logarithmic search time Smallest element at the beginning of the lowest-level list. DeleteMin entry point H 1 2 4 5 7 T Jonatan Lindén 5

Concurrent skiplist-based priority queues Skiplists are easy to make concurrent Skiplists scale well when concurrent threads access different parts of the structure. Jonatan Lindén 6

Concurrent skiplist-based priority queues Skiplists are easy to make concurrent Skiplists scale well when concurrent threads access different parts of the structure. Jonatan Lindén 6

Concurrent skiplist-based priority queues Skiplists are easy to make concurrent Skiplists scale well when concurrent threads access different parts of the structure. Bottleneck: concurrent DeleteMin operations in priority queues try to remove the same element. Jonatan Lindén 6

Standard solution Standard solution to increase concurrency: logical deletion by setting a delete flag. physical deletion unlinks the node after the logical deletion has succeeded. Delete flag. H 1 2 4 5 7 T Jonatan Lindén 7

Standard solution Standard solution to increase concurrency: logical deletion by setting a delete flag. physical deletion unlinks the node after the logical deletion has succeeded. H 1 2 4 5 7 T Jonatan Lindén 7

Standard solution Standard solution to increase concurrency: logical deletion by setting a delete flag. physical deletion unlinks the node after the logical deletion has succeeded. H 1 2 4 5 7 T Jonatan Lindén 7

Standard solution Standard solution to increase concurrency: logical deletion by setting a delete flag. physical deletion unlinks the node after the logical deletion has succeeded. H 1 2 4 5 7 T Jonatan Lindén 7

Standard solution Standard solution to increase concurrency: logical deletion by setting a delete flag. physical deletion unlinks the node after the logical deletion has succeeded. H 1 2 4 5 7 T Jonatan Lindén 7

Standard solution Standard solution to increase concurrency: logical deletion by setting a delete flag. physical deletion unlinks the node after the logical deletion has succeeded. H 1 2 4 5 7 T Jonatan Lindén 7

Standard solution Standard solution to increase concurrency: logical deletion by setting a delete flag. physical deletion unlinks the node after the logical deletion has succeeded. H 1 2 4 5 7 T Jonatan Lindén 7

Standard solution Standard solution to increase concurrency: logical deletion by setting a delete flag. physical deletion unlinks the node after the logical deletion has succeeded. Lock-free: Perform all writes using Compare-and-Swap (CAS) Colocate delete flag together with each next pointer, e.g., in the lowest order bit [Harris 2001], to make physical deletion safe. H 1 2 4 5 7 T Jonatan Lindén 7

Contention Bottleneck: CAS in DeleteMin Several types of contention: (i) Multiple CASes compete H 1 2 4 5 7 T Jonatan Lindén 8

Contention Bottleneck: CAS in DeleteMin Several types of contention: (i) Multiple CASes compete modified by CAS H 1 2 4 5 7 T Jonatan Lindén 8

Contention Bottleneck: CAS in DeleteMin Several types of contention: (i) Multiple CASes compete (ii) Updates must be propagated to reads H 1 2 4 5 7 T Jonatan Lindén 8

Contention Bottleneck: CAS in DeleteMin Several types of contention: (i) Multiple CASes compete (ii) Updates must be propagated to reads Insert(6) serialize H 1 2 4 5 7 T Jonatan Lindén 8

Our algorithm Jonatan Lindén 9

Our solution Key idea: No physical deletion after logical deletion! Instead, delete nodes in batches. H 2 4 5 7 T Jonatan Lindén 10

Our solution Key idea: No physical deletion after logical deletion! Instead, delete nodes in batches. By updating the pointers in head node. H 2 4 5 7 T Jonatan Lindén 10

Our solution Key idea: No physical deletion after logical deletion! Instead, delete nodes in batches. By updating the pointers in head node. Requires that logically deleted nodes form a prefix. H 2 4 5 7 T Jonatan Lindén 10

Our solution Key idea: No physical deletion after logical deletion! Instead, delete nodes in batches. By updating the pointers in head node. Requires that logically deleted nodes form a prefix. Insert(1) H 2 4 5 7 T Jonatan Lindén 10

Our solution Key idea: No physical deletion after logical deletion! Instead, delete nodes in batches. By updating the pointers in head node. Requires that logically deleted nodes form a prefix. Insert(1) H 2 4 5 7 T Jonatan Lindén 10

Our solution Key idea: No physical deletion after logical deletion! Instead, delete nodes in batches. By updating the pointers in head node. Requires that logically deleted nodes form a prefix. Insert(1) H 1 2 4 5 7 T Jonatan Lindén 10

Our solution Key idea: No physical deletion after logical deletion! Instead, delete nodes in batches. By updating the pointers in head node. Requires that logically deleted nodes form a prefix. H 1 2 4 5 7 T Does not work! Jonatan Lindén 10

Our solution Key idea: No physical deletion after logical deletion! Instead, delete nodes in batches. By updating the pointers in head node. Requires that logically deleted nodes form a prefix. H 1 2 4 5 7 T should be inserted here Jonatan Lindén 10

Our solution To guarantee prefix property, Store delete flag together with the predecessor s next pointer. H 2 4 5 7 T Jonatan Lindén 11

Our solution To guarantee prefix property, Store delete flag together with the predecessor s next pointer. Insert(1) H 2 4 5 7 T Jonatan Lindén 11

Our solution To guarantee prefix property, Store delete flag together with the predecessor s next pointer. Insert(1) H 2 4 5 7 T Jonatan Lindén 11

Our solution To guarantee prefix property, Store delete flag together with the predecessor s next pointer. Insert(1) H 2 4 1 5 7 T Jonatan Lindén 11

Our solution To guarantee prefix property, Store delete flag together with the predecessor s next pointer. Insert(1) H 2 4 1 5 7 T Still a prefix! Jonatan Lindén 11

Our solution Resolving conflicts between Insert and DeleteMin. Insert(1) H 2 4 5 7 T Jonatan Lindén 12

Our solution Resolving conflicts between Insert and DeleteMin. Insert(1) H 2 4 5 7 T 1 Jonatan Lindén 12

Our solution Resolving conflicts between Insert and DeleteMin. Insert(1) H 2 4 5 7 T 1 Jonatan Lindén 12

Our solution Resolving conflicts between Insert and DeleteMin. Insert(1) H 2 4 5 7 T 1 Jonatan Lindén 12

Our solution Resolving conflicts between Insert and DeleteMin. Insert(1) H 2 4 5 7 T Jonatan Lindén 12

Our solution Resolving conflicts between Insert and DeleteMin. Insert(1) H 2 4 5 7 T 1 Jonatan Lindén 12

Our solution Resolving conflicts between Insert and DeleteMin. H 2 4 5 7 T 1 Jonatan Lindén 12

Our solution Resolving conflicts between Insert and DeleteMin. H 2 4 5 7 T 1 Jonatan Lindén 12

Physical batch deletion Physical batch deletion: update pointers in the head Done by DeleteMin, when the prefix of deleted nodes exceeds a threshold H 2 4 1 5 7 T Jonatan Lindén 13

Physical batch deletion Physical batch deletion: update pointers in the head Done by DeleteMin, when the prefix of deleted nodes exceeds a threshold H 2 4 1 5 7 T Jonatan Lindén 13

Physical batch deletion Physical batch deletion: update pointers in the head Done by DeleteMin, when the prefix of deleted nodes exceeds a threshold H 2 4 1 5 7 T Jonatan Lindén 13

Correctness linearizable concurrent priority queue Correctness proof based on assertional reasoning in the paper Follows rather easily after establishing that the lowest level list consists of a deleted prefix followed by a sorted list which defines the logical state of the queue. We have also modeled the algorithm in SPIN and performed extensive state-space exploration Jonatan Lindén 14

Evaluation Jonatan Lindén 15

Comparison of maximal throughput Compared against other lock-free skiplist-based priority queues: Sundell & Tsigas (ST): Only a single logically deleted node allowed in the lowest-level list. Herlihy & Shavit (HS): Lock-free adaptation of [Lotan & Shavit]. Logically deleted nodes need not form a prefix. Jonatan Lindén 16

Comparison of maximal throughput Benchmark: 50% Insert, 50% DeleteMin, 4 socket Intel sandybridge machine. M operations/s 10 9 8 7 6 5 4 3 2 1 New HS ST 4 8 12 16 20 24 28 Number of threads 30 80% improvement in comparison to HS 1-8 cores: single socket (shared L3 cache) Jonatan Lindén 17

Evaluation Benchmark: DES workload, 4-socket AMD bulldozer machine. M operations/s 3 2 1 New HS ST 0 4 8 12 16 20 24 28 Number of threads 30 80% improvement in comparison to HS 1 2 cores: shared L2 cache Jonatan Lindén 18

More resources http://user.it.uu.se/~jonli208/priorityqueue BSD licensed implementation SPIN model extended technical report Performance bug in the version in the proceedings: in the Restructure algorithm which updates the head pointers please see the extended technical report. Jonatan Lindén 19

Conclusions a new linearizable, lock-free, skiplist-based priority queue algorithm. new representation of logical deletion. This reduces the number of CASes to critical shared memory locations 30 80 % performance improvement over existing such algorithms. Jonatan Lindén 20

Conclusions a new linearizable, lock-free, skiplist-based priority queue algorithm. new representation of logical deletion. This reduces the number of CASes to critical shared memory locations 30 80 % performance improvement over existing such algorithms. Thank you. Jonatan Lindén 20