PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Similar documents
Read-Copy Update in a Garbage Collected Environment. By Harshal Sheth, Aashish Welling, and Nihar Sheth

Hierarchical PLABs, CLABs, TLABs in Hotspot

Optimising Multicore JVMs. Khaled Alnowaiser

Short-term Memory for Self-collecting Mutators. Martin Aigner, Andreas Haas, Christoph Kirsch, Ana Sokolova Universität Salzburg

A Skiplist-based Concurrent Priority Queue with Minimal Memory Contention

Linked Lists: Locking, Lock-Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Concurrent Skip Lists. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

ECE/CS 757: Homework 1

The Google File System

The benefits and costs of writing a POSIX kernel in a high-level language

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

A Scalable Lock Manager for Multicores

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

JVM Performance Study Comparing Java HotSpot to Azul Zing Using Red Hat JBoss Data Grid

NUMA in High-Level Languages. Patrick Siegler Non-Uniform Memory Architectures Hasso-Plattner-Institut

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Schism: Fragmentation-Tolerant Real-Time Garbage Collection. Fil Pizlo *

Performance of Non-Moving Garbage Collectors. Hans-J. Boehm HP Labs

OS-caused Long JVM Pauses - Deep Dive and Solutions

Scalable Concurrent Hash Tables via Relativistic Programming

Myths and Realities: The Performance Impact of Garbage Collection

A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Exploiting the Behavior of Generational Garbage Collector

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

CS4021/4521 INTRODUCTION

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

Dynamic Concurrent Van Emde Boas Array

JVM Memory Model and GC

Concurrent Preliminaries

Go Deep: Fixing Architectural Overheads of the Go Scheduler

Concurrent Data Structures Concurrent Algorithms 2016

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Garbage Collection. Hwansoo Han

The Z Garbage Collector Low Latency GC for OpenJDK

An Analysis of Linux Scalability to Many Cores

Real Time: Understanding the Trade-offs Between Determinism and Throughput

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Fiji VM Safety Critical Java

Towards Parallel, Scalable VM Services

The G1 GC in JDK 9. Erik Duveblad Senior Member of Technical Staf Oracle JVM GC Team October, 2017

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

CSE 120 Principles of Operating Systems

Taking a Virtual Machine Towards Many-Cores. Rickard Green - Patrik Nyblom -

Java Without the Jitter

Shenandoah An ultra-low pause time Garbage Collector for OpenJDK. Christine H. Flood Roman Kennke

Scaling the OpenJDK. Claes Redestad Java SE Performance Team Oracle. Copyright 2017, Oracle and/or its afliates. All rights reserved.

Threads. Computer Systems. 5/12/2009 cse threads Perkins, DW Johnson and University of Washington 1

CS377P Programming for Performance Multicore Performance Synchronization

The Z Garbage Collector Scalable Low-Latency GC in JDK 11

A Comprehensive Complexity Analysis of User-level Memory Allocator Algorithms

Towards High Performance Processing in Modern Java-based Control Systems. Marek Misiowiec Wojciech Buczak, Mark Buttner CERN ICalepcs 2011

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Software transactional memory

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

The C4 Collector. Or: the Application memory wall will remain until compaction is solved. Gil Tene Balaji Iyengar Michael Wolf

Chapter 4 Threads, SMP, and

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

High Performance Managed Languages. Martin Thompson

RadixVM: Scalable address spaces for multithreaded applications (revised )

Concurrent Garbage Collection

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Concurrent Counting using Combining Tree

Linked Lists: The Role of Locking. Erez Petrank Technion

Assessing the Scalability of Garbage Collectors on Many Cores

Building Efficient Concurrent Graph Object through Composition of List-based Set

JVM and application bottlenecks troubleshooting

Call Paths for Pin Tools

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

CS 261 Fall Mike Lam, Professor. Virtual Memory

What s in a process?

10/10/ Gribble, Lazowska, Levy, Zahorjan 2. 10/10/ Gribble, Lazowska, Levy, Zahorjan 4

The Z Garbage Collector An Introduction

Fast BVH Construction on GPUs

Elastic Data. Harvey Raja Principal Member Technical Staff Oracle Coherence

Reagent Based Lock Free Concurrent Link List Spring 2012 Ancsa Hannak and Mitesh Jain April 28, 2012

Online Course Evaluation. What we will do in the last week?

RiSE: Relaxed Systems Engineering? Christoph Kirsch University of Salzburg

Scheduling Transactions in Replicated Distributed Transactional Memory

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission

Java On Steroids: Sun s High-Performance Java Implementation. History

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

VoltDB vs. Redis Benchmark

Computer Architecture

September 15th, Finagle + Java. A love story (

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Advance Operating Systems (CS202) Locks Discussion

CPHash: A Cache-Partitioned Hash Table Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek

Azul Systems, Inc.

Competition => Collaboration a runtime-centric view of parallel computation

Designing experiments Performing experiments in Java Intel s Manycore Testing Lab

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Enabling Java-based VoIP backend platforms through JVM performance tuning

Transcription:

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference

Sequential Origins Research on data structures produced several fast sequential designs Not designed for concurrent access In the 90's, many extended to support concurrent operations Multi-core processors changing Exponential growth in # of cores New architectures

Parallelism is the future Source: http://www.gotw.ca/publications/concurrency-ddj.htm

Contributions Focused on the skip list Techniques to improve scalability Basic analytical model of scalability performance Analyses of implementations in Java and C++

MODERN MULTI-CORES

Cache Hierarchy Source: http://mechanical-sympathy.blogspot.co.uk/2013/02/cpu-cache-flushing-fallacy.html

Cache Coherence and Ownership Concurrent access and mutation presents a challenge Cache coherence protocol Cores must take ownership to mutate data Concurrent writes cause contention Cores must have up-to-date copy to read data Requires cache line transfer if data modified Cache contention creates bottlenecks

SKIP LISTS

Skip Lists Probabilistic data structure implementing an ordered set 3 operations: insert, delete, lookup Average case complexity O(log n) Stores a sorted list Hierarchy of linked lists connect increasingly sparse subsequences of elements Randomized with geometric distribution for O(log n) performance Auxiliary lists allow for efficient operation No global updates (rebalancing, etc.) due to probabilistic nature Can be efficiently parallelized

Skip Lists Diagram of a Skip List

Skip List Algorithm Based on the lock-based concurrent skip list in The Art of Multiprocessor Programming by Herlihy and Shavit Fine-grained locking for insertion and deletion Wait-free lookup Locks ensure skip list property is maintained Higher level lists are always contained in lower level lists

Skip List Algorithm Lookup Similar to binary search Insertion Find new node s predecessors and successors Lock predecessors Validate that links are correct Link new node to predecessors and successors

Preliminary Model Basic model to analyze scalability by estimating cache coherence traffic Ignored higher levels of linked lists Treated lookups as instantaneous operations Assumed threads inserted nodes simultaneously Measured expected cache line transfers as thread count and size varied Modeled as balls and bins problem

Preliminary Model k threads, n element skip list Find expected value of Explicit formula for expected cache line transfers t Verified formula using Monte Carlo simulation

IMPLEMENTATION AND OPTIMIZATION

Implementation and Optimization: C++ Custom read-copy update (RCU) garbage collector Avoids read-write conflicts Padded relevant data structures to a cache line Avoids false sharing

Implementation and Optimization: Java Simplified contains method implementation Avoided keeping track of predecessors and successors Avoiding generics/autoboxing Pre-allocated arrays internal to operations Tried a bunch of hacks to increase the scalability of the Java implementation Nothing improved scalability

PERFORMANCE

Experimental Setup 80-core machine 8 x 10-core Intel Xeon @ 2.40 GHz 256 GB of RAM Linux C++ with tcmalloc or jemalloc Java with HotSpot and OpenJDK VM

Procedure Varied thread count and size Fixed size skip list with uniformly distributed key space Read-only benchmark Threads concurrently search for random elements Read-write benchmark Threads concurrently add and then remove elements Size was maintained within a constant factor Measured total throughput Warmed up the JVM before performing tests

Read-only scalability Java scaled similarly

Read-write scalability

Read-write scalability Read-write benchmarks did not scale linearly Skip list reached maximal throughput between 10 and 40 cores, depending on the implementation Two reasons for drop-off in performance Lock contention between threads, especially in small ( 1000 element) skip lists As thread count increases, lock contention increases and relative performance decreases As thread count increases, relative memory allocator performance decreases

Java vs. C++ C++ implementation faster than Java implementation Memory deallocation Especially difficult in the multithreaded case Java has GC, C++ requires manual memory management Java s compacting GC speeds up memory allocation Keeps heap unfragmented Increment pointer, return old value Minimal synchronization in multithreaded case using TLABs C++ s default glibc memory allocator is bad

Java VMs

C++ Memory Allocators

Conclusion Implemented and analyzed performance of concurrent skip lists All implementations scale well for read-only C++ glibc allocator doesn t scale, alternatives scale better

Future Work Better model the performance of the skip list Take into account search time, asynchronous reads and writes, and hierarchy of linked lists More general model using queuing theory and Markov models Measure performance hit caused by lock contention Custom memory allocator Redesigned benchmark to avoid synchronization for memory allocation

Thanks Thanks to Austin Clements and Stephen Tu for excellent mentorship throughout the process Professor Kaashoek and Professor Zeldovich for the project idea and mentorship MIT PRIMES for providing us this great opportunity for research Our parents for all their support