Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Similar documents
Cost of Concurrency in Hybrid Transactional Memory

Cost of Concurrency in Hybrid Transactional Memory

6.852: Distributed Algorithms Fall, Class 20

arxiv: v3 [cs.dc] 17 Feb 2015

Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory

Chí Cao Minh 28 May 2008

arxiv: v1 [cs.dc] 22 May 2014

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

Invyswell: A HyTM for Haswell RTM. Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, Maurice Herlihy

Inherent Limitations of Hybrid Transactional Memory

HTM in the wild. Konrad Lai June 2015

Lock vs. Lock-free Memory Project proposal

Lower bounds for Transactional memory

Transactional Memory

Reduced Hardware Transactions: A New Approach to Hybrid Transactional Memory

Hardware Transactional Memory on Haswell

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

Reduced Hardware Lock Elision

INTRODUCTION. Hybrid Transactional Memory. Transactional Memory. Problems with Transactional Memory Problems

Improving the Practicality of Transactional Memory

Using Software Transactional Memory In Interrupt-Driven Systems

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Phased Transactional Memory

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

SELF-TUNING HTM. Paolo Romano

Architectural Support for Software Transactional Memory

New Programming Abstractions for Concurrency in GCC 4.7. Torvald Riegel Red Hat 12/04/05

On Improving Transactional Memory: Optimistic Transactional Boosting, Remote Execution, and Hybrid Transactions

Speculative Synchronization

Main-Memory Databases 1 / 25

1 Publishable Summary

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Transactional Memory. Yaohua Li and Siming Chen. Yaohua Li and Siming Chen Transactional Memory 1 / 41

DHTM: Durable Hardware Transactional Memory

Hybrid NOrec: A Case Study in the Effectiveness of Best Effort Hardware Transactional Memory

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory

Lecture 12 Transactional Memory

Conflict Detection and Validation Strategies for Software Transactional Memory

Refined Transactional Lock Elision

CS4021/4521 INTRODUCTION

ABORTING CONFLICTING TRANSACTIONS IN AN STM

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

DBT Tool. DBT Framework

Reuse, Don t Recycle: Transforming Lock-Free Algorithms That Throw Away Descriptors

Managing Resource Limitation of Best-Effort HTM

Performance Evaluation of Intel Transactional Synchronization Extensions for High-Performance Computing

PHyTM: Persistent Hybrid Transactional Memory

TRANSACTION MEMORY. Presented by Hussain Sattuwala Ramya Somuri

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Evaluating Contention Management Using Discrete Event Simulation

Martin Kruliš, v

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 14 November 2014

Scalable Software Transactional Memory for Chapel High-Productivity Language

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Software transactional memory

Understanding Hardware Transactional Memory

Optimizing Hybrid Transactional Memory: The Importance of Nonspeculative Operations

Optimizing Hybrid Transactional Memory: The Importance of Nonspeculative Operations

HydraVM: Mohamed M. Saad Mohamed Mohamedin, and Binoy Ravindran. Hot Topics in Parallelism (HotPar '12), Berkeley, CA

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Scheduling Transactions in Replicated Distributed Transactional Memory

Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory

LogTM: Log-Based Transactional Memory

PHyTM: Persistent Hybrid Transactional Memory

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory

Foundations of the C++ Concurrency Memory Model

Flexible Architecture Research Machine (FARM)

Heckaton. SQL Server's Memory Optimized OLTP Engine

6 Transactional Memory. Robert Mullins

ByteSTM: Java Software Transactional Memory at the Virtual Machine Level

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Performance Evaluation of Adaptivity in STM. Mathias Payer and Thomas R. Gross Department of Computer Science, ETH Zürich

The Common Case Transactional Behavior of Multithreaded Programs

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017

Reduced Hardware Transactions: A New Approach to Hybrid Transactional Memory

Enhancing Concurrency in Distributed Transactional Memory through Commutativity

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Enhancing efficiency of Hybrid Transactional Memory via Dynamic Data Partitioning Schemes

New Programming Abstractions for Concurrency. Torvald Riegel Red Hat 12/04/05

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

TicToc: Time Traveling Optimistic Concurrency Control

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

A Dynamic Instrumentation Approach to Software Transactional Memory. Marek Olszewski

Computer Science 146. Computer Architecture

Early Results Using Hardware Transactional Memory for High-Performance Computing Applications

Tradeoffs in Transactional Memory Virtualization

bool Account::withdraw(int val) { atomic { if(balance > val) { balance = balance val; return true; } else return false; } }

Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems

Transactional Memory. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Summary: Open Questions:

Accelerating Transactional Memory by Exploiting Platform Specificity

Extending Hardware Transactional Memory Capacity via Rollback-Only Transactions and Suspend/Resume

Indistinguishability: Friend and Foe of Concurrent Data Structures. Hagit Attiya CS, Technion

Remote Transaction Commit: Centralizing Software Transactional Memory Commits

An Update on Haskell H/STM 1

Portland State University ECE 588/688. Cray-1 and Cray T3E

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Transcription:

Cost of Concurrency in Hybrid Transactional Memory Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) 1

Transactional Memory: a history Hardware TM Software TM Hybrid TM 1993 1995-today Today Software implementation by Shavit-Touitou ( 95): static HTM transactions support: Intel Haswell, IBM Power8,.. Original proposal by Herlihy-Moss ( 93) Dynamic Different STM memory by models et and al. ( 03) supported instructions Exploit Incremental cache-coherence Hardware limitations validation and spurious cost aborts Optimistic TL2 by Dice synchronization: et al. ( 06), NOrec buffer by Spear speculative et al. ( 06),... updates Fallback to software transactions Mitigate validation cost 2

Transactional Memory: a history Hardware TM Software TM Hybrid TM 1993 1995-today Today Cost of Concurrency in Hybrid TMs? 3

Hybrid Transactional Memory (HyTM) Model Transactions Fast-path Transactional operations HW primitives Data items Base objects Executed in hardware Exploit cache-coherence Spurious aborts Cache limitations 4

Hybrid Transactional Memory (HyTM) Model Transactions Slow-path Transactional operations HW primitives Data items Base objects Executed in software Reliable for large transactions Slower execution time Harder to verify 5

Hybrid Transactional Memory (HyTM) Model Transactional operations HW primitives Transactions Data items Base objects For fast-path transactions Operate on cached memory state Direct on Power8 For slow-path transactions Maintain TRACKING SET Operate directly on memory Shared/exclusive mode state Capacity limit Direct access Cached access 6

Hybrid Transactional Memory (HyTM) Model Tracking set aborts in fast-path transactions Automatic contention detection for cached accesses Fast-path T2 WRITE to base object B (EXCLUSIVE mode) A2 Tracking set is invalidated and T2 must abort Fast-path or slow-path T1 ACCESS base object B 7

Hybrid Transactional Memory (HyTM) Model Tracking set aborts in fast-path transactions Automatic contention detection for cached accesses Fast-path T2 READ to base object B (SHARED mode) A2 Tracking set is invalidated and T2 must abort Fast-path or slow-path T1 WRITE base object B 8

Hybrid Transactional Memory (HyTM) Model Committed fast-path transactions (only cached accesses) appear to execute atomically (single step) Aborted or incomplete Fast-path T2 Slow-path T1 Execution indistinguishable to T1 from an execution in which T2 does not participate 9

Instrumentation Fast-path transactions intuitively need code instrumentation Detect overlap contention with slow-path Global timestamp, ownership records, etc. Instrumentation affects performance Slow-path T0 Fast-path W(X W(Y R(X)? T1 Partial commit of X and Y T1 must access meta-information to detect contention with T0 10

Cost of concurrency in HyTM Sequential Progressive Minimal concurrency More concurrency O(1) fast-path instrumentation Cost on slow-path transactions? Ω(m) fast-path instrumentation 11

Linear validation cost in HyTM Progressive opaque HyTM+Invisible reads Linear time and space complexity for slow-path transactions Fast-path T m W(X m Commits Slow-path T R(X 0 1 ) 0 R(X m-1 ) 0 R(X m ) 1 (m-1) invisible reads of distinct data items X 1. X m-1 Read of X m must return the value 1 updated by T m 12

Linear validation cost in HyTM Progressive opaque HyTM+Invisible reads Linear time and space complexity for slow-path transactions Slow-path T R(X 0 1 ) 0 R(X m-1 ) 0 R(X m ) 1 T 0 does not observe T m until the access of data item X m Fast-path T m W(X m Commits Read of X m by T 0 must return the value 1 13

Linear validation cost in HyTM Progressive opaque HyTM+Invisible reads Linear time and space complexity for slow-path transactions Slow-path T R(X 0 1 ) 0 R(X m-1 ) 0 R(X m )? Fast-path Write new value to data item X 1 and commit T m W(X m T 1 W(X 1 Fast-path Cannot return value 1--cycle in serialization T 1 T m T 0 14

Linear validation cost in HyTM Progressive opaque HyTM+Invisible reads Linear time and space complexity for slow-path transactions T 0 is invisible to fast-path transactions T 1 T m T R(X 0 1 ) 0 R(X R(X m ) m-1 ) 0 Fast-path T m W(X m T 1 W(X 1 Each fast-path transaction writes new values to data objects X 1 to X m-1 W(X m-1 T m-1 15

Linear validation cost in HyTM Progressive opaque HyTM+Invisible reads Linear time and space complexity for slow-path transactions Slow-path T R(X 0 1 ) 0 R(X R(X m ) m-1 ) 0 Tracking set aborts: cannot contend on same memory location T m W(X m T 1 W(X 1 W(X m-1 T m-1 16

Validation cost in HyTM Progressive opaque HyTM+Invisible reads Linear time and space complexity for slow-path transactions Slow-path T R(X 0 1 ) 0 R(X R(X m ) m-1 ) 0 T m W(X m T 1 W(X 1 Read of X m must access m-1 distinct memory locations W(X m-1 T m-1 17

Validation cost in HyTM Progressive opaque HyTM+Invisible reads Linear time and space complexity for slow-path transactions Slow-path T R(X 0 1 ) 0 R(X R(X m ) m-1 ) 0 Transaction T 0 must take at least i (i=1 to m-1)=ω(m 2 ) memory steps T m W(X m T 1 W(X 1 W(X m-1 T m-1 18

Cost of concurrency in HyTM Sequential Progressive Minimal concurrency More concurrency O(1) fast-path instrumentation+ O(1) slow-path reads Fast-path transactions aborted by non-conflicting ones or linear fast-path instrumentation cost and linear slow-path steps Ω(m) fast-path instrumentation + Ω(m) slow-path steps per-read 19

Cost of concurrency in HyTM Sequential Progressive Minimal concurrency More concurrency O(1) fast-path instrumentation+ O(1) slow-path reads Progressive STMs like TL2 circumvent the cost of validation for better performance, but impossible in progressive HyTMs! Ω(m) fast-path instrumentation + Ω(m) slow-path steps per-read 20

Cost of Concurrency in HyTM Algorithm 1 Algorithm 2 Transactional Lock Elision Hybrid Norec Instrumentation in fast-path reads per-read constant constant constant Instrumentation in fast-path writes Validation in slow-path reads per-write per-write constant constant Ω( Rset ) Ω( Rset ) none Ω( Rset ) only if concurrency h/w-s/w concurrency prog prog for slow-path readers zero Not prog; small contention window Direct accesses inside fast-path? yes no no yes 21

Experimental setup Large-scale 2-socket Intel E7-4830 v3 with 12 cores per socket and 2 hyperthreads (HTs) per core, for a total of 48 threads Each core has a private 32KB L1 cache and 256KB L2 cache (which is shared between HTs on a core) All cores on a socket share a 30MB L3 cache Non-uniform memory architecture (NUMA) 128GB of RAM, and runs Ubuntu 14.04 LTS. All code was compiled with the GNU C++ compiler (G++) 4.8.4 with build target x86_64-linux-gnu and compilation options -std=c++0x -O3 -mx32 22

Experimental methodology Six timed trials for several thread counts n Prefilling: n concurrent threads perform 50% Insert and 50% Delete operations on keys drawn uniformly randomly from [0, 10 5 ) until the size of the tree converges to a steady state (containing approximately 10 5 /2 keys) Measuring: Each thread performs (U/2)% Insert, (U/2)% Delete and (100 U )% Search operations, on keys/values drawn uniformly from [0, 10 5 ); U=0,10,40 Plots for Binary Search Tree (BST) microbenchmark With (without) one (any) thread performing RangeIncrement operations Capacity aborts on fast-path 23

Cost of Concurrency in HyTM 0% updates: #threads vs. ops/microsec 24

Cost of Concurrency in HyTM 10% updates: #threads vs. ops/microsec 25

Cost of Concurrency in HyTM 40% updates: #threads vs. ops/microsec Read-only workloads: costs purely down to fast-path Algorithm 1 overhead due to linear instrumentation Update workloads with RangeIncrement TLE suffers due to global lock bottleneck NUMA effects on update heavy workloads From thread counts > 24 Hybrid norec performs poorly to Algorithm 2 for same reasons as TLE 26

Circumventing the impossibilities? Middle-path approach? Ongoing work Almost uninstrumented fast fast-path No concurrency with slow-path Concurrent with middle-path Ongoing experiments on Intel Haswell and IBM Power8 (STAMP and data structure microbenchmarks) Completely different memory models Power8 allows direct accesses inside hardware 27

Transactional memory is here to stay? HyTM: an efficient universal construction? Start with a base HyTM with minimal instrumentation overhead, maximal concurrency and little global metadata bottleneck Dynamic implementation choices depending on workloads Multi-path approach Formal methods and verification techniques Impact of cache hierarchy, cache-size and memory model on HyTM performance 28