The Common Case Transactional Behavior of Multithreaded Programs

Similar documents
System Challenges and Opportunities for Transactional Memory

The Common Case Transactional Behavior of Multithreaded Programs

Tradeoffs in Transactional Memory Virtualization

DBT Tool. DBT Framework

Parallelizing SPECjbb2000 with Transactional Memory

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Parallelizing SPECjbb2000 with Transactional Memory

A Scalable, Non-blocking Approach to Transactional Memory

Transactional Memory. Lecture 19: Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Transactional Memory Coherence and Consistency

Executing Java Programs with Transactional Memory

Improving the Practicality of Transactional Memory

Transactional Memory

Lock vs. Lock-free Memory Project proposal

Early Release: Friend or Foe?

Transactional Memory. Lecture 18: Parallel Computer Architecture and Programming CMU /15-618, Spring 2017

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory

Chí Cao Minh 28 May 2008

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

LogTM: Log-Based Transactional Memory

Scheduling Transactions in Replicated Distributed Transactional Memory

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Thread-Safe Dynamic Binary Translation using Transactional Memory

Hybrid Static-Dynamic Analysis for Statically Bounded Region Serializability

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

Characterization of TCC on Chip-Multiprocessors

Work Report: Lessons learned on RTM

Heuristics for Profile-driven Method- level Speculative Parallelization

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Architectural Semantics for Practical Transactional Memory

Architectural Support for Software Transactional Memory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Speculative Synchronization

SYSTEM CHALLENGES AND OPPORTUNITIES FOR TRANSACTIONAL MEMORY

Hardware Transactional Memory. Daniel Schwartz-Narbonne

Programming with Transactional Memory

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

Deterministic Replay and Data Race Detection for Multithreaded Programs

Hardware Transactional Memory Architecture and Emulation

DESIGNING AN EFFECTIVE HYBRID TRANSACTIONAL MEMORY SYSTEM

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Log-Based Transactional Memory

Transactional Memory Implementation Lecture 1. COS597C, Fall 2010 Princeton University Arun Raman

HydraVM: Mohamed M. Saad Mohamed Mohamedin, and Binoy Ravindran. Hot Topics in Parallelism (HotPar '12), Berkeley, CA

Flexible Architecture Research Machine (FARM)

Enhancing Concurrency in Distributed Transactional Memory through Commutativity

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

ARCHITECTURES FOR TRANSACTIONAL MEMORY

Implementing and Evaluating Nested Parallel Transactions in STM. Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Stanford University

Reminder from last time

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Transactional Memory. Yaohua Li and Siming Chen. Yaohua Li and Siming Chen Transactional Memory 1 / 41

Potential violations of Serializability: Example 1

Conflict Detection and Validation Strategies for Software Transactional Memory

6.852: Distributed Algorithms Fall, Class 20

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

6 Transactional Memory. Robert Mullins

Analysis and Tracing of Applications Based on Software Transactional Memory on Multicore Architectures

Software Speculative Multithreading for Java

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

ECE 669 Parallel Computer Architecture

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Light64: Ligh support for data ra. Darko Marinov, Josep Torrellas. a.cs.uiuc.edu

Speculative Optimizations for Parallel Programs on Multicores

Towards Pervasive Parallelism

Unbounded Transactional Memory

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

The Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun

Foundations of the C++ Concurrency Memory Model

Software-Controlled Multithreading Using Informing Memory Operations

Lecture 12 Transactional Memory

Deconstructing Transactional Semantics: The Subtleties of Atomicity

Memory Consistency and Multiprocessor Performance

A Dynamic Instrumentation Approach to Software Transactional Memory. Marek Olszewski

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

The OpenTM Transactional Application Programming Interface

Practical Near-Data Processing for In-Memory Analytics Frameworks

POSIX Threads: a first step toward parallel programming. George Bosilca

Transaction Memory for Existing Programs Michael M. Swift Haris Volos, Andres Tack, Shan Lu, Adam Welc * University of Wisconsin-Madison, *Intel

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Atomic Transac1ons. Atomic Transactions. Q1: What if network fails before deposit? Q2: What if sequence is interrupted by another sequence?

Concurrent & Distributed Systems Supervision Exercises

Introduction to Parallel Computing

Lowering the Overhead of Nonblocking Software Transactional Memory

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CS377P Programming for Performance Multicore Performance Synchronization

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Parallel Programming

EazyHTM: Eager-Lazy Hardware Transactional Memory

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Commit Algorithms for Scalable Hardware Transactional Memory. Abstract

Future Work. Build applications that use extensions to optimize performance. Interface design.

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Transactional Memory. review articles. Is TM the answer for improving parallel programming?

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Transcription:

The Common Case Transactional Behavior of Multithreaded Programs JaeWoong Chung Hassan Chafi,, Chi Cao Minh, Austen McDonald, Brian D. Carlstrom, Christos Kozyrakis, Kunle Olukotun Computer Systems Lab Stanford University http://tcc.stanford.edu tcc.stanford.edu 1

The Parallel Programming Problem CMPs are here but no parallel software to run on them Lock-based parallel programming is simply broken Coarse-grained locks: serialization Fine-grained locks: deadlocks, races, priority inversion, Poor composability,, not fault-tolerant, tolerant, Transactional Memory (TM): an promising alternative Transactions: atomic & isolated access to shared-memory Performance through optimistic concurrency Parallel programming with TM Coarse grain Non-blocking synchronization for parallel algorithms Speculative parallelization for sequential algorithms 2

The Design Space for TM A transactional memory system provides Basics: versioning, conflict resolution, commit, abort Desired: nesting for libraries, virtualization Several proposed designs Software-only: only: [DSTM], [OSTM], [ASTM], [SXM], [McRT[ McRT-STM] Hardware-assisted: assisted: [TLR], [TCC], [U/LTM], [VTM], [LogTM[ LogTM] Hybrids: [HyTM[ HyTM], [Hybrid-TM] Different tradeoffs in implementing basic/desired features Key questions Which is the common case to optimize for? 3

In Search of the Common Case Key metrics of transactional program Transaction length Cost of fixed overheads, time virtualization issues Read-/write /write-set size Buffer space requirements, buffer virtualization issues Write-set to length ratio Amortize commit/abort overheads Frequency of nesting & I/O in transactions Support for nesting, syscalls, Frequency of conflicts Scheduling and contention management policies The chicken & egg problem Programmers need efficient TM systems to support development Designers need TM applications to derive common case Can we break the deadlock? 4

Paper Summary Study the TM behavior of existing parallel programs Map existing parallel constructs to transactions 36 applications, multiple domains, 4 programming models Analyzed common case for Transaction length, read-/write /write-set size, write-set to length ratio, nesting & I/O For both non-blocking synchronization & spec. parallelization Implementation agnostic measurements Derived guidelines for TM system design Buffering requirements and virtualization approach Overhead amortization, nesting & I/O support, 5

Key assumption Methodology Overview The inherent parallelism & synchronization patterns are likely the same regardless of language primitives used Methodology 1. Trace parallel application on existing hardware 2. Map parallel constructs to transaction boundaries E.g. lock/unlock -> > transaction begin/end 3. Process trace to analyze metrics Measurements are agnostic to TM design Limitation: cannot measure violation behavior 6

Parallel Applications Transaction Usage Non-blocking Synchronization Speculative Parallelism Languages Java Pthread ANL OpenMP Applications MolDyn, MonteCarlo, RayTracer,, Crypt, LUFact, Series, SOR, SparseMatmult,, SPECjbb2000, PMD, HSQLDB Apache, Kingate,, Bp-vision, Localize, Ultra Tic Tac Toe, MPEG2, AOL Server Barnes, Mp3d, Ocean, Radix, FMM, Cholesky, Radiosity,, FFT, Volrend,, Water-N2, Water-Spatial APPLU, Equake,, Art, Swim, CG, BT, IS Different domains : scientific, enterprise, AI/robotics, multimedia Different qualities : highly optimized Vs. less optimized Java, Pthreads,, ANL studied for non-blocking synchronizations OpenMP studied for speculative parallelization 7

Non-Blocking Synchronization Transactions are used for critical sections in parallel algorithms Original primitives mapped to transactional boundaries E.g. Java synchronized block, pthread_mutex,, ANL LOCK macro mapped transaction boundaries Semantics issue To conserve the original program semantics, wait splits transaction This mapping is not always safe, but was fine in our study Original Threading Primitive Lock Unlock Wait Transaction Mapping BEGIN END END-BEGIN 8

Transaction Length Number of instructions executed in transaction Length in Instructions Application Avg 50th % 95th % Max Java average 5949 149 4256 13519488 Pthreads average 879 805 1056 22591 ANL average 256 114 772 16782 Up to 95% of transactions have less than 5000 instructions => Light-weight transactional primitives are required Some programs have rare but long transactions => Time virtualization is needed (transaction context-switching) 9

Time Virtualization for TM Interrupt and context-switch procedure Interrupt Other proposals Can wait? No Is there a young xaction? No Swap out xaction to VM; use its CPU for interrupt Yes Yes Wait for a short xaction to finish; use its CPU for interrupt Abort a young xaction; use its CPU for interrupt Rare case! Common case Common case Handle in VM software [Satya94][Chen97] 10

Read-/Write /Write-Set Size Bytes of data read/written by transaction Read Set Size in Kbytes 1000 100 10 1 0.1 50th ANL Java Pthreads 80th Write Set Size in Kbytes 1000 100 10 1 0.1 0.01 50th ANL Java Pthreads 80th Percentile of Transactions Percentile of Transactions 98% of transactions: <16KB read-set, <6KB write set => 32K L1 Cache will be enough for most transactions There are few very large transaction > 32K => space virtualization is needed but it s s better be cheap 11

Write-Set to Length Ratio Ratio of # unique addresses written to # instructions in transaction tion 100 Percentage of Transaction 80 60 40 20 ANL Java Pthreads 0 0% 20% < 25% in most transactions Write-Set Size to Transaction Length Ratio => Big challenge for SW TM because of high per-write overhead => Even HW TM needs sufficient bandwidth for versioning and commit 12

Transaction Nesting and I/O Nesting occurs only in java VM code 2.2 average depth => Limited support for nesting is sufficient for now I/O within transactions is rare 27 applications have less than 0.1% of transactions with I/O 8 applications have up to 1% of transactions with I/O No transactions include both input and output => Buffered I/O would not deadlock 13

Speculative parallelization Speculatively parallelize loops in sequential algorithms E.g. each loop iteration becomes a transactions This study 6 loop based applications Mapped outermost loop iteration to single transaction Original Threading Primitive Outermost Iteration Start Outermost Iteration End Transaction Mapping BEGIN END 14

Read-/Write /Write-Set Size Bytes of data read/written by transaction 10000 equake art is swim cg bt 10000 equake art is swim cg bt Read Set Size in Kbytes 1000 100 10 1 0.1 Write Set Size in Kbytes 1000 100 10 1 0.1 0.01 50th 55th 60th 65th 70th 75th 80th 85th 90th 95th 100t Transaction Percentile 0.01 50th 55th 60th 65th 70th 75th 80th 85th 90th 95th 100t Transaction Percentile The read-/write /write-sets get larger up to L2-sized buffers (~128K) => They doesn t t fit in L1 cache but still fits into L2-sized buffer => Inner loop parallelization might be better to reduce buffer requirement 15

Take-away away Points Transaction Usage Non-blocking Synchronization Observation Short-lived transactions Read-/write /write-sets < 16K High write-set to length ratio Few nested transactions TM Design Guidelines Light-weight TM primitives L1 cache for versioning Per write overhead is critical Challenge for STM Limited nesting support Few transactions with I/O Buffered I/O Speculative Parallelism Large read-/write /write-sets L2 cache for versioning 16

Summary Extensive study of transactional behavior of programs 36 parallel applications from multiple domains Map existing parallel constructs to transactions Covered both non-blocking synchronization & speculative parallelization Contributions Quantitative Observations on transactional characteristics Most transactions are short-lived, small, and not nested Design Guidelines for Transactional Memory systems Effective Guideline for TM Architects 17