Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization

Similar documents
Hybrid Static-Dynamic Analysis for Statically Bounded Region Serializability

Legato: Bounded Region Serializability Using Commodity Hardware Transactional Memory. Aritra Sengupta Man Cao Michael D. Bond and Milind Kulkarni

Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization

Lightweight Data Race Detection for Production Runs

Hybrid Static Dynamic Analysis for Statically Bounded Region Serializability

Legato: End-to-End Bounded Region Serializability Using Commodity Hardware Transactional Memory

Valor: Efficient, Software-Only Region Conflict Exceptions

Avoiding Consistency Exceptions Under Strong Memory Models

DoubleChecker: Efficient Sound and Precise Atomicity Checking

Towards Parallel, Scalable VM Services

11/19/2013. Imperative programs

Lightweight Data Race Detection for Production Runs

NDSeq: Runtime Checking for Nondeterministic Sequential Specs of Parallel Correctness

Lightweight Data Race Detection for Production Runs

Adaptive Multi-Level Compilation in a Trace-based Java JIT Compiler

Valor: Efficient, Software-Only Region Conflict Exceptions

PACER: Proportional Detection of Data Races

How s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services

Probabilistic Calling Context

A Dynamic Evaluation of the Precision of Static Heap Abstractions

The Java Memory Model

EECS 570 Lecture 13. Directory & Optimizations. Winter 2018 Prof. Satish Narayanasamy

Hybrid Static Dynamic Analysis for Region Serializability

Trace-based JIT Compilation

High-level languages

Chí Cao Minh 28 May 2008

Atomicity for Reliable Concurrent Software. Race Conditions. Motivations for Atomicity. Race Conditions. Race Conditions. 1. Beyond Race Conditions

Dynamic Monitoring Tool based on Vector Clocks for Multithread Programs

Lu Fang, University of California, Irvine Liang Dou, East China Normal University Harry Xu, University of California, Irvine

FastTrack: Efficient and Precise Dynamic Race Detection (+ identifying destructive races)

Reference Object Processing in On-The-Fly Garbage Collection

Compiler Techniques For Memory Consistency Models

Java Memory Model. Jian Cao. Department of Electrical and Computer Engineering Rice University. Sep 22, 2016

CalFuzzer: An Extensible Active Testing Framework for Concurrent Programs Pallavi Joshi 1, Mayur Naik 2, Chang-Seo Park 1, and Koushik Sen 1

Atomicity via Source-to-Source Translation

Chronicler: Lightweight Recording to Reproduce Field Failures

MAMA: Mostly Automatic Management of Atomicity

Programming Language Memory Models: What do Shared Variables Mean?

A Case for System Support for Concurrency Exceptions

Taming release-acquire consistency

Nondeterminism is Unavoidable, but Data Races are Pure Evil

Atomicity. Experience with Calvin. Tools for Checking Atomicity. Tools for Checking Atomicity. Tools for Checking Atomicity

C. Flanagan Dynamic Analysis for Atomicity 1

Part 3: Beyond Reduction

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

Cooperative Concurrency for a Multicore World

Identifying the Sources of Cache Misses in Java Programs Without Relying on Hardware Counters. Hiroshi Inoue and Toshio Nakatani IBM Research - Tokyo

Motivation & examples Threads, shared memory, & synchronization

Static and Dynamic Program Analysis: Synergies and Applications

Semantic Atomicity for Multithreaded Programs!

Enforcing Textual Alignment of

Efficient Runtime Tracking of Allocation Sites in Java

A Parameterized Type System for Race-Free Java Programs

Foundations of the C++ Concurrency Memory Model

Continuous Object Access Profiling and Optimizations to Overcome the Memory Wall and Bloat

7/6/2015. Motivation & examples Threads, shared memory, & synchronization. Imperative programs

Towards Transactional Memory Semantics for C++

The Java Memory Model

The C/C++ Memory Model: Overview and Formalization

FastTrack: Efficient and Precise Dynamic Race Detection By Cormac Flanagan and Stephen N. Freund

IBM Research - Tokyo 数理 計算科学特論 C プログラミング言語処理系の最先端実装技術. Trace Compilation IBM Corporation

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

A Causality-Based Runtime Check for (Rollback) Atomicity

ANALYZING THREADS FOR SHARED MEMORY CONSISTENCY BY ZEHRA NOMAN SURA

Multicore Programming Java Memory Model

Coherence, Consistency, and Déjà vu: Memory Hierarchies in the Era of Specialization

Efficient Processor Support for DRFx, a Memory Model with Exceptions

Language- Level Memory Models

Phase-based Adaptive Recompilation in a JVM

Software Transactions: A Programming-Languages Perspective

Towards Garbage Collection Modeling

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Types for Precise Thread Interference

Efficient Context Sensitivity for Dynamic Analyses via Calling Context Uptrees and Customized Memory Management

Duet: Static Analysis for Unbounded Parallelism

An introduction to weak memory consistency and the out-of-thin-air problem

Maximal Causality Reduction for TSO and PSO. Shiyou Huang Jeff Huang Parasol Lab, Texas A&M University

Chimera: Hybrid Program Analysis for Determinism

Process Synchronization. Mehdi Kargahi School of ECE University of Tehran Spring 2008

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Program logics for relaxed consistency

DRFx: A Simple and Efficient Memory Model for Concurrent Programming Languages

Program Calling Context. Motivation

Optimistic Shared Memory Dependence Tracing

Part IV: Exploiting Purity for Atomicity

Deterministic Parallel Programming

A Scalable Lock Manager for Multicores

Weak Memory Models: an Operational Theory

Deterministic Replay and Data Race Detection for Multithreaded Programs

Combining the Logical and the Probabilistic in Program Analysis. Xin Zhang Xujie Si Mayur Naik University of Pennsylvania

Phosphor: Illuminating Dynamic. Data Flow in Commodity JVMs

The New Java Technology Memory Model

Sound Thread Local Analysis for Lockset-Based Dynamic Data Race Detection

Potential violations of Serializability: Example 1

AtomChase: Directed Search Towards Atomicity Violations

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Dynamic Analysis of Inefficiently-Used Containers

Effective Performance Measurement and Analysis of Multithreaded Applications

Static Conflict Analysis for Multi-Threaded Object Oriented Programs

RELAXED CONSISTENCY 1

Transcription:

Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization Aritra Sengupta, Man Cao, Michael D. Bond and Milind Kulkarni 1 PPPJ 2015, Melbourne, Florida, USA

Programming Language Semantics? Data Races Java provides weak semantics 2

Weak Semantics T1 A a = null; boolean init = false; T2 a = new A(); init = true; if (init) a.field++; 3

Weak Semantics T1 a = new A(); init = true; No data dependence T2 if (init) a.field++; 4

Weak Semantics T1 A a = null; boolean init = false; T2 a = new A(); init = true; if (init) a.field++; 5

Weak Semantics T1 T2 init = true; if (init) a.field++; a = new A(); 6

Weak Semantics T1 T2 Null dereference init = true; if (init) a.field++; a = new A(); 7

Java Memory Model JMM (Manson et al., POPL, 2005) variant of DRF0 (Adve and Hill, ISCA, 1990) Atomicity of synchronization-free regions for datarace-free programs Data races: weak semantics 8

Need for Stronger Memory Models The inability to define reasonable semantics for programs with data races is not just a theoretical shortcoming, but a fundamental hole in the foundation of our languages and systems Give better semantics to programs with data races Stronger memory models Adve and Boehm, CACM, 2010 9

Run-time cost Memory Models: Run-time cost vs Strength Synchronization-free region serializability 1 SC Statically Bounded Region Serializability DRF0 Strength 1. Ouyang et al.... and region serializability for all. In HotPar, 2013. 10

Run-time cost Memory Models: Run-time cost vs Strength Synchronization-free region serializability SC Statically Bounded Region Serializability 2 DRF0 Strength 11 2. Sengupta et al. Hybrid Static-Dynamic Analysis for Statically Bounded Region Serializability. ASPLOS, 2015.

Statically Bounded Region Serializability (SBRS) Compiler demarcated regions execute atomically Execution is an interleaving of these regions Sengupta et al., ASPLOS, 2015 12

Statically Bounded Region Serializability (SBRS) acq(lock) Synchronization operations Method calls Loop backedges methodcall() rel(lock) 13

Statically Bounded Region Serializability (SBRS) acq(lock) methodcall() rel(lock) 14

Statically Bounded Region Serializability (SBRS) Statically and dynamically bounded Loop backedges 15

Overview Enforcement of SBRS with dynamic locks Precise dynamic locks: EnfoRSer-D (our prior work), low contention Enforcement of SBRS with static locks Imprecise static locks: EnfoRSer-S, low instrumentation overhead Hybridization of locks EnfoRSer-H: static and dynamic locks, right locks for right sites? Results EnfoRSer-H does at least as well as either. Some cases significant benefit 16

Enforcement of SBRS Prevent two concurrent accesses to the same memory location where one is a write 17

Enforcement of SBRS Prevent two Prevent concurrent regions accesses that to have the same memory location races where from one running is a write concurrently! 18

Enforcement of SBRS Acquire locks before each memory access Acquire locks at the start of the region 19

Enforcement of SBRS Acquire locks before each memory access Precise object locks: dynamic locks Acquire locks at the start of the region 20

Enforcement of SBRS Acquire locks before each memory access Acquire locks at the start of the region Region level locks: statically chosen for an access site 21

EnfoRSer-D Precise dynamic locks Per-access locks with retry mechanism Compiler Transformations: Speculative execution 22

SBRS with Dynamic Locks Y = = X Dynamic per-access locks Z = 23

SBRS with Dynamic Locks Y = Program access = X Z = 24

SBRS with Dynamic Locks Y = = X Ownership transferred Z = 25

SBRS with Dynamic Locks Y = Dynamic locks: precise location, hence no reasoning about races statically! = X Ownership transferred Z = 26

SBRS with Dynamic Locks Precise conflict detectionefficient for high-conflicting regions Y = High instrumentation cost at each access = X High overhead for lowconflicting programs Ownership transferred Z = 27

Experimental Methodology Benchmarks DaCapo 2006, 9.12-bach Fixed-workload versions of SPECjbb2000 and SPECjbb2005 Platform Intel Xeon system: 32 cores 28

Implementation and Evaluation Developed in Jikes RVM 3.1.3 Code publicly available on Jikes RVM Research Archive 29

%overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D 95 75 55 35 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 30

%overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D 95 75 27% overhead on average 55 35 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 31

%overhead over unmodified JVM Run-time Performance EnfoRSer-D 115 95 75 55 35 Precise conflict detection but high instrumentation overhead! Complex compiler transformations (additional code) Can we do better? 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 32

EnfoRSer with Static Locks Reduce the instrumentation overhead of EnfoRSer-D Less complex code generation 33

EnfoRSer-S Static region level locks Racing sites acquire same lock Coarsened locks to reduce instrumentation overhead 34

SBRS with Locks on Static Sites L0 Static locks: racy sites protected by same lock L1 L2 Static Region Locks Y = = X Z = 35

SBRS with Locks on Static Sites L0 L1 L2 All locks acquired before access Y = = X Z = 36

SBRS with Locks on Static Sites L0 L1 L2 Y = Ownership transferred = X Z = 37

SBRS with Locks on Static Sites L0 L1 Imprecise and does not lower instrumentation overhead! L2 Y = Ownership transferred = X Z = 38

SBRS with Locks on Static Sites L012 Coarsened single lock Y = = X Z = 39

SBRS with Locks on Static Sites Low instrumentation L012 overhead Coarsened single lock Imprecise Y = conflict detection = X Z = 40

EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 41

EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 42

EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 RACE 43

EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 RACE Naik et al. s 2006 race detection algorithm, Chord 44

EnfoRSer-S R1 R2 R3 R4 L3 s0 s2 s5 s7 s3 L2 s8 s1 s4 s6 L4 L0 L1 45

EnfoRSer-S R1 R2 R3 R4 L0 L0 L1 L3 s0 s2 s5 s7 s3 s8 s1 s4 s6 46

EnfoRSer-S R1 R2 R3 R4 L0 L0 L1 L3 L1 L1 L2 L2 L4 s0 s2 s5 s7 s3 s8 s1 s4 s6 47

EnfoRSer-S R1 R2 R3 R4 L0 L0 L1 L3 L1 L1 L2 L2 L4 s0 s2 Does not reduce s3 instrumentation overhead! s5 s7 s8 s1 s4 s6 48

EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 49

EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 Same Region Static Locks (SRSL) RACE 50

EnfoRSer-S R1 R2 R3 R4 s0 s2 s5 s7 s3 s8 s1 s4 s6 Same Region Static Locks (SRSL) RACE L0 L1 51

EnfoRSer-S R1 R2 R3 R4 L0 L0 L0 L1 s0 s2 s5 s7 s3 s8 s1 s4 s6 52

EnfoRSer-S R1 R2 R3 R4 s0 L0 Reduces instrumentation L0 L0 L1 overhead s2 s5 s7 Increases contention s3 s8 s1 s4 s6 53

%overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S 11,000 3,500 12,000 33,000 10,000 4,300 3,400 660 2,600 95 75 55 35 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 54

%overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S 11,000 3,500 12,000 33,000 10,000 4,300 3,400 660 2,600 95 75 2600% overhead on average 55 35 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 55

%overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S 11,000 3,500 12,000 33,000 10,000 4,300 3,400 660 2,600 95 75 55 35 EnfoRSer-S performs better 2600% overhead on average 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 56

Hybridizing Locks High contention sites: Precise dynamic locks (precise conflict detection) Low contention sites: Single static lock (low instrumentation overhead) 57

EnfoRSer-H Static locks to reduce instrumentation Dynamic locks for precise conflict detection Correctly and efficiently combine: best of both 58

%overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S EnfoRSer-H 11,000 3,500 12,000 33,000 10,000 4,300 3,400 660 2,600 95 75 26% overhead on average 55 35 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 59

%overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S EnfoRSer-H 11,000 3,500 12,000 33,000 10,000 4,300 3,400 660 2,600 95 75 Provides nearly the best of either approaches! 55 35 63% reduction in overhead 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 60

%overhead over unmodified JVM Run-time Performance 115 95 EnfoRSer-D EnfoRSer-S EnfoRSer-H 11,000 3,500 12,000 33,000 10,000 4,300 3,400 660 2,600 49% reduction in lock acquires and not increasing contention! 75 55 87% reduction in overhead 35 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 61

%overhead over unmodified JVM Run-time Performance 115 EnfoRSer-D EnfoRSer-S EnfoRSer-H 11,000 3,500 12,000 33,000 10,000 4,300 3,400 660 2,600 95 49% reduction in lock acquires 75 55 Dramatically improves on both when hybridization helps! 87% reduction in overhead 35 15-5 hsqldb6 xalan6 avrora9 luindex9 lusearch6 lusearch9 sunflow9 xalan9 pjbb2000 pjbb2005 geomean 62

Hybridizing Locks Right synchronization for right program sites Combining different synchronization mechanisms Best of different synchronization mechanisms 63

EnfoRSer-H Costbenefit model Profiling Right locks for right sites Assignment algorithm Combine static and dynamic locks 64

Two-iteration Methodology Program execution First iteration Profiling (use RACE relation) Assignment Algorithm Between iterations Compute SRSL and estimate cost Program exectuion Second iteration Use hybridized instrumentation 65

Assignment Algorithm 66

Profiled data Compute initial cost Change a static lock to dynamic lock Change all its racing sites to dynamic locks Profiled data Recompute cost current cost < previous cost No Yes Retain state Revert to previous state 67

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 estimatedcost = N i estimatecost(ri) ; 68

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 estimatedcost = N i estimatecost(ri) ; 1. Conflicts on each site 2. Lock acquires on each site 69

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 70

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 71

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 If (current cost < previous cost) // retain state 72

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 Switch on next site 73

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 74

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 If (current cost < previous cost) // retain state 75

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 Switch on next site 76

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 estimatedcost = N i estimatecost(ri) ; current cost > previous cost // revert state 77

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 78

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 s1, s9 79

Assignment Algorithm R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 Final State! 80

Lock Assignment R1 R2 R3 R4 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 L1 81

Lock Assignment R1 R2 R3 R4 L1 L1 L1 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 L1 82

Lock Assignment R1 R2 R3 R4 L1 L1 L1 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 L1 83

Lock Assignment R1 R2 R3 R4 L1 L1 L1 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 84

Lock Assignment R1 R2 R3 R4 L1 L1 L1 s0 s2 s4 s7 s1 s3 s5 s8 s9 s6 Per-object locks and transformations 85

Related Work Use of static locks Chimera, Lee et al., PLDI 2012 Use of static analysis Static Conflict Analysis for Multi-Threaded Object-Oriented Programs, Von Praun and Gross, PLDI, 2003. Goldilocks, Elmas et al., PLDI 2007. Red Card, Flanagan and Freund, ECOOP 2013 Hybridizing locks Hybrid Tracking, Cao et al., WODET 2014 86

Conclusion Combine synchronization mechanisms Best of different kinds of synchronization Efficient enforcement of SBRS Static locks Dynamic locks Precise conflict detection Low instrumentation overhead Select the best for a region 87