Lock Inference in the Presence of Large Libraries

Similar documents
Lock Inference in the Presence of Large Libraries

Implementing Atomic Sections Using Lock Inference Final Report

CS140 Final Review. Winter 2014

Improving the performance of Atomic Sections Transfer Report

Modeling and Analyzing Concurrent Systems. Robert B. France

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Outline. Database Tuning. Ideal Transaction. Concurrency Tuning Goals. Concurrency Tuning. Nikolaus Augsten. Lock Tuning. Unit 8 WS 2013/2014

Database Management and Tuning

Linked Lists: The Role of Locking. Erez Petrank Technion

CS153: Compilers Lecture 23: Static Single Assignment Form

JiST Java in Simulation Time An efficient, unifying approach to simulation using virtual machines

301AA - Advanced Programming [AP-2017]

Atomicity via Source-to-Source Translation

High Performance Computing on GPUs using NVIDIA CUDA

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Dynamic Deadlock Avoidance Using Statically Inferred Effects

Problem. Context. Hash table

Practical Considerations for Multi- Level Schedulers. Benjamin

Hierarchical Real-time Garbage Collection

Deterministic Process Groups in

Ohua: Implicit Dataflow Programming for Concurrent Systems

The Deadlock Lecture

A Dynamic Evaluation of the Precision of Static Heap Abstractions

The benefits and costs of writing a POSIX kernel in a high-level language

Part II: Atomicity for Software Model Checking. Analysis of concurrent programs is difficult (1) Transaction. The theory of movers (Lipton 75)

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

Box: Using HBase as a message queue. David MacKenzie Staff So2ware Engineer

Linux multi-core scalability

Deadlock Bug Detection Techniques

Chimera: Hybrid Program Analysis for Determinism

Computation Abstractions. Processes vs. Threads. So, What Is a Thread? CMSC 433 Programming Language Technologies and Paradigms Spring 2007

Foundations of the C++ Concurrency Memory Model

LDetector: A low overhead data race detector for GPU programs

CS 4120 Lecture 31 Interprocedural analysis, fixed-point algorithms 9 November 2011 Lecturer: Andrew Myers

Lecture 14 Pointer Analysis

301AA - Advanced Programming [AP-2017]

Topic I (d): Static Single Assignment Form (SSA)

6 Transactional Memory. Robert Mullins

Hybrid Analysis for Partial Order Reduction of Programs with Arrays

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

Kampala August, Agner Fog

Goals of Program Optimization (1 of 2)

Mohamed M. Saad & Binoy Ravindran

Non-Blocking Inter-Partition Communication with Wait-Free Pair Transactions

20b -Advanced-DFA. J. L. Peterson, "Petri Nets," Computing Surveys, 9 (3), September 1977, pp

Abstract Interpretation Continued

Page # 20b -Advanced-DFA. Reading assignment. State Propagation. GEN and KILL sets. Data Flow Analysis

CO444H. Ben Livshits. Datalog Pointer analysis

Using Software Transactional Memory In Interrupt-Driven Systems

Sta$c Single Assignment (SSA) Form

CSE 544: Principles of Database Systems

Approach starts with GEN and KILL sets

OPERATING SYSTEM TRANSACTIONS

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions

Summary: Issues / Open Questions:

Verification of Real-Time Systems Resource Sharing

Refinement-Based Context-Sensitive Points-To Analysis for Java

G Programming Languages Spring 2010 Lecture 13. Robert Grimm, New York University

Improving Real-Time Performance on Multicore Platforms Using MemGuard

COSC 6385 Computer Architecture - Review for the 2 nd Quiz

UNIT:2. Process Management

Compiler Optimisation

A program execution is memory safe so long as memory access errors never occur:

Computing Approximate Happens-Before Order with Static and Dynamic Analysis

Effective Data-Race Detection for the Kernel

CSE P 501 Compilers. SSA Hal Perkins Spring UW CSE P 501 Spring 2018 V-1

The Compiler s Front End (viewed from a lab 1 persepc2ve) Comp 412 COMP 412 FALL Chapter 1 & 2 in EaC2e. target code

CalFuzzer: An Extensible Active Testing Framework for Concurrent Programs Pallavi Joshi 1, Mayur Naik 2, Chang-Seo Park 1, and Koushik Sen 1

Computer Science. ! Other approaches:! Special systems designed for extensibility

Practical Virtual Method Call Resolution for Java

Reagent Based Lock Free Concurrent Link List Spring 2012 Ancsa Hannak and Mitesh Jain April 28, 2012

Quiz Answers. CS 537 Lecture 9 Deadlock. What can go wrong? Readers and Writers Monitor Example

An Update on Haskell H/STM 1

Cloning-Based Context-Sensitive Pointer Alias Analysis using BDDs

CSE 501 Midterm Exam: Sketch of Some Plausible Solutions Winter 1997

Making Context-sensitive Points-to Analysis with Heap Cloning Practical For The Real World

Lecture 27. Pros and Cons of Pointers. Basics Design Options Pointer Analysis Algorithms Pointer Analysis Using BDDs Probabilistic Pointer Analysis

Lecture 20 Pointer Analysis

OOPLs - call graph construction. Example executed calls

Compiler Structure. Data Flow Analysis. Control-Flow Graph. Available Expressions. Data Flow Facts

Motivation. Overview. Scalable Dynamic Analysis for Automated Fault Location and Avoidance. Rajiv Gupta. Program Execution

Sta$c Analysis Dataflow Analysis

Scalable GPU Graph Traversal!

Pointer Analysis. Lecture 40 (adapted from notes by R. Bodik) 5/2/2008 Prof. P. N. Hilfinger CS164 Lecture 40 1

FastTrack: Efficient and Precise Dynamic Race Detection (+ identifying destructive races)

NVthreads: Practical Persistence for Multi-threaded Applications

SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (II)

Atomic Transac1ons. Atomic Transactions. Q1: What if network fails before deposit? Q2: What if sequence is interrupted by another sequence?

Lecture 26: Pointer Analysis

Lecture 26: Pointer Analysis. General Goals of Static Analysis. Client 1: Example. Client 1: Optimizing virtual calls in Java

Rajiv GandhiCollegeof Engineering& Technology, Kirumampakkam.Page 1 of 10

Portland State University ECE 588/688. Dataflow Architectures

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Portland State University ECE 588/688. Transactional Memory

Weak Levels of Consistency

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Lecture 16 Pointer Analysis

Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD

Concurrent Programming Method for Embedded Systems

Parallelism Marco Serafini

Transcription:

Lock Inference in the Presence of Large Libraries Khilan Gudka, Imperial College London* Tim Harris, Microso8 Research Cambridge Susan Eisenbach, Imperial College London ECOOP 2012 This work was generously funded by Microso8 Research Cambridge * Now at University of Cambridge Computer Laboratory 1

Concurrency control Status quo: we use locks But there are problems with them Not composable Break modularity Deadlock Priority inversion Convoying StarvaOon Hard to change granularity (and maintain in general) We want to eliminate the lock abstracoon but is there a beser alternaove? 2

Atomic secoons What programmers probably can do is tell which parts of their program should not involve interferences Atomic sec;ons DeclaraOve concurrency control Move responsibility for figuring out what to do to the compiler/runome atomic { x.f++; y.f++; } 3

Atomic secoons Simple semanocs (no interference allowed) Naïve implementaoon: one global lock But we soll want to allow parallelism without: Interference Deadlock OpOmisOc vs. PessimisOc implementaoons 4

ImplemenOng Atomic SecOons: OpOmisOc = transacoonal memory Advantages None of the problems associated with locks More concurrency Disadvantages Irreversible operaoons (IO, System calls) RunOme overhead Much interest 5

ImplemenOng Atomic SecOons: PessimisOc = lock inference StaOcally infer and instrument the locks that are needed to protect shared accesses atomic {!!x.f++;!!y.f++;! }! compiled to Acquire locks in two- phased order for atomicity Can handle irreversible opera;ons! lock(x);! lock (y);!!x.f++;!!y.f++;! unlock(y);! unlock(x);! 6

MoOvaOon: A Simple I/O Example atomic {!!System.out.println( Hello World! );! }! 7

Callgraph: MoOvaOon: A Simple I/O Example 8

MoOvaOon: A Simple I/O Example Cannot find in the literature any lock inference analysis which can handle this! General goals/challenges of lock inference Maximise concurrency Minimise locking overhead Avoid deadlock Achieve all of the above in the presence of libraries. Challenges that libraries introduce: Scalability (many and long call chains) Imprecision (have to consider all library execuoon paths) 9

Our lock inference analysis: Infer fine- grained locks Infer path expressions at each program point: Obj x = ;! Obj y = ;! atomic {!!x = y;!!x.f++;! }! x = y x.f = 10 { y } { x } {} Obj x = ;! Obj y = ;! lock(y);!!x = y;!!x.f++;! unlock(y);! 10

Scaling by compuong summaries void m(obj p) {!!p.f = 1;! }! m(a) f m ({}) = { a } {} f m is m s summary funcoon Summaries can get large: challenge is to find a representa;on of transfer func;ons that allows fast composi;on and meet opera;ons 11

IDE Analyses Use Sagiv et al s Interprocedural DistribuOve Environment framework Advantage: efficient graph representaoon of transfer funcoons that allows fast composioon and meet { y, z } = OUT x y z x = y Kill(IN) = { x } Gen(IN) = { y x in IN } OUT = IN\Kill(IN) U Gen(IN) kill gen { x, z } = IN x y 12 z

Transfer funcoons as graphs Graphs are kept sparse by not explicitly represenong trivial edges { y, z } x y z x = y kill gen { x, z } x y z Transformer composioon is simply transiove closure 13

Transfer funcoons as graphs Implicit edges should not have to be made explicit as that would be expensive For our analysis, most transformer funcoons perform rewrites, thus determining whether an implicit edge exists is costly using Sagiv et al s graphs 14

Transfer funcoons as graphs (Ours) We represent kills in transformers as: x Transformer edges also implicitly kill: x Result: implicit edge very easy to determine. This leads to fast transiove closure y 15

Transfer funcoons as graphs Example: (Ours) { y, z } x y z x = y { x, z } x y z 16

ImplementaOon We implemented our approach in the SOOT framework Evaluated using standard benchmarks for atomicity (that do not perform system calls). Name #Threads #Atomics #client methods #lib methods LOC (client) sync 8 2 0 0 1177 pcmab 50 2 2 15 457 bank 8 8 6 7 269 traffic 2 24 4 63 2128 mtrt 2 6 67 1324 11312 hsqldb 20 240 2107 2955 301971 17

Analysis Omes Experimental machine (a modern desktop): 8- core i7 3.4Ghz, 8GB RAM, Ubuntu 11.04, Oracle Java 6 Java opoons: Min & Max heap: 8GB, Stack: 128MB Name Paths Locks Total sync 0.05s 0.01s 2m 7s pcmab 0.15s 0.02s 2m 7s bank 0.15s 0.02s 2m 7s traffic 0.37s 0.06s 2m 10s mtrt 33.9s 1.89s 2m 49s hsqldb??? 18

Simple analysis not enough Our analysis soll wasn t efficient enough to analyse hsqldb. We performed further opomisaoons to reduce space- Ome: Primi;ves for state Encode analysis state as sets of longs for efficiency. All subsequent opomisaoons assume this Parallel propaga;on Perform intra- procedural propagaoon in parallel for different methods Perform inter- procedural propagaoon in parallel for different call- sites Summarising CFGs Merging CFG nodes to reduce the amount of storage space and propagaoon Worklist Ordering Ordering the worklist so that successor nodes are processed before predecessor nodes. This helps reduce redundant propagaoon Deltas Only propagate new dataflow informaoon Reduces the amount of redundant work 19

EvaluaOon of analysis opomisaoons: Analysis running Ome On the Hello World program Running time (minutes) 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 Number of threads None Summarise CFGs Worklist Ordering Deltas All 20

EvaluaOon of analysis opomisaoons: Analysis memory usage On the Hello World program Op;misa;on Average MB Peak MB None 4923.92 8183.18 Summarise CFGs 2094.68 3470.65 Worklist Ordering 4804.73 8037.14 Deltas 3848.98 6538.27 All 1741.39 3122.84 21

Analysis Omes Experimental machine for hsqldb: 256- core Xeon E7-8837 2.67Ghz, 3TB RAM, SUSE Linux Enterprise Server, Oracle Java 6 Java opoons: Min & Max heap: 70GB, Stack: 128MB, 8 threads Name Paths Locks Total sync 0.05s 0.01s 2m 7s pcmab 0.15s 0.02s 2m 7s bank 0.15s 0.02s 2m 7s traffic 0.37s 0.06s 2m 10s mtrt 33.9s 1.89s 2m 49s hsqldb 6h 6m 22m 6h 38m 22

What about deadlock? Lock inference inserts locks automaocally, so it must ensure that deadlock doesn t happen StaOc analysis is too conservaove Deadlock happens very infrequently All locks are taken at the start of the atomic, so can just rollback the locks if deadlock occurs and try again! 23

What about runome performance? Benchmark Manual Global Us Us vs Manual sync 69.14s 71.22 81.59s 1.18x pcmab 2.28s 3.15 54.61s 23.95x bank 20.89s 19.50 76.88s 3.68x traffic 2.56s 4.22 20.77s 8.11x mtrt 0.80s 0.82 0.91s 1.14x hsqldb 3.25s 3.12 419s 129.03x 24

Improve run- Ome performance: Avoid unnecessary locking We avoid unnecessary locking to improve the performance of the resulong instrumented programs. Lock op;misa;on Type of analysis Run;me slowdown vs. manual locking Single- threaded lock elision Dynamic 1.10x 16.13x Thread- local StaOc 1.09x 14.84x Instance- local StaOc 1.13x 13.16x Class- local StaOc 1.14x 15.32x Method- local StaOc 1.14x 15.05x Dominated StaOc 1.14x 15.47x Read- only StaOc 1.14x 13.26x 25

Removing locks: All opomisaoons Benchmark Manual Global Us (no opt.) Us (all opt.) Us vs Manual Us vs Global sync 69.14s 71.22s 81.59s 56.61s 0.82x 0.79x pcmab 2.28s 3.15s 54.61s 2.47s 1.08x 0.78x bank 20.89s 19.50s 76.88s 3.88s 0.19x 0.20x traffic 2.56s 4.22s 20.77s 4.42s 1.73x 1.05x mtrt 0.80s 0.82s 0.91s 0.85s 1.06x 1.04x hsqldb 3.25s 3.12s 419s 11.39s 3.50x 3.65x 26

What about Hello World? Concurrent Hello World benchmark with 8 threads. Each thread prints Hello World! from thread X 1000 Omes Analysis results Analysis ;me No. of locks (no lock opts) No. of locks (all lock opts) 2m 30s 495 25 RunOme performance Manual Global Us (no opt.) Us (all opt.) 0.32s 0.27s 2.21s 0.8s 27

Conclusion ExisOng lock inference approaches are unsound because they do not analyse library code i.e. due to scalability, imprecision, etc. Our approach does, thus correct by construcoon With an enormous number of opomisaoons, we manage to get worst- case execuoon Ome of only 3.50x and <2x in general case vs perfect and well- tested manual locking as well as some speed- ups! So, programmers get the simplicity of atomic secoons with almost the speed of manual locks 28

QuesOons? Out of this ne5le, danger, we pluck this flower, safety William Shakespeare If he was a programmer today Out of this ne5le, concurrency, we pluck this flower, atomicity 29