Do you have to reproduce the bug on the first replay attempt?

Similar documents
CFix. Automated Concurrency-Bug Fixing. Guoliang Jin, Wei Zhang, Dongdong Deng, Ben Liblit, and Shan Lu. University of Wisconsin Madison

Light64: Ligh support for data ra. Darko Marinov, Josep Torrellas. a.cs.uiuc.edu

Deterministic Replay and Data Race Detection for Multithreaded Programs

Dynamically Detecting and Tolerating IF-Condition Data Races

Ad Hoc Synchroniza/on Considered Harmful

TERN: Stable Deterministic Multithreading through Schedule Memoization

Deterministic Process Groups in

Chimera: Hybrid Program Analysis for Determinism

Understanding the Interleaving Space Overlap across Inputs and So7ware Versions

AVIO: Detecting Atomicity Violations via Access Interleaving Invariants

Automatically Classifying Benign and Harmful Data Races Using Replay Analysis

Embedded System Programming

10/17/2011. Cooperating Processes. Synchronization 1. Example: Producer Consumer (3) Example

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

Diagnosing Production-Run Concurrency-Bug Failures. Shan Lu University of Wisconsin, Madison

Be Conservative: Enhancing Failure Diagnosis with Proactive Logging

DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently

Slides by Y. Nir-Buchbinder, adapted by O. Agmon Ben-Yehuda 1/30

DMP Deterministic Shared Memory Multiprocessing

Yuxi Chen, Shu Wang, Shan Lu, and Karthikeyan Sankaralingam *

Leveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures. Joy Arulraj, Guoliang Jin and Shan Lu

Production-Run Software Failure Diagnosis via Hardware Performance Counters. Joy Arulraj, Po-Chun Chang, Guoliang Jin and Shan Lu

Explicitly Parallel Programming with Shared Memory is Insane: At Least Make it Deterministic!

Atom-Aid: Detecting and Surviving Atomicity Violations

Deterministic Shared Memory Multiprocessing

DBT Tool. DBT Framework

Detecting and Surviving Data Races using Complementary Schedules

Transactional Memory. Concurrency unlocked Programming. Bingsheng Wang TM Operating Systems

Detecting and Surviving Data Races using Complementary Schedules

Predicting Data Races from Program Traces

FastTrack: Efficient and Precise Dynamic Race Detection (+ identifying destructive races)

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

A Serializability Violation Detector for Shared-Memory Server Programs

Transaction Memory for Existing Programs Michael M. Swift Haris Volos, Andres Tack, Shan Lu, Adam Welc * University of Wisconsin-Madison, *Intel

ConSeq: Detecting Concurrency Bugs through Sequential Errors

Development of Technique for Healing Data Races based on Software Transactional Memory

Understanding and Genera-ng High Quality Patches for Concurrency Bugs. Haopeng Liu, Yuxi Chen and Shan Lu

Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay

The Common Case Transactional Behavior of Multithreaded Programs

DoublePlay: Parallelizing Sequential Logging and Replay

Optimistic Shared Memory Dependence Tracing

Fully Automatic and Precise Detection of Thread Safety Violations

Active Testing for Concurrent Programs

Active Testing for Concurrent Programs

Instrumentation and Sampling Strategies for Cooperative Concurrency Bug Isolation

Early detection of configuration errors to reduce failure damage

Concurrency problems. To do. q Non-deadlock bugs q Resources & deadlocks q Dealing with deadlocks q Other issues q Next Time: I/O and file systems

A Case for System Support for Concurrency Exceptions

On the Accuracy and Scalability of Intensive I/O Workload Replay

RaceMob: Crowdsourced Data Race Detec,on

Samsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions

Race Catcher. Automatically Pinpoints Concurrency Defects in Multi-threaded JVM Applications with 0% False Positives.

HARDWARE SUPPORT FOR SOFTWARE DEBUGGING IS CRITICAL. HARDWARE SUPPORT IS NECESSARY TO OBSERVE AND CAPTURE, WITH LITTLE OR NO

Efficient Data Race Detection for Unified Parallel C

Concurrency problems. q Non-deadlock bugs q Resources & deadlocks q Dealing with deadlocks q Other issues q Next Time: I/O and file systems

Speculative Synchronization

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5

T ransaction Management 4/23/2018 1

RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors

CSE332: Data Abstractions Lecture 23: Programming with Locks and Critical Sections. Tyler Robison Summer 2010

Process Synchronization

Concurrency & Multithreading

Statistical Debugging for Real-World Performance Problems. Linhai Song Advisor: Prof. Shan Lu

Production-Run Software Failure Diagnosis via Hardware Performance Counters

Record and Replay of Multithreaded Applications

W4118: concurrency error. Instructor: Junfeng Yang

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

Major Project, CSD411

Monitors; Software Transactional Memory

The Bulk Multicore Architecture for Programmability

UNDERSTANDING, DETECTING AND EXPOSING CONCURRENCY BUGS SHAN LU. B.S., University of Science and Technology of China, 2003 DISSERTATION

Scientific Applications. Chao Sun

Beyond Sequential Consistency: Relaxed Memory Models

JAVA CONCURRENCY FRAMEWORK. Kaushik Kanetkar

Parallel Simulation Accelerates Embedded Software Development, Debug and Test

Finding Concurrency Bugs in Java

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Scalable In-memory Checkpoint with Automatic Restart on Failures

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Advanced Memory Management

Execution Replay for Multiprocessor Virtual Machines

Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration

Lessons Learned During the Development of the CapoOne Deterministic Multiprocessor Replay System

Towards Production-Run Heisenbugs Reproduction on Commercial Hardware

Time-Ordered Event Traces: A New Debugging Primitive for Concurrency Bugs

Mutual Exclusion and Synchronization

DoubleChecker: Efficient Sound and Precise Atomicity Checking

CalFuzzer: An Extensible Active Testing Framework for Concurrent Programs Pallavi Joshi 1, Mayur Naik 2, Chang-Seo Park 1, and Koushik Sen 1

Chapter 5. Multiprocessors and Thread-Level Parallelism

Problems with Concurrency

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools

MS Windows Concurrency Mechanisms Prepared By SUFIAN MUSSQAA AL-MAJMAIE

The concept of concurrency is fundamental to all these areas.

Concurrency Bugs. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Statistical Debugging for Real-World Performance Problems

Chapter 6: Process [& Thread] Synchronization. CSCI [4 6] 730 Operating Systems. Why does cooperation require synchronization?

Intro to Transactions

Distributed Systems COMP 212. Revision 2 Othon Michail

Review: Easy Piece 1

Ditto Deterministic Execution Replay for the Java Virtual Machine on Multi-processors

Transcription:

Do you have to reproduce the bug on the first replay attempt? PRES: Probabilistic Replay with Execution Sketching on Multiprocessors Soyeon Park, Yuanyuan Zhou University of California, San Diego Weiwei Xiong, Zuoning Yin, Rini Kaushik, Kyu H. Lee, Shan Lu University of Illinois at Urbana Champaign

Concurrency bugs are important Writing concurrent program is difficult Programmers are used to sequential thinking Concurrent programs are prone to bugs Concurrency bugs cause severe real-world problems Therac-25, Northeast blackout Multi-core trend worsens the problem

Characteristics of Concurrency Bugs A concurrency bug may need a special thread interleaving to manifest Thread 1 Thread 2 if ( buf_index + len < BUFFSIZE ) buf_index+=len; memcpy (buf[buf_index], log, len); Crash! Apache Two implications : Hard to expose a concurrency bug during testing Difficult to reproduce a concurrency bug for diagnosis

Deterministic Replay of Uniprocessor Recording non-deterministic factors and re-execution Inputs (keyboards, networks, files, etc) Thread scheduling Return values of system calls T1 T2 uniprocessor input thread scheduling syscall T1 T2 input thread scheduling syscall reproduce the bug < Production run > < Replay run >

Deterministic Replay for Multiprocessors Much more difficult Multi-threads execute simultaneously on different processors Extra source of non-determinism: Interleaving of shared memory accesses T1 S1 T2 T3 T4 multiprocessor S2 S3 T2 S1: if (buf_index + len < BUFFSIZE ); T3 S2: buf_index += len; S3: memcpy (buf [buf_ind ex], log, len); Crash!

State of the Art on Multiprocessor Replay Hardware-assisted approach Recording all thread interactions with new hardware extension ex) Flight Data Recorder, BugNet, Strata, RTR, DMP, Rerun, etc. None of them exists in reality! Software-only approach High production-run overhead (> 10-100X ) Not practical! due to capturing the global order of shared memory accesses ex) InstantReplay, Strata/s, etc. Recent work: SMP-Revirt use page protection mechanism to optimize memory monitoring >10X production-run overhead on 2 or 4 processors has false sharing and page contention issues (scalability)

Contrast between Common Practice & Existing Research Proposals Common practice Existing research proposals Production run error 0% overhead Impractical! error 10-100 X slowdown Diagnosis phase >1000 replay attempts* error the 1 st replay attempt * : according to our experimental results

Observations number of replay attempts > 1000 Current practice Ideal case 1 0 Impractical 10-100X production run recording overhead Existing s/w-only research proposals 1) Production run performance is more critical than replay time 2) We do NOT need to reproduce a bug on the 1 st replay attempt

Our Idea Probabilistic Replay with Execution Sketching (PRES) Record only partial information during production run Low recording overhead Push the complexity to diagnosis time Leverage feedback from unsuccessful replays

PRES Overview Probabilistic Replay via Execution Sketching (PRES) sketches partial information replay feedback complete information off-sketch detected error Sketch recording during production run reproduce the bug Partial-Information based replay (PI-Replay) reproduce the bug with 100% probability Diagnosis phase Recording partial information (sketch) during production run Reproducing a bug, not the original execution

Contents Introduction Our approach Overview of PRES Sketch recording Bug reproduction Partial-Information based replayer Monitor Feedback generator Evaluation Conclusion

Sketch Recording Lower overhead Higher overhead BASE uni-processor deterministic replay SYNC SYS FUNC BB-N BB production run RW BASE: Uni-processor deterministic replay Thread 1 Thread 2 Existing Inputs worker() s/w only Subsuming deterministic relationships worker() replay for multi-processors { All Thread non-deterministic scheduling events { including lock (L); lock (L); System the global myid calls order = gid; of shared memory myid=gid; accesses gid = myid+1; unlock (L); gid=myid+1; unlock (L); optimized BB SYNC BASE+ < BB-2 + > global BASE order + of of of shared synchronization system function basic-blocks memory global calls order calls read of / write operations accesses every 2 nd basic-blocks RW existing s/w-only deterministic replay } tmp=result; print( %d\n, tmp); if (myid==0) } result = data; wrong output! sketch point

Contents Introduction Our approach Overview of PRES Sketch recording Bug reproduction Partial-Information based replayer (PI-Replayer) Monitor Feedback generator Evaluation Conclusion

Partial Information-based Replay Process of bug reproduction phase < reproduction phase > lessons how to improve the replay sketches stop /abort restart complete information sketch recorder PI-replayer monitor feedback generator replayer replay recorder reproduce the bug with 100% probability Monitor is used for: Detecting successful bug reproduction Detecting off-sketch path: deviates from sketches 14

lessons PI-replayer PI-replayer replay recorder monitor feedback generator Partial-Information based replayer Consults the execution sketch to enforce observed global orders Right before re-executing a sketch point, make sure that all prior points from other threads have been executed T1 lock (A) T1 lock (A), global order 1 T2 lock (B) < Production run > SYNC sketches T1 : lock A, global order 1 T2 : lock B, global order 2 T2 lock (B), global order 2 wait for T1 to execute lock A first < Replay run >

lessons Monitor PI-replayer replay recorder monitor feedback generator Detect successful bug reproduction Crash failure - PRES can catch exceptions Deadlock - a periodic timer to check for progress Incorrect results - programmer needs to provide conditions for checking Can leverage testing oracles and existing bug detection tools Detect unsuccessful replay Compare against the execution sketch from the original execution Prevent from giving useless replay efforts on a wrong path

What if a replay attempt fails? Replay it again! Restart from the beginning or the previous checkpoint Shall we do something different next time? Random approach: just leave it to fate Systematic approach Actively learn from previous mistakes

lessons Feedback Generator (1/2) PI-replayer replay recorder monitor feedback generator Why previous replays cannot reproduce a bug? Some un-recorded data races execute in different orders Production run Thread 1 Thread 2 worker() worker() { { if (myid==0) result = data; } } tmp = result; printf( %d\n, tmp); 1 st replay attempt Thread 1 Thread 2 worker() worker() { { } tmp = result; printf( %d\n, tmp); if (myid==0) result = data; } fail to reproduce the bug! < FUNC sketches > This original order is not recorded in the sketch

lessons Feedback Generator (2/2) PI-replayer replay recorder monitor feedback generator Steps replay recorder R/W traces Identifying dynamic race pairs use happens-before race detector initial candidates sketch unrecorded races T1 Filtering order-determined races suspect races T2 The race order is already suspect implied Selecting suspect by the order of sketch points close-to-failure-first, depth-first one race Starting next replay deterministically execute until the suspect race pair flip the race order

Contents Introduction Our approach Overview of PRES Sketch recording Bug reproduction Partial-Information based replayer Monitor Feedback generator Evaluation Conclusion

Methodology Implement PRES using Pin 8-core Xeon machine 11 evaluated applications 4 server applications (Apache, MySQL, Cherokee, OpenLDAP) 3 desktop applications (Mozilla, PBzip2, Transmission) 4 scientific computing applications (Barnes, Radiosity, FMM, LU) 13 real-world concurrency bugs 6 Atomicity violation bugs (single- and multi-variable bugs) 4 Order violation bugs 3 Deadlock bugs

Recording Overhead (1/2) Non-server applications Application PIN SYNC SYS FUNC BB-5 BB-2 BB RW Desktop Applications (overhead %) Mozilla 32.5 59.5 60.6 83.8 598.0 858.8 1213.9 3093.5 PBZip2 17.4 18.0 18.0 18.4 595.4 1066.6 1977.9 27009.7 Transmission 5.9 14.5 21.3 32.7 30.3 33.9 41.9 71.8 Scientific Applications (overhead %) Barnes 2.5 6.5 6.8 427.7 424.2 1122.3 2351.0 28702.2 Radiosity 16.0 16.1 16.1 779.0 480.3 1181.7 2425.5 27209.6 RW : up to around 280 times slowdown SYNC, SYS : Good for performance critical applications 6-60% overhead

Recording Overhead (2/2) Server application normalized throughput MySQL SYNC, SYS : 3 times higher than RW throughput

Application Number of Replay Attempts Bug type Base SYNC SYS FUNC BB-5 BB-2 BB RW Server Applications Apache Atom. NO 96 28 7 8 7 1 1 MySQL Atom. NO 3 3 2 3 2 1 1 Cherokee Atom. NO 33 22 24 25 8 7 1 OpenLDAP Deadlock NO 1 1 1 1 1 1 1 Desktop Applications Mozilla Multi-v NO 3 3 3 4 4 3 1 PBZip2 Atom. NO 3 3 2 4 3 3 1 Transm. Order NO 2 2 2 2 2 2 1 Scientific Applications NO : not reproduced within 1000 tries Barnes Order NO 10 10 1 74 19 1 1 Radiosity Order NO NO NO 152 5 1 1 1 reproduce 12 Reproduce out of all 13 all bugs 13 bugs mostly mostly within within 10 5 attempts

Benefit of Feedback Generation NO : not reproduced within 1000 tries Application SYNC SYS FUNC BB-5 BB-2 BB Apache w/ 96 28 7 8 7 1 w/o NO NO NO NO 754 1 PBZip2 w/ 3 3 2 4 3 3 w/o NO NO NO NO NO NO Barnes w/ 10 10 1 74 19 1 w/o NO NO NO NO NO NO Low-overhead sketches (SYNC and SYS) can be used to reproduce concurrency bugs only because of our PI-replayer that leverages feedback The random approach does NOT work!

Effects of Race Filtering The number of dynamic benign data races to be explored Applications BASE SYNC SYS FUNC BB-5 BB-2 BB Apache 54390 1072 274 33 25 25 6 MySQL 39983 2 2 2 2 1 0 Cherokee 133 86 58 16 36 7 3 Mozilla 36258 317 310 14 72 60 42 PBZip2 667 326 318 1 7 4 4 Transmission 225 240 172 6 6 6 4 Filters out the races whose order can be inferred from sketches Significantly shrinks the unrecorded non-deterministic space to be explored

Bugs vs. Execution Reproduction Application SYNC SYS FUNC BB-5 BB-2 BB Mozilla BR 3 3 3 4 4 3 ER 16 16 4 4 4 3 PBZip2 BR 3 3 2 4 3 3 ER 5 5 2 4 5 3 the number of replays for bug reproduction (BR) vs. those for execution reproduction (ER) Reproducing the exact same execution path requires 1.6-5 times more attempts with SYS and SYNC

Scalability BASE PIN SYNC SYS FUNC BB- 5 BB-2 BB RW 25 20 15 10 5 0 46 ~ Radiosity 150 ~ 2 4 8 25 ~ number of processors Normalized elapse time (the lower, the better) 273 2-core : SYNC 6.2%, SYS 11.8% (770% by SMP-Revirt ) 1 0.8 0.6 0.4 0.2 0 MySQL 2 4 8 number of processors Normalized throughput of a server application (the higher, the better)

Conclusions Software solutions can be practical for reproducing concurrency bugs on multiprocessor PRES: Probabilistic Replay via Execution Sketch No need to reproduce the bug at the first attempt Trade replay-time efficiency for lower recording overhead PI-Replayer (that leverages feedback) makes low-overhead sketches (SYS, SYNC) useful for bug reproduction