Chimera: Hybrid Program Analysis for Determinism

Size: px

Start display at page:

Download "Chimera: Hybrid Program Analysis for Determinism"

Gladys Jenkins
5 years ago
Views:

Narayanasamy University of Michigan, Ann Arbor - 1 - *

1 Chimera: Hybrid Program Analysis for Determinism Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor * Chimera image from

2 Deterministic Replay Goal: record and reproduce multithreaded execution Debugging concurrency bugs Offline heavyweight dynamic analysis Forensics and intrusion detection and many more uses Problem Multithreaded record-and-replay is too slow (>2x) or requires custom hardware - 2 -

3 Multithreaded Record-and-Replay is Slow Thread 1 Thread 2 Thread 3 Checkpoint Memory and Register State Log non-deterministic program input - Interrupts, I/O values, DMA, etc. Write Write Read Log shared memory dependencies - 3 -

4 Replay for Data-Race-Free Programs is Cheap Lock(l) X=1 Y=1 Unlock(l) Z=1 Signal(c) T1 T2 X=0 Y=0 Unlock(l) T3 order of mem. ops. order of sync. ops. Wait(c) X=2 Y=2 Z=2 Data-race-free programs Shared memory accesses are well ordered by synchronization ops. Recording happens-before order of sync. ops. is sufficient Problem: Programs with data races - 4 -

synchronizations for potential data races Problem: Too many false

5 Our Contribution: A Hybrid Analysis Chimera Potentially racy program P Data-race-free program P Sound static data race analysis Add synchronizations for potential data races Problem: Too many false positives Profiling non-concurrent code regions Symbolic bounds analysis - 5 -

6 Roadmap Motivation Chimera Analysis 1) Static data race analysis 2) Profiling non-concurrent code regions 3) Symbolic bounds analysis Weak-lock Design Evaluation Conclusion - 6 -

7 Roadmap Motivation Chimera Analysis 1) Static data race analysis 2) Profiling non-concurrent code regions 3) Symbolic bounds analysis Weak-lock Design Evaluation Conclusion - 7 -

8 Static Data Race Analysis Find potential data-races using a sound static data race detector RELAY [Voung et al., FSE 07] Protect all potential data-races using weak-locks A new time-out lock which may be preempted (discussed later) Record and replay the happens-before order of weak-locks - 8 -

9 Protect Potential Races using Weak-locks void foo() { X = 0; Potential racy-pair void bar() { X = 1; for(i =... ){ Y[ tid ][ i ] = 0; Potential racy-pair for(i = ){ Y[ tid ][ i ] = 1; No race report Z = 1; Static analysis helps avoid instrumentation for access to Z - 9 -

10 Sources of False Positives in RELAY Sound data-race detector reports too many false data-races 53x overhead Source 1: Non-mutex synchronizations are ignored Lockset based analysis ignores fork-join, barrier, signal-wait, etc. May report a false data-race between memory instructions that can never execute concurrently Source 2: Conservative pointer analysis Overestimate variables accessed by a memory instruction May report a false data-race between memory instructions that can never access the same location

11 Roadmap Motivation Chimera Analysis 1) Static data race analysis 2) Profiling non-concurrent code regions 3) Symbolic bounds analysis Weak-lock Design Evaluation Conclusion

12 Profiling Non-concurrent Code Regions T1 foo() BARRIER BARRIER Problem Lockset based analysis ignores non-mutex synchronization ops. Solution Profile non-concurrent code regions (e.g., functions) Increase the granularity of weak-locks to protect a larger code region instead of each potential racy instruction Parallelism is preserved unless mis-profiled T2 bar()

13 Function-Level Weak-Locks void foo() { X = 0; void bar() { X = 1; for(i = ){ Y[ tid ][ i ] = 0; for(i = ){ Y[ tid ][ i ] = 1; Z = 1; if profiler says foo() and bar() are not likely to run concurrently foo() BARRIER BARRIER bar()

14 Roadmap Motivation Chimera Analysis 1) Static data race analysis 2) Profiling non-concurrent code regions 3) Symbolic bounds analysis Design Evaluation Conclusion

15 Imprecision in Conservative Pointer Analysis T1 foo() BARRIER May run Concurrently T2 bar() BARRIER

16 Imprecision in Conservative Pointer Analysis void foo() { for(i = 0 to N){ Y[ tid ][ i ] = 0; void bar() { Potential for(i= 0 to N){ racy-pair False Race Y[ tid ][ i ] = 1; Thread1 Thread 2 Y[][] RELAY uses Steensgaard s and Anderson s pointer analysis Flow-Insensitive and Context-Insensitive (FICI) analysis Naming heap objects is conservative Overestimate the variables accessed by a memory instruction

17 Symbolic Bounds Analysis Our Solution Derive the symbolic lower and upper bounds that a racy code region may access (e.g., loops) [Rugina and Rinard, PLDI 00] void foo() { for(i = 0 to N){ Y[ tid ][ i ] = 0; Symbolic Bounds Analysis Bounds: &Y[tid][0] to &Y[tid][N] Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expression Parallelism is preserved if the bounds are precise enough

18 Loop-level Weak-locks void foo() { X = 0; (&Y[tid][0],&Y[tid][N]) for(i = 0 to N){ Y[ tid ][ i ] = 0; void bar() { X = 1; (&Y[tid][0],&Y[tid][N]) for(i = 0 to N){ Y[ tid ][ i ] = 1; (&Y[tid][0],&Y[tid][N]) Z = 1; (&Y[tid][0],&Y[tid][N]) Symbolic bounds: &Y[tid][0] ~ &Y[tid][N]

19 Imprecise Symbolic Bounds Sources Depend on the value computed inside the code region Depend on arithmetic operations not supported in the analysis e.g., modulo operations, logical AND/OR, etc. void qux() { for(i = 0 to N){ prev = Z[ prev ]; Symbolic Bounds Analysis Bounds: -INF to +INF Choosing the optimal granularity If bounds are too imprecise and the loop body is long enough, resort to instruction (basic-block) level weak-locks for parallelism

20 Roadmap Motivation Chimera Analysis Weak-lock Design Evaluation Conclusion

21 Deadlock due to Weak-locks No deadlocks between weak-locks function-level > loop-level > instruction-level Deadlock between weak-locks and original sync. ops. is possible T 1 wait (cv) T 2 signal(cv) Time-out!!

22 Weak-lock Time-out A weak-lock might time-out Invoke a special system call to handle it T 1 Current owner T 2 Current owner Time-out!! wait (cv) Logged order of weak-locks signal(cv) Weak-lock guarantee Only one thread holds a given weak-lock at any given time Mutual exclusion may be compromised; but sufficient for replay

23 Roadmap Motivation Chimera Analysis Weak-lock Design Evaluation Conclusion

24 Implementation Source-to-source Instrumentation Implemented in OCaml using CIL as a front end Static analysis Data race detection: RELAY [Voung et al., FSE 07] Include all library source codes for soundness (uclibc s libc, libm, etc.) Symbolic bounds analysis: [Rugina and Rinard, PLDI 00] Intra-procedural analysis for racy loops only Runtime system Modified Linux kernel to record/replay program input Modified pthread library to record/replay happens-before order of original synchronization operations and weak-locks

25 Evaluation Setup Test Environment 2.66 GHz 8-core Xeon processor with 4 GB of RAM Different set of inputs for profiling and performance evaluation Average of five trials with 4 worker threads 2, 4, 8 threads for scalability results Benchmarks Desktop applications aget, pfscan, and pbzip2 Server programs knot and apache SPLASH-2 suite ocean, water-nsq, fft, and radix

26 Record and Replay Performance 2.5 record replay 86% slowdown Normalized perf. overhead % slowdown 39% 0 aget pfscan pbzip2 knot apache ocean water fft radix average Recording : 39% on average Replay : similar to recording; much lower for I/O intensive prgs

27 Effectiveness of Coarse-grained Weak-locks Normalized recording overhead instr instr + func instr + loop instr + loop + func instr + bb + loop + func > 53x aget pfscan pbzip2 knot apache ocean water fft radix average

28 Effectiveness of Coarse-grained Weak-locks Normalized recording overhead instr instr + func instr + loop instr + loop + func instr + bb + loop + func > aget pfscan pbzip2 knot apache ocean water fft radix average Coarse-grained weak-locks reduce the cost of instrumentation

29 Effectiveness of Coarse-grained Weak-locks Normalized recording overhead instr instr + func instr + loop instr + loop + func instr + bb + loop + func > aget pfscan pbzip2 knot apache ocean water fft radix average Coarse-grained weak-locks reduce the cost of instrumentation Exception: control-flow dependency (e.g., pfscan)

30 Effectiveness of Coarse-grained Weak-locks Normalized recording overhead instr instr + func instr + loop instr + loop + func instr + bb + loop + func > aget pfscan pbzip2 knot apache ocean water fft radix average Coarse-grained weak-locks reduce the cost of instrumentation Exception: control-flow dependency (e.g., pfscan)

31 Effectiveness of Coarse-grained Weak-locks Normalized recording overhead instr instr + func instr + loop instr + loop + func instr + bb + loop + func > 1.39x aget pfscan pbzip2 knot apache ocean water fft radix average Coarse-grained weak-locks reduce the cost of instrumentation Exception: control-flow dependency (e.g., pfscan)

32 Breakdown of Recording Overhead Normalized recording overhead func locks loop locks instr/bb locks sync op & system log aget pfscan pbzip2 knot apache ocean water fft radix Weak-lock overhead = contention (waiting) cost + logging cost

33 Breakdown of Recording Overhead Normalized recording overhead func wait func log loop wait loop log instr/bb wait instr/bb log sync op & system log aget pfscan pbzip2 knot apache ocean water fft radix Weak-lock overhead = contention (waiting) cost + logging cost High loop-lock contention High instr/bb-lock contention

34 Normalized recording overhead Scalability 2p 4p 8p aget pfscan pbzip2 knot apache ocean water fft radix average Scientific applications scale worse due to imprecise symbolic bounds analysis

35 Conclusion Goal: Software-only deterministic multiprocessor replay systems Chimera Analysis Static data race analysis Find and protect potential data races with weak-locks Instruction/basic-block-level weak-locks Profiling non-concurrent code regions Address the inadequacy of lockset-based algorithm Function-level weak-locks Symbolic bounds analysis Address the imprecision of conservative pointer analysis Loop-level weak-locks Low Recording Overhead 39% recording overhead for 4 worker threads

36 Thank you

Deterministic Replay and Data Race Detection for Multithreaded Programs

Deterministic Replay and Data Race Detection for Multithreaded Programs Dongyoon Lee Computer Science Department - 1 - The Shift to Multicore Systems 100+ cores Desktop/Server 8+ cores Smartphones 2+ cores