Scalable Isolation of Failure- Inducing Changes via Version Comparison

Size: px

Start display at page:

Download "Scalable Isolation of Failure- Inducing Changes via Version Comparison"

August Hardy
5 years ago
Views:

1 Scalable Isolation of Failure- Inducing Changes via Version Comparison Mohammadreza Ghanavati, Artur Andrzejak, Zhen Dong Heidelberg University 1

2 Debugging in Large-Scale Software Root Cause Manual Tedious Time Consuming Expensive Bug Symptom Automated Debugging 2

Selected Works on Automated Debugging Automated Debugging Slicing-based Debugging Statistical Debugging Delta Debugging Spectrum-based Debugging Formula-based Debugging

3 Selected Works on Automated Debugging Automated Debugging Slicing-based Debugging Statistical Debugging Delta Debugging Spectrum-based Debugging Formula-based Debugging Static Slicing (81) Tarantula (01) DD (99,01) Renieris (03) Darwin (09) Dynamic slicing (88, 93) CBI (03) HDD (05) Wong (10) Error Invariants (11) Whyline (08) Holmes (08) 3

4 More Works on Automated Debugging Automated Debugging Slicing-based Debugging Statistical Debugging Delta Debugging Spectrum-based Debugging Machine Learningbased Methods Formula-based Debugging Data-Mining-based Methods Model-based Methods Static Slicing (81) Tarantula (01) DD (99,01) Renieris (03) Brun: for Latent Code Errors (04) Darwin (09) Cellier: FCA (08) DeMillo: modeling failure(97) Dynamic slicing (88, 93) CBI (03) HDD (05) Wong (10) Briand: Decision tree algorithm(07) BugAssist (11) Nessa: N-grams (09) Mateis: functional dependency model(2000) Whyline (08) Wong: crosstab analysis-based method (08) Wong: backpropagation neural network(09) Error Invariants (11) Watowa: dependency model(02) Holmes (08) Sources: 1. A. Orso: Automated Debugging: Are We There Yet?, Dagstuhl Seminar Feb W. E. Wong and V. Debroy A Survey of Software Fault Localization, Technical Report UTDCS-45-09, The University of Texas at Dallas, Nov

5 Challenges of Automated Debugging In most AD approaches, result is a (ranked) set of suspicious statements to be examined by the developer Usability problems reported in Parnin and Orso (ISSTA '11) 1. Specificity is reported as a relative fraction of total #LOC 2. No further help with bug understanding (no explanations) For (1): suspect set is typically ~ 0.1-1% of total #LOC OK for small projects: 1k LOC => 10 LOC Infeasible for large ones: 1 mio LOC => 10k LOC!!! Can we keep the set of suspects small and independent of project size? 5

$fraction of code changed between commits is roughly constant Assuming frequent & regular commits Ignore topic merges (i.e. a feature branch merged into main branch) Code version i Code changes Tests Code version i+1 Code changes Tests 6$

6 Exploiting SW Development Processes Development process in large SW projects: Incremental SW development with many intermediate versions (subjected to tests) Observation: even in large projects, fraction of code changed between commits is roughly constant Assuming frequent & regular commits Ignore topic merges (i.e. a feature branch merged into main branch) Code version i Code changes Tests Code version i+1 Code changes Tests 6

Exploiting SW Development Processes Idea: investigate only the

previous version provides prior knowledge source of additional

a practical technique for some bugs than an impractical one for all

7 Exploiting SW Development Processes Idea: investigate only the changes of the code Constant size of suspect set Behavior of previous version provides prior knowledge source of additional information Drawback: not all defects can be discovered But: better a practical technique for some bugs than an impractical one for all Code version i Code changes Tests Code version i+1 Code changes Tests Suspicious changes 7

Successful at: Memory Leak Isolation Identical (but arbitrary) workload (unit / integ. tests etc.) Software version i Software version i+1 new 1 allocated released new 1 allocated released Problem!

8 Successful at: Memory Leak Isolation Identical (but arbitrary) workload (unit / integ. tests etc.) Software version i Software version i+1 new 1 allocated released new 1 allocated released Problem! allocated allocated new k released new k released allocated allocated new n released new n released Felix Langner, Artur Andrzejak: Detection and Root Cause Analysis of Memory- Related Software Aging Defects by Automated Tests, MASCOTS

Intersect Compare coverage & filter suspects Artur Andrzejak Overview of the Approach Source Version i Source Version i+1 Difference set Instrument version i Execute & get code coverage Prg.

9 Intersect Compare coverage & filter suspects Artur Andrzejak Overview of the Approach Source Version i Source Version i+1 Difference set Instrument version i Execute & get code coverage Prg. source Failure stack trace Stack trace analysis Backward slices Instrument version i+1 1: Find the differences of versions 2: Retrieve failure site from the stack trace 3: Compute backward slices for each version 4: Compute the intersections of steps 2 & 3 5: Instrument the intersection 6: Execute failed test on each instrumented version obtained from step 5 to get code coverage profiles 7: Get list of suspects with filtering Execute & get code coverage Suspect Report 9

10 Assume 2 versions: Difference set Version i as passing version Version i+1 as failing version Using version history tools like SVN, GIT or diff command: Code version i Diff tool Difference Set Code version i+1 10

Stack Trace Analysis Stack trace: resulted from running test case on the failing version Failure site = topmost non-exceptionhandling entry of the stack We compute the backward slice for both

11 Stack Trace Analysis Stack trace: resulted from running test case on the failing version Failure site = topmost non-exceptionhandling entry of the stack We compute the backward slice for both failing and passing version (with failure site as seed statement) Backward Slicing S B (B) = {A A--> *B} A fragment of stack trace S B (B) = essentially, all statements which could have reached seed B 11

12 source code Artur Andrzejak Intersectinig & Instrumenting execution of test case T in version i deleted changes execution of test case T in version i+1 compute backward slice from failure site and intersect with changes backward slice added changes Instrument CHANGES failure site Intersected changes test result Instrument only: added intersected changes (+) in version i+1 deleted intersected changes (-) in i 12

Intersect Compare coverage & filter suspects Artur Andrzejak Getting Code Coverage Source Version i Source Version i+1 Difference set Instrument version i Execute & get code coverage Prg.

13 Intersect Compare coverage & filter suspects Artur Andrzejak Getting Code Coverage Source Version i Source Version i+1 Difference set Instrument version i Execute & get code coverage Prg. source Failure stack trace Stack trace analysis Backward slices Instrument version i+1 Execute & get code coverage Execute & get code coverage: re-execution of the test case T (failing for version i+1) on both of instrumented versions Get coverage profile for version i Get coverage profile for version i+1 Suspect Report 13

14 Comparing the Coverage Information The following statements are included in the set of suspects: All statements added to or changed in failing version i+1 which are in the failing coverage profile All statements deleted from passing version i which are in the passing coverage profile Final result: The resulting list of suspicious statements with the locations 14

15 Implementation & Evaluation Setup Implementation: Using WALA tool for static analysis Building call graph Computing backward slice using thin slicing* Using UNIX diff command for finding differences Using Shrike library for instrumenting (from WALA) Application: Apache Hadoop, release alpha Test cases: 4 real bugs from Hadoop bug repository *Manu Sridharan, Stephen J. Fink, Ratislav Bodik, Thin slicing, in PLDI

16 Real Test Cases From Apache Hadoop Bug issue Broken by Issue Failing Test HDFS-3856 HADOOP-8689 TestHDFSServerPorts HDFS-4887 HDFS-4840 TestNNThroughputBenchmark HDFS-4282 HADOOP-9103 TestEditLog.testFuzzSequences Yarn-960 Yarn-701 TestBinaryTokens, TestMRCredentials We use real bugs from Hadoop issue tracking system between 15 th August 2012 and 27 th July

17 Test Case Requirements No consideration of third-party libraries Well documented bugs for result validation Existence of a passing version for the failing test Note: test case Yarn-960 did not satisfy first requirement: the failure stack trace contains only third party libraries. We do not consider this test case (no fundamental limitation). 17

Complexity Evaluation: Code Size Bslice IS Cov Report (#LOC) 1000 900 800 700 600 500 400 300 200 100 0 Report=1 Report=0 9 2

1030 Dif (#LOC)= 1325 In 2 out of 3 cases no false positives (high specificity) All of the steps in our approach are needed to

18 Complexity Evaluation: Code Size Bslice IS Cov Report (#LOC) Report=1 Report= Passing Failing Passing Failing Passing Failing Report=1 HDFS-3856 HDFS-4887 HDFS-4282 Dif (#LOC)= Dif (#LOC)= 1030 Dif (#LOC)= 1325 In 2 out of 3 cases no false positives (high specificity) All of the steps in our approach are needed to reach high specificity Empty suspect list for one bug, but at least our approach can indicate the method related to the defect 18

19 Performance Evaluation (1): Overheads Comparison Issue HDFS-3856 HDFS-4887 HDFS-4282 Version Original version Run time (s) Size (MB) Full inst. Bslice inst. Our approach Passing / / / 1.00 Failing / / / 1.00 Passing / / / 1.00 Failing / / / 1.00 Passing / / / 1.02 Failing / / / 1.02 Our approach has negligible instrumentation overheads 19

Performance Evaluation (2): Comparison of Running Times Slicing Inst. Run Test time 80 70 60 50 40 30 20 10 0 Total / Test time = 3.

20 Performance Evaluation (2): Comparison of Running Times Slicing Inst. Run Test time Total / Test time = 3.3 Total / Test time = 2.4 Total / Test time = 2.4 HDFS-3856 HDFS-4887 HDFS-4282 The total time of our approach requires at most 3.3 times the duration of the failed test 20

21 Drawbacks of Our Approach Does not work if the actual bug has been introduced already in an older version (or before) but manifests only with recent changes (yielding version i+1) Evaluation on only four bugs: insufficient sample size to draw general conclusions 21

22 Future Work Extended evaluation. Evaluation of the approach on more bugs (also artificial defects) and other applications More runtime facts. Extension of dynamic analysis with other runtime information, like value of conditional expressions Adaptivity. Modification of the steps of our method depending on the sizes of intermediate results 22

23 Future Work Supporting failure understanding. Study of runtime data collection during the passing and failing to improve bug understanding Enhancing tools for continuous integration. Integration of our methods in wide-spread testing tools like Jenkins to support wide-spread adoption 23

Summary Objective: A scalable approach to isolating failure-inducing changes Idea: Exploiting version differences together with static and dynamic code analysis Strength of the approach: Applicable

24 Summary Objective: A scalable approach to isolating failure-inducing changes Idea: Exploiting version differences together with static and dynamic code analysis Strength of the approach: Applicable to very large projects Suspect set is proportional to the size of recent code changes Runtime overhead is in the order of executing a test triggering bug search Integration into existing testing processes feasible Evaluation: On a large-scale project (Apache Hadoop) and real bugs Promising result with high accuracy(near to zero false positives), low complexity, low performance degradation Limitations: Just crashing errors, no latent errors 24

25 References M. Renieris and S. P. Reiss, Fault Localization with Nearest Neighbor Queries, in ASE W. E. Wong, V. Debroy and B. Choi, A Family of Code Coverage-based Heuristics for Effective Fault Localization. Journal of Systems and Software, 83(2): , February M. Weiser. Program slicing. IEEE Trans. Softw. Eng., 10(4): , 1984 M. Sridharan, S. J. Fink, and R. Bodik. Thin slicing. In PLDI, D. Qi, A. Roychoudhury, Z. Liang, and K. Vaswani. Darwin: an approach for debugging evolving programs. In ESEC/FSE, G. Misherghi and Z. Su. HDD: hierarchical delta debugging. In ICSE,2006. B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalable statistical bug isolation. In PLDI, 2005 B. Liblit. Cooperative Bug Isolation: Winning Thesis of the 2005 ACM Doctoral Dissertation Competition, volume 4440 of Lecture Notes in Computer Science. Springer, 2007 J. A. Jones and M. J. Harrold. Empirical evaluation of the tarantula automatic fault-localization technique. In ASE, T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani. Holmes: Effective statistical debugging via efficient path profiling. In ICSE, W. E. Wong, T. Wei, Y. Qi, and L. Zhao, A Crosstab-based Statistical Method for Effective Fault Localization, in Proceedings of the 1st International Conference on Software Testing, Verification and Validation, pp , Lillehammer, Norway, April 2008 F. Wotawa, On the relationship between model-based debugging and program slicing, Artificial Intelligence, 135(1-2): , February 2002 F. Wotawa, M. Stumptner, and W. Mayer, Model-based debugging or how to diagnose programs automatically, in Proceedings of the 15th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems: Developments in Applied Artificial Intelligence, pp , Cairns, Australia, June W. E. Wong and Y. Qi, BP Neural Network-based Effective Fault Localization, International Journal of Software Engineering and Knowledge Engineering 19(4): , June S. Nessa, M. Abedin, W. Eric Wong, L. Khan, and Y. Qi, Fault Localization Using N-gram Analysis, in Proceedings of the 3rd International Conference on Wireless Algorithms, Systems, and Applications, pp , Richardson, Texas, USA, April 2009 (Lecture Notes In Computer Science, Volume 5258) C. Mateis, M. Stumptner, and F. Wotawa, Modeling Java programs for diagnosis, in Proceedings of the 14th European Conference on Artificial Intelligence, pp , Berlin, Germany, August R. A. DeMillo, H. Pan, E. H. Spafford, Failure and fault analysis for software debugging ; in Proceedings of 21st International Computer Software and Applications Conference, pp , Washington DC, USA, August P. Cellier, S. Ducasse, S. Ferre, and O. Ridoux, Formal Concept Analysis Enhances Fault Localization in Software, in Proceedings of the 4th International Conference on Formal Concept Analysis, pp , Montréal, Canada, February

26 Thank you. QUESTIONS ARE WELCOME! 26

27 Additional Slides 27

28 Version Comparison: Main Idea The main idea is a differential approach: compare consecutive development versions Code version i Code version i+1 Find Differences List of suspicious changes 28

29 Detection of Failure-Inducing Changes Code version i Report Stack trace analysis Intersect Versions differences Instrument Execution Filtering Code version i+1 29

Issue HDFS-3856 HDFS-4887 HDFS-4282 Version Complexity Evaluation Dif (#LOC) Code Size Original version Bslice IS Cov Passing 922 544 189 109207 Failing 759 445 166 Passing 9 2 2 1030 Failing 9 2 2

30 Issue HDFS-3856 HDFS-4887 HDFS-4282 Version Complexity Evaluation Dif (#LOC) Code Size Original version Bslice IS Cov Passing Failing Passing Failing Passing Failing Report (#LOC) In 2 out of 3 cases no false positives (specificity). All of the steps in our approach is necessary to reach this level of specificity. Empty suspect list for one bug. But at least our approach can indicate failure-inducing change method. 30

Performance Evaluation (2) Comparison of Running Times Issue App. time for passing + failing version Slicing Inst. Run Total Test time Total / Test HDFS-3856 4 4 12 20 6 3.

31 Performance Evaluation (2) Comparison of Running Times Issue App. time for passing + failing version Slicing Inst. Run Total Test time Total / Test HDFS HDFS HDFS The total time of our approach requires at most 3.3 times the duration of the failed test, and the latter is just one of many tests executed within a test suite 31

co-collated with ISSRE 2017 October 23, 2017

Panel Presentation at the International Workshop on Program Debugging (IWPD 2017) co-collated with ISSRE 2017 October 23, 2017 Artur Andrzejak Heidelberg University http://pvs.ifi.uni-heidelberg.de/ Panel