Automatically Finding Patches Using Genetic Programming Westley Weimer, Claire Le Goues, ThanVu Nguyen, Stephanie Forrest
Motivation Software Quality remains a key problem Over one half of 1 percent of US GDP each year Programs ship with known bugs Vista shipped with thousands of them! Software Repair via Genetic Programming Transform a program with a bug Into a program without the bug By modifying relevant parts of the program 2
The Cunning Plan We can automatically and efficiently repair off-the-shelf, unannotated legacy programs. Basic idea: Randomly search through the space of all programs until you find a variant that repairs the problem. Key insights: Use existing regression tests to evaluate variants. Search by randomly perturbing parts of the program likely to contain the error. (SBST'09 Best Paper, ICSE'09 Best Paper, GECCO'09,...) 3
Input: The Process The program source code Some regression test cases passed by the program A test case failed by the program (= the bug) Work: (State Space Exploration) Create random variants of the program Run them on the test cases Repeat if necessary Output: New program source code that passes all tests or no solution found in time 4
This Talk Genetic Programming Weighted Paths Example Repair Experiments Repair Quality Experiments Big Finish 5
Genetic Programming Genetic programming is the application of evolutionary or genetic algorithms to program source code. Genetic Algorithms: Population of variants Crossover and mutation Fitness Function 6
What's In A Name? If you're wary of genetic programming, you can view this as search-based software engineering. We use the regression tests to guide the search. 7
The Secret Sauce In a large program, not every line is equally likely to contribute to the bug. Insight: since we have the test cases, run them and collect coverage information. The bug is more likely to be found on lines visited when running the failed test case. The bug is less likely to be found on lines visited when running the passed test cases. Also: Do not try to invent new code! 8
The Weighted Path We define a weighted path to be a list of <statement, weight> pairs. We use this weighted path: The statements are those visited during the failed test case. The weight for a statement S is 1.0 if S is not visited on a passed test case 0.1 if S is also visited on a passed test case 9
Genetic Programming for Program Repair: Mutation Population of Variants: Each variant is an <AST, weighted path> pair Mutation: To mutate a variant V = <AST V, wp V >, randomly choose a statement S from wp V biased by the weights Delete S, replace S with S1, or insert S2 after S Choose S1 and S2 from the entire AST Assumes program contains the seeds of its own repair (e.g., has another null check elsewhere). 10
Genetic Programming for Program Repair: Fitness Compile a variant If it fails to compile, Fitness = 0 Otherwise, run it on the test cases Fitness = number of test cases passed Weighted: passing the bug test case is worth more Selection Higher fitness variants are retained into the next generation Repeat until a solution is found 11
Example Source Code For Zune Bug Repair Millions of Microsoft Zune media players froze up on December 31st, 2008... 12
year=1980 while (days>365) printf(... year) if (isleapyear) Abstract Syntax Tree For Zune Bug Repair if (days>366) days -= 365 year += 1 days -= 366 year += 1 (no children) 13
year=1980 while (days>365) printf(... year) if (isleapyear) Weighted Path For Zune Bug Repair (1/3) if (days>366) days -= 365 year += 1 days -= 366 year += 1 Visited on Negative Test (days=10593) year += 1 14
year=1980 while (days>365) printf(... year) if (isleapyear) Weighted Path For Zune Bug Repair (2/3) if (days>366) days -= 365 year += 1 days -= 366 year += 1 Also Visited on Positive Test (days=1000) year += 1 Visited on Negative Test but not Positive Test year += 1 15
year=1980 while (days>365) printf(... year) if (isleapyear) Weighted Path For Zune Bug Repair (3/3) if (days>366) days -= 365 year += 1 days -= 366 year += 1 Weighted Path = Visited on Negative Test but not Positive Test year += 1 16
year=1980 while (days>365) printf(... year) if (isleapyear) Mutation For Zune Bug Repair (1/2) if (days>366) days -= 365 year += 1 days -= 366 year += 1 17
year=1980 while (days>365) printf(... year) if (isleapyear) Mutation For Zune Bug Repair (2/2) if (days>366) days -= 365 year += 1 days -= 366 year += 1 days -= 366 18
year=1980 while (days>365) printf(... year) if (isleapyear) Final Repair For Zune Bug if (days>366) days -= 365 year += 1 days -= 366 year += 1 days -= 366 19
Evolution of Zune Repair (5 normal test cases weighing 1 each, 2 buggy test cases weighing 10 each) 20
Minimize The Repair Repair Patch is a diff between orig and variant Random mutations may add unneeded stmts (e.g., dead code, redundant computation) In essence: try removing each line in the diff and check if the result still passes all tests Delta Debugging finds a 1-minimal subset of the diff in O(n 2 ) time Removing any single line causes a test to fail We use a tree-structured diff algorithm (diffx) Avoids problems with balanced curly braces, etc. 21
Experimental Results Program LOC Path # Fitness Bug wu-ftpd 35109 149 55 Format string vulnerability php string.c 26044 52 2 Integer overflow atris 21553 34 16 local stack buffer overflow flex 18775 3836 668 Segfault lighttpd fastcgi.c 13984 136 52 Remote heap buffer overflow indent 9906 1435 1367 Infinite loop openldap io.c 6519 25 8 Non-overflow denial of service nullhttpd 5575 768 219 Remote heap buffer overflow deroff 2236 251 22 Segfault Average repair time: 313 seconds. Average minimization time: 12 seconds. Total: 15 repaired programs, over 140,000 lines of code. 22
Scalability 23
Repair Quality Repairs are typically not what a human would have done Example: our technique adds bounds checks to one particular network read, rather than refactoring to use a safe abstract string class in multiple places Recall: any proposed repair must pass all regression test cases When POST test is omitted from nullhttpd, the generated repair eliminates POST functionality Tests ensure we do not sacrifice functionality Minimization prevents gratuitous deletions Adding more tests helps rather than hurting 24
Repair Quality, Self-Healing In an ecommerce/security setting, a high quality repair is one that blocks a security vulnerability without reducing transactional throughput Integrate with an anomaly detection system When ADS flags a request, treat it as the buggy test case and initiate repair Danger Will Robinson: this can be done without humans in the loop! 25
Experimental Setup Obtain indicative workloads Apply the workload to a vanilla server Speed up workload until server drops additional requests Send known attack packet: ADS flags it Take server down during repair, apply repair, restart server Measure throughput after applying repair 26
Experimental Results HTTP workload: 138k requests from 12k IPs over 14 hours; PHP workload similar Success = correct output delivered to client before client starts next request in workload 27
Experimental Results HTTP workload: 138k requests from 12k IPs over 14 hours; PHP workload similar Success = correct output delivered to client before client starts next request in workload 28
Technique Limitations Can only handle deterministic faults No multithreaded code or race conditions, etc. Long term: put scheduler constraints into the variant representation. Assumes bug test case visits different lines than normal test cases Assumes existing statements can form repair Current work: repair templates Hand-crafted and mined from CVS repositories 29
Conclusions We can automatically and efficiently repair off-the-shelf legacy programs. Around 15 programs totaling 140kloc in about 6 minutes each, on average We use regression tests to encode desired behavior. Normal tests encode required behavior The genetic programming search focuses attention on parts of the program visited during the bug but not visited during passed test cases. 30
Questions I encourage difficult questions. 31
Bonus Slide: Test Cases 32