Automatically Finding Patches Using Genetic Programming. Westley Weimer, Claire Le Goues, ThanVu Nguyen, Stephanie Forrest

Similar documents
Automatically Finding Patches Using Genetic Programming

Using Execution Paths to Evolve Software Patches

Fixing software bugs in 10 minutes or less using evolutionary computation

AUTOMATIC PROGRAM REPAIR USING GENETIC PROGRAMMING

Automated Program Repair

A Systematic Study of Automated Program Repair: Fixing 55 out of 105 Bugs for $8 Each

Automatically Finding Patches Using Genetic Programming

CAREER: Scalable and Trustworthy Automatic Program Repair A fundamental challenge for computer scientists over the next decade is to produce and

Automatically Finding Patches Using Genetic Programming

Automa'c, Efficient, and General Repair of So8ware Defects using Lightweight Program Analyses

Introduction to Scientific Modeling CS 365, Fall Semester, 2011 Genetic Algorithms

Trusted Software Repair for System Resiliency. Westley Weimer, Stephanie Forrest, Miryung Kim, Claire Le Goues, Patrick Hurley

SemFix: Program Repair via Semantic Analysis. Ye Wang, PhD student Department of Computer Science Virginia Tech

Representations and Operators for Improving Evolutionary Software Repair

A Systematic Study of Automated Program Repair: Fixing 55 out of 105 bugs for $8 Each

LEVERAGING LIGHTWEIGHT ANALYSES TO AID SOFTWARE MAINTENANCE ZACHARY P. FRY PHD PROPOSAL

Repair & Refactoring

Automatically Repairing Concurrency Bugs with ARC MUSEPAT 2013 Saint Petersburg, Russia

Automated Program Repair through the Evolution of Assembly Code

DynaMoth: Dynamic Code Synthesis for Automatic Program Repair

Program Synthesis. SWE 795, Spring 2017 Software Engineering Environments

Verification & Validation of Open Source

Software Security IV: Fuzzing

Genetic Programming for Julia: fast performance and parallel island model implementation

arxiv: v1 [cs.se] 25 Mar 2014

Leveraging Program Equivalence for Adaptive Program Repair: Models and First Results. Westley Weimer, UVA Zachary P. Fry, UVA Stephanie Forrest, UNM

CSCE150A. Introduction. While Loop. Compound Assignment. For Loop. Loop Design. Nested Loops. Do-While Loop. Programming Tips CSCE150A.

Combining Bug Detection and Test Case Generation

Computer Science & Engineering 150A Problem Solving Using Computers. Chapter 5. Repetition in Programs. Notes. Notes. Notes. Lecture 05 - Loops

Differential program verification

WHY TEST SOFTWARE?...

The Evolution of System-call Monitoring

An Unsystematic Review of Genetic Improvement. David R. White University of Glasgow UCL Crest Open Workshop, Jan 2016

CSE 565 Computer Security Fall 2018

Fuzzing. compass-security.com 1

Genetic-Algorithm-Based Construction of Load-Balanced CDSs in Wireless Sensor Networks

Motivation. Overview. Scalable Dynamic Analysis for Automated Fault Location and Avoidance. Rajiv Gupta. Program Execution

Root Cause Analysis for HTML Presentation Failures using Search-Based Techniques

Software Vulnerability

REPAIRING PROGRAMS WITH SEMANTIC CODE SEARCH. Yalin Ke Kathryn T. Stolee Claire Le Goues Yuriy Brun

Survey of Cyber Moving Targets. Presented By Sharani Sankaran

Overview AEG Conclusion CS 6V Automatic Exploit Generation (AEG) Matthew Stephen. Department of Computer Science University of Texas at Dallas

Fault Isolation for Device Drivers

CSC 405 Introduction to Computer Security Fuzzing

ASTOR: A Program Repair Library for Java

CS2141 Software Development using C/C++ Debugging

Cyber Moving Targets. Yashar Dehkan Asl

Overview. Concepts this lecture String constants Null-terminated array representation String library <strlib.h> String initializers Arrays of strings

Secure Software Development: Theory and Practice

"Secure" Coding Practices Nicholas Weaver

EasyChair Preprint. A Study on the Use of IDE Features for Debugging

Hugbúnaðarverkefni 2 - Static Analysis

Static Analysis of C++ Projects with CodeSonar

CSE 565 Computer Security Fall 2018

SoK: Eternal War in Memory

Genetic Programming Prof. Thomas Bäck Nat Evur ol al ut ic o om nar put y Aling go rg it roup hms Genetic Programming 1

ADVANCED DIGITAL IC DESIGN. Digital Verification Basic Concepts

Outline. Classic races: files in /tmp. Race conditions. TOCTTOU example. TOCTTOU gaps. Vulnerabilities in OS interaction

Efficient Search for Inputs Causing High Floating-point Errors

Neutral Networks of Real-World Programs and their Application to Automated Software Evolution

Automatically Repairing Broken Workflows for Evolving GUI Applications

My other computer is YOURS!

Automatic Repair of Real Bugs in Java: A Large-Scale Experiment on the Defects4J Dataset

Automating Test Driven Development with Grammatical Evolution

THE ROAD NOT TAKEN. Estimating Path Execution Frequency Statically. ICSE 2009 Vancouver, BC. Ray Buse Wes Weimer

IntFlow: Integer Error Handling With Information Flow Tracking

University of Oxford / Automatic Heap Layout Manipulation - Sean Heelan 1

Static Analysis in Practice

arxiv: v1 [cs.se] 22 Feb 2018

Mutations for Permutations

Betriebssysteme und Sicherheit Sicherheit. Buffer Overflows

Other array problems. Integer overflow. Outline. Integer overflow example. Signed and unsigned

Genetic Improvement Programming

Undefined Behaviour in C

Cooperative Bug Isolation

Software Security II: Memory Errors - Attacks & Defenses

Constructing an Optimisation Phase Using Grammatical Evolution. Brad Alexander and Michael Gratton

Memory Safety (cont d) Software Security

Collaborative Intrusion Detection System : A Framework for Accurate and Efficient IDS. Outline

Lecture 4 September Required reading materials for this class

N-Variant SystemsA Secretless Framework for Security through. Diversity Cox et al.

EECS 481 Software Engineering Exam #1. You have 1 hour and 20 minutes to work on the exam.

CSE 565 Computer Security Fall 2018

KLEE Workshop Feeding the Fuzzers. with KLEE. Marek Zmysłowski MOBILE SECURITY TEAM R&D INSTITUTE POLAND

Bi-Objective Optimization for Scheduling in Heterogeneous Computing Systems

Lecture 1: Buffer Overflows

CSE 374 Programming Concepts & Tools. Hal Perkins Fall 2015 Lecture 15 Testing

Homework # 7 Distributed Computing due Saturday, December 13th, 2:00 PM

An Analysis of Patch Plausibility and Correctness for Generate-And-Validate Patch Generation Systems

Scaling up: How we made millions of domains happier. Tom Arnfeld, DNS Engineer Pavel Odintsov, DNS Engineer

SourcererCC -- Scaling Code Clone Detection to Big-Code

A program execution is memory safe so long as memory access errors never occur:

Evaluating Bug Finders

On-Demand Proactive Defense against Memory Vulnerabilities

Buffer overflow prevention, and other attacks

Statically Detecting Likely Buffer Overflow Vulnerabilities

Simple Overflow. #include <stdio.h> int main(void){ unsigned int num = 0xffffffff;

We will focus on Buffer overflow attacks SQL injections. See book for other examples

Outline. Computer programming. Debugging. What is it. Debugging. Hints. Debugging

Last week. Data on the stack is allocated automatically when we do a function call, and removed when we return

Transcription:

Automatically Finding Patches Using Genetic Programming Westley Weimer, Claire Le Goues, ThanVu Nguyen, Stephanie Forrest

Motivation Software Quality remains a key problem Over one half of 1 percent of US GDP each year Programs ship with known bugs Vista shipped with thousands of them! Software Repair via Genetic Programming Transform a program with a bug Into a program without the bug By modifying relevant parts of the program 2

The Cunning Plan We can automatically and efficiently repair off-the-shelf, unannotated legacy programs. Basic idea: Randomly search through the space of all programs until you find a variant that repairs the problem. Key insights: Use existing regression tests to evaluate variants. Search by randomly perturbing parts of the program likely to contain the error. (SBST'09 Best Paper, ICSE'09 Best Paper, GECCO'09,...) 3

Input: The Process The program source code Some regression test cases passed by the program A test case failed by the program (= the bug) Work: (State Space Exploration) Create random variants of the program Run them on the test cases Repeat if necessary Output: New program source code that passes all tests or no solution found in time 4

This Talk Genetic Programming Weighted Paths Example Repair Experiments Repair Quality Experiments Big Finish 5

Genetic Programming Genetic programming is the application of evolutionary or genetic algorithms to program source code. Genetic Algorithms: Population of variants Crossover and mutation Fitness Function 6

What's In A Name? If you're wary of genetic programming, you can view this as search-based software engineering. We use the regression tests to guide the search. 7

The Secret Sauce In a large program, not every line is equally likely to contribute to the bug. Insight: since we have the test cases, run them and collect coverage information. The bug is more likely to be found on lines visited when running the failed test case. The bug is less likely to be found on lines visited when running the passed test cases. Also: Do not try to invent new code! 8

The Weighted Path We define a weighted path to be a list of <statement, weight> pairs. We use this weighted path: The statements are those visited during the failed test case. The weight for a statement S is 1.0 if S is not visited on a passed test case 0.1 if S is also visited on a passed test case 9

Genetic Programming for Program Repair: Mutation Population of Variants: Each variant is an <AST, weighted path> pair Mutation: To mutate a variant V = <AST V, wp V >, randomly choose a statement S from wp V biased by the weights Delete S, replace S with S1, or insert S2 after S Choose S1 and S2 from the entire AST Assumes program contains the seeds of its own repair (e.g., has another null check elsewhere). 10

Genetic Programming for Program Repair: Fitness Compile a variant If it fails to compile, Fitness = 0 Otherwise, run it on the test cases Fitness = number of test cases passed Weighted: passing the bug test case is worth more Selection Higher fitness variants are retained into the next generation Repeat until a solution is found 11

Example Source Code For Zune Bug Repair Millions of Microsoft Zune media players froze up on December 31st, 2008... 12

year=1980 while (days>365) printf(... year) if (isleapyear) Abstract Syntax Tree For Zune Bug Repair if (days>366) days -= 365 year += 1 days -= 366 year += 1 (no children) 13

year=1980 while (days>365) printf(... year) if (isleapyear) Weighted Path For Zune Bug Repair (1/3) if (days>366) days -= 365 year += 1 days -= 366 year += 1 Visited on Negative Test (days=10593) year += 1 14

year=1980 while (days>365) printf(... year) if (isleapyear) Weighted Path For Zune Bug Repair (2/3) if (days>366) days -= 365 year += 1 days -= 366 year += 1 Also Visited on Positive Test (days=1000) year += 1 Visited on Negative Test but not Positive Test year += 1 15

year=1980 while (days>365) printf(... year) if (isleapyear) Weighted Path For Zune Bug Repair (3/3) if (days>366) days -= 365 year += 1 days -= 366 year += 1 Weighted Path = Visited on Negative Test but not Positive Test year += 1 16

year=1980 while (days>365) printf(... year) if (isleapyear) Mutation For Zune Bug Repair (1/2) if (days>366) days -= 365 year += 1 days -= 366 year += 1 17

year=1980 while (days>365) printf(... year) if (isleapyear) Mutation For Zune Bug Repair (2/2) if (days>366) days -= 365 year += 1 days -= 366 year += 1 days -= 366 18

year=1980 while (days>365) printf(... year) if (isleapyear) Final Repair For Zune Bug if (days>366) days -= 365 year += 1 days -= 366 year += 1 days -= 366 19

Evolution of Zune Repair (5 normal test cases weighing 1 each, 2 buggy test cases weighing 10 each) 20

Minimize The Repair Repair Patch is a diff between orig and variant Random mutations may add unneeded stmts (e.g., dead code, redundant computation) In essence: try removing each line in the diff and check if the result still passes all tests Delta Debugging finds a 1-minimal subset of the diff in O(n 2 ) time Removing any single line causes a test to fail We use a tree-structured diff algorithm (diffx) Avoids problems with balanced curly braces, etc. 21

Experimental Results Program LOC Path # Fitness Bug wu-ftpd 35109 149 55 Format string vulnerability php string.c 26044 52 2 Integer overflow atris 21553 34 16 local stack buffer overflow flex 18775 3836 668 Segfault lighttpd fastcgi.c 13984 136 52 Remote heap buffer overflow indent 9906 1435 1367 Infinite loop openldap io.c 6519 25 8 Non-overflow denial of service nullhttpd 5575 768 219 Remote heap buffer overflow deroff 2236 251 22 Segfault Average repair time: 313 seconds. Average minimization time: 12 seconds. Total: 15 repaired programs, over 140,000 lines of code. 22

Scalability 23

Repair Quality Repairs are typically not what a human would have done Example: our technique adds bounds checks to one particular network read, rather than refactoring to use a safe abstract string class in multiple places Recall: any proposed repair must pass all regression test cases When POST test is omitted from nullhttpd, the generated repair eliminates POST functionality Tests ensure we do not sacrifice functionality Minimization prevents gratuitous deletions Adding more tests helps rather than hurting 24

Repair Quality, Self-Healing In an ecommerce/security setting, a high quality repair is one that blocks a security vulnerability without reducing transactional throughput Integrate with an anomaly detection system When ADS flags a request, treat it as the buggy test case and initiate repair Danger Will Robinson: this can be done without humans in the loop! 25

Experimental Setup Obtain indicative workloads Apply the workload to a vanilla server Speed up workload until server drops additional requests Send known attack packet: ADS flags it Take server down during repair, apply repair, restart server Measure throughput after applying repair 26

Experimental Results HTTP workload: 138k requests from 12k IPs over 14 hours; PHP workload similar Success = correct output delivered to client before client starts next request in workload 27

Experimental Results HTTP workload: 138k requests from 12k IPs over 14 hours; PHP workload similar Success = correct output delivered to client before client starts next request in workload 28

Technique Limitations Can only handle deterministic faults No multithreaded code or race conditions, etc. Long term: put scheduler constraints into the variant representation. Assumes bug test case visits different lines than normal test cases Assumes existing statements can form repair Current work: repair templates Hand-crafted and mined from CVS repositories 29

Conclusions We can automatically and efficiently repair off-the-shelf legacy programs. Around 15 programs totaling 140kloc in about 6 minutes each, on average We use regression tests to encode desired behavior. Normal tests encode required behavior The genetic programming search focuses attention on parts of the program visited during the bug but not visited during passed test cases. 30

Questions I encourage difficult questions. 31

Bonus Slide: Test Cases 32