TypeChef: Towards Correct Variability Analysis of Unpreprocessed C Code for Software Product Lines

Similar documents
Variability-Aware Parsing in the Presence of Lexical Macros and Conditional Compilation

Enabling Variability-Aware Software Tools. Paul Gazzillo New York University

University of Passau Department of Informatics and Mathematics Chair for Programming. Bachelor Thesis. Andreas Janker

Conditional Compilation

Partial Preprocessing C Code for Variability Analysis

Small Formulas for Large Programs: On-line Constraint Simplification In Scalable Static Analysis

CS 326 Operating Systems C Programming. Greg Benson Department of Computer Science University of San Francisco

Bachelor Thesis. Configuration Lifting for C Software Product Lines

Boolean Representations and Combinatorial Equivalence

SOFTWARE PRODUCT LINES: CONCEPTS AND IMPLEMENTATION QUALITY ASSURANCE: SAMPLING

University of Marburg Department of Mathematics & Computer Science. Bachelor Thesis. Variability-Aware Interpretation. Author: Jonas Pusch

SAT-CNF Is N P-complete

A Taxonomy of Software Product Line Reengineering

Propositional Calculus: Boolean Algebra and Simplification. CS 270: Mathematical Foundations of Computer Science Jeremy Johnson

PAPER 1 ABSTRACT. Categories and Subject Descriptors. General Terms. Keywords

The Compilation Process

CS415 Compilers. Intermediate Represeation & Code Generation

Parallel and Distributed Computing

Mining Kbuild to Detect Variability Anomalies in Linux

Configuration Coverage in the Analysis of Large-Scale System Software

Variability-Aware Static Analysis at Scale: An Empirical Study

SAT Solver. CS 680 Formal Methods Jeremy Johnson

Parsing All of C by Taming the Preprocessor. Robert Grimm, New York University Joint Work with Paul Gazzillo

CSE341: Programming Languages Lecture 17 Implementing Languages Including Closures. Dan Grossman Autumn 2018

Implementation of Fork-Merge Parsing in OpenRefactory/C. Kavyashree Krishnappa

Where do Configuration Constraints Stem From? An Extraction Approach and an Empirical Study

PROPOSITIONAL LOGIC (2)

Efficient Circuit to CNF Conversion

A Tree Kernel Based Approach for Clone Detection

8.1 Polynomial-Time Reductions

COMP 202 Recursion. CONTENTS: Recursion. COMP Recursion 1

Normal Forms for Boolean Expressions

Motivation. CS389L: Automated Logical Reasoning. Lecture 5: Binary Decision Diagrams. Historical Context. Binary Decision Trees

Integrating a SAT Solver with Isabelle/HOL

CS 267: Automated Verification. Lecture 13: Bounded Model Checking. Instructor: Tevfik Bultan

P.G.TRB - COMPUTER SCIENCE. c) data processing language d) none of the above

Decision Procedures. An Algorithmic Point of View. Decision Procedures for Propositional Logic. D. Kroening O. Strichman.

Comparing Multiple Source Code Trees, version 3.1

CSE 413 Languages & Implementation. Hal Perkins Winter 2019 Structs, Implementing Languages (credits: Dan Grossman, CSE 341)

The C Pre Processor ECE2893. Lecture 18. ECE2893 The C Pre Processor Spring / 10

Formal Verification. Lecture 7: Introduction to Binary Decision Diagrams (BDDs)

DRAFT. Department Informatik. An Algorithm for Quantifying the Program Variability Induced by Conditional Compilation. Technical Report CS

Boolean Functions (Formulas) and Propositional Logic

ABC basics (compilation from different articles)

Boolean Satisfiability: The Central Problem of Computation

PYTHON FOR MEDICAL PHYSICISTS. Radiation Oncology Medical Physics Cancer Care Services, Royal Brisbane & Women s Hospital

Static Program Checking

Where Can We Draw The Line?

Deductive Methods, Bounded Model Checking

COMP 410 Lecture 1. Kyle Dewey

Presence-Condition Simplification in Highly Configurable Systems

Polynomial SAT-Solver Algorithm Explanation

Computability Theory

Program Translation. text. text. binary. binary. C program (p1.c) Compiler (gcc -S) Asm code (p1.s) Assembler (gcc or as) Object code (p1.

Intermediate Representation (IR)

1.1 The hand written header file

Program verification. Generalities about software Verification Model Checking. September 20, 2016

Chapter 8. NP and Computational Intractability. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

CSc 520 Principles of Programming Languages

Large-Scale Variability-Aware Type Checking and Dataflow Analysis

4.1 Review - the DPLL procedure

NP and computational intractability. Kleinberg and Tardos, chapter 8

Information Science. No. For each question, choose one correct answer and write its symbol (A E) in the box.

Software Similarity Analysis. c April 28, 2011 Christian Collberg

Kmax: Finding All Configurations of Kbuild Makefiles Statically Paul Gazzillo Stevens Institute. ESEC/FSE 2017 Paderborn, Germany

Circuit versus CNF Reasoning for Equivalence Checking

Slide Set 18. for ENCM 339 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary

Compiler, Assembler, and Linker

4. An interpreter is a program that

Toolchain-Independent Variant Management with the Leviathan Filesystem

1.4 Normal Forms. We define conjunctions of formulas as follows: and analogously disjunctions: Literals and Clauses

Chapter 7: Preprocessing Directives

Performance Measurement of C Software Product Lines

Morpheus: Variability-Aware Refactoring in the Wild

On Reasoning about Finite Sets in Software Checking

1 Definition of Reduction

Mixed Integer Linear Programming

Model Checking I Binary Decision Diagrams

SOFTWARE ARCHITECTURE 5. COMPILER

Symbolic Model Checking

Seminar decision procedures: Certification of SAT and unsat proofs

Concepts of programming languages

CSE 374 Programming Concepts & Tools

NP-Completeness. Algorithms

Scalable Program Analysis Using Boolean Satisfiability: The Saturn Project

MODULE 5: Pointers, Preprocessor Directives and Data Structures

Finite Model Generation for Isabelle/HOL Using a SAT Solver

Understanding Linux Feature Distribution

Linear Time Unit Propagation, Horn-SAT and 2-SAT

TDDE18 & 726G77. Functions

Compiler construction

Local Two-Level And-Inverter Graph Minimization without Blowup

Macros and Preprocessor. CGS 3460, Lecture 39 Apr 17, 2006 Hen-I Yang

Gabriel Hugh Elkaim Spring CMPE 013/L: C Programming. CMPE 013/L: C Programming

Binary Decision Diagrams

Behavior models and verification Lecture 6

CSE413: Programming Languages and Implementation Racket structs Implementing languages with interpreters Implementing closures

Syntax-Directed Translation. Lecture 14

Racket: Macros. Advanced Functional Programming. Jean-Noël Monette. November 2013

! Greed. O(n log n) interval scheduling. ! Divide-and-conquer. O(n log n) FFT. ! Dynamic programming. O(n 2 ) edit distance.

Transcription:

TypeChef: Towards Correct Variability Analysis of Unpreprocessed C Code for Software Product Lines Paolo G. Giarrusso 04 March 2011

Software product lines (SPLs) Feature selection SPL = 1 software project 1 variant of a program, out of many possible ones. Examples of features: Which data representation to use? Support end-user feature so-and-so? Fast or real-time version?

(Static) correctness checking Aim: to support developers, check if all variants are correct Syntactic correctness Type-correctness Bug finding Static analysis Model checking (freedom from deadlock, liveness)...

Exponential number of variants 33 optional, independent features a unique variant for each person on the planet Slide credits: Christian Kästner

Exponential number of variants 320 optional, independent features # variants > # estimated atoms in the universe Slide credits: Christian Kästner

Example SPLs NASA flight control system: 275 features Vim (text editor): 779 features HP Owen printer firmware: 2000 features Linux kernel: > 6500 features

Approach Analyse the whole SPL at once! Parsing: build a conditional AST, which stores the presence conditions (boolean formulas) of code elements SPL-aware type checking: if A refers to B, B must be present whenever A is: pc A pc B. If conflicting definitions are present, they must not be active at the same time: pc A xor pc B. Done for other languages (e.g., Java)

Rely on SAT-solvers We need therefore to check formula validity. NP-complete problem! Exponential time again! For many classes of problems, available SAT-solvers are efficient. Our problem is one of those!

Conditional compilations for SPLs Use a lexical preprocessor (like the C preprocessor, CPP) to implement SPLs. Example: 1 #if FEATURE_REAL_TIME 2 void sort(int array[], int length) { 3 //Use heap sort, always O(n log n) 4 } 5 #else 6 void sort(int array[], int length) { 7 //Use quick sort, usually but not always faster. 8 } 9 #endif Conditional compilation is available in other languages as well.

Analysis of unpreprocessed code C compilers first preprocess code, then parse it. Instead, we need to parse C code before preprocessing. But it is hard! CPP mixes variability with other stuff.

Examples for parsing CPP Macro expansion required for parsing! Alternative definitions Undisciplined annotations (Around 16% in a study of 40 Open Source projects) Slide credits: Christian Kästner

From the Linux kernel: Slide credits: Christian Kästner

Requirements The output must: Be simple to further process (esp. parse) Contain only variability, remove unrelated constructs Avoid #define... use only #if...#endif Avoid #define Use only #if...#endif and #define

Correctness of partial preprocessing Ideally, our correctness requirement would be: cpp(σ, ppc(prog)) = cpp(σ, prog) The actual specification is more complex and has quite a few restrictions, which are OK for our application scenario.

Conditional compilation 1 #if C_1 2 body 1 3 #elif C_2 4 body 2 5 #else 6 body else 7 #endif becomes: 1 #if C_1 2 body 1 3 #endif 4 #if!c_1 && C_2 5 body 2 6 #endif 7 #if!c_1 &&!C_2 8 body else 9 #endif

Macro expansion Given: 1 #if C_1 2 #define A (expansion_1) 3 #elif C_2 4 #define A (expansion_2) 5 #endif a reference to A becomes: 1 #if C_1 2 (expansion_1) 3 #endif 4 #if!c_1 && C_2 5 (expansion_2) 6 #endif 7 #if!c_1 &&!C_2 8 A 9 #endif

Include guards Typical header structure, for foo.h: 1 #ifndef FOO_H 2 #define FOO_H 3 /* Header body */ 4 #endif This way, multiple or even (indirect) recursive inclusions of foo.h are tolerated. Therefore, when FOO_H is tested, we need to check if it is satisfiable again, use SAT!

Real-world example: Slide credits: Christian Kästner

The need for simplification 1 #if FEAT1 && FEAT2 2 #define A BODY1 3 #else 4 #define A BODY2 5 #endif Define B as: 1 #if FEAT2 2 #define B A 3 #endif Without any simplification, the expansion of B would become: 4 #if FEAT2 && FEAT1 && FEAT2 5 BODY1 6 #endif 7 #if FEAT2 &&!(FEAT1 && FEAT2) 8 BODY2 9 #endif

Simplified result 1 #if FEAT2 && FEAT1 2 BODY1 3 #endif 4 #if FEAT2 &&!FEAT1 5 BODY2 6 #endif Less duplicated literals (or none)! Even more important in complex, real-world examples!

Scalability requirements Potentially huge codebases (Linux kernel) File inclusion: a file can include thousands of lines of extra code. During development, naive algorithm implementation lead to: Filling up the disk (>9G of output for one file) Filling up the heap (2-3G of RAM) Non-termination Most of this happened during formula manipulation. All state-of-the-art algorithms (including the alternative to SAT-solvers, i.e. BDD) have exponential worst-case complexity.

Formula representation I 1st idea: Represent formula by an unordered node-labeled tree, similar to AST; nodes represent And, Or and Not operations on the nodes. 2nd idea: Hash-consing: each formula is represented exactly once; after a formula is built, it is looked up in a canonicalization map to find an existing copy, which is used if available. Formula comparison becomes O(1). Formulas are represented by DAGs, not trees, because subtrees can be shared.

Formula representation II Simplification during construction: simplification rules remove some redundant terms. And and Or nodes contain sets of nodes. This removes duplicates and speeds up membership testing, which becomes O(1). Negation normal form (NNF): negation is pushed down to literals, using DeMorgan laws. This is done during formula construction: quite tricky to make it non-exponential. Simplification rules require O(1) negation.

Some simplification rules e False False e e e... e (e e ) e e e ( e o) False e (e o) e e ( e o) e o Remove duplicates (see e) (at least nearby ones)! The dual of each rule is also present.

Exponential replication Below, size of T i can be polynomial or exponential in i, depending on how size is measured: T 1 = a (1) T i+1 = T i (b (T i c)) (2) The difference is in the expansion! T i appears twice in T i+1 ; T 1 appears (indirectly) 2 n times in T n+1. Represented as a tree, the number of nodes is exponential in n; represented as a DAG, the number of node is linear in n. Express such a formula in the output without #define fully expand references. We need to preserve sharing!

Formula renaming Formula renaming: If ϕ appears twice in ψ, replace it by a variable A, and impose A ϕ. This avoids replication, and produces an equisatisfiable formula. A ϕ can be further optimized to reduce the output size, we omit here the details. This technique is also crucial for non-exponential transformation of formulas into CNF for SAT-solving.

Conclusion We discussed variability analysis for C; within this context, our focus was on efficient algorithms for boolean formula manipulation. Thanks to these algorithms, we believe we will be able to partially preprocess the whole Linux kernel and Vim, while considering the whole feature model. These techniques might also be useful for source manipulation tools for C (e.g., refactorings).

Part of this work was published as: Christian Kästner, Paolo G. Giarrusso, and Klaus Ostermann. Partial preprocessing of C code for variability analysis. In Proc. Int l Workshop on Variability Modelling of Software-intensive Systems (VaMoS), pages 137 140, New York, 2011. ACM Press.