Verification Modulo Versions: Towards Usable Verification

Size: px

Start display at page:

Download "Verification Modulo Versions: Towards Usable Verification"

Eustacia Walsh
6 years ago
Views:

1 Verification Modulo Versions: Towards Usable Verification Francesco Logozzo, Shuvendu K. Lahiri, Manuel Fähndrich Microsoft Research {logozzo, Sam Blackshear University of Colorado Boulder Abstract We introduce Verification Modulo Versions (VMV), a new static analysis technique for reducing the number of alarms reported by static verifiers while providing sound semantic guarantees. First, VMV extracts semantic environment conditions from a base program P. Environmental conditions can either be sufficient conditions (implying the safety of P) or necessary conditions (implied by the safety of P). Then, VMV instruments a new version of the program, P, with the inferred conditions. We prove that we can use (i) sufficient conditions to identify abstract regressions of P w.r.t. P; and (ii) necessary conditions to prove the relative correctness of P w.r.t. P. We show that the extraction of environmental conditions can be performed at a hierarchy of abstraction levels (history, state, or call conditions) with each subsequent level requiring a less sophisticated matching of the syntactic changes between P and P. Call conditions are particularly useful because they only require the syntactic matching of entry points and callee names across program versions. We have implemented VMV in a widely used static analysis and verification tool. We report our experience on two large code bases and demonstrate a substantial reduction in alarms while additionally providing relative correctness guarantees. Categories and Subject Descriptors D.2.4 [Software Engineering]: Software/Program Verification; F.3.1 [Logics and Meanings of Programs]: Specifying and Verifying and Reasoning about Programs; F.3.2 [Logics and Meanings of Programs]: Semantics of Programming Languages Experimentation, Languages, Reliability, Verifi- General Terms cation. Keywords analysis. 1. Introduction Abstract interpretation, Specification inference, Static Program verification is traditionally version un-aware: it focuses on the verification of a particular version of a program, independent of past versions of the same code base. Consider the typical verification scenario. The user has a program P that she wants to verify. She runs the static verification tool, which produces a long list of alarms. Ideally, she will go over the list, addressing all the issues Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PLDI 14, June 9 11, 2014, Edinburgh, UK Copyright c 2014 ACM [to be supplied]... $15.00 fixing these alarms could involve repairing the program or providing additional annotations or contracts. In practice, such a process is too expensive to be realistic easily requiring several weeks of work on medium-scale projects. Furthermore, she is more interested in fixing the new defects, i.e., the ones introduced since the last release P of the same project. In fact her confidence in the old code is much higher (because of extensive tests, deployments, code reviews and reuse) than her confidence in the new code. Furthermore, even if a bug is found in P, it may be the case that she will not choose not to fix it to avoid compatibility issues [4]. Most industrial-strength static analysis tools provide a syntactic baselining feature to tackle this issue e.g., the Polyspace Verifier [31], Coverity SAVE [14], Grammatech CodeSonar [19], and CodeContracts static checker [17]. Syntax-based baselining can be roughly sketched as follows. The tool first creates a database with all unresolved warnings for a base program P. In the database, warnings are indexed by source position, type, message, and some other syntactic information. The resulting database is the analysis baseline. When the source program P is changed to create program P, each warning generated for P is compared against the warnings in the analysis baseline. If the warning is matched, it is automatically hidden; if not, it is reported as a new warning. By nature, syntactic matching is unreliable and unsound. It can both re-report old warnings and suppress warnings that are genuinely new. For example, let us consider a simple line-based suppression strategy that suppresses all warnings about an assertion failure at line i. Say a warning based on an assertion a 0 at line i was suppressed in the original program. Assume the user inserts a new and potentially failing assertion a 1 at line i in the new program, shifting a 0 back to line i + 1. Now, the analyzer will (wrongly) suppress the warning about the failure of a 1, and (wrongly) rereport the warning about the failure of a 0. This is exactly the opposite of the desired behavior! Though this example represents the worst case behavior of an overly simplistic syntactic baselining scheme, our point is that syntactic approaches to baselining will always fall short of correct behavior in some cases because the baseline lacks semantic information about why a warning should be suppressed. More importantly, it does not provide any guarantees in cases where it suppresses an old alarm or shows a new alarm. Motivated by these deficiencies in verification tools too many warnings for version-unaware verification vs. unreliable and unsound baseline we seek to provide a new verification technique which delivers most of the benefits of a baseline while additionally providing semantic correctness guarantees. Contributions. We introduce Verification Modulo Versions (VMV), a technique for automatically inferring and maintaining semantic information across program versions. In VMV, the database corresponding to the analysis baseline is populated with semantic environmental conditions E extracted from the base program P. When

2 Sufficient(): 0: y = g() 1: if * 2: assert y > 0 Sufficient (): 0: y = g() // assume y > 0 // read from baseline 1: y = y - 2 2: if * 3: assert y > 0 Sufficient (): 0: y = g() // assume y > 0 // read from baseline 1: z = y - 2 2: if * 3: assert z > -12 (a) An example with no necessary condition other than true and a sufficient condition y > 0. The expression * denotes non-deterministic choice. (b) An evolved version of the program from Fig. 1a. Though the assertion is unchanged syntactically, VMV(S) will report a regression because the baseline condition y > 0 is no longer sufficient to ensure the safety of the assertion. Figure 1: VMV(S): Verification Modulo Versions with sufficient conditions. (c) A different evolved version of the program from Fig. 1a. Though the assertion has changed syntactically and semantically, VMV(S) proves it by using the sufficient condition y > 0 from the baseline. the static verifier analyzes P, it reads the semantic information E from the baseline, instruments P with E, and analyzes the instrumented program P E. Since P E makes more assumptions than P, a monotonic analyzer will report fewer warnings for P E than for P. In order to make VMV practical, we must address three main questions: (i) What semantic information E should be extracted from P and saved in the baseline? (ii) Where do we insert E in P, that is, how do we obtain P E? (iii) Which semantic guarantees do we have on P E? We answer these questions by making the following contributions: We formally characterize the optimal extracted environment information E as an abstraction of the trace semantics of P. We consider abstractions such that the extracted information is (i) sufficient for the absence of errors, or (ii) necessary for the presence of good runs. We consider observing and injecting the extracted information at different levels of abstraction, namely: (i) the history-level as a sequence of states, (ii) the program point-level as sets of states reaching a program point, or (iii) the call-level, as sets of states after method calls. We show that if E is sufficient to remove the errors in P and P E has an error, then P introduced a regression w.r.t. P i.e., under the same environmental assumptions ensuring P was correct, P leads to an error. Otherwise stated, P requires more assumptions from the external environment than P did. We show that if E is necessary for the presence of good runs in P and P E has no errors, then P is correct relative to P i.e., under the same environmental assumptions holding in all good P runs, P is correct. We show how the concepts above interplay with the abstractions necessary to make the approach practical: (i) the underlying static analyzer or the program verifier; (ii) the inference of the sufficient/necessary conditions; and (iii) the mapping of environment conditions to new versions. We have implemented VMV with necessary conditions and call matching in the cccheck tool and validate its effectiveness on a sizable suite of real-world C# projects. For the three largest projects in the open source ActorFx code base, VMV reduces the number of alarms to investigate by 72.6% while providing relative correctness guarantees. We only needed to add few contracts to bring the number of false alarms to 0. We applied the technique to a larger production-quality code base and we found similar results. 2. Overview of the Solution The problem Suppose we have two versions of a program: the original program P and an updated version P. Assume also that we Necessary(): 0: y = g() 1: if * 2: assert y > 0 3: else 4: assert false (a) An example with a necessary condition (y > 0), but no sufficient condition other than false. Necessary (): 0: y = g(); // assume y > 0 // read from the baseline 1: if (y % 2 == 1) 2: assert y > 0 3: else 4: assert y > 1 (b) An evolved version of the program from Fig. 2a where the necessary condition (y > 0) from the previous program implies the safety of both assertions in the new program. Figure 2: VMV(N ): Verification Modulo Versions with necessary conditions. have some static analyzer or deductive verifier whose internals are opaque to us. The goal of traditional alarm masking (baselining) is to emit the set of alarms for P not in P using some syntactic context matching to identify alarms both in P and P. The goal of VMV is to provide a robust verification technique which exploits semantic information present in a prior version P to produce a semantically sound set of alarms for P. By robust, we mean that the technique should require minimal syntactic matching among program versions. By semantically sound, we mean that VMV provides actual guarantees on its output. VMV has three main phases: (i) extraction of semantic environmental conditions E from the base program P; (ii) installation of E in P to obtain an instrumented program P E; and (iii) verification of P E. VMV is only useful if the set of alarms P E is significantly smaller than that of P. In this paper we systematically study the (orthogonal) choices for extraction and installation of correctness conditions, the semantic guarantees provided in the concrete and in the abstract, and justifying a particular point of the design space in VMV that we have adopted in the tool cccheck. Extraction VMV extracts semantic conditions from P, i.e., constraints on the environment to ensure good executions. We must be careful to differentiate two kinds of assumptions, sufficient vs. necessary. A sufficient condition guarantees that the program always reaches a good state. A necessary condition holds whenever the program reaches a good state. Otherwise stated, if a sufficient condition does not hold, the program may or may not reach a bad state. If a necessary condition does not hold, then the program will definitely reach a bad state. We will see that installing sufficient or

3 necessary conditions will provide different semantic guarantees on the instrumented program P E. The difference between sufficient vs. necessary conditions is exemplified in Fig. 1a and Fig. 2a. In the first case, the condition y > 0 (at program point 0) is sufficient for the correctness of the program: When y is positive, the program will always reach a good state no matter which non-deterministic choice is made. However, the condition is not necessary because the program may still reach a good state even if y is not positive. In the second case (Fig. 2a), the condition y > 0 is necessary for the correctness of the program, but not sufficient. If y is not positive, then the assertion at line 2 will always fail. Since there is no condition that we can impose on y to ensure that the second assertion will not fail, the program has no sufficient correctness condition other than false. In general, finding a sufficient condition requires an under-approximation while finding a necessary one requires an over-approximation. The diagram in Fig. 3 summarizes our framework for extracting correctness conditions from a program P. At the lowest level is the trace semantics τ + P, which capture all finite executions of P (Sec. 3). By abstracting away all traces in τ + P that lead to an error state via the abstraction function α G, we obtain the success semantics GP (Sec. 3). Necessary and sufficient conditions for a program are suitable abstractions of the success semantics. As our diagram illustrates, we can extract correctness conditions at increasing levels of abstraction. One option (Sec. 4) is to extract correctness conditions for sequences of function calls (history conditions S, N ). More abstract options (Sec. 5) are: (i) to extract correctness conditions for each program point where a function call appears, i.e., different occurrences of the same function invocation are kept separate (state conditions S Σ, N Σ); or, (ii) to merge together calls to the same function (call conditions, S, N ). The intuition behind this hierarchy of conditions is that the more abstract the condition, the easier it is to inject into P. In the following we will let VMV(S) (resp. VMV(N )) denote VMV instantiated with sufficient (resp. necessary) conditions. Insertion In order to build the instrumented program P E, we need to inject the conditions E into P (Sec. 6). The insertion step depends on the abstraction level of the extracted conditions. For instance, history conditions require the exact matching of program points, whereas call conditions require only the matching of entry points and callees among program versions. In our theoretical treatment we assume that we are given a function δ mapping each correctness condition from its program point in P to the corresponding program point(s) in P. Our theoretical results are parametric w.r.t. δ. In practice, we cannot ask the user to provide such a map, therefore we pick a function δ requiring a minimal syntactic matching that is easy to compute. Our pragmatic choice is to use call conditions (S, N ) where δ only needs to match entry points and callee functions among program versions. Our hypothesis is that such a matching is much more reliable than mapping arbitrary program points. Concrete semantic guarantees The next question is: what semantic guarantees do we have on P E? We will see that if the extracted correctness conditions E are sufficient and P E has a bad run, then P introduced a δ-regression, i.e., P is not correct when the same hypotheses under which P was correct are mapped to P via δ (Th. 1). Intuitively, this means that the correctness of P relies on stronger assumptions than the correctness of P. Conversely, if the extracted correctness conditions are necessary and there are no errors in P E, then P is δ-correct with respect to the same environment assumptions in P (Th. 2). Otherwise stated, sufficient conditions are useful to find new bugs, while necessary conditions are useful to prove relative correctness (the absence of new bugs). δ Sufficient Necessary Call conditions S N State conditions History conditions Success Semantics Concrete Semantics α S S Σ α Σ S α s GP α G Underapproximate τ + P α n Overapproximate Figure 3: The hierarchy of environmental conditions extraction. The success semantics abstracts the concrete semantics. The correctness conditions (sufficient ensuring the program is always correct; or necessary required for all the correct executions) abstract the success semantics. Correctness conditions can be observed at increasing levels of abstraction (history, program point, and calls). As an example of bug finding, let us consider the code in Fig. 1b. VMV(S) reads the (sufficient) condition y > 0 from the analysis baseline and installs it as an assumption about the return value of g(). It emits a new alarm because the sufficient condition y > 0 from Sufficient is no longer strong enough to prove the correctness of the assertion in the new version of the program: the correctness of Sufficient requires more environmental assumptions than the correctness of Sufficient. Note that a syntactic baselining technique matching asserted expressions might wrongly suppress the alarm, since the expression is y > 0 in both versions of the program. In Fig. 1c, a syntactic baselining technique also would likely fail to match the alarm from Sufficient with the alarm in Sufficient because the line numbers have shifted and the expression being asserted has changed. Under the carriedover assumption y > 0, the assertion at line (3) holds. Note that any under-approximation of y > 0 (e.g., false) would allow us to make the same guarantee, whereas an over-approximation would be incorrect. For instance, using true would cause VMV(S) to (wrongly) report a regression for Sufficient. Finally, note that the condition read from the baseline is stronger than (i.e., is an under-approximation of) the condition we can extract from Sufficient : ret > -10. As an example of relative verification consider the evolved program Necessary in Fig. 2b. VMV(N ) proves that under the necessary correctness assumption y > 0 inherited from Necessary, Necessary is correct. Furthermore, we deduce that Necessary fixes Necessary since the condition that was only necessary but not sufficient for the correctness of Necessary is now sufficient for the correctness of Necessary. Abstraction In practice, we need to instantiate our VMV framework with a real static analysis/verification tool and an effective method for semantic condition inference. We require approximation to make program analysis and condition inference tractable (Sec. 7). The analysis tool should over-approximate the concrete trace semantics (i.e., τ + P in Fig. 3), whilst the inference tool should under-approximate the sufficient conditions (S S) and overapproximate the necessary conditions (N N ). It is well-known N Σ N α N α Σ

4 that the abstraction may lose some of the concrete guarantees, so it is natural to ask what happens to VMV in the abstract. For sufficient conditions (i.e., VMV(S)), if the analysis is complete [18], then the alarms for P S are still concrete regressions (e.g. as in Fig. 1b). However, if the tool is incomplete, i.e., it fails to prove every fact that is true, then the alarms are abstract regressions. An abstract regression denotes an environment s S for which the tool can prove P correct, but cannot prove the correctness of P. Orthogonally if S < S then P S may suppress real alarms, i.e., some regression may be missed think for instance of a naive extraction algorithm always suggesting false. In general, no guarantee can be given on the assertions not shown in the list of raised alarms. For necessary conditions (i.e., VMV(N )), if the analysis is incomplete or N < N, then some of the reported alarms may be false alarms the analysis failed to prove the relative correctness either because the inferred condition is too weak (at worst true), or the tool is too imprecise. Nevertheless, all the assertions not shown as alarms are correct or relatively correct. Motivated by the observations above, and the fact that all practical verification and analysis tools are incomplete, and that computing good sufficient conditions is harder than computing good necessary conditions, and that (relative) verification is more interesting than bug finding, we instantiate VMV with state-based necessary precondition inference (Sec. 8). 3. Concrete Semantics and Background We formalize VMV using Abstract Interpretation [8]. Abstract Interpretation is a theory of semantic approximations which is very well suited to describing (under/over) approximations. In formalizing VMV, we will need to consider four orthogonal axes of approximation: (i) the extracted semantic conditions (history, state, or call); (ii) whether they are sufficient or necessary; (iii) the approximation induced by the inference tool (under-approximation of sufficient, over-approximation of necessary); and (iv) the over-approximation induced by the static analysis/verification tool. Syntax We consider a simple imperative block language with holes. Blocks are sequences of basic statements. Basic statements are assignments, assertions, assumptions, and function calls. For our theoretical treatment: (i) holes are input values and calls to unknown external functions; (ii) non-deterministic choices (*) can only appear in assignments (i.e., assert * and assume * are disallowed); (iii) all contracts (if any) for function calls are explicitly expanded. A program is a directed graph whose nodes are blocks. We assume that a program has a single exit block. The special program points entry and exit denote function entry and exit points. We use the BeforeCall (resp. AfterCall) set to denote the set of program points before (resp. after) some method call. Concrete states A program state maps variables x, y V to values in V. We assume some reserved variable names: arg 0, arg 1,... are the actual arguments for function calls and ret is the value returned by a function call. Program states s, s 0... belong to a given set of states Σ. The function π Σ PC maps a state to the corresponding program point, and the function π s Σ Stm maps states to the corresponding statements. The function = (Σ Exp) {true, false} returns the Boolean value of an expression in a given state. Small-step operational semantics The non-deterministic transition relation τ P (Σ Σ) associates a state with its possible successors in the program P. When the program P is clear from the context, we write τ instead of τ P. We often write τ(s, s ) instead of s, s τ. The definition of τ for our block language is mostly straightforward, hence we omit it here. Function calls are the only interesting case for our exposition. We use function calls to model unknowns from the environment and imprecision in the static analysis. To simplify the theoretical development, we assume that function calls are (i) black boxes we cannot inspect their bodies, so they can return any value, and (ii) side effect-free, otherwise do as in [13]. If s Σ is the state corresponding to a function call (that is, π s(s) = ret = f(arg 0... arg n 1 )), then the transition relation is such that s Σ. τ(s, s ) v V. s = s[ret v]. In general, side effect-free functions need not be deterministic. Our model of functions accounts for the possibility that a function can be invoked with the same parameters and return a different value. The set of initial states is I Σ. The final (or blocking) states have no successors: F = {s Σ s. τ(s, s )}. The bad states B F are final states corresponding to assertion failures, i.e., B = {s Σ π s(s) = assert e s = e}. The good states G F are non-failing final states: G = F \ B. Maximal trace semantics The concrete semantics of a program P is a set of execution traces. Various choices are possible for the trace semantics. Here, for simplicity, we will simply consider finite maximal finite traces, e.g., we ignore non-termination. Traces are sequences of states in Σ. The empty trace ɛ has length 0. A trace s = s 0s 1... s n 1 has length n. We will often write s i to refer to the i-th state of the trace s. The set of traces of length n 0 is Σ n. The set of finite traces is Σ = {ɛ} n>0 Σn. The good traces are G = { s Σ s n 1 G}. Similarly, the bad traces are B = { s Σ s n 1 B}. We define trace concatenation as sequence juxtaposition. We can easily extend trace concatenation and composition to sets of traces. The execution prefixes define the partial execution trace semantics. The partial runs of P of length n are those partial executions such that τ P n = { s Σ n i [0, n 1). τ P(s i, s i+1)}. The complete or maximal runs of P are those partial runs ending with a blocking state: τ + P = n>0{ s τ n P s 0 I s n 1 F}. Please note that while each trace in τ + P is finite, the set τ + P is in general infinite. This means that while we do not capture infinite executions, we do capture unbounded non-determinism from data sources like input variables and values returned from a function. Finally, we let τ + P (I) denote the set maximal partial traces starting with a state in I. Recall: Abstract Interpretation In this paper, we use Abstract Interpretation to approximate a concrete semantics consisting of the maximal trace semantics defined above over a concrete domain defined by the complete Boolean lattice (Σ ),,, Σ,,,. The concrete semantics can be either over-approximated or underapproximated. Let C, and A, be two partial orders. If there exists an abstraction function α C A and a concretization function γ A C such that c C, a A. α(c) a c γ(a) we say that α, γ is a Galois connection, and we note that by γ C, α A,. The composition of Galois connections is a Galois connection. The composition enables the step-wise construction of abstract semantics, and it is the theoretical justification to our construction. Sometimes, we may need a relaxed form of the Abstract Interpretation framework in which we only require the abstraction function α function to be monotonic [9]. Success Semantics The good execution (or success) semantics of P considers only the traces in the maximal execution traces of P that terminate in a non-error state. To formalize this, we first define a Galois connection (Σ γ G ), (Σ ), α G with the abstraction α G = λt. T G and the concretization

5 γ G = λt. T B. Then we define the good execution trace semantics as GP α G( τ + P ). 4. History Environment Conditions We want to extract from GP the sufficient and necessary conditions for correct executions. Roughly, a sufficient condition ensures that all the executions exhibiting the condition are good executions. A necessary condition holds for all good executions. In this section we unify and generalize concepts from [7, 12] to define the optimal sufficient and necessary history conditions as suitable abstractions of the success semantics. First, let us define the Galois connection (Σ γ ), n (Σ ), where the abstraction α n α n = λt. {s 0 e s T e = α ( s)} uses the commodity function α defined as ɛ if s = ɛ α ( s) = s 1α (s 2... s n 1) if π s(s 0) = ret = f(...) α (s 1... s n 1) otherwise. Intuitively, α n captures sequences of environment choices from a set of traces T. The abstraction function records the entry state for the function and the state after each method call. Weakest sufficient environment conditions The weakest sufficient environmental conditions capture the largest set of sequences of environment choices (initial states and return states) that guarantee that the program execution always reaches a good state. We define them via a parameterized Galois connection [7] (Σ ), γ s[s] (Σ ),. Informally, given a base set of traces S, the α s[s] abstraction α s[s] first selects the traces in T that terminate in a good state ( s α n(g(t ))) such that their environmental choices (α n({ s})) are different from the choices of all the traces in S that may lead to a bad state ( s B S B ). Then, out of the selected traces, the abstraction retains only the environmental choices. Formally, the parameterized abstraction is α s[s] = λt.{α n({ s}) s α G(T ) and the (parameterized) concretization is s B S B α n({ s}) α n({ s B})} γ s[s] = λt.{ s s γ G(T ) s B S. α n({ s}) = α n({ s B})}. Note that our definition of α s[s] generalizes the trace-based wlp of [12]. The largest set of environment choices guaranteeing that the program P is always correct is S α s[τ + P ](GP). As an example, let us consider Fig. 1a. The weakest sufficient condition is S = { y y 0 y y 1 y 0 V, 0 < y 1}. i.e., when g returns a positive number, the program is correct; that is, it will never reach any bad state. The program is correct even if g ensures a stronger property, e.g., the return value is 42 or a positive even number. Strongest necessary environment conditions We want the set N of sequences of environment choices characterizing good executions. The intuition is that if we have a sequence of environment choices not in N, then an execution compatible with those choices will definitely lead to a bad state. In particular, we want the smallest such set N, which we can get by abstracting GP. Therefore N α n(gp) is the strongest environment property satisfied by good executions of P. If an environment does not make one of the choices admitted by N, then the program P will fail for sure. If an environment does make a choice allowed by N, then we know (by construction) that there is at least one concrete execution of P that terminates in a good state. However, there may also be concrete TwoCallSites(y, z): 0: if * 1: ret = f(y) 2: x = ret + 1 3: else 4: ret = f(z) 5: x = ret - 1 6: assert x > 100 (a) Two distinct invocations of the same function. The sufficient and necessary state conditions coincide, but the call conditions are different. TwoCallSites (w): 0: ret = f(w) 1: x = ret 2: assert x > 100 (b) Evolved version where exact syntactic matching of program points is ambiguous, but function calls matching is well-defined. Figure 4: An example where the sufficient and necessary call conditions are different (resp. ret>101 and ret>99). executions that terminate in a bad state. In the example of Fig. 2a: N = { y y 0 y y 1 y 0 V, y 1 > 0}. That is, in all the good executions, the value returned by f is positive. If f is negative then the program will fail for sure. Otherwise, the program may or may not fail, depending on the non-deterministic choice. Relation between the two concepts It follows from the definitions above that S N. In general, it is sound to under-approximate S, but not to over-approximate it: S characterizes the largest set of concrete executions that always terminate in a good state and any under-approximation of S yields a smaller set of concrete executions that always terminate in a good state. Dually, it is sound to over-approximate N, but not to under-approximate it. Though sufficient and necessary conditions are not new concepts, we take the novel step of using both kinds of conditions to get different guarantees about the warnings reported by our VMV technique as we will see in Sec State-Based Environment Conditions The trace-based environment conditions capture the sequence of environment choices that are necessary or sufficient for program correctness. However, in some cases we may be interested in the set of environments that are possible at a given program point regardless of the sequence of environment choices made previously. In such cases, we can abstract the trace-properties of the section above to state properties by collecting all the states observed at each program point and discarding the environment choices made along the way. State conditions Given a set of traces T, the reachable states abstraction α Σ collects the set of states that reach each program point. Formally, we have the Galois connection (Σ γ ), Σ α Σ PC (Σ),, where the abstraction function α Σ is α Σ = λt. λpc. {s i s T. s = s 0... s i... s n 1 π(s i) = pc} and the concretization γ Σ is the expected one. The weakest state-based sufficient conditions on the environment are the set of states S Σ α Σ( S). The strongest state-based necessary conditions on the environment are the set of states N Σ α Σ( N ). If entry denotes the entry point of P, then (i) S Σ(entry) is the weakest sufficient precondition of [15] and (ii) N Σ(entry) is the strongest necessary precondition of [10]. It follows from Sec. 4 that S Σ N Σ, where is the pointwise functional extension of. In the example of Fig. 4a, the sufficient and necessary state-based conditions coincide: S Σ(1) = N Σ(1) = {s s(ret) > 99} and S Σ(4) = N Σ(4) = {s s(ret) > 101}. Method call conditions The state-based environment conditions collect the possible environments at each program point. In some

6 cases, we are only interested in the environment conditions at certain program points. For example, we may only be interested in the environment conditions following a method call. We call this abstraction the method call conditions. Let Callees be the set of invoked functions. Then the weakest sufficient condition on function calls (i.e., the largest set of states that guarantees the program always terminates in a good state) is the intersection of all the sufficient states at return points. Formally, we need the monotonic abstraction function: α S = λr. λf. r(pc). pc Callees(f) Therefore, S α S(S Σ) are the weakest sufficient conditions on callees. They are an under-approximation of S. In the example of Fig. 4a, the weakest sufficient condition on f is S(f) = {s s(ret) > 101}. No matter which branch is taken, if f returns a value larger than 101, the assertion holds. The necessary call conditions are given by the Galois connection PC (Σ), γ Σ Callees (Σ),. The abstraction function α Σ merges the states after all invocations of a α Σ given callee f. Formally, the abstraction function is α N = λr. λf. r(pc) pc Callees(f) and the concretization function is { n(f) pc AfterCall pc Callees(f) γ N = λn. λpc. Σ otherwise Therefore, N α N (N Σ) are the strongest conditions on the callees which are necessary for the program to be correct. They are an over-approximation of N. In the example of Fig. 4a, the strongest necessary condition on f is N (f) = {s s(ret) > 99}, i.e., if the assertion holds then f returns a value larger than Semantic Conditions Injection Our goal now is to apply the extracted environmental conditions to a new version of the program P. The idea is that the previous conditions provide a semantic baseline that allow us to report only errors which are new in the modified program P, or to prove its correctness under the same environmental assumptions made in the good runs of P. Syntactic differences We call P the base program and P the new program. We do not impose any constraint on how much P and P control flow graphs differ. For the moment, and for the theoretical treatment, we assume we are given a function δ PC(P) PC(P ) { } that captures the syntactic changes between P and P. The δ function maps a program point from the base program to its corresponding point in the new version or to if the program point has been removed. For simplicity, we assume: (i) P and P have the same variables, otherwise a further projection function mapping variables of P into P should be introduced; and (ii) each program point of P corresponds at most to one program point in P, otherwise we need to lift the co-domain of δ to sets of program points, and add a pair-wise disjunction hypothesis on its image. We define the image of a trace w.r.t. δ: ɛ if s = ɛ α δ = λ s. sα( s ) if s = s s δ(π(s)) α( s ) if s = s s δ(π(s)) = that is, α δ ( s) abstracts away all the states in s which do not refer to a program point in P. Semantic filtering Intuitively, applying the trace environmental conditions E to P means restricting the traces in the concrete semantics of P to those which are compatible with E. Formally, the δ-filtered maximal trace semantics of P is τ + P δe = { s τ + P t E. α δ ( t) = α ( s)}, i.e., we keep only those traces in τ + P such that the sequence of choices made by the environment is compatible with the trace condition E. It follows trivially from the definition that this filtering yields a subset of the concrete traces, i.e., τ + P δe τ + P. VMV(S): VMV with sufficient conditions Say that we choose to inject sufficient conditions S from P into P. What can we deduce on P S? Intuitively, if P S has a bad trace, then a new error has been introduced. We know by construction that, if id denotes the identity function, ( τ + P ids) B = the history conditions S were sufficient to eliminate all the bad runs for P. We say that a program P has a δ-regression w.r.t. a mapping function δ if there exists a sufficient condition s for P which is no more sufficient to eliminate all the bad runs. More formally, there exists a trace s S such that τ + P δ {s} B. From the previous definitions and basic set theory it immediately follows: THEOREM 1. If S S and ( τ + P δs) B then P has a δ-regression. An immediate consequence of the theorem is that we can use, e.g., call-based sufficient conditions S (which under-approximate S) to prove δ-regressions. VMV(N ): Semantic baselining with necessary conditions Now, let us inject the extracted necessary conditions N into P, and let us suppose that we find no bad execution formally, ( τ + P δn ) B =. Intuitively, we can conclude that under the same environment conditions that hold for all the good runs (and maybe some of the bad runs) of P opportunely injected via δ, P has no bad runs. As a consequence: (i) P does not require stronger environmental hypotheses than P to avoid bad runs, and (ii) if S N, i.e., N is strictly not sufficient for the absence of bad runs in P, then P fixed the errors in P for which N was insufficient. We call the first property relative δ-correctness (or simply relative correctness) and the latter δ-fixing. What said is summarized by: THEOREM 2. Let N N and ( τ + P δ N ) B =. Then P is relatively δ-correct with respect to P. Furthermore, if also ( τ + P δ N ) B then P δ-fixed a bug in P. An immediate consequence is that we can use call-based necessary conditions N, which over-approximate N, to prove relative correctness. Note that in general the strongest necessary conditions for P may not be necessary conditions for P anymore. For instance, if P removes an assertion from P, the necessary conditions for P may preclude some good runs in P. Combining VMV(S) and VMV(N ) The results above inspire an immediate algorithm combining VMV(S) and VMV(N ) to prioritize the reported alarms. First, extract S and N from P. Since the extraction routines are completely independent, this can be done in parallel. Second, analyze P S and P N (in parallel) and collect the reported alarms (say respectively A S and A N ). Third, report the alarms A S, as we know they are δ-regressions. Fourth, report the set of alarms A N \ A S which may include real bugs masked by S. Fixing all the alarms accomplishes full (relative) verification. 7. Abstraction(s) The semantic filtering and the two theorems above give us the best theoretical solution to the problem of cross-version semantic condition injection and, correspondingly, to the problem of verification

7 modulo versions. Unfortunately, such a theoretical, concrete solution is not practical for many reasons. First, the exact computation of the success semantics and history conditions is infeasible the exact inference of S and N is impossible in general. Second, the theoretical results are parameterized by the cross-versions mapping function δ we certainly cannot require a user to provide δ. Third, every static analyzer, deductive verifier, model checker, or type system performs some over-approximation of τ: do the results above still hold when an abstract semantics is used? In this section we will address these orthogonal topics. Abstraction of sufficient conditions Our use of the weakest sufficient conditions S is sound and complete for detecting δ- regressions: it guarantees that we capture all the regressions of P w.r.t. P for the changes encoded by δ. If we consider some underapproximation S S, then all the regressions we capture are definite δ-regressions w.r.t. P but we may miss some. For instance, consider again the example in Fig. 1a, but now suppose that the tool extracted the sufficient condition ret > 10, under-approximating ret > 0. When such a condition is injected in Sufficient, the analyzer proves the assertion, and no alarm is raised. Therefore, it failed to spot the δ-regression. Overall, when analyzing P S, we may suppress too many warnings both because of the nature of sufficient conditions and because of approximation. We judge that if the goal is to provide an advanced bug-finding tool then one should use VMV(S). Abstraction of necessary conditions Our use of the strongest necessary conditions is sound and complete for proving relative δ-correctness. However, if we consider an over-approximation N N, we may fail in proving relative completeness. For instance, consider the example in Fig. 2a. Suppose we extract the condition ret > -10 from the base program. Then we cannot prove that Necessary fixed Necessary. As a consequence, when analyzing P N we may not suppress all the warnings from P. In general, necessary conditions are more practical as it is easier to craft useful over-approximations than it is to craft useful underapproximations, though our framework can use either approach. Overall, we judge that VMV(N ) is useful when the goal is verification. The assumptions carried over from the previous version of the program will significantly reduce the number of alarms to consider, while providing (relative) correctness guarantees. We will see in Sec. 8 that, in our setting, for 77% of the alarms the necessary conditions are also sufficient. The programmer can then focus her efforts on fixing or suppressing the remaining alarms until the number is reduced to zero. The resulting warning-free program can be used as the baseline for future versions of the program. Any warnings reported will represent either definite regressions or new warnings. Computing δ in practice Thus far, we have assumed that we are given the syntactic change map δ. In general, automatically computing δ in a meaningful way is impossible because several different δ s can characterize the transformation from P to P. For example, consider the program change shown in Fig. 4. The base program contains two invocations of f: one at program point 1, the other at program point 4. The new program contains only one invocation of f. Which (if either) of the two calls in the base program corresponds to the call in the new program? In other words, which is the correct δ: 1 0, 4, 1, 4 0, or 1, 4? For this program, it is ambiguous which δ is the correct choice. However, this arbitrary choice will have semantic ramifications because each state corresponding to a call to f in the base program has a different correctness condition associated with it. That is, each (equally valid) choice of δ will result in a different correctness condition being ported to P. This is undesirable. Fortunately, we found that we can avoid many of the problems associated with computing δ by using the call condition semantics S, N. Unlike the state condition semantics or any scheme based on syntactically matching assertions, which require computing a δ that matches all program points, using call conditions only requires syntactically matching callee names across program versions. As long as methods are not renamed, this approach is quite reliable. For example, in Fig. 4, the method f is called in both versions of the program. Since we merge the call conditions for all invocations of a function, changing the number of calls to a particular function across program versions presents no difficulty. Contrast this with the ambiguity introduced by changing the number of calls to f during our (previous) attempt to construct a δ to port the state conditions across versions. Therefore, for each function f that is called in both P and P, we simply assume the semantic condition after all calls to f in P. Essentially, we treat the correctness conditions as if they were postconditions for f that f s body does not need to prove. We should be careful on how to merge semantic conditions from multiple occurrences of f in P. The theory developed in the previous sections guides us. In the example of Fig. 4a suppose S 1, N 1 and S 4, N 4 are the inferred sufficient and necessary conditions at program points 1 and 4. Then, after line 0 in Fig. 4b we will inject an under-approximation of S 1 S 4 and/or an over-approximation of N 1 N 4. The injected program P E can be readily analyzed, and the results of this analysis can be reported to the user. Approximation introduced by the tool All sound automatic verifiers and static analyzers over-approximate the concrete semantics τ +. When over-approximation adds execution traces that end in an error state, analyzers will report false alarms. Therefore, the analysis is bound to be incomplete cannot prove every fact that holds. What guarantees does VMV provide when using a tool T over-approximating τ +? The answer depends on whether we are considering sufficient or necessary conditions. Sufficient conditions. Suppose we use an inference tool generating a sufficient condition S. In this case, we require that the inference of sufficient conditions generates a condition S that is provable under T. The condition false can always be used as a default in the absence of non-trivial conditions. We define an abstract δ- regression if T does not prove P S. An abstract regression is spurious if τ + P δ γ(s) B =, for an appropriate γ, but T cannot prove it. For instance, suppose that in Fig. 1 we instantiate T with the sign abstract domain [8]. In Fig. 1a, the condition y = positive is a sufficient one γ(positive) = {n N n > 0}. When using this abstraction in the evolved programs of Fig. 1b and Fig. 1c, at line 1, positive positive =, therefore T reports an abstract δ regression for both cases. For Fig. 1c, however, it is a spurious abstract δ-regression. The spurious abstract regression can be eliminated if: (i) we use the most precise transfer function for signs [18]; or, (ii) we use a more precise numerical domain, e.g., intervals [8]. We believe abstract δ-regressions (even the spurious ones) may be of interest to the user as they occur not only due to the analysis imprecision, but rather due to a change that can make existing analysis more imprecise. We plan to investigate the concept of (spurious) abstract regressions in the future. Necessary conditions. Now suppose our inference tool generates a necessary condition N. In this case, there is no need to use T to prove that N is necessary for P. When using T on the instrumented program P N, all the proven assertions are δ correct, but because of incompleteness T may fail to prove some δ-correct assertion. If N is a good approximation and T is precise enough, then the alarms generated for P N should be far fewer than those

8 generated for P alone. Therefore VMV(N ) is suitable for relative verification prove assertions of P under the same environmental assumptions that hold in the good runs of P. Non-Monotonicity. Orthogonal to what has been said above, a subtle problem can arise due to the (often forgotten) fact that practical analysis tools are not monotonic. That is, it is possible for an analyzer to compute a less precise result given more precise inputs (e.g., more assumptions). This can happen for many reasons: the use of widening, timeouts, non-monotonic state space finitization, etc. In the context of VMV, this can result in an odd case where analyzing the new program P yields fewer warnings than analyzing the refined program P E. An easy way to solve this (very obscure) problem is to consider the intersection of the warnings generated by the analyses of P and P E. Note that computing this intersection does not pose any problems with respect to syntactic matching because the only syntactic differences between P and P E are the addition of our correctness conditions E. 8. Experience Tool We implemented VMV on top of the industrial-strength static analyzer cccheck [17]. cccheck is a modular static analyzer and contract verifier with inter-method inference: it builds an approximation of the call-graph, visits it bottom-up, implements assume/guarantee reasoning for function calls, and infers contracts to reduce the annotation burden. Intra-procedurally, cccheck uses abstract interpretation [8] to infer invariants at each program point. It includes abstract domains to model the heap, array contents [11], nullness, integer values [26, 29], floating points [30], enums, among others. It leverages the inferred invariants to prove the assertions in the code. Assertions can be either explicit (contracts) or implicit (null-pointer dereferences, buffer overrun, division by zero, etc.). cccheck uses the CPPA algorithm from [12] to infer necessary preconditions. The main source of imprecision (and of false warnings) of cccheck is the handling of external code: when no contract is available to describe an API behavior, cccheck soundly assumes the worst case. As a consequence, it may report many alarms that appear dumb to a programmer who has internalized implicit assumptions about API behavior. A starting point for the development of VMV was to automatically identify these assumptions and inject them into future versions of the code base to suppress dumb warnings. We picked one particular point in the design space enabled by VMV (Fig. 3): VMV(N ) with call conditions insertion for δ and CPPA for (over-)approximation of N. There are several reasons for this choice: (i) we are interested in understanding the capabilities of cccheck as a relative verifier (that is, for proving the correctness of P w.r.t P); (ii) the CPPA implementation in cccheck is very robust, whereas the tool lacks algorithms to infer sufficient conditions; (iii) designing a sufficient condition inference algorithm good enough to be useful in practice is difficult [10]. We have modified cccheck s CPPA implementation to compute necessary conditions on method entry (entry-assumptions) and at called method return points (callee-assumptions) i.e., for each function call f, we infer a necessary condition N f. We extended the cccheck caching mechanism to store the extracted conditions in a SQL database for retrieval on subsequent analysis runs. We build the instrumented program P N by reading those conditions from the database and injecting them as opportune assume statements in P. Entry-assumptions are extra-assumptions at the method entry point. Callee-assumptions are extra-assumptions after method calls i.e., we add an assume N f statement after each call to f. Code under analysis We consider two code bases: the opensource ActorFx framework and the non-public CB framework. We chose ActorFx to enable reproducibility of our results and CB to test VMV on complex code bases. ActorFx provides a language-independent model of dynamic and distributed objects for Windows Azure. It is available as open-source project at actorfx.codeplex.com. The base changeset is (randomly picked) and the new one is (the latest available at the time of this writeup, roughly three months later). CB is a large, complex, and high-quality production code base at Microsoft, made up of 57 C# projects. It includes (among other things) a parser, a compiler, a tracer, a distributed file system and a distributed runtime. The code base interacts with the file system, network, underlying operating system, etc., and it heavily uses the.net serialization/deserialization mechanism. As a consequence, the correctness of many assertions in the code base rely on implicit assumptions about API behavior which are unknown to cccheck. The base version is snapshot of the code (randomly picked) and the new version is snapshot (three weeks later). Both projects have a few Code- Contracts [16] annotations. Contracts range from simple non-null assertions or arithmetic relationships between variables to complex algebraic properties between the entry and exit states of a method. ActorFx Results We report in Fig. 5 the results of applying VMV(N ) to the three largest projects of the ActorFx framework. The first column is the name of the project, the second column reports the fraction of the number of external libraries over the total number of libraries referenced by the project. We define a library as external if it is not part of.net nor of ActorFx. External libraries are unknown to cccheck, which soundly assumes the worst case for them. The columns 3 6 report respectively the total number of methods analyzed, proof obligations, raised alarms, and inferred necessary conditions for the base program P. The raised alarms (warnings) are the assertions that the tool could not prove safe. Columns 7 9 contain the total number of methods analyzed, checks, and warnings emitted by cccheck on the new program P. The next two columns indicate the total number of warnings raised on P N and the number of relative verified assertions. Finally, the last three columns report our experience of annotating P N to reduce the number of warnings to zero. The first observation is that VMV(N ) dramatically reduces the number of alarms to inspect in P N, while still providing semantic guarantees on the proven assertions. For the ActorLib, cccheck proves 2099 assertions absolutely correct and 316 assertions relatively correct when using VMV(N ), leaving only 137 warns to inspect roughly reducing the number of warnings by 70%. For ActorCore, 1146 assertions are proven independently of the previous versions and 70 more carrying over hypotheses on the good runs of ActorCore, so that the user is left a mere 37 warns to look at. The best results are on System.Cloud, where VMV(N ) enables the relative proof of 214 more assertions on top of the 1178 cccheck proved on P on its own, lowering the number of warnings by more than 80%. Overall, VMV(N ) proved the relative correctness of 72.6% of the warnings in P. Next, we focused on the alarms in P N, evaluating how much effort is needed to go to zero warnings. It turns out that, as expected, VMV(N ) eliminated the largest majority of warnings originating from the use of external APIs, allowing the developer to focus only on interesting warnings. We added several contracts (mainly object invariants) to the source code, and we found some bugs in the code that we reported to the developers. It is worth observing that: (i) often a contract was enough to prove more than one assertion at the time, i.e., we added fewer annotations than there were warnings; (ii) the total number of checks in the annotated program significantly increased with respect to P. Overall, the annotation process for the three libraries took us just a couple of hours. CB Results. In the next experiment, we continued our investigation into VMV(N ) by quantifying how far from sufficient condi-

A Gentle Introduction to Program Analysis

A Gentle Introduction to Program Analysis Işıl Dillig University of Texas, Austin January 21, 2014 Programming Languages Mentoring Workshop 1 / 24 What is Program Analysis? Very broad topic, but generally