Program Analysis Course Notes

Size: px

Start display at page:

Download "Program Analysis Course Notes"

Britney Kelley
5 years ago
Views:

1 1. Background / overview 1.1 Course overview Program Analysis Course Notes Ashok Sreenivas, 2008 Introduction: what and why of program analysis Background and program analysis techniques Lattice theory Data flow analysis Abstract interpretation Non-standard type inference Inter-procedural analysis Analysis I: Identifying equivalent expressions Different approaches Relative merits, demerits Analysis II: Pointer analysis Theoretical complexities Families of algorithms 1.2 Program analysis what and why What is program analysis? Infer properties of a given program Analogies to other kinds of analysis (Shakespeare's poetry or an airplane) What do we mean by properties? Syntactic properties analogous to physical properties. Not of interest in this course. Semantic properties properties that hold when the program runs. Similar to a flying plane or understanding Shakespeare. Much more interesting and relevant. Of course, we want to find properties without running the program. (Why?) Why should we study program analysis? Program verification Ensure that a program meets its specifications Discover (all) invariants about the program Property verification static debugging Relatively more modest aim of ensuring if a given property holds for the program Works against partial specifications Examples: safety of file operations, array index overflows etc. Program optimization 1

2 Finding properties that hold which ensure that a given transformation on a program does not change its behaviour [eg., eliminating constant computations] Preferably, it should also make the program run faster! Translation validation Does the object code of your program faithfully reflect the source? Requires identifying and comparing properties across two languages! Program understanding, re-engineering etc. Software engineering applications very little completely new code is ever written Want to understand the program s behaviour, perhaps in pieces, perhaps under specific conditions, Useful to maintain legacy systems, re-target them Analysis results primarily intended for human consumption (unlike in other cases) What do we mean by a program? Or, more precisely, what kind of programs are we talking about? Program analysis techniques talk about analyzing a class of programs not one program (like a compiler) Class of programs defined by a language or semantic model Languages syntactically different but (almost) similar semantically can be treated similarly Obviously, difficulty of analysis proportional to the complexity of the language / semantic-model For the purposes of this course: semantic model would broadly include all imperative, maybe object-oriented programs. In particular, it includes variables, assignments, control-flow (sequence, if-then-else, loops). Also includes pointers, procedures / functions (with parameters). Aggregate types structures, arrays etc. also considered. It does not include higher-order functions, functions as first-class values (though it may include function pointers ). Example analyses: sign analysis, interval analysis. Sign analysis: If some value should never become negative (say, temperature or pressure ) Interval analysis: Similarly, for some critical values such as temperature, pressure etc. Or also very relevant to array index analysis, and therefore security violations in web applications. Sample program and solutions for these analyses on the sample program. <<x: unknown, y: unknown>> <<x: [], y: []>> x = -10; <<x: -ve, y: unknown>> <<x: [-10, -10], y: []>> y = 1; <<x: any, y: +ve>> <<x: [-10, inf], y: [1, inf]>> while (x <= 100) <<x: any, y: +ve>> <<x: [-10, inf], y: [1, inf]>> 2

3 { <<x: any, y: +ve>> <<x: [-10, 100], y: [1, inf]>> x = x + y; <<x: any, y: +ve>> <<x: [-10, inf], y: [1, inf]>> y = y * 2; <<x: any, y: +ve>> <<x: [-10, inf], y: [1, inf]>> } <<x: any, y: +ve>> <<x: [101, inf], y: [1, inf]>> Points to note One has to design suitable abstractions for the analyses One needs info at each program point Info should represent all possible executions The idea of approximations: Why is x s sign <<any>> at the top of the while loop? Why is the interval of y [1,inf] at the top of the while loop? Can it be better? 1.3 Fundamentals Underlying principles of Program Analysis The actual (or 'concrete') program works on concrete values giving concrete outputs say program with integer inputs and outputs. Questions you ask of the program, i.e. the properties of interest are abstract. Eg., range of values, set of values (why set?), signs etc. So, notion of abstraction is the key. The concrete values almost always abstracted (eg. signs, intervals etc.), i.e. the domain on which the program operates is changed from concrete to abstract. The program itself may also be abstracted to simplify the analysis. Approximation Finding exact information about the program often impossible Why? Halting problem is a program analysis problem! Many others undecidable too. Even if theoretically possible, may be extremely hard, computationally intractable. Therefore, will have to settle for approximate answers many times. Approximations introduce notions of soundness and completeness how correct is the approximation and how good is it. Soundness Everything inferred by the analysis is also true of the program True in analysis => True in program Though everything true at run-time may not be inferred Basically determines the direction of approximation For interval analysis, the natural direction would be that it is OK to predict wider intervals Say x really takes on values 11, 16, 19. 3

4 If analysis says x: [0, 25], i.e. it is never the case that x takes values outside of this range it is OK. But if analysis says x: [12, 18] or even x: [12, 28], it is unsound because it says x never takes the value 11 which is wrong. But notion of soundness depends on what you want to use analysis information for. Example (variable initialization) If application is to detect bugs arising due to uninitialized variables, you want to report a super-set of all actual uninitialized variables. That is, you want to catch all 'real' bugs and perhaps also some spurious 'bugs' that are not really so. If application is to pre-initialize uninitialized variables (rather than when encountered first assume this is more efficient or easy), then you want to report a subset of all uninitialized variables. That is, it is OK to miss out on a few uninitialized variables, because this is likely to save effort at run-time (if we caught a superset of uninitialized variables, some variables that are deemed uninitialized but are actually initialized will get initialized twice which may not be desirable). But most often, the direction of approximation for soundness is obvious. Example: If analysis predicts no invalid memory access, or overflowing computations, program is really free of such errors. If it points out potential invalid memory accesses or overflowing computations, these may or may not be so. Completeness Converse, i.e. everything true in program is also predicted by the analysis True in program => True in analysis The other direction of approximation. Not everything predicted by the analysis may be true of the program! If analysis predicts invalid memory access, then it is an invalid memory access. In almost all situations, an analysis must be sound. Preferably, it should also be complete, but as we have seen this may be extremely hard or even impossible. Being sound and incomplete is also called being conservative or 'safe' Being just sound (and horribly incomplete!) is always very easy. Just pick the extreme solution! Infinite intervals for interval analysis, +/ for the sign analysis, Every variable for the uninitialized variables analysis etc. But these are also completely useless solutions. Hence precision important. Precision Try to get as close to complete as possible without losing out on soundness Tighter intervals, fewer false-alarms with uninitialized variables etc. Often, trade-off between precision and effort. Sometimes (very rarely) maybe trade-off between soundness and effort. 4

5 2. Program analysis techniques 2.1 Lattice theory The mathematical underpinning of the idea of approximations, soundness etc. Useful/ relevant in many analyses Partial order with the ordering relation representing the notion of approximation Notions of joins, meets, product lattices, functions over lattices, monotonicity and fixed points. Details in lattice.pdf and lattice-others.pdf Example lattices Consider the set S = {1, 2, 3}. (S, \subseteq, \cup, \cap, \phi, S) is a lattice of subsets of S ordered by the subset relation (which is a partial order). The LUB (join) operator is union as it gives the smallest element larger than two elements that contains both the elements. Similarly, the GLB (meet) operator is set intersection. The bottom element is the empty set and the top element is the entire set. Consider the set S = {1, 2, 3, 4, 6, 8, 12, 24} and the 'divides' relation ( ). is a partial order as x x \forall x; x y \wedge x \neq y => y does not divide x x y \wedge y z => x z (S,, LCM, GCD, 1, 24) is a complete lattice. LCM is the join/lub operator as it gives the least element 'larger' (i.e. multiple of) than any two given elements under the chosen ordering. Similarly, GCD is the meet/glb operator, least element (bottom) is 1 which divides everything and greatest (top) element is 24 which everything divides. Both these lattices can be extended to all natural numbers (> 0). But would result in lattices of infinite height and infinite 'width'. Lattice of signs {Bottom, +, 0, -, Any} Bottom <= x \forall x Any >= x \forall x +, -, 0 unrelated to each other Note: One can have other sign lattices too with elements such as non-negative, non-positive and non-zero to represent other classes of numbers Finite lattice (with obviously finite chains) Lattice of intervals Elements of the form [x, y] [x, y] <= [a, b] iff x >= a \wedge y <= b, i.e. a 'tighter' interval is lower than a more loose one. Bottom element is the empty interval (special case for the <= relation defined above) 5

6 2.2 Analysis techniques Top element is the complete interval [-\inf, \inf] Infinite height and width Initially, focus only on single procedure programs. Inter-procedural analysis introduced later Three approaches to program analysis Data flow analysis Abstract interpretation Non-standard type inference The three approaches not independent just different ways of looking at the problem. Sometimes ideas from multiple approaches work best. Running example program across all techniques: S0: ENTRY S1: x = 10 S2: while (x < 100) do S3: if (x > 0) S4: x = x - 3 else S5: x = x + 2 fi S6: x = x * 4 od S7: EXIT Example analysis: Sign analysis Five elements (bottom, 0, +, -, top/any/+-) ordered in the usual way 2.3 Data flow analysis Developed primarily in the context of program optimization References: Kildall 73, Hecht 77, ASU 86. Uses two abstractions The program is always abstracted to a control flow graph (see below) The information abstraction depends on the analysis. This defines the desired lattice (called L) A control flow graph (CFG) abstracts a program Description of CFG with example A node for every basic statement (can be extended to basic blocks ) An edge for every possible transfer of control Cycles in the presence of loops Not every path in the CFG may be a path in P (even if every branch in the CFG may have an equivalent). (Why?) Develop (monotone) functions corresponding to basic constructs in the program Each node type corresponds to a 'basic construct' of the language 6

7 Flow functions that describe what happens to property of interest when the program executes that construct Set of monotone, abstract flow functions defined for each node type : F <L, F> defines a data flow problem Given a program P and a <L,F> pair Build a CFG for P Instantiate F for each node in P Assume some information (usually no information ) at program entry Results in a set of (mutually recursive) equations for information at each program point Can be in of node, out of node or just one of them. Mutually recursive equation set Multiple solutions! For example, in interval analysis, all variables having [-inf, +inf] interval or the computed interval Lattice properties ensure existence of solution. See p 27 of lattice.pdf Least (Greatest in data flow literature) fixed point gives us the best solution for the chosen abstraction. Two questions How to solve the mutually recursive system of equations? What is the relationship between this solution and desired analysis property? Solution can be through iterative or elimination approaches. Guaranteed to terminate if chains are finite. Example worked through iterative approach. The concept of meet-over-all-paths (MOP) solution Assume all paths are executable The desired solution at a point is the meet of information at that point along all paths to that point Meet to combine information Information along a path is just function composition of individual functions Discovered fixed point is equal to MOP solution if the analysis framework (i.e. all flow functions) is distributive Distributive framework: f (x \meet y) = (f x) \meet (f y) Even if framework not distributive, solution is conservative That is discovered solution is a safe approximation of MOP solution, i.e. <= MOP solution in the upside down data flow lattice The classical 'separable' or 'bit-vector' problems Available expressions, 'reaching definitions', 'live' variables, 'very busy' expressions Each 'element' (expression, definition, variable, expression) can be dealt with independently That is, whether an expression is available or not etc. does not depend on any other expression Note: Terminology confusion between data-flow analysis and other semantics / analysis literature Semantics / abstract interpretation literature typically orders lattices such that the 'smaller' elements represent more precise information and therefore, the 7

8 desired solution is the least fixed point. This is computed by repeated application of the function to the bottom element of the lattice, and the best approximation of any two elements is the join or LUB. Data flow analysis literature typically orders lattices such that the 'larger' elements represent more precise information and therefore, the desired solution is the maximal fixed point. This is computed by repeated application of the function to the maximal element of the lattice, and the best approximation of any two elements is the meet or GLB. Hence the term MOP solution. In other words, the two sets of literature view the lattice 'upside-down' with respect to the other. 2.4 Abstract interpretation Use running example Formally defines multiple levels of semantics References: CoCo 77, CoCo 79, Nielson-Nielson-Hankin (PPA) book Lowest level of semantics describes actual program execution Semantic state transformer functions defined for each construct Overall program semantics is defined as a fixed point of (composition of) these functions These functions operate over a concrete domain of values Integers, Booleans, physical memory locations etc. Each domain is described by a (natural) lattice This is the concrete interpretation the base abstract interpretation Each analysis is now described as an abstract interpretation A (semi- or complete) lattice of abstract values (intervals, signs etc.) with its own ordering An abstract semantics, i.e. a semantic function for each construct that operates on the abstract values The analysis itself is now the semantics as derived from these abstract functions Fixed point computation to determine analysis solution Existence of fixed point, its computability etc. follows from lattice theory Semantics can be described at any 'level' and in any 'form': Eg: the CoCo77 papers work on 'trace-like' semantics, i.e. semantics of flow-chart like programs But it can also be a denotational semantics (or big step semantics) etc. Also see absint.pdf Consistent abstract interpretations To prove correctness of analysis Define a pair of functions abstraction (alpha) and concretization (gamma) functions from concrete to abstract domains and vice-versa These functions may introduce loss of information gamma. alpha >= id; alpha. gamma <= id Alpha and gamma may form a Galois connection best abstraction Use these functions to show correctness 8

9 Showing that information loss in alpha/gamma is consistent (commuting diagram on p 242 of CoCo77 paper), i.e. safe approximations Often fixed points impossible or expensive to compute Extremely long chains Example: Interval analysis Means of approximating fixed points Widening operators (from interval widening i.e. approximating) to ensure safe approximation of least fixed point reached Narrowing operators to get you back towards the least fixed point while always remaining a safe approximation Examples Widening operator (p 246, 247, Sec of CoCo 77) Widening operator not commutative, unlike the join operator! Narrowing operator (p 248, 249, Sec of CoCo 77, and example) Diagrams to explain widening / narrowing (p 249) Note: The CFG of data flow analysis itself is an abstract interpretation (albeit a very low level abstraction) Basically throwing away some control flow information but keeping all the data flow information 2.5 Non-standard type inference Use running example Logic-based approach to analysis Consider the problem of program analysis as a type inference problem where the information we require corresponds to the types to be discovered (and the corresponding program constructs are the identifiers or variables to whom we need to assign types) Such types should form an inclusion or subtype hierarchy this corresponds to the information lattice Similar to the flow functions of data flow analysis and abstract semantic functions of abstract interpretation, you define typing rules Typing rules specify when a construct is well-typed, i.e. what is the best type assignment to the constituents of a construct which would make them mutually consistent Example: y + z is the construct and (say) + is known to be an operator Arith x Arith -> Arith. Then the typing rule for the addition construct would say the typing is consistent if y is Arith, z is Arith and the expression itself is also Arith. For y < z, the last part would change to Bool. Since we often want information at each point in the program, there may be different types associated to each program point This in turn, means that the type of a statement is really a type transformer i.e. modifying a type to another, just as a flow function or semantic function Defining the analysis Similar to defining the lattice / flow functions / semantic functions 9

10 Deciding on the set of types, their inclusion relationship and the typing rules Performing the analysis Finding the most general types (equivalent to best approximations) under the giving typing rules for a given program Done similar to type inferencing algorithms such as Milner s Could be expensive depends on the typing hierarchy Can have special subsumption rules to help approximate faster Similar to widening Also see TypeSyst.pdf 2.6 Inter-procedural analysis Analysis of programs with multiple procedures / functions raises issues different from analysis of single procedure programs Two papers to go through One, introducing general techniques of inter-procedural analysis Two, better algorithms for a specific class of analyses 2.7 Two approaches to inter-procedural analysis Classic Sharir, Pnueli paper (1981) laying out the general techniques In a data-flow analysis context Consider an inter-procedural control flow graph (ICFG) with calls connected to procedures and returns connected to the node(s) after the call. Example on p 198, Fig 7-1 of paper Many inter-procedural paths in the ICFG are obviously wrong since a return must return back to the corresponding call The set of inter-procedurally valid paths (IVP) in an ICFG is a subset of the set of paths in the ICFG Considering all paths in the ICFG sound but highly imprecise Formally, the set of paths in a CFG are generated by a regular language, while the set of IVPs in an ICFG are generated by a context-free language. Because of the need to simulate a call stack There is also a need to handle scopes, lifetimes of variables, parameter passing etc. These are ignored for now, as they are easy to do. Therefore, need techniques to only consider IVPs rather than all paths. Broadly, two approaches to address the problem Functional approach Call strings approach Functional approach Define functional equations for the information at a point in terms of the information at the entry of the procedure, along IVPs [pp of paper] Solve for the functional equations to obtain solutions that are functions Existence of the solution depends on the height of the function lattice Approximation techniques can be used 10

11 Having obtained functions for each point along IVPs from its procedure entry, actual information at each point can be found using another set of recursive (non-functional) equations [pp of paper] Example on page of paper Can show that this approach yields the MOP solution over IVPs if the functions are distributive, and that it yields a sound approximation if the functions are non-distributive. Proof on pp of paper. Practical problem: to represent the computed functions efficiently A purely iterative algorithm to implement the functional approach is also possible Algorithm on pp of paper Example on pp Does not explicitly represent functions but directly applies the function (only) to values occurring in the analysis But does not necessarily make it cheap In fact, may not even converge - if the chains are infinitely high etc. But guaranteed to yield a correct result if it converges The call strings approach Resembles the iterative data flow approach Explicitly carry the call stack with the information Propagate only relevant information back from return edges using the call strings at return nodes Obvious problem when call strings unbounded (due to recursion) Formal definitions of call strings and their extensions for each edge in the CFG [pp 212 of paper] Using these definitions, define an augmented data flow framework for the inter-procedural case: <L*, F*> L* defines functions from call strings to L (equivalently pairs of call strings and lattice values). This is the information at a point in the new framework i.e. information at a point is parametrized by its calling context hence the name context sensitive analysis F* defines functions over L* and is derived from F and properties of call strings. Functions for inter-procedural edges change only the call-string part Functions for intra-procedural edges change only the L part Should be closed under composition and meet, and contain the identity function. Note: this framework depends on the ICFG and not independent unlike in the intra-procedural case Solving the resultant data flow equations results in the MOP solution over all IVPs. After solving, the solutions for all the call-strings merged (through a join) to get the solution valid for all call-sequences (and eliminating the call-strings in the process). See p 215, 7-12 of paper. Proof given in paper. 11

12 Solution may not converge for recursive programs even if L is finite. But possible to converge by choosing appropriately finite subset of call strings Finite prefix closed subset of all call strings Basically choose them long enough to allow convergence even if longer paths possible Height of lattice * longest cycle in call graph. Or for simplicity, size of lattice * number of calls. But the size of the call string set may be still too huge But not so in the case of the so called separable problems where the effective height of the lattice is 1! Example pp Special case of inter-procedural analysis Reps-Horwitz-Sagiv paper of 1995 For a special sub-class of problems, one can have efficient algorithms for precise inter-procedural analysis The class is defined as the set of problems which have distributive transfer functions and finite data-flow facts i.e. transfer functions are from (finite) sets of facts to (finite) sets of facts, and the meet operator is either union or intersection. Inter-procedural, finite, distributive, subset (IFDS) problems Transfer (flow) functions associated with edges This includes the classical bit-vector or separable problems and also problems such as copy-constant propagation, possibly uninitialized variables etc. Precise inter-procedural analysis for IFDS problems is reduced to an equivalent problem of graph reachability over IVPs (or equivalent). Adapts the functional approach of Sharir-Pnueli Interprocedural control-flow graph ( supergraph in the paper) has 4 kinds of edges: Normal intraprocedural edges Edges from call nodes to entry nodes Edges from exit nodes to return nodes Edges from call nodes to return nodes Example: Fig 1, p 3 of paper Each flow function f from 2 D -> 2 D is mapped to a binary relation R f over (D union {0}) where {0} represents the empty set. R f has at most (D+1) 2 elements. R f is defined as follows: (0, 0) \in R f \forall y \in f(\phi). (0,y) \in R f \forall y \in f({x}). (x,y) \in R f if y \not\in f(\phi) Basically the bottom element maps to itself. The bottom element also maps to all those elements that are generated (i.e. obtained by applying f to the empty set in other words independent of the input) If the singleton x maps to y, then (x,y) is also part of the relation Some examples on p 4, Sec 3 of paper. Mapping from a representation relation to a function is also easily possible: [R] (X) = ({y \exists x \in X. (x,y) \in R} union {y (0,y) \in R }) \ {0} Easy to see that [R f ] = f. 12

13 Composition of two flow functions also maps to composition of corresponding relations R f ; R g = { (x, y) \exists z. (x,z) \in R f \wedge (z, y) \in R g } Therefore, path functions are compositions of relations, i.e. [R f ; R g ] = g \circ f In other words, if relation expressed as graph, composition of path functions is equivalent to tracing a path in the graph! Translating the IFDS problem to a graph reachability problem Associated with every flow graph node, have D+1 points corresponding the elements of D and 0. Exploded super graph whose nodes are pairs consisting of original ICFG nodes and a data-flow fact Corresponding to flow function f of every edge, connect corresponding points as defined by R f. Assuming no information is available at the entry of main, the solution to the IFDS problem is simply the set of points reachable from the point <entry of main, 0> along IVPs. But, how to determine reachability along IVP? Done by a work-list algorithm using path edges and summary edges Similar in spirit to the Sharir-Pnueli functional approach without actually computing functions Path edge is an edge of the form <e p, d1> to <n, d2> where e p is the entry of a procedure containing node n. Indicates that there is an IVP from the <entry of main, 0> to <e p, d1>, and a same-level IVP from <e p, d1> to <n, d2>. In other words, the data-flow fact at the target of the path edge is part of the IFDS solution at that node. Summary edges similarly capture the effect a procedure, i.e. they are edges of the form <c, d1> to <r, d2> where c is a call node and r its return node. It represents information that (may have) passed through the procedure called. Algorithm for computation of path and summary edges in Fig 3 (page 7) of paper. Detailed example with possibly uninitialized variables problem in Fig 1 (page 3) and Fig 2 (page 5). Some special cases such as h-sparse and separable problems lead to more optimal algorithms. Complexities of all given in Table 5.2 (page 9). Worst case is O(ED 3 ). Becomes O(ED) for separable problems. 'Real' analyses of course have to contend with many more issues/constructs!! Arrays Records/structures Polymorphism Sub-typing / inheritance Pointers... 13

14 2.9 Analysis problems to focus on Equivalence between expressions General question: Does expression x + 2y have the same value at a program point as expression c 3b + d? (Obviously!) un-decidable Still useful to find approximate solutions and solve special cases as many applications Software verification Does assertion x + y = 3 * z hold at this point? Constant propagation Does variable x have constant value c? Or, does x c = 0 hold? Replace variable with constant, if so. Copy propagation Do two variables hold the same value? Or, does x y = 0 hold? Can be used in, say, efficient register usage. Common sub-expression elimination Do x+y and c+d have the same values, and has one of them already been computed? If so, use it in place of the other. Alias / pointer analysis Alias analysis: Do two names refer to the same location? Eg., <*p, *q> if p, q are pointers to the same type, and can point to the same memory location (stack or heap) at that point in execution. Points to analysis: What are the names to which a pointer may point to? Eg., <p, x>, <q, x> if both p and q may point to x at a point in execution. 3. Herbrand equivalence analysis 3.1 Problem definition Most general problem statement During the execution of the program, find the relationships between values of expressions that hold at each point. Eg. at point p, x 3 2xy + 3yz 23 <= 0 Similar to discovering program invariants! Obviously undecidable in most general form Many simpler variants One variant Finding relations (equality and inequality) among linear expressions Essentially, restrict the kind of expressions being dealt with Herbrand equivalence of expressions Operators in the expression are un-interpreted So, only structural equivalence to be checked Herbrand equivalence => expression equivalence but not the other way round, i.e. sound but incomplete, as desired. Can be solved 'precisely', even though expensive. 14

15 Expression x+y equivalent to a+b only if both can be reduced to structurally equivalent expression Say, at the point in question, x has value c, y has value d+e; while a has value c+d and b has value e Both reduce to c+d+e and hence Herbrand equivalent Is this right? Not really, as it needs the knowledge that + is associative! Applications (Copy) Constant propagation Common sub-expression elimination Invariant code motion Detecting / verifying invariants 3.2 Cousot-Halbwachs (1978) Focuses on linear restraints (which include linear equalities and inequalities) so closer to invariant discovery but restricted to linear expressions Tries to find relationships such as x + 3y z <= 0 Uses abstract interpretation approach In a sense, completely orthogonal to Herbrand equivalences Here operators are interpreted Subsumes Herbrand equivalences and therefore constant propagation, available expressions etc. But only for linear expressions Lattice of linear restraints with geometric representation Each restraint (i.e. equality or inequality) represented by (an approximation of) the set of points allowed by it Geometric interpretation (one 'dimension' per variable) 'Continuous' domains Example of space for a set of restraints on p 86 of paper Partial ordering by geometric inclusion If the region of one restraint subsumes the region of another, it is more approximate Basically allows more values Lattice of infinite height (and width!) So need widening to converge Determining the region from a set of restraints (and vice-versa) requires a lot of complex math Polyhedra, convex hulls, frames and so on Usually approximations used as exact intersection of two restraints may be impossible or very hard to find Intersection of two regions represents the merge (meet) information The abstract semantic functions Describe how the state of restraints is transformed by each statement Semantic function for assignment has to consider different possibilities such as assignment of non-linear expressions, linear expressions etc. Semantic functions given from p 90 to p 93 of paper for flow chart programs 15

16 Assignment of non-linear expression to x means we know nothing about value of x. So other relationships involving x have to be modified to eliminate x - cutting out one dimension! Assignment of linear expressions does not result in elimination of a dimension but requires complex geometric jugglery Similarly, linear and non-linear equality and inequality tests also change the set of restraints Start with initial restraints (relationships among input parameters) Keep applying semantic functions (with widening) until it saturates Example from the paper on p 94 (Sec 5) of paper Pictorial intuition to how it works, including widening Expensive but comprehensive technique Example given in paper (p 84) finds a whole bunch of inequalities for bubble sort Shows the power of program analysis even if not practically feasible 3.3 Kildall s paper (1973) The paper proposed the idea of a general data flow framework and the meet-over-all-paths solution One of the problems addressed was common sub-expression elimination If the equivalent of an expression has already been computed, do not compute it again At each program point, compute a set of equivalent expressions Abstract lattice Partitioning of expressions (computed so far) Each equivalence class containing equivalent expressions Ordering corresponds to partition refinement P1 <= P2 if P1 is a 'coarser' partitioning 'More expressions are equivalent' in P1 Example {{a,b,c,d},{e,f}} <= {{a, b}, {c, d}, {e, f}} {{a,b,c}, {d,e,f}}?? {{a,b}, {c,d}, {e,f}} : These partitionings are unrelated! Join is difficult Prune equivalence classes to relevant ones Details on p 198 of paper Identify expressions common to the two partitionings For each common expression, intersect corresponding partitions to get partitions of the joined partioning The flow function works on the equivalence classes depending on the computations inside a node (p 198 of paper) Assume a partitioning P at the entry to a node N For each (partial) computation exp at N, if exp is already in some partition of P, it is redundant Else, create a new partition for exp 16

17 Also have to add other elements to exp s partition depending on equivalences of exp s sub-expressions Infer new expressions to be added to equivalence classes to make them complete ( structuring the equivalence classes) Eg., if exp is a+b, and a is in a partition with c+d and b is in a partition with e+f, then have to add all of a+e+f, c+d+b, c+d+e+f to the partition with exp. These operations make it hard If the node has an assignment (say v = exp) Remove all expressions containing v from their partitions. For all expressions exp' that have exp as a sub-expression, create a new entry in the partition with exp replaced by v. The flow function is also distributive! Can easily add constant propagation to this function Add constants also to the equivalence classes If operands of an expression are in an equivalence class with constants, then compute the expression itself If computed constant has a class of its own, add this expression to that class, else add the constant to the class of the expression Basic complexity exponential because of trying to deal with partitions, and trying to complete partitions Basically have to look at all possible ways of combining operands based on equivalence classes of operands Meet also expensive Global value numbering Number the partitions and represent expressions by operators operating on partition numbers Decreases number of expressions in a partition Brings down the cost Small examples on pp 203 of the paper of decrease in partition size Meet becomes more complex Need to recover the 'hidden' information from value-numbered expressions Details on p 203 of paper if required And complexity remains as bad Example on p 235 (Fig 1) and p 236 (Fig 2) of Ruthing-Knoop-Steffen (RKS) paper and discussion on difficulties with Kildall on p Alpern-Wegman-Zadeck (1988) A simple, cheap algorithm to detect some but not all equivalences between variables (and expressions) Single global data-structure suffices unlike Kildall Data flow framework Uses an SSA (static single assignment form) representation Replace each variable by many copies so that only one assignment to a copy Introduce merge functions (non-deterministic phi functions) at join points 17

18 Phi-functions subscripted by the node in which they belong (Fig 3 on p 3 of paper) Otherwise might detect two phi-expressions as equivalent just because they have equivalent operands, even if the conditions under which the branching (corresponding to the phi) occurred was different Given SSA form, a value graph is built for the program that describes how the SSA variables are related So, if x1 = y0 + z1 and z1 = x0 * 3: there would be a + node labelled with x1 with edges to nodes labelled y0 and z1 The z1 node would have operator * and have edges to nodes labelled x0 and 3. Given the value graph representation, it does a partition refinement (on the lines of FSA minimization) to merge or collapse isomorphic parts of the graph Two nodes with the same operator and dependences would be collapsed into one partition (and carry labels from both the nodes) Partitioning algorithm in Fig 7 (p 5) of paper Begin with putting all expressions with common root operator in same partition Then for each partition of expressions of a particular operator If corresponding (m'th) operands are not in the same partition, split the original partition so that all operands of operators in one partition come from the same partiton Keep doing this in a worklist until no more partitions can be split After partitioning, variables sharing a node are said to be congruent Congruence implies equivalence (not the other way around) That is, sound but not complete Slightly super-linear, so cheap to do Fig 3, 4 (Sec 4, p ) of RKS paper gives example where this algorithm works Fig 5, 6 (Sec 4, p 240) of RKS paper gives example where this algorithm does not work Because phi functions also treated as un-interpreted operators So, right at the beginning (most optimistic point), an expression rooted at a phi can never match an expression rooted at a proper operator meaning it can never later be merged Even if they are the same inside the phi 3.5 Ruthing, Knoop, Steffen (1999) An improvement to AWZ paper One problem with AWZ is that it misses out on many equivalences since phi-nodes also left un-interpreted. Overcomes this lacuna of the AWZ by interpreting phi nodes, i.e. distributing the phi over an operator Presents two simple graph rewrite rules to do this Calls them normalization rules as the aim is to convert the value graph to a normal form 18

19 One rewrite rule just eliminates a phi node both of whose operands are the same. The variables associated with the phi are moved to the operand Another distributes the phi over an operator if both operands of the phi are the same operator Exact rules in Fig 7 (p 242) of paper The rewrite system consists of these two rules plus the graph partitioning exercise the three together are repeatedly applied The rewrite system is sound Because each of the three rules is The rewrite system also has the nice desired properties of confluence and termination The rewriting process will terminate It does not matter in what order you apply the rules where in the graph Given multiple choices for rule application, you can choose one randomly May of course affect how many steps it takes to converge! Example in Fig 8 (p 243) for confluence The rewrite system is also complete for an acylic program Proof in the paper Examples in Fig 9 and Fig 10 (p 244 of paper) The failure cases for AWZ which work here Complexity O(n 4 log(n)) in the worst case (repeated partitioning is the most expensive step) O(n 2 log(n)) expected in practice Example where RKS is incomplete: Can we detect that x and y have the same values after the first assignment inside the loop (even assuming we can perform computations involving constants)? x = 0; y = x + 1; while (C1) { } x = x + 1; if (C2) else x = x + 1; y = y + 2; x = x + 2; y = y + 3; 3.6 Probabilistic approaches (Gulwani-Necula) Both AWZ and RKS incomplete but sound Can we get completeness if we are ready to sacrifice soundness? That is what probabilistic approaches try to do The probability of error should be very small Confidence in results 19

20 Typically, tuneable lower the error probability, greater the cost So, you choose the cost-precision trade-off 3.7 Discovering linear equalities Discovers relationships of the form a 1 x 1 + a 2 x a n x n + c = 0 where the a i s and c are constants So, only discovers specific kind of relationship between variables Generalization of constant propagation Not all Herbrand equivalences Idea is extremely simple Just run the program on different sets of (randomly chosen) input values So multiple parallel executions Each execution results in a state at a point Collection of states called a sample At each branch point, ignore the condition and execute both the branches! At join points, combine the values obtained from the two branches using a freshly chosen random affine combination of weights That is, values on one branch are given weight w and on the other are given weight (1 w) - an affine combination One weight per state in the sample Example in Fig 1, p 2 of the paper Multiple executions help decrease probability of error Particularly for expressions of the form x = k Geometric intuition behind the idea Each state is a point in n-dimensional space Each sample is a set of points in the n-dimensional space Merging two samples using affine combinations of weights Is randomly choosing a point on the line connecting the two points representing the two states So merger of two samples picks a set of points on lines joining corresponding pairs of states in the two samples Example in Fig 2, p3 of paper Completeness: Such affine combinations preserve any linear relationship (of the kind desired) And since the desired property is expected to be valid for all executions, it should be valid for the sample executions too! Lemma 1, p 4 of paper, and associated proof 'Almost' soundness: Has a very low probability of satisfying any non-existent linear relationship Lemma 2, p 4 and associated proof Schwartz's theorem (Theorem 1, p3 of second Gulwani-Necula paper) Relation to testing Testing equivalent to choosing affine weights 0 and 1. Example program in Fig 1, p2 of paper 20

21 3 paths exhibit relationship while only one doesn't - still easy to catch by this method while testing will only find that the relationship does not hold if exactly that path is executed. Intersection of spaces Identifying relationships within branches (of linear conditions) Example in Sec 5, p 4 of paper, also in Fig 3 Need to 'derive' a sample that satisfies the given condition from the current sample Geometric intuition: Points in the sample 'projected' on to the hyperplane represented by the condition But projection not orthogonal so as to preserve linear relationships But as given in Fig 4 (p 5). Connect two samples with a (hyper)line Choose a point on this hyperline different from all other points and not on the hyperplane of the condition Draw hyperlines from this chosen point to all other points and take the intersections of these hyperlines with the hyperplane. These points will all satisfy the linear relationships found so far and will also satisfy the chosen condition. Note: in the process one point in the sample is 'sacrificed' as the original two samples will both result in the same final point. Details in Fig 3, Fig 4 But all this goes well beyond Herbrand equivalences, as you detect value equivalent expressions and not just Herbrand equivalent ones!! Technical details regarding union of spaces, soundness, completeness, fixed point computation available in the paper. 3.8 Global value numbering using random interpretation Extends the discovery of linear relationships to discovering Herbrand equivalences Kildall s is exponential, AWZ is efficient but highly imprecise, Ruthing-Knoop-Steffen is in between. This one catches as much as Kildall in polynomial time but is (probabilistically) unsound. Unsoundness probability tuneable through parameters Idea is to choose random interpretations for operators and execute the program to discover relationships, rather than leaving them uninterpreted. Previous paper chose random affine interpretations for the phi-operators and natural interpretations for the linear operators Joins are still treated using affine combinations. Interpretation for each operator F can be made by choosing p parameters from field L E.g. if p = 2 and F is binary, F(a,b) may be interpreted as p1 * a + p2 * b (where p1 and p2 are the two random parameters ) Interpretation should be linear as it should distribute over the interpretation for the affine join (\phi) combinations. Equation 5 on p 4 of paper. Unfortunately, this is not (probabilistically) sound as two distinct functions can easily get the same interpretation (Fig 3, p 4 of paper). No point having more than 2 or 3 parameters for a binary operator with a linear interpretation, but these are too few to distinguish between the 21

22 different leaves of a complex reason as there could be more than 2 leaves in the expression. To overcome this, choose k parallel values for each variable (and therefore expression) Uses 4k 2 parameters for the interpretation, namely r1 rk, r1 rk, s1 sk-1, s1 sk'-1. Interpretation of ith (i between 1 and k) linear interpretation of F is a regression as follows: P(x,i) = x P(F(e1,e2),1) = r1 * P(e1, 1) + r1 * P(e2, 1) P(F(e1,e2),i) = ri * P(e1,i) + ri * P(e2,i) + si-1 * P(e1, i-1) + si-1 * P(e2, i-1) Degree of P(e,i) is same as depth of e. Implicit ordering among parameters, i.e. P(e,i) does not contain ri+1 rk or si sk (and their primed varieties). Can be shown that this interpretation is sound, i.e. if P(e1,i) = P(e2,i) for i > j (where j is log of max leaves of e1 and e2), then e1 = e2 under Herbrand equivalence. Lemma 7, p 5 of paper. Essentially, induction on the depth of the expression(s). Work out interpretations of example in Fig 3 (p 4) of paper to show that this interpretation does distinguish between the two non Herbrand equivalent expressions (if required). Therefore, max value of k should be log of depth of largest expression in program. This also appears very conservative, i.e. it can be smaller. This gives us a sound interpretation for each operator as a polynomial over the parameters, where soundness is defined as: if the interpretations of two expressions are equal, they are Herbrand equivalent. Essentially, defines a non-standard semantics for Herbrand equivalence. The analysis proceeds by random interpretation similar to the previous paper Run the program on a sample of size k Choose interpretations for each operator as shown earlier (by picking 4k-2 parameters and an interpretation) Compute values of expressions not necessarily by computing polynomials first, but can compute values directly. See function V on page 5 of paper. A sample S satisfies Herbrand equivalence e1 = e2 if V(e1, k, S) = V(e2, k, S). Note: comparing the kth values of the polynomials as they have the greatest distinguishing power. At join points, perform random affine joins. Fixed point (i.e. same set of Herbrand equivalences attained) guaranteed as lattice has finite depth bounded by n, the number of program variables (page 7). Intuition: All possible herbrand equivalences can be represented by a pair (I,E) where I is a set of independent variables and E is a set of expressions of the form x=e, one for each non-independent variable such that e contains only variables from I. So, if one set of Herbrand equivalences is 'less than' the other, then the 'lesser' one has lesser variables in I. So, any chain length is bounded by the number of program variables. (Lemma 13 of paper) 22

23 Error probability derives directly from Schwartz s theorem regarding probability of random values being the root of a polynomial Turns out the error probability is d/l (for one union operation) where d is the degree of the polynomial and L is the size of the field from which random values are chosen Obviously, bigger L is, the better! Probability of error of whole analysis <= (2n 2 + t) / L, if k >= (2n 2 + t) where n is the max of number of variables, function applications and join points; t is the max depth of expressions. Probability can be still further decreased by performing the random interpretation m times, decreasing the error probability to:((2n 2 + t) / L) m 4. Alias / Pointer analysis 4.1 Problem definition Do two names refer to the same object, or does a pointer point to a given object? Relevant for most analysis problems in modern languages First analysis whose results are used by other analyses Examples Array over-flow analysis: both subscripts and arrays: a [*p], p[*q + 3] Escape analysis for security: Which objects are leaked? Constant propagation / common sub-expression elimination etc. Flavours of the problem Exact definition will depend on the particular language's semantics. For example, C,C++ with explicit pointers that may point to the stack or heap v/s Java which has only object references that point only to heap. Soundness Typically interested in superset of 'actual' aliases That is, it is OK to say that a and b are aliased when they are not, but it is not to OK to miss out the alias pair (a,b) if there may be an execution on which (a, b) may be aliased Because missing out such aliases may result in unsound predictions. For example, one may conclude that an array index is never out of bounds as we missed out a possible alias pair and hence a possible array index value. 4.2 Theoretical complexities Bill Landi / Barbara Ryder Complexities of aliasing problems in languages like C <a,b> is an alias pair at a program point if they refer to the same location. Typically, each of a and b is something of the form *p or x or p->left etc. Binary relation over the space of names. Reflexive and symmetric relation but not transitive! Why? (<a,b>, <b,c> could be along different paths, so <a,c> may never hold!) 23

A Gentle Introduction to Program Analysis

A Gentle Introduction to Program Analysis Işıl Dillig University of Texas, Austin January 21, 2014 Programming Languages Mentoring Workshop 1 / 24 What is Program Analysis? Very broad topic, but generally