Thursday, December 23, The attack model: Static Program Analysis

The attack model: Static Program Analysis

How making SPA? DFA - Data Flow Analysis CFA - Control Flow Analysis Proving invariance: theorem proving Checking models: model checking Giaco & Ranzato

DFA: The people Gary Kildall Ken Kennedy Jeffrey D. Ullman Giaco & Ranzato

The source Flemming Nielson, Hanne Riis Nielson, Chris Hankin: Principles of Program Analysis. Springer (Corrected 2 nd printing, 452 pages, 2005. Alfred V. Aho, Ravi Sethi and Jefferey D. Ullman: Compilers: Principles, Techniques, and Tools. Addison-Wesley. 2006. Giaco & Ranzato

DFA Giaco & Ranzato

Data Flow Analysis in history Scanner Parser Semantic analysis Optimizer Code generator CFA DFA Improve ment We start from a program representation: CFG The semantics is given by recursive equations specifying the i/o behavior at each program point Giaco & Ranzato

CFG Giaco & Ranzato

What is DFA Wiki: Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a computer program. A better definition? Data-flow analysis is a technique for gathering information about the how data flows at run time in at various points in a computer program. Giaco & Ranzato

Example: Live Variable Analysis Essential for register allocation: two contemporary alive variables cannot be stored into the same register! x and y cannot be stored into the same location n if they are both in use! Useful for SW watermarking (the QP algorithm) Giaco & Ranzato

Example a and b are never in use at the same time: they can be substituted with x Giaco & Ranzato

Live variables x is live at the exit of C if x holds a value that will be used after (will be read: right-hand side) x is not live after C if before its future use it will be reassigned (x := exp and x exp) If x is not live, it is dead! dead-code elimination: if x is dead after x:=exp then we can erase x:=exp dead code is undecidable!! Giaco & Ranzato

Live variables The last use of b as r-value is in 4 b used in 4 and it is live in the arc 3 4 No assignment to b in 3: it is live in 2 3 b is assigned in 2: no one will use b before 1after 2 Live range of b: {2 3, 3 4} Giaco & Ranzato

Live variables a is live in 4 5 and 5 2 a is live in 1 2 a is not live in 2 3 and 3 4 even if in 3 variable a is defined, this value will not be used until a will be assigned a new value in 4 Giaco & Ranzato

Live variables c is live in all arcs liveness can be used to deduce that if c is a local variable, then c is used without being initialized! (warning!!!!) Giaco & Ranzato

Live variables It is enough to have 2 registers: a and b are never alive together! Giaco & Ranzato

Live variables a and b are never alive along the same arcs! we can optimize P: new register ab Giaco & Ranzato

Basic notation CFG with out-edges and in-edges pre[n] & post[n] denote predecessors nodes and successors nodes of n. Example: post[5]={2,6} because 5 6 and 5 2 pre[2]={1,5} because 5 2 and 1 2 Giaco & Ranzato

Notation A variable is defined when it is the L-value of assignment: x :=... A variable is used when it is a R- value in an expression:... :=.. x.. def[n] are the variables defined in n use[n] are the variables used in n Example: def[3]={c}, def[5]= use[3]={b,c}, use[5]={a} Giaco & Ranzato

Formalizing liveness Definition x is live on e f if there exists and execution path C from e to n such that: e f is the first arc in C x use[n] For any n' e and n' n in C, x def[n']. x is live-in in a node n if x is live on all in-edges of n. x è live-out (or simply live) in a node n if it is live on at least one of the out-edges of n. Example: a is live on 1 2, 4 5 e 5 2 b is live on 2 3, 3 4 c is live on all arcs a is live-in at 2, BUT it is not live-out at 2 a is live-out at 5 Giaco & Ranzato

Computing Liveness Liveness information (i.e., live-in and live-out for all nodes) can be over approximated as follows: 1. If a variable x use[n], then x is live-in at n. Namely, if a node n uses x as R-value then x è live for any incoming arc in n. 2. If a variable x is live-out at n and x def[n], then the variable x is also live-in at n Namely, if x is live for some arc outgoing n and x is not defined in n then x is live for all arcs incoming in n. 3. If a variable x is live-in at m, then x is live-out for all nodes c pre[m]. Correctness: If x is truly live-in (live-out) at n then the static analysis will find that x is live-in (live-out) at n. Giaco & Ranzato

Approximating Liveness Liveness analysis is approximate: the assumption is that all paths in the CFG are possible!!! The analysis determines that a is live-in in 5, and therefore a is live-out in 3. BUT there is no true execution path from 3 to 5 and therefore a is not concretely live at the exit of 3! Giaco & Ranzato

Data-Flow equations Define: in[n] the set of variables that are classified as live-in at the node n out[n] the set of variables that are classified as live-out at the node n This can be expressed with 2 equations (or a system of equations): 1. in[n] = use[n] (out[n] - def[n]) 2. out[n] = {in[m] m post[n]} Giaco & Ranzato

Least fixpoint Least fix-point of the system of equations: n nodes(cfg(p)): in[n] = use[n] (out[n] - def[n]) out[n] = {in[m] m post[n]} Formally: Let Vars(P) < ω and nodes(cfg(p)) = N then live : (2 Vars(P) X 2 Vars(P) ) N (2 Vars(P) X 2 Vars(P) ) N (2 Vars(P) X 2 Vars(P) ) N is a finite complete lattice! live is a monotone function such that: live( in1,out1,...,inn,outn ) = in[1] = use[1] (out[1] - def[1]),out[1] = {in[m] m post[1]},..., in[n] = use[n] (out[n] - def[n]),out[n] = {in[m] m post[n]} Giaco & Ranzato

Correctness Theorem n nodes(p): live-in[n] in[n] and live-out[n] out[n]. Proof idea: Both in[n] and out[n] compute over the CFG statically, i.e. following possibly non-real executions!! Giaco & Ranzato

Approximation soundness How can we read the answer of a static analysis? If x will be live in n in some program execution path then x out[n] If x will not be live in n in some program computation it may well happen that x out[n] For liveness sound approximation means: we can erroneously derive that x is live, BUT we CANNOT erroneously derive that a variable is dead!! If x out[n] then x may be live at program point n If x out[n] then x is definitively dead at program point n. Giaco & Ranzato

Giaco & Ranzato

The approximation is complete!! out[1]={a,c}, out[2]={b,c}, out[3]={b,c}, out[4]={a,c}, out[5]={a,c} Giaco & Ranzato

Backward analysis Live variable analysis is indeed backward: information propagates backward from out to in I can compute in[n] if I know out[n]; I can compute out[n] if I know in[m] for all successors of n Giaco & Ranzato

Backward analysis Giaco & Ranzato

Reaching definitions Given a program point n, what are the definitions (assignments) that are available and not overwritten, when program execution reaches this point along some path? And what definitions are available after n? A program point n may kill a definition: if the command in n is an assignment x:=exp. In this case we kill definitions for x which are available in entry at n. We can generate new definitions by assignments. We are interested in entry and exit reaching definitions for any program point in CFG.... it is one of the simplest data-flow analysis in compilers! Giaco & Ranzato

Forward analysis Giaco & Ranzato

Formal definition Definitions are pairs of variable-program-point: {(x,p) x Vars, p is a program point} 2 (Vars Points) where (x,p) means that x is assigned at point p. The analysis computes the set of reaching definitions for each program point: definition chains. If (x,p) is computed at point q then the assignement to x at point p is available in q.? is a special symbol in Points, which is used for uninstantiated variables The value ι = {(x,?) x Vars} denotes uninstantiated variables Giaco & Ranzato

Formal definition The analysis is given by the following system of fix-point equations for any program point in CFG: ι if p is a program entry point RD entry (p) {RD exit (q) q pre[p]} otherwise RD exit (p) (RD entry (p) \ kill RD [p] ) gen RD [p] RD is a possible analysis: if x:=a in program point q is really available at the entry of point p then (x,q) RD entry (p) (the converse may not hold) Giaco & Ranzato

Formal definition {(x,q) q Points, x def[q]} {(x,?)} if x def[p] kill RD [p] if x def[p] {(x,p)} if x def[p] gen RD [p] if x def[p] As usual: def[p] = {x} if the instruction at program point p is x:=exp Otherwise def[p]=?. The analysis is forward with least fixpoint. Giaco & Ranzato

RD entry (1)= {(n,?),(m,?)} RD exit (1) = {(n,?),(m,?)} 1 input n; RD entry (2)= {(n,?),(m,?)} RD exit (2)= {(n,?),(m,2)} 6 2 m:= 1; 3 n>1; output m; 4 m:= m*n; 5 n:= n-1; RD entry (3)= RD exit (2) U RD exit (5) ={(n,?),(n,5),(m,2),(m,4)} RD exit (3)= {(n,?),(n,5),(m,2),(m,4)} RD entry (4)= {(n,?),(n,5),(m,2),(m,4)} RD exit (4)= {(n,?),(n,5),(m,4)} RD entry (5)= {(n,?),(n,5),(m,4)} RD exit (5)= {(n,5),(m,4)} RD entry (6)= {(n,?),(n,5),(m,2),(m,4)} RD exit (6)= {(n,?),(n,5),(m,2),(m,4)} Giaco & Ranzato

DFA training On-line analyzer: http://pag.cs.uni-sb.de/ it implements standard DFA with an intuitive interface! Giaco & Ranzato

http://pag.cs.uni-sb.de/ Giaco & Ranzato

DFA Framework Is there a common structure in DFA? Having a framework allows the design of a common algorithm and specification (correctness proofs, complexity evaluation etc) Giaco & Ranzato

A common structure? Forward in[n]! out[n] pre! post Backward out[n]! in[n] post! pre Possible Analysis Semantics Analysis Reaching definitions Live variables Definite Analysis Analysis Semantics Available expressions Very busy expressions Giaco & Ranzato

A common pattern ι if p E GA (p) { GA (q) (q,p) F } otherwise GA (p) f p (GA (p)) where: E are the initial/terminal points in CFG ι is the initial/final information F are the arcs or inverse arcs in CFG is the either or f p is a transfer function associated with node p Giaco & Ranzato

Forward vs Backward ι if p E GA (p) = { GA (q) (q,p) F } otherwise GA (p) = f p (GA (p)) In forward analysis E are the initial points, F = {(q,p) q" p}, GA is GA entry and GA is GA exit In backward analysis E are the final points, F = {(q,p) p" q}, GA is GA exit and GA is GA entry Giaco & Ranzato

Possible vs Definite ι if p E GA (p) = { GA (q) (q,p) F } otherwise GA (p) = f p (GA (p)) When = we look for the largest set satisfying the equations on all possible computation paths entering (exiting) a node: This is a definite (or must) analysis! Quando = we look for the least set satisfying the equations on at least one possible computation path entering (exiting) a node: This is a possible (or may) analysis! Giaco & Ranzato

Distributive Dataflow Analysis Assume transfer functions monotone and = A dataflow analysis problem is distributive if all transfer functions are additive, namely for any f we have that for any x,y C: f(x y) = f(x) f(y) Note that by f monotonicity: f(x y) f(x) f(y) Giaco & Ranzato

A distributive transfer function Giaco & Ranzato

A non-distributive transfer function Giaco & Ranzato

An example f g h k(h(f(0) U g(0))) = k(h(f(0)) U h(g(0))) = k(h(f(0))) U k(h(g(0))) k The analysis is equivalent to combine the result of the analysis along all separate paths Giaco & Ranzato

DFA of a distributive problem If a problem is distributive then the minimal solution to its system of equations is equivalent to the combination of the separate analysis applied to all program execution paths (including infinite ones). does not cause a loss of precision! Giaco & Ranzato

What problems are distributive? Distributive problems are easy. DFA concerning the structure of code are typically distributive! Example: live variables, available expressions, reaching definitions, very busy expressions are all distributive problems. These are properties concerning HOW the program executes. Giaco & Ranzato

Non-distributive problems Typical non-distributive problems concern WHAT programs compute. Example: the output is a constant, a positive value, belongs to an interval, is bounded etc. Example: Constant Propagation Analysis For every program point p determine whether a variable always has the same constant value whenever the execution reach p. Giaco & Ranzato

Constant Propagation Analysis The domain of properties is (Var # Z ) where: Var is the set of variables in P Z is the dual CPO to Z T -4-3 -2-1 0 1 2 3 4 Giaco & Ranzato

Constant Propagation Analysis Var # Z are the states evaluating variables in Z with meaning dont know. Var # Z is a CPO under the usual point-wise order : If σ,σ' Var # Z then σ σ' iff x Var. σ(x) Z % σ(x) Z σ'(x). is a bottom state (totally undefined function) in (Var # Z ). Giaco & Ranzato

({x,y} # Z ) T= {(x, ), (y, )} {(x, ), (y,4)} {(x,1), (y,4)} {(x,1), (y,2)} {(y,4)} {(x,1)} {(y,7)} Giaco & Ranzato

Analyzing expressions In order to specify transfer functions we need to be able to evaluate (integer) expressions in Aexp in a state σ (Var! Z ) : A:(Aexp " (Var! Z ) ) " Z, A x σ = A n σ = if σ = or σ(x) = undef σ(x) otherwise if se σ = n otherwise A a 1 op a 2 σ = A a 1 σ op A a 2 σ where op is the interpretation of op on Z, defined as follows: let opz :Z 2 " Z an arithmetic operation on Z: (A) if z 1,z 2 Z then z 1 op z 2 = z 1 opz z 2 ; (B) op z = z op = ; (C) z 1 op z 2 = otherwise. Giaco & Ranzato

The transfer functions The transfer functions for constant propagation are: f p : (Var & Z ) ' (Var & Z ) and defined as follows: if p is a node containing an assignment [x:=a]p then f p (σ) σ[x A a σ] if p is a node containing a non assignment command: f p (σ) σ This is a possible/forward analysis Giaco & Ranzato

Example Consider the program [x:=10] 1 ; [y:=x+10] 2 ; ([while x<y] 3 [y:=y-1] 4 ); [z:=x-1] 5 The minimal solution of Constant Propagation Analysis is: CP entry (1) = CP exit (1) = {(x 10)} CP entry (2) = {(x 10)} CP exit (2) = {(x 10), (y 20)} CP entry (3) = CP exit (3) = CP entry (4) = CP exit (4) = {(x 10), (y )} CP entry (5) = {(x 10), (y )} CP exit (5) = {(x 10), (y ), (z 9)} Giaco & Ranzato

Non-distributivity Constant Propagation Analysis is not distributive: consider the transfer function for the command line [y:= x * x] p We consider two states σ 1 and σ 2 such that σ 1 (x) = 1 e σ 2 (x) = -1. In this case: (σ 1 σ 2 )(x) = and therefore f p (σ 1 σ 2 )(y) = while f p (σ 1 )(y) = 1 = f p (σ 2 )(y) Giaco & Ranzato

Abstract Interpretation

The people Patrick Cousot Radhia Cousot Made in France Giaco & Ranzato

Applications Developed in 77 for generalizing DFA Successful model for: DFA, Model Checking, Types, Program transformation, etc. Successfully used in concrete analysis systems since Y2000 analyzed ~2M lines of safety critical C code with no false alarms! Giaco & Ranzato