Multiplication of BDD-Based Integer Sets for Abstract Interpretation of Executables

Size: px

Start display at page:

Download "Multiplication of BDD-Based Integer Sets for Abstract Interpretation of Executables"

Elfreda Randall
5 years ago
Views:

Dr. Sibylle Schupp Sven Mattsen Hamburg University of echnology (UHH)

1 Bachelor hesis Johannes Müller Multiplication of BDD-Based Integer Sets for Abstract Interpretation of Executables March 19, 2017 supervised by: Prof. Dr. Sibylle Schupp Sven Mattsen Hamburg University of echnology (UHH) echnische Universität Hamburg-Harburg Institute for Software Systems Hamburg

3 Eidesstattliche Erklärung Hiermit erkläre ich, Johannes Müller, an Eides statt, dass ich die vorliegende Bachelorarbeit selbstständig angefertigt und dabei keine weiteren als die angegebenen Hilfsmittelverwendet habe. Die Arbeit wurde in dieser oder ähnlicher orm noch keiner Prüfungskommission vorgelegt. Harsefeld, 19. März 2017 Johannes Müller iii

5 Contents Contents 1. Introduction 1 2. Background Data-flow Analysis Abstract Interpretation BDD-Based Integer Sets Multiplication of BDD-Based Integer Sets Multiplication of General Sets Singleton Multiplication Constructing BDD-Based Strided Intervals Correctness of Singleton Multiplication Conversion of BDD-Based Sets to Strided Intervals Requirements for a Galois Insertion Algorithm for inding Best-itting Strided Intervals Implementation Evaluation Related Work Conclusion 47 A. Appendix 49 A.1. Implementation in Scala A.2. Evaluation Results v

7 1. Introduction he ubiquitous presence of computer programs in nearly all areas of modern life necessitates methods facilitating the extraction of properties from programs, such as correctness in respect to their specification, performance metrics or presence of security vulnerabilities. Especially in safety-critical areas, like avionics or the construction of medical devices, these safety properties are of utmost importance. During the history of software development and engineering, several methods that detect such properties with different trade-offs have been conceived. or example, the dynamic testing of programs, while relatively easy to employ, can only prove the presence of bugs, not their absence, whereas formal proofs of certain properties, which theoretically are the most precise and thorough method, are comparatively hard to formulate for non-trivial programs. Another approach, the static analysis of programs, which does not need to examine the program at runtime, promises to derive properties in an automatic fashion, making it attractive as an additional tool for software developers to check the quality of their product. hese analyses can process different forms of a program: one option is to examine the source code, i.e., a higher-level representation, of the program, and another one the inspection of machine code, i.e., the source code compiled to an executable. his thesis will focus on the latter representation of programs. he need to analyze machine code instead of higher-level source code arises for several reasons. or one, the source code of a program of interest might not always be freely available, for example if the program is legacy software, where the source code has simply been lost over the years. Another example would be a vulnerability analysis of third-party software, where the original source is naturally not available. In addition, the executable might not always be a faithful translation of the original source code, for example due to compiler optimizations, as described by Balakrishnan [1]. Even for open-source software, which can be downloaded in executable form, a mere analysis of the source code for vulnerabilities or malicious code might not be sufficient, since there exists the possibility that a downloaded executable has code added on top of the original functionality. In summary, it can be concluded that the executable is the single source of truth for the behavior of a program and must thus be treated as an important analysis subject. In order to analyze programs statically, one still needs information about the possible behavior of the program, which in turn requires knowledge of the possible program states at each program point that determine this behavior. Part of this program state are the current contents of memory locations, or in higher-level languages the values of variables. One particular static analysis that determines this information is the value analysis, which computes a set of possible values of each variable or memory cell, called the variation domain (VD). As an example, consider Listing 1.1. A value analysis for this program would compute the possible values of the variable i at each program point, for example the set {0, 1 at line 6. If the original program contains an operation on variables to compute some result, a value analysis needs a way to combine VDs with respect to this operator in order to compute all possible results. During an analysis, 1

8 1. Introduction Listing 1.1: Example program 1 #i n c l u d e <s t d i o. h> 2 i n t main ( void ) { 3 i n t a [ 2 ] = { 4, 9 ; 4 i n t i ; 5 f o r ( i = 0 ; i < 2 ; i++) 6 p r i n t f ("%d\n ", a [ i ] ) ; 7 return 0 ; 8 these computations are performed using transfer functions, which are the operators of the concrete program lifted to the world of variation domains. As an example, in order to analyze the example program we need a transfer function for the addition, which would compute all possible values of i after the incrementation. or this thesis, we use BDDs to represent the variation domains. his representation allows efficient and precise transfer functions for bitwise operators, which are a common occurrence in machine code, and addition, as defined by Mattsen et al. [6]. he overarching topic of this thesis is the development of transfer functions for multiplication, since there currently only exists a vastly over-approximating transfer function and multiplication commonly occurs in executables, even if no explicit multiplication was employed in the source program, making a more precise transfer function a worthwhile research subject. o see this peculiarity, consider the array access in line 6 of the example program. he corresponding instruction in machine code, load a + i * 4 in pseudo code, computes the address of the array element from the index i and achieves this by multiplying the index i by a constant factor of 4, since we assume a system working with 4 byte integers, and adding this resulting offset to the base address of the array. igure 1.1 visualizes how the resolution of the address of an array element works: the start address of the array a is given by 0x3 and thus the start address of the i th array element by 0x3 + i 4. As part of this thesis, we will present an algorithm computing exact results for the special case of a singleton multiplication, i.e., multiplication where one VD only contains a single element as it is the case for array accesses, where the singleton set in our example would hold the value 4. We will also describe an algorithm for the general case of multiplication, which however only approximates, since a precise... 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xA... a[0] a[1] igure 1.1.: Layout of an array in memory (squares represent bytes in memory, the text inside represents their address) 2

9 calculation would not be computationally feasible. As part of the singleton multiplication we will present an algorithm for the construction of BDD-based sets representing strided intervals, which arise when multiplying an interval with a constant. Having found a way to convert from the strided interval domain to the domain of BDD-based sets, we expand on that and devise an algorithm that given a BDD-based set finds an optimal strided interval, which is a superset of the original one, allowing us to conveniently convert between the abstract domain of BDD-based sets and strided intervals. In short, the contributions of this thesis are a precise transfer function for the special case of singleton multiplication an approximating transfer function for the general multiplication abstraction and concretization functions between the domain of BDD-based sets and the strided interval domain an evaluation of the presented multiplication algorithms. he thesis is structured into eight chapters. We will begin by explaining important fundamental concepts in Chapter 2, including data-flow analysis and the representation of integer sets as BDDs. Chapter 3 is dedicated to the explanation of the transfer functions for the multiplication that we designed. In Chapter 4 we will explain a conversion between the domain of strided intervals and our domain of BDD-based integer sets. Chapter 5 will serve as a description of the implementation of the developed algorithms. An evaluation of our algorithms will be presented in Chapter 6. We will discuss related work in Chapter 7 and conclude the thesis in Chapter 8. 3

11 2. Background 2.1. Data-flow Analysis One technique of static analysis constitutes data-flow analysis [2], which examines program properties that are dictated by the specific way the data is propagated through the program, i.e., how the sequence of executed code, determined by the control flow, changes the state of the program. Data-flow analyses are used for example during optimization phases of program compilation. here exist a plethora of specific analyses that fall under the umbrella of data-flow analysis, such as the live variables or reaching definitions analysis. hey all depend on the control flow of the program, typically represented by a control flow graph, and define a property space of data-flow information specific to their goals, which is derived during analysis by means of combination and modification of previously known information. In particular, for each node in the control flow graph a transfer function modeling the effects of the corresponding basic block is defined, which computes the exit state of the block, i.e., the properties known after execution of that block, from incoming information, i.e., properties of the program known before the basic block is executed. he incoming information is computed by combining the exit states of all predecessor nodes, using the join function of the analysis. or each node b of the control flow graph we have: in b = join p predb (out p ) out b = trans b (in b ) his gives rise to a system of equations for the program depending on the analysis used. Starting at the entry node of the program and the initially known information, one can iteratively traverse the control flow graph, computing new exit states based on the combination of incoming information and propagate this knowledge through the control flow graph. his is repeated until the properties known at each node do not change anymore, i.e., we reached a fix point of the system of equations. Binary Analysis Analyzing executables using a data-flow analysis entails several challenges. o begin with, the extraction of the control flow graph of an executable is much harder compared to the case of control-flow extraction from higher-level languages, since the original control flow given by higher-level control structures is reduced to jump instructions in the executable. While static jumps to a known instruction are easily resolved, dynamic jumps to an address that is not statically known pose a significant obstacle, since the possible jump targets must be computed as part of the analysis. In order to formulate a sound analysis, the computed set of jump targets must be a superset of all possible jump targets in the original program, otherwise we would ignore certain valid sequences of executed instructions. However, the analysis needs to make sure that the derived set of jump targets is as precise as possible, since an over-approximation of jump targets 5

12 2. Background S1 int a[2]={..; int i = 0; S2 i < 2 false EXI printf(..); i = i + 1; S3 true igure 2.1.: Control flow graph for the code in Listing 1.1 leads to an over-approximation of the control flow graph, which might entail the analysis of instruction sequences that were not present in the original program. In practice, the reconstruction of the control flow graph and the data-flow analysis can be combined: during the analysis of an instruction, all successor instructions are derived and then visited, instead of knowing the successors prior to the examination of an instruction. Value Analysis he specific data-flow analysis we are interested in is the value analysis as mentioned in the introduction, which computes variation domains for the variables respectively memory cells in the program, i.e., it analyzes the possible values that these variables can hold. his information is used for example to check for out-of-bound accesses on arrays, which happen if a possible value for the index exceeds the bounds of the array, or for the computation of possible jump targets of dynamic jumps as required by the reconstruction of the control flow graph for executables. As an example, in the following we perform a value analysis on the example program in Listing 1.1 from the introduction. he corresponding control flow graph is displayed in igure 2.1, where the basic blocks of the program are represented by nodes with edges connecting basic blocks that are executed in sequence. If the execution depends on a conditional expression, the required value of this expression for a basic block to be executed is used as label of the edge. We will now define our analysis by starting with the property space, i.e., the information that we want to track. Since we want to infer the possible values of the single variable i of the program, we use the power set of all integers P(Z) as our property space, i.e., as our flow data we always have a set of possible values. In practice, such a property space is not feasible because we would have to be able to store infinite sets, but is used here for a simplification of the example. We define the join operator, which 6

13 2.1. Data-flow Analysis combines the information of different variation domains, as the set union, i.e., we have: join(s 1,..., s n ) = s 1... s n We will now define the transfer functions for the nodes in the control flow graph, which update a variation domain based on the code that would be executed, as follows: S1: this basic block initializes i to the value 0, which means that the possible values of i after this block are given by the singleton set 0, i.e., trans S1 (in) = {0. S2: the conditional does not modify i and its transfer function is thus the identity function, i.e., trans S2 (in) = in. S3: in the first statement of this basic block i is only read and not modified, while the second statement increments i by 1, which means that we form the new variation domain by adding 1 to all previously possible values, i.e., trans S3 (in) = {i + 1 i in. EXI: the exit node does not modify the values of i and thus we arrive at trans EXI (in) = in. We initialize in S1 with Z, since we have no information about i at that point, i.e., i could be any integer. he initial outgoing properties out n are initialized with the empty set, since this is the neutral element to the set union. he analysis proceeds as follows: Step 1: We calculate out S1 as out S1 = trans S1 (Z) = {0. Step 2: in S2 is computed by combining the outgoing properties of S1 and S3, i.e., in S2 = join({0, ) = {0. rom this we determine the new exit state as out S2 = trans S2 ({0) = {0. Step 3: We compute the outgoing property of S3 from the information incoming from S2: out S3 = trans S3 ({0) = {i+1 i {0 = {1. his tells us that the possible values of i after S3 are just the single number 1. Step 4: Since the outgoing information of S3 changes, we have to recompute the incoming state for S2 as in S2 = join({0, {1) = {0, 1, which also constitutes the new outgoing information of S2. Step 5: Due to the change of out S2 we have to update out S3, since we have a new incoming variation domain: out S3 = trans S3 ({0, 1) = {i + 1 i {0, 1 = {1, 2. Step 6: We proceed by updating the incoming variation domain for S2 based on the new outgoing variation domain from S3 as in S2 = join({0, {1, 2) = {0, 1, 2. Step 7: Since the condition i < 2 would now no longer be fulfilled for S2, we take the false edge to the exit node. Here we just propagate the incoming variation domain, i.e., the outgoing one from S2, and conclude that the possible values of i at program exit are out EXI = trans EXI ({0, 1, 2) = {0, 1, 2. 7

14 2. Background 2.2. Abstract Interpretation he sound static analysis of a program encompasses the consideration of all possible execution behaviors, since all defects or properties that are present in the original program and of interest must be discovered as part of the analysis, i.e., no false negatives may exist. As a consequence, such an analysis must capture all possible execution traces, i.e., all sequences of states of the program, which for example includes the values of variables. However, the sheer size of the state space of non-trivial programs prevents the consideration of every single program state from being computationally feasible. o circumvent this unreasonable computation, one can abstract the concrete program to an abstract one, which reaches abstract states, and perform the analysis on the abstract version. he abstract program, consisting of an abstract state space and state transformers, must obviously be a faithful representation of the original program in order for the analysis to be correct and sensible. Abstract interpretation [3] is a framework facilitating the conception of sound static analyses based on abstracting a concrete program with respect to its semantics. As the result of the process, such an analysis derives properties that describe the concrete values of the concrete program. or the case of a value analysis, the properties would describe the possible values of a concrete variable in the original program. Intuitively, an analysis is correct if the properties describe (at least) all possible program values, which in our case means that we derive a variation domain which is a superset of the actual one of the concrete program. o formalize the correctness of an analysis, one needs to define a correctness relation relating concrete program values to abstract properties dependent on the particular analysis goals. If this correctness relation is preserved under computation the analysis is shown to be correct, i.e., given that the relation holds between an initial value and initial property it must also hold between the final value and property after the property has been transformed by a transfer function corresponding to the semantics of the concrete program. o facilitate the construction of correct analyses, the properties are required to be organized in a lattice, i.e., a set with meet operator, join operator, and partial ordering, where the ordering represents a metric for the precision of a property or the amount of information known, i.e., if l 1 l 2 then l 2 represents at least the information represented by l 1, which means that if l 1 correctly describes a concrete value then l 2 does so too. Based on the goals of the analysis to be performed, one can now choose a property space that can hold all information required but is still small enough to enable efficient computation. Generally, the more abstract and thus less precise a chosen abstract domain of properties and transfer functions is, the more efficient is the computation in that domain, which means that one has to choose a trade-off between precision and computability. As an example, if an analysis is to be designed that checks whether array indices are always within bounds there is no need to track all possible values of the indices complicating the computation; the smallest and largest possible value alone are sufficient. However, it is also possible to change the abstract domain used during an analysis itself without compromising the correctness of the analysis: if a computation is too expensive in a more concrete domain one can approximate this do- 8

15 2.3. BDD-Based Integer Sets main by a more abstract one and perform the computation in this new domain followed by conversion back to the more concrete domain. he basis of these safe abstractions are the Galois connections, which dictate how abstraction to and concretization from the abstract domain must behave. Let L and M be two lattices with L representing a more concrete property space, α : L M the monotone abstraction function relating more concrete properties to more abstract ones and γ : M L the corresponding concretization function for the opposite direction. ogether with the concrete property space L and abstract one M, α and γ form a Galois connection, if c L and a M c γ(α(c)) (2.1) α(γ(a)) a (2.2) he first condition expresses that no information is lost by going to the more abstract domain and back, whereas the second condition expresses the requirement that a concretization followed by an abstraction may not add information. A more strict version of a Galois connection is the Galois insertion, which eliminates superfluous elements in the more abstract domain M by requiring equality in the second equation, i.e., a concretization followed by an abstraction must yield the original element and thus all elements in M uniquely describe a value in L BDD-Based Integer Sets he representation of integer sets as variation domain for our value analysis is based on the concept of Binary Decision Diagrams (BDDs). his section will explain the basics of BDDs and how they can be used to represent integer sets. Binary Decision Diagrams BDDs are directed, acyclic graphs, which are commonly used to represent Boolean functions and are formed from two types of nodes: terminal and decision nodes. erminal nodes, also called sinks, are nodes that are either labeled, meaning alse-terminal, or for a rue-terminal and are the leaves of a tree since they have no successors. or the extend of this thesis, they will be represented as rectangular boxes. Decision nodes are inner nodes of BDDs, labeled with a variable of the represented Boolean function, and have two outgoing edges, a 0-edge (dashed line) and a 1-edge (solid line), representing the value of the variable. We will depict them as circles. BDDs have a single root node. igure 2.2b shows an exemplary BDD representing the Boolean xor-function given by the truth table in igure 2.2a. In order to evaluate a BDD that represents a n-ary Boolean function f : {0, 1 n {0, 1 with variables x i, we start at the root node and repeat the following steps: If the node is a sink, the result for the given input is the label of the node. 9

16 2. Background x 1 x 2 x 1 x (a) ruth table for the xor-function 1 x 1 x 2 x (b) BDD for the xor-function 0 If the node is a decision node with label x i, we look up the input value corresponding to the variable x i. If this value is 0, we need to follow the 0-edge to reach the next node, and otherwise follow the 1-edge. As an example, we evaluate our example function for the input 01, i.e. x 1 = 0 and x 2 = 1. We start at the root node with label x 1 and follow the 0-edge to the right subtree since x 1 = 0. After reaching the (right) node with label x 2, in the next step we need to follow the 1-edge, reaching a rue-terminal, and conclude that f(0, 1) = 1. Reduced Ordered Binary Decision Diagrams (OBDDs) In practice, however, regular BDDs are rarely used, since their lack of enforcement of ordering between nodes can lead to unwanted configurations and less efficient algorithms. or this reason, OBDDs were introduced, which augment normal BDDs with an ordering restriction, which all paths in the BDD must adhere to. In particular, a total order π on the set of variables x 1,..., x n needs to be defined and the following statement must hold: x i {x 1,..., x n : x j Successors(x i ) : x i < π x j or each path in the graph we have that the variables on this path are ordered with respect to π. his also means that no variable can appear twice in a path to a terminal. hese OBDDs still do not optimally represent Boolean functions, since they can still contain redundant information, such as a decision node that points to two equivalent sinks. his gives rise to reduced OBDDs, which remove these redundant information. o achieve this, several reduction rules that transform a non-reduced OBDD to a more reduced one, i.e., one with less nodes, are defined and an OBDD is called reduced if there is not a single rule that is still applicable. he reduction rules for our purposes are the following: Rule 1 If there exists a node in the OBDD that has both edges pointing to the same terminal, replace the node with that terminal and redirect the incoming edges. Rule 2 If the OBDD contains two or more equivalent subgraphs, remove all but one and redirect the incoming edges of the removed ones to the remaining subgraph. 10

17 2.3. BDD-Based Integer Sets Note that these rules differ from the ones commonly found in literature in that the first one only applies if the successors are terminals and not arbitrary subgraphs. An advantage of these ROBBDs is that two equivalent Boolean functions have the same representation, i.e., we have canonicity, which simplifies comparing for equality. In the following, for simplicity, we will refer to ROBDDs as BDDs and may use nonreduced BDDs for our examples, if this leads to an improvement in clarity. Indicator unctions An indicator function I A, sometimes also called characteristic function, for a given set A is a function that indicates whether an element is member of the set or not. o achieve this, the function must accept all values, of which the set can be made up, and return a Boolean value, indicating whether this value is element of the set. his leads to the following definition: Definition 1. Let U be a set of all possible values (universe) and A U a set of interest, the indicator function I A is defined as follows: Expressed differently, we have: I A : U {0, 1 { 1 x A x 0 x / A x A I A (x) = 1 hese indicator functions can be used to represent sets, which is especially useful if storing the indicator function is more size-efficient than storing every single value of the set. hey also allow for natural definitions of indicator functions for combinations of sets. Let A U and B U be sets. We now have for example: I A B (x) = I A (x) I B (x) I A B (x) = I A (x) I B (x) If we use BDDs to represent sets using indicator functions, this allows us to quite easily compute the union and intersection of two sets by simply combining the BDDs using well known algorithms such as the If-hen-Else algorithm [4]. BDD-Based Integer Sets It should be obvious that a BDD representing the Boolean function f : {0, 1 n {0, 1 can be seen as a specialized indicator function, where the set of all possible values U is just the set of all n-ary bit vectors. If we now interpret these n-ary bit vectors as unsigned integers in binary, we have found a way to encode sets of n-bit integers as BDDs. he ordering in our BDDs is 11

18 2. Background ? 0? 1? 0? 1? 0? 1? 0? 1? 0? 1? 0? 1? 0? 1? 0? igure 2.3.: BDD for 4-bit integers defined as being from most significant bit (MSB) to least significant bit (LSB), which allows us to easily find the minimum and maximum element in a BDD-based set. As an example, consider a BDD that can store arbitrary unsigned 4-bit integer sets, displayed in igure 2.3. Considering this visualization, we can make some observations: Combining the labels on the path from root to terminal simply gives us a binary number. Each decision node represents a specific bit in the binary number, dependent on its level in the tree, and the outgoing edges determine the value of that bit. he root node, for example, represents the first bit in the number, and all numbers in the right subtree have the first bit unset, while the MSB of the numbers in the left subtree is set. We can also derive that a subtree at depth m, and thus of height h = n m, represents a set of binary numbers on its own. here are two ways to look at such a subtree: on its own it represents a set of h-bit binary numbers, which is subset of the interval [0, 2 h 1]. But if this subtree is viewed as part of the entire tree, it determines the membership of binary numbers with a certain m-bit prefix, namely the edge labels along the path from root node to the subtree. Consider the subtree reached by first taking the 1-edge and then the 0-edge. All numbers represented by this subtree have the prefix 10 and have thus the form 10--, those numbers are thus from the interval [8, 8+3 = 11]. In general, the smallest possible number represented by a subtree is the binary number given by taking the edge labels on the path to the subtree with the remaining places set to 0, which we will call a. hen this subtree represents a set of numbers from the interval [a, a + 2 m 1]. If an entire subtree only contains terminals of one specific flavor, our reduction rules demand this subtree to be replaced by this terminal. So whenever there is a terminal not at the lowest level, it counts for an entire interval that is either subset of the represented set or not. As a specific example, consider the reduced BDD representing the set {0, 3, 4, 5, 6, 7 in igure 2.4. Since none of the numbers in the interval [8, 15], i.e., numbers with the first bit set, is element of the set, the subtree representing these numbers is simply set to the alse-terminal. Similarly, all numbers of the interval [4, 7], which have the prefix 01, are member of the set and thus their corresponding subtree is set to the rue-terminal. 12

19 2.3. BDD-Based Integer Sets igure 2.4.: BDD representing the set {0, 3, 4, 5, 6, 7 or the remaining numbers, 0 (0000) and 3 (0011), no common prefix can be found, such that all numbers with that prefix are member of the set, preventing a reduction of the rightmost subtree of height 2. Using BDDs to represent integer sets, we can now efficiently store the variation domains of integer memory locations. One big advantage of using BDDs lies in the fact that we do not have the restriction of convexity, such as for example the interval domain, since we can store arbitrary sets. his ability is especially important for the analysis of binaries, since variation domains of certain types of memory locations, for example those holding jump targets for dynamic jumps, can contain element of almost arbitrary value, which would cause convex domains to vastly over-approximate. or the abstract domain of BDD-based sets, there already exist a number of transfer functions for certain operators like the addition or bitwise operators. 13

21 3. Multiplication of BDD-Based Integer Sets We have seen in the example from the introduction that transfer functions for multiplication of variation domains are imperative for the analysis of executables. he design of algorithms for these transfer functions is the focus of this thesis and will be discussed in this chapter. In the following, whenever we speak of set multiplication we mean the cartesian product of the sets with each tuple mapped to the product of its components, i.e., A B.= {a b a A, b B, where is defined as the operator for set multiplication. his operator can be interpreted as an exact transfer function for multiplication, i.e., one without overapproximation. We will start by describing an over-approximating algorithm for the multiplication of general sets, since computing the exact result for BDD-based sets would not be computationally feasible. Intuitively, the complexity of set multiplication can be comprehended when looking at the dependencies during binary multiplication between the input bits of the operands and the output bits of the result: whereas for bitwise operators the result of the bit at the i th position only depends on both input bits at the i th position, the result of the i th bit during multiplication depends on all previous bits of both operands. his means that if we were to precisely compute the result of the product of BDD-based sets, at each decision node, we would have to incorporate the information of both subtrees of the operands, compared to just looking at the outgoing edges for the case of bitwise operators. In addition to this general multiplication, we will introduce and explain an algorithm for the precise computation of the special case of the singleton multiplication, where one of the operands only contains a single element, i.e., is a singleton set. It makes sense to put effort into designing such a precise algorithm for this special case, since the multiplication with a singleton is a frequent operation in machine code. As seen in the example from the introduction, whenever an index into something comparable to an array, i.e., a memory region of contiguous elements of a certain size, needs to be converted to an actual byte offset in this region, the index, whose possible values are captured in a variation domain, gets multiplied by the constant size of each element, giving rise to the aforementioned singleton multiplication Multiplication of General Sets he basic idea behind an approximating multiplication of general sets is that while we can not easily compute the result on BDDs directly, we can convert a BDD to an approximated intermediate representation, which allows for an efficient set multiplication, and then multiply the operands in this representation followed by converting the result 15

22 3. Multiplication of BDD-Based Integer Sets back to a BDD. As intermediate representation we have chosen to approximate BDDs, i.e., arbitrary sets, by a set of intervals I = {i 1,..., i n. o keep our analysis correct, an approximation of a BDD-based set A must of course not lose any information, i.e., integer elements: i A i I Let A, B be BDD-based sets and I A, I B their approximations using intervals, a safe over-approximated result for the set multiplication can then be computed as follows: mult gen (A, B) = (i a,i b ) I A I B i a int i b, where int is the transfer function for interval multiplication using the interval bounds, i.e., [a, b] int [c, d] = [a c, b d]. In short, we form each possible combination of intervals from both operand approximations, multiply those intervals and finally form the union of the resulting intervals. he final result is converted back to a BDD to conclude our general set multiplication. In the context of abstract interpretation, this can be seen as abstracting from our original BDD domain to the domain of interval disjunctions, computing the result in that domain, and then concretizing the result back to our original domain. As an example, consider the BDD-based sets A = {0, 1, 5, 6, 7 and B = {2, 3, 4. A valid approximation for A would be I A = {[0, 1], [5, 7] and for B I B = {[2, 4], since every element in an original set is part of (at least) one interval in the approximation. orming all possible combinations of intervals yields I A I B = {([0, 1], [2, 4]), ([5, 7], [2, 4]), combining the tuple components using interval multiplication gives us the set {[0, 4], [10, 28] and forming the union of all contained sets finally results in mult gen (A, B) = {0, 1, 2, 3, 4, 10, 11,..., 28. he precise set multiplication has the result {0, 2, 3, 4, 10, 12, 14, 15, 18, 20, 21, 24, 28, which is a subset of the approximated result as requested. In this case, the approximated result contains mult gen (A, B) = 24 elements, whereas the precise one only A B = 13 elements, an increase of roughly 85%. Since the interval multiplication over-approximates by a large margin, it is essential to find a good approximation to a set of intervals. We will now discuss how this approximation, i.e., converting a BDD-based set to a set of intervals, can be implemented. he underlying idea for this conversion is that each subtree in a BDD represents a set and this set can be either approximated by a set of intervals, if we want more precision, or a single interval, by taking its lower and upper bound. In principal, our conversion algorithm recursively traverses through the tree and decides for each subtree, i.e., each decision node, whether to approximate the subtree by a single interval or by a set of intervals, which is done by recursively converting both subtrees to sets of intervals and forming the union of them. If the decision is made to approximate by a single interval, the algorithm finds the smallest and greatest element 16

23 3.1. Multiplication of General Sets Algorithm 1 Abstracting BDD-based integer sets to sets of intervals 1: function ApproxBDDSet(bdd, depth, start) 2: if bdd is Nalse then return 3: else if bdd is Nrue then 4: return {[start, start + 2 n depth 1] 5: else if SUBREE-ES(bdd) then 6: s1 ApproxBDDSet(falseSucc(bdd), depth+1, start) 7: s2 ApproxBDDSet(trueSucc(bdd), depth+1, start + 2 n depth ) 8: return s1 s2 9: else 10: min min(bdd) 11: max max(bdd) 12: return {[start + min, start + max] 13: end if 14: end function in the subtree and uses those values as the bounds of the interval, a task that is made efficient by the ordering of MSB to LSB, since we only need to find the rightmost and leftmost rue-terminal. Another possibility, which does not involve finding the minimum and maximum and is thus more efficient, would be to treat the whole subtree as if it was a single rue-terminal, i.e., setting the lower bound of the interval to the smallest possible element in the subtree and the upper bound to the biggest possible one. As our base cases, we have that rue-terminals result in a singleton set of just the interval, which the terminal represents, and alse-terminals in an empty set. Based on this, we can define Algorithm 1, which still depends on a predicate SUBREE-ES deciding the cut-off point in the tree, i.e., whether to approximate the subtree by a single interval or multiple ones. In the algorithm, n represents the bit-size of our integers, i.e., the maximum height of the BDD, the input start keeps track of the path of root node to our subtree, i.e., the common prefix of all numbers. In the following we will describe a few possible predicates that we have evaluated for our analysis. In each case, trade-offs between the cost of the computation of the predicate and its meaningfulness must be made. he first very simple approach decides based on the level of the subtree in the entire tree, meaning the length of the path from root to this subtree, and recurses deeper into the tree only if this value is smaller than a specified cut-off value. his means that we only ever go up to a certain depth into the tree and approximate by sets of intervals until then. Algorithm 2 implements this idea. he computation of this predicate is obviously very cheap. Algorithm 2 Depth-based cut-off 1: function DepthCutoff h (bdd) 2: return depth(bdd) < h 3: end function 17

24 3. Multiplication of BDD-Based Integer Sets Another strategy is to compute the proportion of number of elements inside the subset represented by the subtree to the maximum possible number and only recurse if this value is smaller than some parameter representing a precision, i.e., we decide to approximate a subtree by a single interval if it proportionally contains enough elements causing us to not lose unreasonable amounts of information. An implementation of this approach is presented in Algorithm 3. his predicate can be made more precise by comparing the number of elements in the subtree to the number of elements in the approximating interval that would be created, because in that case we do not consider the border regions of the subtree that do not contain any elements. However, this would increase the cost of the computation of the predicate. Algorithm 3 Precision-based cut-off 1: function PrecisionCutoff p (bdd) 2: count elemcount(bdd) 3: h height(bdd) 4: maxcount 2 h 5: return count maxcount < p 6: end function Correctness We will now demonstrate that the presented algorithm for general multiplication provides a correct result with respect to the framework of abstract interpretation. o achieve that, we must show that the computed result is a superset of the precise result, i.e., that we do not lose information. he precise result for two sets A and B is given by A B := {a b a A, b B, which means that we have to show that the product of each combination of values out of A and B is an element of our result. Let A, B be integer sets represented by BDDs and I A, I B their approximations using sets of intervals. We have that a A, b B : i a I A, i b I B : a i a b i b, i.e., for each combination of a A and b B we can find two intervals in the approximations, such that the values a and b are members of an interval. he product of these sets is then computed as part of our algorithm and is a subset of the final result, since the union of all these interval products is formed. We now need to show that a b i a int i b, which is clearly the case, since a and b are bounded by the bounds of their respective interval and the bounds of the resulting interval are just the original bounds multiplied Singleton Multiplication In this section we will describe how to design an algorithm for the special case of singleton multiplication. Due to the prevalence of singleton multiplication in executables, we require our algorithm to compute an exact result, i.e., we do not want it to overapproximate. Since the algorithm should still be able to compute the result efficiently, 18

25 3.2. Singleton Multiplication igure 3.1.: Augmented BDD with edge weights we need to find a way to directly operate on the structure of BDDs without using an intermediate representation as in the case for general multiplication. We will start by taking a closer look at how the binary representation of integers works and how this information can be made explicit in our BDD representation, which will later on help us visualize the inner workings of the algorithm for singleton multiplication. A n-bit binary number x = b n,..., b 1 with b i {0, 1 is at its core just a bit vector. he corresponding number is a sum of powers of 2, which depends on the values of the bits. We have n x = b n,..., b 1 = b i 2 i 1. i=1 One way to interpret this is, that each bit contributes a value to the final result: if the bit is 0, this value is also 0, but if it is 1, the value is a power of 2 that depends on the position of the bit in the bit vector. We will now augment the classic BDDs by adding this information explicitly. he decision nodes represent a certain bit in the bit vector and the outgoing edges the value of that bit, i.e. the b i s in our sum. Previously the powers of 2 that each level in a BDD represents, i.e. the 2 i 1 s were only implicit, leading us to explicitly add them in our augmented BDD by setting them as the weight of the edges. igure 3.1 is an example of an augmented BDD for a height of 2. Having such a representation allows us to simply sum up all the weights along a path from root node to a terminal in order to find out what value this terminal represents. As an example take a look a igure 3.1. he sum along the path to the leftmost terminal is = 3, as expected since 11 is the binary representation of the decimal number 3. erminals that are not on the lowest level represent intervals that start with the sum of the edge weights of the path to this terminal and have a certain size dependent on the position in the tree. or the case of singleton multiplication we want to multiply many of such binary numbers by the single element of the singleton operand, which we will call y. or one specific x = b n,..., b 1, the product x y is given by n n x y = ( b i 2 i 1 ) y = b i 2 i 1 y. i=1 i=1 Interpreting this result, we can see that the product of a binary number with another number y can be computed by adding up powers of 2 multiplied by y, depending on 19

26 3. Multiplication of BDD-Based Integer Sets 2 y 0 y 1 y 0 y 1 y 0 y igure 3.2.: Augmented BDD with edge weights for singleton multiplication the values of the bits of the first operand. Yet again, each bit in the input contributes a summand to the output, prompting us to adapt the augmented BDD by multiplying each edge weight by y, such that the edge weights reflect the summands. he sum of edge weights along a path from the root node to a terminal now represents the product of the original number in the BDD and y. he basic idea of the singleton algorithm consists of summing up the edge weights along the paths to each rue-terminal and collecting the results. As an example, consider igure 3.2, where the edge weights have been multiplied by y. We again look at the leftmost terminal representing 3, the sum along the path is 2 y + 1 y = 3 y, which is exactly what we would expect. We now describe a recursive algorithm that takes as input a BDD and the singleton value y of the second operand and returns a BDD which represents the set of each integer in the input BDD multiplied by y. As a starting point, we will derive a recursive formula for the multiplication of a single binary number with another number, which can then be extended to work for sets of numbers represented by BDDs as first argument. We define a function that takes a bit vector representing an integer as first argument and a natural number y as second argument: mult : {0, 1 i N 0 N 0 i ( b i,..., b 1, y) b i 2 j 1 y his is just the formula from above adapted to a function definition syntax. We can then extract the last summand (j = i) from the sum, giving us: j=1 i 1 ( b i,..., b 1, y) b i 2 i 1 y + b i 2 j 1 y. he second summand of the equation is just mult( b i 1,..., b 1, y), giving rise to the recursive formula for the multiplication: j=1 mult( b i,..., b 1, y) = b i 2 i 1 y + mult( b i 1,..., b 1, y). As the base case we set mult(, y) = 0, since 0 is the neutral element to the addition. In short, when we multiply a binary number by another number, we compute the recursive result, where we remove the first bit from the number, and add to it a power of 2, which 20

27 3.2. Singleton Multiplication depends on the length of the input vector, multiplied by y if the first bit is 1, and 0 otherwise. We now adapt this recursive formula to BDD-based integer sets as the first argument. We will use the notation {... # to represent BDD-based sets. he input BDD can either be a decision node or a terminal. We start by considering the case of a decision node. Such a decision node is the root of a subtree of a certain height, which represents a set of integers of the form x j = b i,..., b 1. he decision node itself represents the first bit b i of all these integers, and the outgoing edges the value of the first bit. his means that for all integers in the right subtree of the decision node, where we have an incoming 0-edge, the first bit is 0 and for all numbers in the left one 1, respectively. he remaining bits b i 1 to b 1 of the integers are determined by the subtrees, which means that if we call the singleton algorithm recursively for a subtree, we get back a BDD-based set representing the integers that arise if we multiply the suffix b i 1,..., b 1 of the original numbers x j by y. Based on the recursive formula for multiplication of two single numbers, we now define a way to combine the recursive result and the information of the decision node to form the final result: let bdd be a decision node of the tree and ts and fs its true- and false-successor, respectively. he result for this node is computed by: mult sing (bdd, y) = (mult sing (ts, y) + {2 i 1 y # ) mult sing (fs, y), where i is the level that the subtree is placed in the original BDD and + the transfer function of addition for BDDs, which already exists. Since all binary numbers that end in the left subtree have a leading 1, we add the edge weight at that level, 2 i 1 y, to the recursive result of the left subtree, which is a BDD representing the lower bits of that subset multiplied by y, and form the union with the recursive result of the right subtree, where the leading bit is 0 and thus no addition necessary, giving us a new BDD, where the edge weight was added to each number. or the base cases, we have that multiplication by the alse-terminal, i.e., the empty set, results again in the empty set. or a rue-terminal, for now we restrict ourselves to ones on the lowest level, we return a BDD representing the singleton set of 0, since 0 is again the neutral element to the addition. Since a rue-terminal at a higher level would be equivalent to a fully expanded subtree with only rue-terminals, we will later on derive what the expected result of the recursive call for such a subtree would be and from that define the result for such a terminal. In Algorithm 4 the idea is formulated as pseudo code. As an example, consider the multiplication of the BDD-based set A = {1, 2, 3 # with the singleton set B = {5 #. he BDD representing the set A is displayed in igure 3.3a, with the edge weights, i.e., the terms b i 2 h 1 y, already filled in. We start by computing the result for the base cases, which is the singleton set {0 # for each rue-terminal, and the empty set for all alse-terminals, which can be observed in igure 3.3b. or the next step, we calculate the result for the decision nodes representing the last bit. he general procedure is to take the recursive result for the subtrees, add the edge weight to each number in that result, and form the union of those resulting sets. In our case, we have for the right decision node the recursive results # for the right subtree, which has an incoming edge with weight 0, giving us the partial result {0 # + # = #, and {0 # for 21

28 3. Multiplication of BDD-Based Integer Sets Algorithm 4 Incomplete singleton multiplication 1: function MultSing(bdd, y, height) 2: if bdd is Nalse then return # 3: else if bdd is Nrue height = 0 then 4: return {0 # 5: else if bdd is Node(trueSucc, falsesucc) then 6: fr MultSing(falseSucc, y, height - 1) 7: tr MultSing(trueSucc, y, height - 1) 8: weight {2 height 1 y # 9: return fr (tr + weight) 10: end if 11: end function the left subtree with incoming edge weight 5, resulting in {0 # + {5 # = {5 #. his leads to the final result # {5 # = {5 # for this decision node. his result makes sense, since the subtree spanned by this decision node represents the set of 1-bit integers {1, which also intuitively gives the result {5 when each element is multiplied by 5. A similar computation is performed for the left decision node: we calculate {0 # + {0 # = {0 # for its right subtree and {0 # + {5 # = {5 # for its left subtree. he union of these two sets {0 # {5 # = {0, 5 # then determines the final result for this decision node. Intuitively this result makes sense, since the left subtree represented the set of 1-bit integers {0, 1, which multiplied by 5 results in just {0, 5. hese results are shown in igure 3.3c. o complete our computation for the entire subtree, we examine the root node: the recursive result for the right subtree was {5 #, adding the corresponding edge weight 0 to each element yields the set {5 # again. or the left subtree we add the edge weight 10 to each element of the recursive result {0, 5 #, giving us {10, 15 #. he union of these two sets {5 # {10, 15 # = {5, 10, 15 # constitutes the final result of the singleton multiplication, as expected and shown in igure 3.3d. Worth highlighting is that each set written down in this example would be represented as a BDD during the computation of the algorithm and specialized versions for set addition and union would be used. Let us now revisit the case of a rue-terminal, which is not placed at the lowest level, but at a certain depth d. If we expand this terminal, we will get a subtree of height h = n d, where all paths lead to a rue-terminal. On its own, such a BDD would represent all h-bit numbers, i.e., the interval [0, 2 h 1]. If such a BDD would be a subtree during our singleton multiplication, the recursive result would be each of the numbers in this interval multiplied by the singleton element y, i.e. [0, 2 h 1] {y = {0, y, 2 y,..., (2 h 1) y. Such a set is called a strided interval, i.e., an interval with an additional parameter, the stride, that determines the distance between consecutive elements, written as s[a, b], where s is the stride. In our case the strided interval would be y[0, (2 h 1) y]. his means that whenever we come across a rue-terminal at a higher level, the result of our multiplication must be a BDD representing such a strided interval. In order to achieve good performance in our singleton algorithm, we need a 22

Change- and Precision-sensitive Widening for BDD-based Integer Sets

Bachelor hesis elix Lublow Change- and Precision-sensitive Widening for BDD-based Integer Sets October 06, 2016 supervised by: Prof. Dr. Sibylle Schupp Sven Mattsen Hamburg University of echnology (UHH)