Running Programs Backwards: Instruction Inversion for Effective Search in Semantic Spaces

Running Programs Backwards: Instruction Inversion for Effective Search in Semantic Spaces Bartosz Wieloch bwieloch@cs.put.poznan.pl Krzysztof Krawiec kkrawiec@cs.put.poznan.pl Institute of Computing Science, Poznan University of Technology Piotrowo 2, 60965 Poznań, Poland ABSTRACT The instructions used for solving typical genetic programming tasks have strong mathematical properties. In this study, we leverage one of such properties: invertibility. A search operator is proposed that performs an approximate reverse execution of program fragments, trying to determine in this way the desired semantics (partial outcome) at intermediate stages of program execution. The desired semantics determined in this way guides the choice of a subprogram that replaces the old program fragment. An extensive computational experiment on 20 symbolic regression and Boolean domain problems leads to statistically significant evidence that the proposed Random Desired Operator outperforms all typical combinations of conventional mutation and crossover operators. Categories and Subject Descriptors I.2.2 [Artificial Intelligence]: Automatic Programming; I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and Search Heuristic methods Keywords genetic programming, program semantics, desired semantics, search operators, instruction inversion 1. INTRODUCTION The conventional search operators used in genetic programming (GP) make little or no assumptions about the properties of instructions from the programming language of consideration. Essentially, the only attribute of instruction that such operators have to be aware of is its arity (and the types of instruction s inputs and outputs, if types are considered). A common argument for this attitude is generality: by abstracting from the internals of instructions, operators like tree-swapping crossover or subtree-replacement mutation can be indeed regarded as universal. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GECCO 13, July 6 10, 2013, Amsterdam, The Netherlands. Copyright 2013 ACM 978-1-4503-1963-8/13/07...$15.00. This argument is however strikingly inconsistent with the GP practice, which to great extent focuses on symbolic regression, Boolean function synthesis, and other programming domains with instruction sets strongly rooted in mathematical foundations. These foundations bestow the instructions with formal properties that can be potentially exploited. Following this observation, this paper proposes a straightforward method for exploiting the property of (partial or complete) invertibility of instructions. This property, shared by most arithmetic and logic instructions, enables us to determine the desired semantics at any arbitrary location (locus) in a program. This ability is leveraged in a new GP search operator that is the main contribution of this paper, and which turns out to significantly outperform the conventional operators. 2. GENERAL IDEA In this paper, for simplicity we concentrate on the canonical tree-based GP as proposed by Koza [6]. We assume that subprograms (subtrees) in programs can be replaced by independently generated subtrees (procedures). This assumption is not constraining as it is also required by standard GP search operators. Procedures can be executed independently from the rest of a program. Moreover, execution of a subprogram returns a certain result, exactly in the same manner as the execution of the entire program does. However, let us emphasize that our approach can be likewise adopted for other variants of GP, like Linear GP [1] or Cartesian GP [10]. Moreover, the proposed general idea may be used even in non evolutionary metaheuristics. For example, it may be applied as an operator for generating a neighbor solution in local search. In the following, we limit our considerations only to such GP tasks in which fitness calculation is based on fitness cases, meant as a list of pairs composed of input data and the corresponding desired (correct) output. In such case, following [8], we may define the program semantics as a list of outputs that are actually produced by the program for all fitness cases. Additionally, by the target semantics we mean the semantics of an ideal solution (i.e., target semantics equals to a list of the desired outputs defined in the list of fitness cases). Obviously, the closer the semantics of a program to the target semantics, the better it is. Therefore, the distance between the semantics of an individual and the target semantics can be treated as a minimized fitness value. In practice, most benchmark problems considered in the lit- 1013

erature define the fitness value in the above way, even if the term program semantics is not used explicitly. Now, let us imagine that we have an oracle that can tell us what should any given fragment (called part in the following) of a program return for a given fitness case to make the entire program yield the correct output. Because oracle s verdicts do not depend on the actual part in question, only the rest of the entire program (called context) is relevant. In the case of tree-based GP we will identify context with an incomplete tree that misses a single branch (part). Thus, an entire program can be assembled by combining a context with a part (subtree). We will say that a context accepts a part if the entire program returns the correct output. For a given context, the list of values obtained from the oracle for each fitness case constitutes the desired semantics of a context (desired semantics for short). The rather obvious yet important observation for this study is that finding a part (subprogram) with semantics equal to the desired semantics of some context is equivalent to solving the entire GP task, as such a part combined with the context will form the ultimate, optimal solution. In this way, the ideal solution could be created in just a single step. Because it is not always possible (or technically feasible) to find a part with semantics equal to the desired semantics, we are interested here in minimizing this discrepancy. Our hypothesis is that decreasing the distance between the desired semantics and the semantics of a part can at least bring us closer to solving the problem. Now, to apply this idea in practice, we need two things: (1) the oracle (i.e., a computationally feasible method to calculate the desired semantics), and (2) the source of parts which will be matched to a given context. In the next section, we will present an algorithm for calculating the approximate desired semantics for two different domains of problems: real-valued functions (symbolic regression problems) and the Boolean domain (synthesis of logic functions). In relation to the second requirement, there are many possible ways to provide the set of parts to match the context (called library in the following). It may be a library of intentionally designed subprograms, a sample of random subprograms, or even an exhaustive set of all programs within certain constraints (e.g., maximal size [7]). In this paper we shall present an efficient and uncontroversial form of library, suitable for working with population-based approaches. Specifically, the library of parts is constructed from all parts of individuals from the current population. Apart from computational efficiency, this approach avoids, among others, human biases the library contains only code fragments that have already evolved. 3. DESIRED SEMANTICS In this section we describe the concept of desired semantics in more detail. Firstly, we show the possible types of oracle s answers. These considerations dictate to some extent the representation of desired semantics used in this paper, which we describe subsequently. Finally, we present a simple method for calculating the desired semantics for so-called partially invertible instructions. 3.1 Possible Situations In general, it is convenient for us to assume that the oracle, when queried with a specific context, returns a set of desired 1 1 (a) one value 0 sin (c) any, value is insignificant 1 pow 2 1 cos (b) multiple (two, infinitely many) 1 + exp (d) none (inconsistent context) Figure 1: Examples of the four situations concerning oracle s answer for a zero-valued target (i.e., t = 0). The subtitles report the number of accepted values. The dotted circle (node ) represents a missing subtree of the context. values. The reason for this choice is the fact that we have to consider the following four possible situations: 1. There exists exactly one value which causes the context to return the correct output. 2. There exist more than one such values (either finite or infinite number of them). 3. Any value fed into the context causes the semantics of entire program to reach the target. In other words, the missing part in this context is an intron and does not have any influence on the final behavior of the program. 4. No matter what is fed into the context, the resulting program will not attain the target. In such a case, the context has to be changed to make it possible. To illustrate the above situations let us assume that the task is to evolve a real-valued mathematical expression which identically equals zero, i.e., its target semantics t = 0. For this problem, we can easily design contexts that represent the above categories, shown in Figure 1. The first context encodes expression 1 ( 1), where is the missing subtree, and accepts exactly one semantics only by replacing with a subtree returning 1 we can obtain an expression which equals 0. The next two contexts accept more than one value. However, the context 1 2 accepts only two possible semantics: 1 and 1, whereas the context 1 cos() accepts any subtree returning 2πn, n Z (thus, infinitely many values are accepted in this case). In the next example, where the context is 0 sin(), whatever subtree is pasted in place of, the whole expression will return zero anyway. Finally, the last example shows a situation in which the resultant expression is always greater than 1, disregard the semantic substituted in place of. That context cannot be used to construct a solution to our toy problem. Because the desired semantics contains the oracle s answers for each fitness case, then its every element should be able to express all of the four considered situations. However, the situation where a context accepts many values 1014

Algorithm 1 Calculating the desired semantics of a context. 1: procedure DesiredSemantics(c, t) for context c and target semantics t 2: L list of nodes on the path from the root of c to the subtree missing in c (to the node) 3: d t desired semantics of an empty context 4: for all n L do 5: S semantics of all child subtrees x of node n such that x/ L 6: d n 1 (d, S) calculate the desired values using the inverse of instruction n 7: return d 8: end procedure [-1,?,1] [2,0,0] [1,0,1] [ 1, 0, 1] x [ 1, 0, 1] x Figure 2: Exemplary context with calculated desired semantics (in bold). The target semantics of a task is [2, 0, 0], the semantics of an independent variable is [ 1, 0, 1], and the context desired semantics is [ 1,?, 1]. (in particular, infinitely many) is not convenient to encode. Moreover, this may lead to exponential growth of memory demand and complexity of the required computations by a method calculating the desired semantics. To avoid such problems, from now on we adopt a technically feasible but formally incomplete solution where the desired semantics stores for each fitness case at most one arbitrary chosen value from the set returned by the oracle. The elements of such simplified desired semantics are allowed to express one of three situations (instead of four): Only one concrete value is acceptable (possibly an arbitrarily chosen one). Any value is acceptable (i.e., don t care ) such elements will be called insignificant in the following. No value is acceptable such element is inconsistent. The simplified desired semantics can be expressed as a list (similarly to conventional program semantics as introduced in Section 2), where each element of the list is a concrete value or one of two special values used to encode the undefined elements: insignificant ( don t care ), and inconsistent. In the following we will identify desired semantics with this simplified version of it, which does not require encoding of alternative values. It is important to notice that switching to simplified desired semantics can introduce certain bias and omit potentially good parts. To illustrate this, let us continue the example presented earlier, where the goal was to evolve an expression returning zero (see Figure 1b). For the context 1 2, our simplified desired semantics will contain either 1 or 1 (the other value will be discarded). Similarly, for the context 1 cos(), the simplified desired semantics will contain a specific multiplicity of 2π (e.g., 0), and a part returning a different multiplicity of 2π (e.g., 2π) will be treated as committing some error. 3.2 Calculating the Desired Semantics Let us assume for a while that a method calculating the desired input value for a single instruction is given. In such a case, it becomes straightforward to design an algorithm that calculates the desired semantics of the entire context. The algorithm starts from the root node and proceeds along the path to the missing subtree () of the context. In each step, the desired semantics of a subsequent node on the path is calculated. Thus, at the beginning, the desired argument for the instruction located at the root node is calculated (the given target semantics is simultaneously the desired semantics of the root). Then, recursively, all consecutive nodes on the path are processed. For each node, the semantics of all its subtrees (arguments) are known, therefore the only unknown for each node is the quested desired semantics. The last calculated value forms the unknown desired semantics of the context. Algorithm 1 presents the pseudocode of this procedure. In this process of semantic backpropagation, the special insignificant and inconsistent values of calculated desired semantics are directly propagated. To calculate the desired semantics of a context, all instructions on the abovementioned path should be invertible. When we have a invertible instruction, the inversion of it will have the same properties as an inverse function in mathematics. This means that it must be possible to calculate the desired input for each of the used functions (instructions), given the values of all the other inputs (remaining arguments of a function) and the expected output (the result of the function). For functions that are partially invertible, some elements of the calculated desired semantics can be ambiguous or inconsistent (cf. the example in Fig. 1). It should be also noticed that the invertibility requirement means that the used functions cannot be blackboxes we must know how to calculate the desired argument to get an expected function value (i.e., we must know the inverse functions). As an example of calculations conducted by Algorithm 1, let us consider the symbolic regression task of evolving the expression x 2 x (or, to be precise, of evolving an expression with semantics equivalent to that of x 2 x). The only input variable is x, and there are three fitness cases, for which x assumes values 1, 0, and 1, respectively. Thus, the semantics of the terminal node x equals [ 1, 0, 1] and the semantics of the target is [2, 0, 0]. Figure 2 shows an exemplary context ( x x) with semantics of all subtrees denoted in plain text (here only terminals). The desired semantics, computed for the consecutive nodes on the path from the root node to the missing part of the context are marked in bold. Starting from the root node, [2, 0, 0] is both the desired semantics of an empty context (empty program) and simultaneously the target semantics of the problem. The desired semantics for the context ( x) is[1, 0, 1]. Finally, the desired semantics of the context ( x x) is[ 1,?, 1], with the question mark denoting insignificant value. It does not matter what is the second element of the semantics of the missing subtree, because any value of this element is multiplied by zero and always yields zero. 1015

Table 1: Symbolic regression tasks used in the experiment. The columns present the number of independent variables and their domains. Target program (expression) Variables Range F03 x 5 + x 4 + x 3 + x 2 + x 1 [ 1; 1] F04 x 6 + x 5 + x 4 + x 3 + x 2 + x 1 [ 1; 1] F05 sin(x 2 )cos(x) 1 1 [ 1; 1] F06 sin(x)+sin(x + x 2 ) 1 [ 1; 1] F07 log(x +1)+log(x 2 +1) 1 [0;2] F08 x 1 [0; 4] F09 sin(x)+sin(y 2 ) 2 [0.01; 0.99] F10 2 sin(x) cos(y) 2 [0.01; 0.99] F11 x y 2 [0.01; 0.99] F12 x 4 x 3 + y 2 /2 y 2 [0.01; 0.99] 4. RANDOM DESIRED OPERATOR The previous sections presented the conceptual framework of desired semantics. Here, we embed them into the evolutionary context by designing a concrete search operator, called Random Desired Operator (RDO) in the following. Desired semantics determines the preferred semantic properties at a specific location in a program, but is incapable to synthesize the suitable part (subtree). Rather than synthesizing such a part, RDO relies on a dynamically changing repository of ready-to-use parts, which we call library, containing subtrees extracted from individuals in the present population. Technically, in every generation, all subtrees from all current individuals are first collected. Next, semantic duplicates are eliminated: if two or more subtrees have the same semantics, only the one with the smallest subtree depth 1 remains in the set, while the others are discarded. This reduction to minimal subtrees with unique behaviors drastically decreases the library size. There are two reasons to it. Firstly, the majority of program fragments in the whole evolved population exists in many copies. Secondly, different genotypes often map to equivalent phenotypes, i.e., syntactically different program can have the same semantics. To create new solutions, RDO combines a selected context extracted from a single parent individual and the best matching subtree from the library built anew in every generation. RDO is then somehow similar to the standard subtree-replacing mutation operator [6]. Specifically, RDO removes a randomly chosen subtree from the parent, but instead of generating a new random subtree in place of the old one, it looks for a subtree in the library. From the parts available there, it chooses the one that has semantics that is most similar to the desired semantics of the context arising from removing the old subtree (see Algorithm 2). The undefined (i.e., both insignificant and inconsistent) elements in desired semantics are ignored when calculated the semantic distance. Note that, given the method in which the library is built, the RDO is most likely to insert a subtree from other individuals. Therefore, RDO may be seen as a specialized crossover operator (tossing the second child) which performs certain mate selection with respect to the semantic utility of partner s subtrees. 1 We use subtree depth criterion because we apply the same type of constraint to the evolutionary process., i.e., we limit the maximal tree depth. Other measures, e.g., the maximal number of nodes, might be more appropriate as well. Algorithm 2 Random Desired Operator (RDO) 1: procedure RDO(p) 2: r random node in program p 3: c Context(p, r) extractthecontextby removing subtree r from p 4: s DesiredSemantics(c, t) 5: r SearchLibrary(s) find a subtree that best matches semantics s 6: return tree obtained from p by replacing subtree r with r 7: end procedure 5. THE EXPERIMENT 5.1 The Benchmarks In the following experiment, we aim at comparing RDO with standard GP search operators. To this aim, we test them on problems from two different domains: real-valued functions and logic functions. In each domain, we have ten problem instances (tasks). Symbolic Regression Problems The set of symbolic regression problems is presented in Table 1. Problems F03 F12 are taken from Nguyen et al. paper [12], half of which originate from [6, 3, 5, 4]. The table shows the hidden equation to discover (Target program), the number of independent variables (Vars), and the range from which they are chosen (Range). The number of fitness cases (points) for univariate (F03 F08) and for bivariate (F09 F12) problems is 20 and 100, respectively. A program is considered an optimal solution if it returns correct (target) values for each fitness case within a 1.11 10 15 tolerance. This tolerance threshold is necessary to handle the floating point imprecision. Without it, even an expression mathematically equivalent to the target program could be found non-optimal (i.e., would have a non-zero value of our minimized fitness function). The fitness cases are evenly distributed in variable domain(s). More precisely, the values are evenly spaced in a given closed interval from Table 1, with the extreme values placed on the interval boundaries. For univariate problems, this implies that the spans between any two consecutive points equal (b a)/(k 1) for k points in range [a; b]. In case of bivariate functions the values of both variables in fitness cases lie on an evenly spaced square grid. This, however, may cause problems. For that instance, if the variable ranges were [0; 1] for F11, then a substantial number of fitness cases (nearly 40%) would fall on values that constitute special cases, i.e. 0 y,1 y, x 0,orx 1. This may render evolution unable to escape from even very simple local optima. Therefore, we slightly narrowed the original [0; 1] interval to [0.01; 0.99]. The problem mentioned above does not exist in the original problem formulation with [0; 1] interval as in paper [12] because Nguyen et al. (as most researchers) have used randomly selected points uniformly distributed in this range. However, we have strong evidence that such a selection of fitness cases is not a good practice, because GP is highly sensitive to the choice of fitness cases. In other words, the precise values of fitness cases should be always considered as part of a GP task. 1016

Table 2: Boolean tasks used in the experiment. Problem Instance Bits Fitness cases PAR4 4 16 even parity PAR5 5 32 PAR6 6 64 multiplexer MUX6 6 64 MUX11 11 2048 MAJ5 5 32 majority MAJ6 6 64 MAJ7 7 128 comparator CMP6 6 64 CMP8 8 256 For univariate problems, the terminal set contains two elements, x, the independent variable, and a constant 1.0. For bivariate problems there are two terminals, x and y, the independent variables, without the constant 1.0. Though the lack of constants for bivariate problems may seem surprising, let us note that it has been shown many times that GP fares pretty well without any constants at all, as evolution can easily come up with the idea of filling in for them using subexpressions like x/x or x x. The set of non-terminal instructions consists of eight functions: +,,, / (protected), sin, cos, exp, log (protected). The protected version of division returns 1.0 if the denominator equals zero, irrespective of the numerator. log(x) returns 0.0 ifx =0,andlog( x ) otherwise. Let us note that the provided set of instructions allows expressing all target functions presented in Table 1. In other words, for every benchmark problem, an optimal solution is present in the considered solution space. Boolean Problems In the Boolean domain, four different problems are studied here: even parity, multiplexer, majority, and comparator. The first three of them come from [6], and the last one is a simplified version of a digital comparator proposed by Walker and Miller [13]. In the following, we will use the terms argument and bit in the same meaning as independent variable was used in the symbolic regression context. The objective of the even parity (PAR) problem is to synthesize a function which returns true if and only if an even number of its arguments are true. PAR can be alternatively seen as a generalization of the Not-Exclusive-Or function to more than two arguments. We consider instances with 4, 5, and 6 bits (i.e. with 4, 5, or 6 input arguments), denoted as PAR4, PAR5, and PAR6, respectively. In the multiplexer problem (MUX), program arguments are divided into two blocks: address bits and data bits. The goal is to interpret the address bits as a binary number and use that number to index and return the value of the appropriate data bit. We consider two variants of this problem 6-bit (MUX6) and 11-bit (MUX11). In the former we have 2addressbitsand4databits. Inthelatter 3and8bits, respectively. The task in majority (MAJ) problem is to create a function that returns true if more than half of input arguments are true. Note that for an even number of arguments, the function should return false if exactly half (or less) of them are true. We consider three variants of this problem: with 5 bits (MAJ5), 6 bits (MAJ6), and 7 bits (MAJ7). The last Boolean problem used in this paper is comparator. The objective here is to interpret the input bits as two binary integers, and return true only if the first number is greater than the second one. We used six (CMP6) and eight (CMP8) bits variants, which means that we compare 3-bits numbers (CMP6) or 4-bits numbers (CMP8). In Table 2, all ten problems are shown together with the number of fitness cases on which solutions will be evaluated. A solution is considered as an ideal only if it returns the correct result for all fitness cases. The set of terminals used in Boolean domain experiments comprises one terminal for each input bit (D1...D11 depending on the number of bits). The non-terminal instructions are: AND, OR, NAND, and NOR. Similarly to regression problems, also for this domain solutions to all aforementioned problems can be found in the assumed solution space. 5.2 The Setups To verify the performance of RDO, we conduct a series of experiments involving RDO, standard crossover (X), and standard mutation (M) individually, and every combination of two of them. The setup that uses standard crossover and standard mutation serves as control experiment. For each pair of operators we test different proportions of their probabilities varying from 0.0 to 1.0 with step 0.1, which results in ten different setups with RDO and eleven different control setups. Setups are generally denoted as O 1 + O 2 β, whereo 1 and O 2 are symbols of used operators, and β is the probability of operator O 2 (i.e., Pr(O 1)=1 β). For example, in the setup X+RDO 0.2 crossover is applied with 0.8 probability and RDO with 0.2. Whenever β = 0 or β = 1, the notation simplifies to O 1 1.0 oro 2 1.0, respectively (e.g., X 1.0 or RDO 1.0). In total, we have 30 setups, with 19 of them involving RDO (2 9 + 1, i.e., X + RDO β and M + RDO β with β = 0.1...0.9, and RDO 1.0) and 11 control experiments (X + Mβwith β =0.1...0.9, X 1.0, and M 1.0). The parameters of the evolutionary algorithm used in our experiments, shown in Table 3, are based on Nguyen s work [11]. The parameters not mentioned there are taken from ECJ [9] package and are based on the values originally used by Koza [6]. Every setup was tested on all 20 problems. To arrive at statistically significant results, every setup was run independently 200 times with different seeds of a pseudo random number generator, which give us 30 20 200 = 120, 000 evolutionary runs in total. 6. RESULTS Rather than presenting detailed results for every combination of method and benchmark, we provide a global, rankingbased perspective. For statistical validation, we perform the Friedman test comparing success ratio of all setups tested on our benchmark problems. The null hypothesis of this nonparametric test says that medians of all samples (corresponding to setups in our case) are equal, and the alternative hypothesis says that this is not true. Tables 4a-c present the computed ranks for success rate performance measured on all 20 problems, and for problems from the two domains separately. In each table, of all setups that use the same combination of operators, the one marked 1017

Table 3: Evolutionary parameters. Parameter Value Generations 100 Population size 500 Initialization method Ramped Half-and-Half algorithm Initial minimal depth 2 Initial maximal depth 6 Duplicate retries 100 (before accepting a syntactic duplicated individual) Selection method Tournament Tournament size 3 Operators probability Varying from 0 to 1 with step 0.1 Maximal program depth 15 Node selection Probability of terminal nodes: 10% Probability of non-terminal nodes: 90% Mutation method Subtree mutation Subtree builder Grow algorithm Subtree depth 5 Crossover method Subtree swapping Instructions Symbolic regression: x, either1ory, +,,, / (protected), sin, cos, exp, log (protected) Boolean domain: D1...D11 (inputs depend on a problem instance), AND, OR, NAND, NOR Success Symbolic regression: erroroneachfitnesscase< 1.11 10 15 Boolean domain: perfect reproduction of all fitness cases Number of runs 200 Table 4: Friedman ranks of success ratio performance measure. (a) on all 20 problems M+RDO 0.7 8.63 X+RDO 0.2 11.70 M+RDO 0.3 8.78 RDO 1.0 13.40 X+RDO 0.5 8.90 X+RDO 0.1 14.28 M+RDO 0.5 9.15 M+RDO 0.1 14.58 X+RDO 0.4 9.20 X 1.0 20.55 X+RDO 0.8 9.23 X+M 0.1 21.30 M+RDO 0.4 9.25 X+M 0.2 22.53 X+RDO 0.6 9.75 X+M 0.3 23.10 M+RDO 0.6 9.88 X+M 0.4 23.55 X+RDO 0.3 9.95 X+M 0.5 23.85 X+RDO 0.7 9.95 X+M 0.6 24.53 M+RDO 0.8 10.08 X+M 0.7 25.73 M+RDO 0.2 10.65 X+M 0.8 25.85 X+RDO 0.9 11.15 M 1.0 27.18 M+RDO 0.9 11.20 X+M 0.9 27.18 (b) on 10 symbolic regression problems X+RDO 0.4 7.60 X+RDO 0.9 13.70 X+RDO 0.5 8.00 M+RDO 0.9 14.25 M+RDO 0.3 8.20 X 1.0 17.10 M+RDO 0.7 8.80 M+RDO 0.1 17.60 X+RDO 0.8 9.20 RDO 1.0 17.60 X+RDO 0.3 9.25 X+M 0.1 19.50 M+RDO 0.4 9.30 X+M 0.2 21.05 M+RDO 0.5 9.30 X+M 0.3 22.40 X+RDO 0.7 10.45 X+M 0.4 23.20 X+RDO 0.2 10.60 X+M 0.5 23.40 X+RDO 0.6 10.60 X+M 0.6 23.95 M+RDO 0.2 10.85 X+M 0.8 25.05 M+RDO 0.6 11.00 X+M 0.7 26.10 X+RDO 0.1 11.35 M 1.0 26.70 M+RDO 0.8 11.70 X+M 0.9 27.20 (c) on 10 Boolean domain problems M+RDO 0.9 8.15 X+RDO 0.4 10.80 M+RDO 0.7 8.45 M+RDO 0.1 11.55 M+RDO 0.8 8.45 X+RDO 0.2 12.80 X+RDO 0.9 8.60 X+RDO 0.1 17.20 M+RDO 0.6 8.75 X+M 0.1 23.10 X+RDO 0.6 8.90 X+M 0.3 23.80 M+RDO 0.5 9.00 X+M 0.4 23.90 M+RDO 0.4 9.20 X 1.0 24.00 RDO 1.0 9.20 X+M 0.2 24.00 X+RDO 0.8 9.25 X+M 0.5 24.30 M+RDO 0.3 9.35 X+M 0.6 25.10 X+RDO 0.7 9.45 X+M 0.7 25.35 X+RDO 0.5 9.80 X+M 0.8 26.65 M+RDO 0.2 10.45 X+M 0.9 27.15 X+RDO 0.3 10.65 M 1.0 27.65 (a) on all 20 problems M+RDO 0.7 8.83 X+RDO 0.2 12.45 M+RDO 0.6 8.98 X+RDO 0.1 12.63 M+RDO 0.5 9.00 RDO 1.0 13.18 M+RDO 0.4 9.35 M+RDO 0.1 13.25 M+RDO 0.8 9.38 X 1.0 20.70 X+RDO 0.7 9.75 X+M 0.1 20.70 M+RDO 0.3 9.78 X+M 0.2 20.85 X+RDO 0.8 10.05 X+M 0.3 22.25 X+RDO 0.6 10.18 X+M 0.4 22.53 X+RDO 0.5 10.33 X+M 0.5 23.45 X+RDO 0.4 10.35 X+M 0.6 23.93 X+RDO 0.3 10.53 X+M 0.7 25.90 M+RDO 0.9 11.08 X+M 0.8 25.95 M+RDO 0.2 11.50 X+M 0.9 26.85 X+RDO 0.9 11.73 M 1.0 29.63 Table 5: Friedman ranks of median error. (b) on 10 symbolic regression problems M+RDO 0.7 6.80 X+RDO 0.2 13.20 M+RDO 0.6 7.10 X+RDO 0.1 13.35 M+RDO 0.5 7.15 M+RDO 0.1 14.70 M+RDO 0.4 7.85 RDO 1.0 15.50 M+RDO 0.8 7.90 X+M 0.1 19.85 X+RDO 0.7 8.65 X 1.0 20.15 M+RDO 0.3 8.70 X+M 0.2 20.75 X+RDO 0.8 9.25 X+M 0.3 23.20 X+RDO 0.6 9.50 X+M 0.4 23.20 X+RDO 0.5 9.80 X+M 0.5 24.80 X+RDO 0.4 9.85 X+M 0.6 25.20 X+RDO 0.3 10.20 X+M 0.8 27.40 M+RDO 0.9 11.30 X+M 0.7 27.70 M+RDO 0.2 12.15 X+M 0.9 27.90 X+RDO 0.9 12.60 M 1.0 29.30 (c) on 10 Boolean domain problems M+RDO 0.2 10.85 X+RDO 0.9 10.85 M+RDO 0.3 10.85 X+RDO 0.2 11.70 M+RDO 0.4 10.85 M+RDO 0.1 11.80 M+RDO 0.5 10.85 X+RDO 0.1 11.90 M+RDO 0.6 10.85 X+M 0.2 20.95 M+RDO 0.7 10.85 X 1.0 21.25 M+RDO 0.8 10.85 X+M 0.3 21.30 M+RDO 0.9 10.85 X+M 0.1 21.55 RDO 1.0 10.85 X+M 0.4 21.85 X+RDO 0.3 10.85 X+M 0.5 22.10 X+RDO 0.4 10.85 X+M 0.6 22.65 X+RDO 0.5 10.85 X+M 0.7 24.10 X+RDO 0.6 10.85 X+M 0.8 24.50 X+RDO 0.7 10.85 X+M 0.9 25.80 X+RDO 0.8 10.85 M 1.0 29.95 1018

in bold is the best (however, not necessarily in a statistically significant manner). These tables show that the M+RDO 0.7 setup fares the best. Moreover, the Holm s post-hoc analysis (as suggested by Derrac et al. in [2]) reveals that the best control setup (X 1.0) is statistically worse (p-value < 0.01) than the best 13 setups from this ranking only X+RDO 0.2 and setups with RDO 0.1, 0.9, 1.0 are not statistically better than it. However, there is also clear difference between the performance of setups when considering particular domains separately (Tables 4b-c). The best setups for symbolic regression are those that employ RDO with lower probability than in case of Boolean problems, which may suggest that RDO is not as advantageous in the former domain. The differences are even more visible in the qualitative comparison: for symbolic regression problems, the best control setup (also X 1.0) is not statistically worse than any other setup. The best setup, X+RDO 0.4, is statistically better than X+M 0.3, and following setups (the worst 8 setups from the ranking). However, for the Boolean problems the best control setup (X+M 0.1) is statistically worse than the best 12 setups. In Table 5 we show the rankings of median error committed by each setup. When considering both regression and Boolean domain together, the best setup is the same as comparing success ratio, M+RDO 0.7. For symbolic regression, it seems more beneficial to use bit higher probability of RDO than when maximizing success ratio. For the Boolean problems little can be concluded, as they are almost always solved by RDO, so the median error is zero and does not differentiate the setups (16 of its setups rank equally well at the top). Table 6 presents more detailed results for the globally best setup M+RDO 0.7 (the best success rate and the lowest median error on all problems) and the best control setup (X 1.0). The table shows the achieved success rate, the mean generation when the ideal solution was found (calculated only over the successful runs), the median error of the best-of-run individuals, and the time required by a single evolutionary process (averaged over 200 runs). The last column says how many successes are expected if an algorithm would be allowed to run for one hour, starting a new evolutionary run every time the previous one has been completed (with success or not). We find this efficiency measure convenient, because a method that executes very fast does not necessary score many successes. On the other hand, it can be run more times within a fixed computational budget. However, if a method is actually bad, the successive runs do not help much, and this performance indicator will show that. Last but not least, fixing the time budget allows us to take into account the overhead of searching for procedures in the library, which causes RDO to be substantially slower in absolute terms. To provide another reference point, in the rows named best we present the best value of each objective achieved by any setup. Therefore, each value in each best row may come from a different setup. Table 6 confirms our earlier finding that RDO is very beneficial for all Boolean problems. For instance, only 1% of runs for the quite challenging PAR6 problem fail. All other runs of M+RDO 0.7 succeeded, while standard GP failed to solve PAR6 even once. For this reason, the efficiency measured as success per hour for the Boolean domain can be even several orders of magnitude higher. For symbolic regression problems, the results are not so firm. For seven out of ten problems (F05 F11), the success ratio achieved by M+RDO 0.7 is better. However, in terms of the number of successes per hour, this setup can have up to two orders of magnitude worse performance only for 5 problems (F05, F08 F11) RDO is more efficient than the control setup. One of the possible reasons for inferior performance of RDO in the regression domain is the way in which the errors, committed by the inserted procedures with respect to desired semantics, propagate through the programs. In the Boolean domain, when a perfectly matching procedure cannot be found (which is most often the case), a mismatch on a single element of desired semantics causes unitary loss of fitness (one incorrect output bit). In the regression domain however, even a relatively small discrepancy between the semantics of the inserted procedure and the desired semantics may translate into arbitrarily large error of the entire program (consider, e.g., the context (1/)). Apart from that,we anticipate that the RDO performance for symbolic regression can be substantially improved by using a more efficient implementation of the algorithm that searches the library for the best matching subtree. 7. CONCLUSIONS The speedup and the overall improvement in quality provided by the presented approach result from exploiting certain properties of a given problem, more specifically, the properties of instructions that form the programming language of consideration. We demonstrated that taking into account such implicit, but often easily available, features can help to overcome some weaknesses of genetic programming, at least in certain domains and applications. As we already mentioned in Introduction, it is surprising to see that such supplementary 2 and easily exploitable properties are rarely considered in the genetic programming. The one fits all attitude prevails, particularly when it comes to operator design, which is puzzling in the light of the No Free Lunch theorem heritage. Therefore, we believe that this innovative approach is an important contribution to GP and that other methods developed in this spirit will lead to essential breakthrough in the field of genetic programming. There are several directions in which this research can develop. In the variant of RDO presented here, we used the desired semantics in a very strict way, searching the library for the part that exactly matched the desired values (i.e., the values that made the entire program return the correct output). However, to exploit the presented idea, an omniscient oracle is not necessary. It may be enough to only narrow the space of the considered parts in such way that the genetic operators can faster choose the part to be inserted into the parent individual. Therefore, our approach can be also applied to problems where the instructions are not easily invertible, which we plan to investigate in the future research. Acknowledgment Work supported by grant no. DEC-2011/01/B/ST6/07318. 2 Supplementary in the sense that the invertibility of, e.g., the multiplication operator is not required for program execution. 1019

Table 6: Detailed comparison of the best setup with RDO (M+RDO 0.7) with the best control setup (X 1.0). The shown best values (in italic) for each problem and each column may come from different setups. (a) symbolic regression Setup Success Success Median Time Success rate gen. error [ms] per hour RDO 0.335 3.9 0.001 4268.8 282.5 X 1.0 0.680 19.0 0.000 101.9 24014.6 best: 0.680 2.7 0.000 101.9 24014.6 RDO 0.185 4.4 0.001 4631.4 143.8 X 1.0 0.270 30.5 0.174 216.4 4492.3 best: 0.400 4.0 0.001 216.4 4492.3 RDO 0.480 4.7 0.000 3042.4 568.0 X 1.0 0.015 18.3 0.096 278.6 193.8 best: 0.580 2.4 0.000 278.6 1384.7 RDO 0.615 3.5 0.000 2760.2 802.1 X 1.0 0.460 24.2 0.054 122.2 13547.7 best: 0.925 2.6 0.000 122.2 19369.3 RDO 0.360 2.3 0.000 3133.4 413.6 X 1.0 0.070 20.4 0.049 217.8 1157.0 best: 0.405 1.9 0.000 217.8 1157.0 RDO 0.440 8.0 0.000 2848.4 556.1 X 1.0 0.000 0.216 287.4 0.0 best: 0.535 3.4 0.000 287.4 1390.0 RDO 0.995 2.8 0.000 209.1 17130.9 X 1.0 0.250 32.2 1.967 409.9 2195.9 best: 1.000 1.5 0.000 67.7 53207.3 RDO 0.975 2.4 0.000 439.9 7979.3 X 1.0 0.190 24.6 0.894 422.2 1620.0 best: 0.990 1.6 0.000 310.6 11474.4 RDO 1.000 1.4 0.000 76.5 47052.2 X 1.0 0.170 13.4 2.578 509.6 1200.9 best: 1.000 1.3 0.000 48.4 74443.4 RDO 0.000 0.145 13193.4 0.0 X 1.0 0.005 92.0 1.890 582.4 30.9 best: 0.005 8.0 0.141 582.4 30.9 (b) Boolean domain Setup Success Success Median Time Success rate gen. error [ms] per hour RDO 1.000 5.3 0.000 73.4 49060.3 X 1.0 0.100 71.1 2.000 553.4 650.5 best: 1.000 5.0 0.000 68.3 52717.4 RDO 1.000 11.2 0.000 449.7 8005.6 X 1.0 0.000 14.000 431.5 0.0 best: 1.000 11.0 0.000 119.2 8428.4 RDO 1.000 3.1 0.000 34.2 105407.9 X 1.0 0.870 33.9 0.000 206.1 15195.5 best: 1.000 3.0 0.000 28.4 126854.7 RDO 1.000 4.7 0.000 97.7 36836.3 X 1.0 0.360 69.1 1.000 608.9 2128.4 best: 1.000 4.4 0.000 87.2 41294.6 RDO 1.000 7.6 0.000 309.7 11623.2 X 1.0 0.000 5.000 768.2 0.0 best: 1.000 6.8 0.000 190.1 12486.8 RDO 1.000 9.8 0.000 1404.9 2562.5 X 1.0 0.000 334.500 615.1 0.0 best: 1.000 8.7 0.000 117.4 3147.2 RDO 1.000 3.6 0.000 56.6 63639.3 X 1.0 0.730 56.5 0.000 382.8 6865.2 best: 1.000 3.4 0.000 47.9 75164.4 RDO 1.000 4.1 0.000 44.1 81600.1 X 1.0 0.365 55.0 1.000 557.5 2356.8 best: 1.000 3.7 0.000 41.7 86394.4 RDO 1.000 8.7 0.000 250.2 14391.3 X 1.0 0.005 82.0 6.000 772.0 23.3 best: 1.000 7.4 0.000 211.3 17037.0 RDO 0.990 21.3 0.000 1576.8 2260.3 X 1.0 0.000 18.000 682.5 0.0 best: 1.000 15.9 0.000 224.8 3153.9 F03 F04 F05 F06 F07 F08 F09 F10 F11 F12 CMP6 CMP8 MAJ5 MAJ6 MAJ7 MUX11 MUX6 PAR4 PAR5 PAR6 8. REFERENCES [1] W. Banzhaf et al. Genetic Programming An Introduction; On the Automatic Evolution of Computer Programs and its Applications. Morgan Kaufmann, San Francisco, CA, USA, January 1998. [2] J. Derrac et al. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm and Evolutionary Computation, 2011. [3] N. X. Hoai et al. Solving the symbolic regression problem with tree-adjunct grammar guided genetic programming: The comparative results. In D. B. Fogel et al., editors, Proceedings of the 2002 Congress on Evolutionary Computation CEC2002, pages 1326 1331. IEEE Press, 12-17 May 2002. [4] C. Johnson. Genetic programming crossover: Does it cross over? In Proceedings of the 12th European Conference on Genetic Programming, EuroGP 2009, volume 5481 of LNCS, pages 97 108, Tuebingen, April 15-17 2009. Springer. [5] M. Keijzer. Improving symbolic regression with interval arithmetic and linear scaling. In C. Ryan et al., editors, Genetic Programming, Proceedings of EuroGP 2003, volume 2610 of LNCS, pages 70 82, Essex, 14-16 April 2003. Springer-Verlag. [6] J.R.Koza.Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA, 1992. [7] K. Krawiec and T. Pawlak. Approximating geometric crossover by semantic backpropagation. In C. Blum et al., editor, GECCO 13: Proceedings of the 15th annual conference on Genetic and evolutionary computation, Amsterdam, The Netherlands, 2013. ACM. [8] K. Krawiec and B. Wieloch. Analysis of semantic modularity for genetic programming. Foundations of Computing and Decision Sciences, 34(4):265 285, 2009. [9] S. Luke. The ECJ Owner s Manual A User Manual for the ECJ Evolutionary Computation Library, zeroth edition, online version 0.2 edition, October 2010. [10] J. F. Miller. An empirical study of the efficiency of learning boolean functions using a cartesian genetic programming approach. In Proceedings of the Genetic and Evolutionary Computation Conference, volume2, pages 1135 1142, Orlando, Florida, USA, 13-17 July 1999. Morgan Kaufmann. [11] Q. U. Nguyen et al. Semantic aware crossover for genetic programming: The case for real-valued function regression. In Proceedings of the 12th European Conference on Genetic Programming, EuroGP 2009, volume 5481 of LNCS, pages 292 302, Tuebingen, April 15-17 2009. Springer. [12] Q. U. Nguyen et al. Self-adapting semantic sensitivities for semantic similarity based crossover. In 2010 IEEE World Congress on Computational Intelligence, pages 4034 4040, Barcelona, Spain, 18-23 July 2010. IEEE Computational Intelligence Society, IEEE Press. [13] J. A. Walker and J. F. Miller. Investigating the performance of module acquisition in cartesian genetic programming. In H.G. Beyer et al., editors, GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, volume 2, pages 1649 1656, Washington DC, USA, 25-29 June 2005. ACM Press. 1020