On the Use of Probabilistic Models and Coordinate Transforms in Real-Valued Evolutionary Algorithms

Size: px

Start display at page:

Download "On the Use of Probabilistic Models and Coordinate Transforms in Real-Valued Evolutionary Algorithms"

Brett Cox
5 years ago
Views:

1 On the Use of Probabilistic Models and Coordinate Transforms in Real-Valued Evolutionary Algorithms Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Department of Cybernetics Technická 2, Prague 6, Czech Republic January, 2007

2 ii

3 Rodičům, Marcele a Vlastimilovi. Bez vás bych tuto práci nikdy neměl šanci začít psát. Evičce a Vojtíškovi. Bez vás bych tuto práci nikdy nedokončil.

4 iv

5 v Acknowledgements I would like to express my gratitude to many people that supported me and helped me with the work on this thesis. First of all, I would like to thank Vojtěch Franc whose expertise in kernel methods and machine learning allowed me to use them in such a great extent in this thesis. I would like to thank also to Jiří Kubalík who has supported me for many years on my quest to optimize the evolutionary algortithms. And I would also like to thank to my supervisor Jiří Lažanský for giving me so much freedom in the research and for his generous guidence. I am also very greatful to many people who inspired me in the recent years: Nikolaus Hansen, Marcus Gallagher, Jiří Očenášek, Martin Pelikan, Peter Bosman, Jörn Grahl, Bo Yuan, and many others. I would also like to thank to my colleague Filip Železný who read the manuscript, provided many useful comments, suggested improvements and pointed out to many typos I made. Thanks also to my family and my friends. They were always ready to support me and to help me relax when I needed it.

6 vi

7 vii Abstract This thesis deals with black-box parametric optimization it studies methods which allow us to find optimal (or near optimal) solutions for optimization tasks which can be cast as search for a minimum of real function with real arguments. In recent decades, evolutionary algorithms (EAs) gained much interest thanks to their unique features; they are less prone to get stuck in local minima, they can find more than one optimum at a time, etc. However, they are not very efficient when solving problems which contain interactions among variables. Recent years have shown that a subclass of EAs, estimation of distribution algorithms (EDAs), allows us to account for interactions among solution components in a unified way. It was shown, that they are able to reliably solve hard optimization problems, at least in discrete domain; their efficiency in real domain was questionable and there is still an open space for research. In this thesis, it is argued that a direct application of methods successful in discrete domain does not lead to successful algorithms in real domain. The requirements for a successful real-valued EDA are presented. Instead of learning the structure of dependencies among variables (as in the case of discrete EDAs) it is suggested to use coordinate transforms to make the solution components as independent as possible. Various types of linear and non-linear coordinate transforms are presented and compared to other EDAs and GAs. Special chapter is dedicated to methods for preserving the diversity in the population. The dependencies among variables in real domain can take on many more forms than in the case of discrete domain. Although this thesis proposes a number of algorithms for various classes of objective functions, a control mechanism which would use the right solver remains a topic for future research.

8 viii

9 Contents 1 Introduction to EAs Global Black-Box Optimization Local Search and the Neighborhood Structure Discrete vs. Continuous Spaces Local Search in the Neighborhood of Several Points Genetic and Evolutionary Algorithms Conventional EAs EA: A Simple Problem Solver? EA: A Big Puzzle! Effect of the Genotype-Phenotype Mapping in EAs The Roles of Mutation and Crossover Two Ways of Linkage Learning The Topic of this Thesis Targeting Efforts Requirements for a Successful EA The Goals The Roadmap State of the Art Estimation of Distribution Algorithms Basic Principles EDAs in Discrete Spaces EDAs in Continuous Spaces Coordinate Transforms Envisioning the Structure Univariate Marginal Probability Models Histogram Formalization One-dimensional Probability Density Functions Empirical Comparison Comparison with Other Algorithms Summary UMDA with Linear Population Transforms Principal Components Analysis Toy Examples on PCA Independent Components Analysis Toy Examples of ICA Example Test of Independence Empirical Comparison Comparison with Other Algorithms Summary ix

10 x CONTENTS 4 Focusing the Search EDA with Distribution Tree Model Growing the Distribution Tree Sampling from the Distribution Tree Empirical Comparison Summary Kernel PCA KPCA Definition The Pre-Image Problem KPCA Model Usage Toy Examples of KPCA Empirical Comparison Summary Population Shift The Danger of Premature Convergence Using a Model of Successful Mutation Steps Adaptation of the Covariance Matrix Empirical Comparison: CMA-ES vs. Nelder-Mead Summary Estimation of Contour Lines of the Fitness Function Principle and Methods Empirical Comparison Results and Discussion Summary Conclusions and Future Work The Main Contributions Future Work A Test Functions 95 A.1 Sphere Function A.2 Ellipsoidal Function A.3 Two-Peaks Function A.4 Griewangk Function A.5 Rosenbrock Function B Vestibulo-Ocular Reflex Analysis 99 B.1 Introduction B.2 Problem Specification B.3 Minimizing the Loss Function B.4 Nature of the Loss Function

11 List of Tables 3.1 UMDA empirical comparison: factor settings UMDA empirical comparison: number of bins UMDA empirical comparison: results PCA vs. ICA empirical comparison: factor settings PCA vs. ICA: Griewangk function PCA vs. ICA: Two Peaks function DiT-EA empirical comparison: factor settings DiT-EA empirical comparison: results KPCA-EA empirical comparison: results Settings for parameters of artificial VOR signal Nelder-Mead vs. CMA-ES: success rates CMA-ES and Separating Hyperellipsoid: population sizes xi

12 xii LIST OF TABLES

13 List of Figures 1.1 Example of binary and integer neighborhoods The neigborhood of Nelder-Mead simplex Neighborhoods induced by the bit-flip mutation Neighborhoods induced by the 1-point crossover Neighborhoods induced by the 2-point crossover Equi-width histogram model Equi-height histogram model Max-diff histogram model Mixture of Gaussians model Bin boundaries evolution for the Two Peaks function Bin boundaries evolution for the Griewangk function Two linear principal components of the 2D toy data set PCA and marginal equi-height histogram model PCA and ICA components comparison Principal and independent components of 2D Griewangk Observed and expected frequency tables before PCA Observed and expected frequency tables after PCA Monitored statistics layout DiT search space partitioning Efficiency of the DiT-EA on the 2D Two Peaks Efficiency of the DiT-EA on the 2D Griewangk Efficiency of the DiT-EA on the 2D Rosenbrock Efficiency of the DiT-EA on the 20D Two Peaks Efficiency of the DiT-EA on the 10D Griewangk Efficiency of the DiT-EA on the 10D Rosenbrock First six nonlinear components of the toy data set KPCA crossover for clustering and curve-modeling Efficiency of the KPCA-EA on the 2D Two Peaks Efficiency of the KPCA-EA on the 2D Griewangk Efficiency of the KPCA-EA on the 2D Rosenbrock Coefficients d and c in relation to the selection proportion τ One iteration of EMNA and CMA-ES Vestibulo-ocular reflex signal Aligned VOR signal segments Nelder-Mead vs. CMA-ES: number of needed evaluations Nelder-Mead vs. CMA-ES: typical progress CMA-ES vs. Separating Hyperellipsoid: covariance matrix CMA-ES vs. Separating Hyperellipsoid: average progress Modified Perceptron vs. SVM: separating hyperellipsoid xiii

14 xiv LIST OF FIGURES A.1 Two Peaks function A.2 Griewangk function A.3 Rosenbrock function B.1 Simulated VOR signal B.2 Aligned VOR signal segments B.3 The fitness landscape cuts for the VOR analysis I B.4 The fitness landscape cuts for the VOR analysis II B.5 The fitness landscape cuts for the VOR analysis III

15 List of Algorithms 1.1 Hill-Climbing Steepest Descent An iteration of Nelder-Mead downhill simplex algorithm Genetic Algorithm Estimation of Distribution Algorithm Coordinate transform inside EA Generic EDA EDA with PCA or ICA preprocessing Function SplitNode Function FindBestSplit KPCA model building and sampling Perceptron Algorithm Modified Perceptron Algorithm Gaussian distribution sampling Evolutionary model for the proposed method xv

16 xvi LIST OF ALGORITHMS

17 Chapter 1 Introduction to Evolutionary Algorithms This chapter presents an introduction to the field of optimization. It informally mentions the relevant conventional techniques used to solve the optimization tasks. It gives a brief survey of the ordinary genetic and evolutionary algorithms (GEAs), presents them in a unifying framework with the conventional optimization techniques and points out some issues in their design. These issues directly lead to the emergence of a class of evolutionary algorithms based on the creation and sampling from probabilistic models, so-called Estimation of Distribution Algorithms (EDAs), which are described in next chapter in more detail. 1.1 Global Black-Box Optimization Optimization tasks can be found in almost all areas of human activities. A baker has to find the optimal composition of the dough to produce rolls which are popular among his customers. A company must create such a manufacturing schedule so that it minimizes the cost of its production. An engineer has to find such settings of a controller which deliver the best response of the system under control. The optimization task is usually defined as x = arg minf(x), (1.1) x S so that among the members x of the set of all possible objects S, x S, we have to find the optimal x which minimizes the objective (cost 1 ) function f. The cost function usually tells us how good the individual objects are. 2 In the examples given above, objects are the composition of the dough, the schedule, and the settings of the controller, respectively. Similarly, the objective functions are the satisfaction of baker s customers, the cost of manufacturing, and e.g. the speed and precision of the system being controlled. There exist many algorithms which try to solve the optimization task. According to Neumaier (2004), the following classification of optimization methods (based on the degree of rigor with which they approach the goal) is made: Cost function 1 In this thesis, the terms cost function, objective function and evaluation function are used rather interchangeably. If the distinction between them is important, it is pointed out. 2 In many cases, it is sufficient to have a function which takes two objects and tells us which of them is better, so that we are able to produce a complete ordering of a set of objects. 1

18 2 1 INTRODUCTION TO EAS An incomplete method uses clever heuristics to guide the search, but gets stuck in a local minimum. An asymptotically complete method reaches the global optimum with probability one if allowed to run indefinitely long, but has no means to verify if the global optimum was already found. A complete method reaches the global optimum with certainty, assuming indefinitely long run time, and knows after a finite time that an approximate global minimum has been found (within prescribed tolerances). Black-box optimization With an algorithm from the last of the three 3 categories we can predict the amount of work needed to find the global minimum within some prescribed tolerances, but this estimate is usually very pessimistic (exponential in the problem size) and serves as the lower bound on the efficiency of an algorithm. We speak of the black-box optimization if the form of the particular objective function is completely hidden from us. We do not know in advance if the function is unimodal or multimodal, if it can be decomposed to some subfunctions, we do not know its derivatives, etc. All we can do is to sample some object from the space S and have it evaluated by the cost function. If the search space is infinite, without any global information, no algorithm is able to verify the global optimality of the solution in a finite time. In such circumstances, the best algorithm we can construct for the black-box optimization problems is some asymptotically complete method. In the rest of this thesis, an algorithm is said to perform a global search, if the following condition holds: lim t P(xt BSF = x ) = 1, (1.2) Best-so-far solution where x t BSF is the best-so-far solution found by the algorithm from the start of the search to the time instant t (it is the current estimate the algorithm gives us for the global optimum) and x is the true global optimum. 1.2 Local Search and the Neighborhood Structure Local search In this section, several incomplete local search methods are described. A plausible and broadly accepted assumption for any optimization task is that the objective function must be a reasonable one, i.e. it must be able to guide the search, to give some clues where to find an optimum. If this assumption does not hold, e.g. if we are presented with an objective function with very low signal-to-noise ratio or if we face a needle-in-the-haystack type of problem, then we can expect that the best algorithm we can use is the pure random search. The most often used optimization procedures are various forms of local search, i.e. incomplete methods. The feature that all local search procedures have in common is the fact that they repeatedly look for increasingly better objects in the neighborhood of the current object, thus incrementally improving the solution. They start with an initial object from the search space generated randomly or created as an initial guess of a solution. The search is carried out until a termination criterion is met. The termination criterion can be e.g. a limit on the number of iterations, sufficient quality of the current object, or the search can be stopped when all possible perturbations result in worse objects when compared to the current one. In that case we found a local optimum of the cost function. 3 In the original paper, there were four categories, but the last one dealt with the presence of rounding errors which is not relevant here, and thus was omitted.

19 1.2 Local Search and the Neighborhood Structure 3 Individual local search strategies differ in the definition of neighborhood, and in the way they search the neighborhood and accept better solutions. If the neighborhood is finite and precisely defined, it can be searched using strict rules. Very often, the neighbors are searched in random order the local search algorithm can then accept the first improving neighbor, or the best improving neighbor, etc. In one particular variant of the local search, so-called hill-climbing, the iteration consists of a slight perturbation of the current object and if the result is better, it becomes the current object, i.e. it accepts the first-improving neighbor (see Alg. 1.1). Hill-climbing Algorithm 1.1: Hill-Climbing begin x Initialize() while not TerminationCondition() do y Perturb(x) if BetterThan(y, x) then x y 7 end The cost function f(x) (which is hidden in the function BetterThan in Alg. 1.1) takes the object x and evaluates it. In order to do that, it must understand the argument x, i.e. we have to choose the representation of the object. Many optimization tasks have their obvious natural representations, e.g. amounts of individual ingredients, Gantt charts, vectors of real numbers, etc. However, often the natural representation is not the most suitable one some optimization algorithms require a specific representation because they are not able to work with others, sometimes the task at hand can simply be solved easier not using the natural representation (this feature is strongly related to the neighborhood structure of the search space which is described below). In such cases we need a mapping function which is able to transform the object description from the search space representation (the representation used by the optimization algorithm when searching for the best solution) to the natural representation (suitable for evaluation of the objects). We say that a point in the search space represents the object. In all local search procedures, the concept of the neighborhood plays the fundamental role. The most straightforward and intuitive definition of the neighborhood of an object is the following: Representation Neighborhood N(x) = { y dist(x, y) < ǫ}, (1.3) i.e. the neighborhood N(x) of an object x is a set of such objects y so that their distance from the object x is lower then some constant. The neighborhood and its structure, however, is very closely related to the chosen representation of an object and to the perturbation scheme used in the algorithm. From this point of view, a much better definition of the neighborhood is this: N(x) = { y y = Perturb(x)}, (1.4)

20 4 1 INTRODUCTION TO EAS i.e. the neighborhood is a set of such objects y which are accessible by one application of the perturbation. In Fig. 1.1, we can see an example of a small search space represented as binary and integer numbers (the objective values of individual points are not given there). Now, suppose that the perturbation for binary representation flips one randomly chosen bit of the point 101 B (i.e. the neighborhood is a set of points with Hamming distance equal to 1) while the perturbation for integer representation randomly adds or subtracts 1 to the point 5 D (i.e. the neighborhood is a set of points with Euclidean distance equal to 1). It is obvious that these two neighborhoods neither coincide, nor one is a subset of the other. Furthermore, imagine what happens if we enlarge the Binary Integer Figure 1.1: Example of binary and integer neighborhoods Larger neighborhood, fewer local optima search space so that we need 4 or more bits to represent the points in binary space. The number of binary neighbors increases while the neighborhood in the integer space remains the same. It is worth mentioning that larger neighborhood of points results in lower number of local optima, but also in greater amount of time spent by searching the neighborhood. If we chose such a perturbation operator able to generate all points of the search space we would have only one local optimum (which would also be the global one), but we would have to search the whole space of candidate solutions. For the success of the local search procedures, it is crucial to find a good compromise in the size of the neighborhood, i.e. to choose the right perturbation operators Discrete vs. Continuous Spaces Steepest descent The optimization tasks defined in discrete spaces are fundamentally different from the tasks defined in continuous spaces. The number of neighbors in a discrete space is usually finite and thus we are able e.g. to test if a local optimum was found we can enumerate all neighbors of the current solution, and if none is better, we can finish. This usually cannot be done in continuous space the neighborhood is defined to be infinite in that case and something like enumeration simply does not make sense 4. To ensure that we found a local optimum in continuous space, we would need to know the derivatives of the objective function. These features can be observed considering another type of a local search algorithm called steepest descent (see Alg. 1.2). This algorithm examines all points in the vicinity of the current point and jumps to the best of them (i.e. the algorithm accepts the best improving neighbor). In case of discrete space, this can be done by enumeration. In continuous spaces, the neighborhood is given by a 4 Of course, even in continuous spaces we can define the neighborhood to be finite, but in that case (without any regularity assumptions) there is high probability of missing better solutions.

21 1.2 Local Search and the Neighborhood Structure 5 Algorithm 1.2: Steepest Descent begin x Initialize() while not TerminationCondition() do N x { y y = Perturb (x)} x BestOf(N x ) end line going through the current point in the direction of the gradient. The points on this line are examined and the best one is selected as the new current point. 5 It is interesting to note, that the steepest descent algorithm is a deterministic algorithm, while the hill-climbing is not. If started from the same initial point over and over, the steepest descent algorithm always gives the same solution (in both discrete and continuous space). On the contrary, the hill-climbing can give different results in each run as it depends on the order of evaluation of the neighbors which is usually random. Steepest descent also spends a lot of time on unnecessary evaluation of the whole neighborhood in case of a discrete search space. We can expect that it will find a local optimum with less iterations but very often with much more objective function evaluations than the hill-climbing it depends on the size of neighborhood and on the complexity of the search problem Local Search in the Neighborhood of Several Points The algorithms described above work with one current point and perform the search in its neighborhood. One of possible generalizations is to use a set of current points (instead of mere one) and search in their common neighborhood. Again, the neighborhood N(X) of a group of points X, X = {x 1, x 2,...,x N }, can be defined using a combination operator as N(X) = { y y = Combine(X)}, (1.5) i.e. as a set of all points y which can result from some combination of the points in X. Individual algorithms then differ by the definition of the Combine operation. This approach is not very often pursued in the conventional optimization techniques. However, one excellent and famous example of this technique can be mentioned here, the Nelder-Mead simplex search introduced in Nelder & Mead (1965). This algorithm is used to solve optimization problems in R D. In D- Nelder-Mead dimensional space, the algorithm maintains a set of D + 1 points that define a simplex in that space (an interval in 1D, a triangle in 2D, a tetrahedron in 3D, etc.). Using this simplex it creates a finite set of candidate points which constitute the neighborhood of the simplex (see Eqs. 1.7 to 1.11 and Fig. 1.2). This neighborhood is neither searched in random order as in the case of hillclimbing, nor the best point is selected as in the case of steepest descent. Instead, not all neighbors are necessarily evaluated; there is a predefined sequence of assessing the quality of the points along with the rules how to incorporate them to the current set of points defining the simplex. 5 The search for the best point on the line is usually done analytically. The gradient in the best point is perpendicular to the line. simplex search

22 6 1 INTRODUCTION TO EAS 4 3 y e 2 x 1 y r y oc 1 y s3 y s2 0 y ic x 3 x Figure 1.2: Points forming the neighborhood in Nelder-Mead simplex search. The defining simplex is plotted in dashed line, the points in neighborhood are marked with black dots. Suppose now, that the simplex is given by points x 1, x 2,...,x D+1 R D which are ordered in such a way that f(x 1 ) f(x 2 )... f(x D+1 ), f is the cost function. During one iteration, the neighborhood of the simplex consists of points which are computed as follows: x = 1 D D x d, (1.6) d=1 y r = x + ρ( x x D+1 ), (1.7) y e = x + χ(x r x), (1.8) y oc = x + γ(x r x), (1.9) y ic = x γ( x x D+1 ), (1.10) y si = x 1 + σ(x i x 1 ), i 2,...,D + 1, (1.11) where the coefficients should satisfy ρ > 0, χ > 1, χ > ρ, 0 < γ < 1, 0 < σ < 1. The standard settings are ρ = 1, χ = 2, γ = 0.5, σ = 0.5. One iteration of the algorithm is presented as Alg Although this algorithm maintains a set of current points and the neighborhood is defined dynamically during the run, it still preserves the characteristics of an incomplete method. Nelder-Mead search is applicable in black-box optimization (which does not hold for the steepest descent). The hill-climbing is not able to adapt the size and shape of its neighborhood which is the most important feature of the Nelder-Mead method. However, the author is not aware of any systematic empirical comparison of the above mentioned methods. 1.3 Genetic and Evolutionary Algorithms Genetic and evolutionary algorithms (GEAs) have been already known for a few decades and they proved to be a powerful optimization and searching tool in many research and application areas. They can be considered a stochastic generalization of the techniques which search in the neighborhood of several points described above. They maintain a set of potential solutions during the search and are able to perform a search in many neighborhoods of various sizes and structures in the same time. They are inspired by the processes that can be observed in nature, mainly by natural selection and variation. The terminology used to describe the GEA

23 1.3 Genetic and Evolutionary Algorithms Algorithm 1.3: An iteration of Nelder-Mead downhill simplex algorithm Input: x 1, x 2,...,x D+1 so that f(x 1 ) f(x 2 )... f(x D+1 ) begin /* Reflection */ compute y r using Eq. 1.7 if f(x 1 ) f(y r ) < f(x D ) then Accept(y r ); exit if f(y r ) < f(x 1 ) then /* Expansion */ compute y e using Eq. 1.8 if f(y e ) < f(y r ) then Accept(y e ); exit if fy r f(x D ) then /* Contraction */ if f(y r ) < f(x D+1 ) then /* Outside contraction */ compute y oc using Eq. 1.9 if f(y oc ) f(y r ) then Accept(y oc ); exit else /* Inside contraction */ compute y ic using Eq if f(y ic ) < f(x D+1 ) then Accept(y ic ); exit /* Shrinking */ compute points y si using Eq MakeSimplex(x 1, y s2,...,y D+1 ) end is also borrowed from the biology and genetics. A potential solution or a point in the search space is called a chromosome or an individual. Each component of the chromosome (each variable of the solution) is called a locus and a particular value at certain locus is called an allele. The set of potential solutions maintained by the algorithm is called a population. The cost function (which is usually minimized as the name suggests) is translated to the fitness function which describes the ability of an individual to survive in the current environment (and should be thus maximized). The search performed by the GEAs is called an evolution. The evolution consists of many iterations called generations and in each generation the algorithm performs operations like selection, crossover, mutation, and replacement which shall be described below. GEAs gained their popularity mainly for two reasons: 1. they are conceptually easy, i.e. it is easy to describe the processes taking place inside them and thus they are easy to implement, and Chromosome, locus, allele, population, fitness, etc. Why are they so popular? 2. they have shown better global search abilities 6 then conventional methods, i.e. for many problems, GEAs are able to find even such local optima which are hard to find for local optimizers, and furthermore the operators 6 We cannot decide if certain algorithm performs local or global search until we know the representation of solution and the definition of the neighborhood induced by the variation operators. E.g. the hill-climbing algorithm is usually regarded as local search method because its perturbation operator is usually local. However, it can perform the global search as well if the neighborhood contained all possible solutions, i.e. if the probability of generating a solution is greater than 0 for all possible ones.

24 8 1 INTRODUCTION TO EAS of GEAs are usually chosen in such a way that they ensure the asymptotic completeness of the algorithm. In subsequent sections, the above mentioned topics are discussed. First, the well-established types of GEAs are described, and finally the general structure of an evolutionary algorithm is presented and explained Conventional EAs The EAs are sometimes classified by the evolutionary computation (EC) community into four groups: Evolutionary programming (EP) which works with finite automata and numeric representations and has been developed mainly in the United States since 1960s, Evolutionary strategies (ES) which work with vectors of real (rational) numbers and were invented in Germany in late 1960s, Genetic algorithms (GAs) that work with binary representations and have been known since 1970s, and Genetic programming (GP) which works with programs represented in the tree form and is the most recent member of the EA family. Structure of an evolutionary algorithm Initialization There are other types of EAs which are not included in the above list, but these are pretty recent and differ only by the operators used in them. In fact, there is no need to draw any hard boundaries between the above mentioned four types of EAs. They influence each other and cooperate by exchanging ideas and various techniques. What all the EAs have in common is the evolutionary cycle in which some sort of selection and variation must be present. A typical evolutionary scheme is presented as Alg Not all of the presented operations have to be included, neither they must follow this order. Algorithm 1.4: Genetic Algorithm begin X (0) Initialize() f (0) Evaluate(X (0) ) g 0 while not TerminationCondition() do X Par Select(X (g), f (g) ) X Offs Crossover(X Par ) X Offs Mutate(X Offs ) f Offs Evaluate (X Offs ) [X (g+1), f (g+1) ] Replace(X (g),x Offs,f (g), f Offs ) g g + 1 end The initialization phase creates the first generation of individuals. Usually, they are created randomly, but if the designer has some knowledge of the placement of good solutions in the search space, he can use it and initialize the population in a biased way. The population is then evaluated using the fitness function and the main generational loop starts.

25 1.3 Genetic and Evolutionary Algorithms 9 The selection phase designates the parents, i.e. it chooses some of the individuals in the current population to serve as a source of genetic material for creation of new offspring. The selection should model the phenomenon that better individuals have higher odds to reproduce, thus it is driven by the fitness values. There are many selection schemes based on the following three basic types: proportional which takes the raw fitness values and based on them it calculates the probabilities to include that individual to the parental set; rank-based which does not use the raw fitness values to compute the probabilities, instead it sorts the individuals and (to simplify it) the rank becomes the fitness value; and tournament which takes small random groups of individuals from the current population and designates the best in the group as a parent. After the set of parents is determined, the variational operators crossover and mutation are carried out to create new, hopefully better, offspring individuals. The crossover performs the search in the neighborhood of several individuals as it usually takes two or more parents, combines them and creates one or more offspring. The particular process used to combine the parents to produce new individuals is strongly dependent on the chosen representation and will not be described here (for more information see e.g. Goldberg (1989), Michalewicz & Fogel (1999)). The mutation is equivalent to a local search strategy in the neighborhood of one point it usually takes one individual, perturbs it a bit, and the resulting mutant is joined with the other offspring. After the creation of the offspring population is completed, all of the members are evaluated. Now, we have the old population and the population of the offspring solutions. Among all individuals of these two populations the competition for the natural resources takes place. Not all the parents survive to the next generation. Sometimes, the principle of the survival of the fittest holds, at least in a probabilistic form, i.e. the fitter the individual, the higher the probability it will survive; sometimes the old population is replaced by the new one as a whole. The implementation of this procedure is called the replacement. Although it is possible to use a dynamic population size, most often it is required to be constant and the replacement discards many generated solutions. Selection Crossover Mutation Replacement EA: A Simple Problem Solver? The EA metaheuristic is very simple. In a great majority of problems, if one compiles the fitness function and decides the representation, he can simply use some of the standard operators suitable for the chosen representation. The EA will then run and will produce some results. Using such an EA one can easily evolve e.g. matrices, graphs, nets, sets of rules, images, designs of industrial facilities, etc. Making EAs run without any special domain knowledge is thus a very easy job for a wide range of applications. This flexibility and wide applicability of EAs is their greatest strength and weakness in the same time. For all these reasons, it is very tempting to describe the EAs as simple problem solvers. In fact, the things are not that easy. If we apply this naive approach, our EA is very likely to suffer from several potential negative features described below. EA Related Issues Instantiation of EA EA constitutes a general framework, a template. To make it work one must choose a representation of solutions and define some call-back functions (initialization, selection and replacement methods, crossover and mutation opera-

26 10 1 INTRODUCTION TO EAS tors, parameters of the methods, etc.) which are then plugged in the template. After that, an instance of the EA is created and can be used. The task of defining these functions can be called the instantiation of EA. For some types of EAs, the amount of things to be set up is really large, and this can be the source of frustration of many EA designers. To find a really good instantiation is often very hard problem itself. The problem of a good instantiation is closely related to the so-called No- Free-Lunch Theorem (Wolpert & Macready (1997)) which states: All searching procedures based on the sampling of the search space have on average the same efficiency across all possible problems. Something similar holds even for the representation, and for the variational operators. When designing the EA with the aim of better efficiency, we have to apply the domain specific knowledge, which will help the algorithm to search in the right direction, but will deteriorate the algorithm behavior in other domains. Similarly, if our aim is broader applicability, we have to develop an algorithm that will e.g. learn the problem characteristics from its own evolution at the expense of lower efficiency. 7 In that case we need very flexible variational operators that are able to adapt to changing living environments. No-Free- Lunch Theorem Linkage, Epistasis, Statistical Dependency The basic versions of EA usually treat the individual components of the solutions as statistically independent of each other. If this assumption does not hold, we speak of the linkage, epistasis, or statistical dependency of individual components. The linkage presents a severe issue to EAs. It is closely related to the issue of the instantiation. Sometimes, a problem at hand can be successfully solved using a basic version of EA, but choosing a bad representation we can accidentally introduce the dependencies to the algorithm and greatly deteriorate the efficiency of the EA. Most often, however, it is not easy (or it is impossible) to come out with a representation which reduces or removes the dependencies because we know nothing or very little about the fitness function. Premature Convergence During the evolution it can easily happen that one sub-optimal individual takes over the whole population, and the search almost stops. This phenomenon is called premature convergence. It is not an issue itself, rather it is a symptom of a bad instantiation of the EA. Some types of EAs are more resistant against this phenomenon the algorithms that rely more on the mutation type of variational operator than on the crossover type. One of the EA parameters has a significant effect on the premature convergence the population size. If it is small, there is much bigger chance that the premature convergence emerges. A too large population, on the other hand, wastes computational resources. 7 I hypothesize that the No-Free-Lunch theorem holds even for the algorithms that learn problem characteristics during their run, i.e. although they have the ability to adapt to one class of problems, they are misled in the complementary class of problems. All the work with creating algorithms able to learn is carried out in the hope that the class of problems solvable by these algorithms corresponds to a great extent with the class of problems people try to solve in their real life.

27 1.3 Genetic and Evolutionary Algorithms EA: A Big Puzzle! To make EA run is an easy task. To make it run successfully is a very hard task. If we are interested in creating an EA instance that is tuned (or is able to tune itself) to the problem being solved, we have to solve a big puzzle and carefully assemble individual pieces so that they cooperate and do not harm each other in order to overcome at least some of the above mentioned problems. This effort can be aimed at creating algorithms which have a lower number of parameters that must be set by the designer, and are more robust so that we can apply them to broader range of problems without substantial changes. The effort can be aimed at automatic control of some EA parameters so that the algorithm itself actively tries to fight the premature convergence. The effort can be aimed at detection of the relations between variables, and at incorporating this extra knowledge into the evolutionary scheme. The possibilities to optimize the EA are countless, however they can be classified to several groups: Optimize the optimization algorithm! Theory It would be great if we had a kind of EA model which would tell us, how the particular instantiation of EA would behave. Such general and practically applicable models, however, do not exist at present. The interactions between individual parts of EA can be easily described, but hardly analyzed. In spite of that, there are some attempts to build theoretical foundations for EAs. Some attempts to build a theoretical apparatus on the optimal population sizing for GAs can be found in Goldberg (1989), Goldberg et al. (1992), or Harik, Cantù-Paz, Goldberg & Miller (1997), and on optimal operator probabilities in Schaffer & Morishima (1987), or Thierens & Goldberg (1991). Eiben et al. (1999) present a judgment of these theoretical models:... the complexities of evolutionary processes and characteristics of interesting problems allow theoretical analysis only after significant simplification in either the algorithm or the problem model.... the current EA theory is not seen as a useful basis for practitioners. Deterministic Heuristics The characteristics of the population change during the evolution, and thus a static setting of EA parameters is not optimal. The deterministic control changes some parameters using a fixed time schedule. The works listed in the next paragraph show that for many realworld problems even non-optimal choice of schedule often leads to better results than a near-optimal choice of a static value. A deterministic schedule for changing the mutation probability can be found in Fogarty (1989). Janikow & Michalewicz (1991) present a mutation operator which has the property to search the space uniformly initially, and very locally at later stages. In Joines & Houck (1994), the penalties of solutions to constrained optimization problem are transformed by deterministic schedule resulting in dynamic evaluation function. Feedback Heuristics With feedback heuristics, the changes of parameters are triggered by some event in the population. The typical feature of EAs using feedback heuristics is the monitoring of some population statistics (relative improvement of population, diversity of population, etc.).

28 12 1 INTRODUCTION TO EAS Julstrom (1995) presents an algorithm that uses one of the mutation or the crossover operator more often than the other based on their previous success. Davis (1991) used a similar principle for multiple crossover operators. Lis & Lis (1996) present a parallel model where each subpopulation uses different settings of parameters. After certain time period, the subpopulations are compared and the settings are shifted towards the values of the most successful subpopulation. Shaefer (1987) adapts the mapping of variables (resolution, range, position) based on convergence and variance of the genes. A co-evolutionary approach was used to adapt evaluation function e.g. by Paredis (1995). Self-Adaptation The roots of the self-adaptation can be found in ES where the mutation steps are part of individuals. The parameters of EA undergo the evolution together with the actual solution of the problem. Bäck (1992) self-adapts mutation rates in GA. Fogel et al. (1995) uses selfadaptation to control the relative probabilities of five mutation operators for the components of a finite state machine. Spears (1995) added one extra bit to the individuals to discriminate if 2-point or uniform crossover should be used. Schaffer & Morishima (1987) present a self-adaptation of the number and locations of crossover points. Diploid chromosomes (each individual has one additional bit which determines if one should use the chromosome itself or its inversion) are also a form of self-adaptation (see e.g. Goldberg & Smith (1987)). In the context of ES, Rudolph (2001) has proved that the self-adaptation of mutation steps can lead to premature convergence. Meta Optimizers To adapt the settings of an EA we can use another optimizer. This procedure usually involves letting the EA run for some time with certain settings and measuring its performance. Similarly to self-adaptation, the optimization process takes usually longer time since we search a much larger space (object parameter space + EA parameter space). Example of this approach can be found e.g. in Grefenstette (1986). New Types of EAs Although many researchers reported promising results using adaptation in EAs, it is not clear if the adaptation can solve our problems completely. Eiben et al. (1999) state that one of the main obstacles of optimizing the EA settings is formed by the interactions between these parameters. If we use some basic form of EA (which treats the variables independently of each other) to optimize the EA parameters (which usually interact with each other), we can hardly find the optimal settings. However, new types of EAs (or new types of variational operators and other EA components) can be of great help. Algorithms that have fewer parameters to set (or optimize), and address perhaps the most severe issue in EA design the interactions among variables. Before some methods that would allow us to construct such algorithms or operators that account for dependencies among variables can be described, we have to clarify the influence of chosen representation on the operator-induced neighborhood structure and the roles of crossover and mutation in EAs Effect of the Genotype-Phenotype Mapping in EAs As stated earlier, optimization algorithms can work with two (or more) representations of possible solutions. The representation that can be directly inter-

29 1.3 Genetic and Evolutionary Algorithms 13 preted as a solution and evaluated is called the phenotype; the other representation on which the variational operators are applied is called the genotype. This subsection shortly demonstrates the phenotype-level effect of this genotypephenotype mapping when it is used with variational operators on the genotype level. The phenotype used in the examples below is a pair of integer numbers. Both are encoded as 5-bit binary numbers, i.e. the genotype is a bit string of length 10 and the mapping itself is of type B 10 I 2. The first example in Fig. 1.3 shows all possible neighbors of a given individual induced by a single bit-flip mutation operator applied to the genotype. The neighbors are located on lines parallel to the coordinate axes with the origin in the given point. The number of neighbors is independent of the given point. The second example in Fig. 1.4 shows the neighborhood structure of two parents induced by a 1-point crossover. With this kind of crossover only one of the phenotype coordinates can be different from both parents. The number of neighbors varies depending on the pair of parents. The last example in Fig. 1.5 shows the neighborhood structure of two parents induced by a 2-point crossover. In this case the neighbors can differ in both phenotype coordinates from both parents, i.e. the neighborhood structure can be rather complex and its size is variable as in previous case. These examples demonstrate that simple operations on the genotype level can induce rather complex patterns on the phenotype level. If some genotypephenotype mapping is employed in the algorithm, then e.g. a local search procedure performed on the genotype level can hardly be described as a local search on the phenotype level; the genotype level local search may seem to be very unstructured on the phenotype level. Using a genotype that induces a high number of neighbors with mutation and crossover operators (e.g. binary representation) enhances the global search abilities of the algorithm. However, the usefulness of the genotype-phenotype mapping (as used in the above examples) is questionable and more or less accidental because the mapping is hard-wired to the algorithm, i.e. for some instances of optimization tasks it can be advantageous, but for other ones of the same type and complexity it may fail completely. As designers of EAs, we should take control of this aspect and use adaptive genotype-phenotype mappings rather than the obvious hard-wired solutions. Phenotype and genotype The Roles of Mutation and Crossover One of the EA advantages when compared to other techniques is the ability to escape from local optima and to use the population as a facility to learn the structure of the problem 8. Is it the crossover, or the mutation, what ensures these features? 9 The mutation generates offspring in the local neighborhood of one point, i.e. it has only one point as its input. As explained in the previous subsection, if some genotype-phenotype mapping is present, the mutation can generate new individuals that can be hardly described as neighbors in the sense of a phenotype distance. Without such mapping, however, there is no chance for mutation 8 To learn the structure of the problem: create a variational operator which is able to modify itself based on the previous population members given as its input with the aim of reaching higher probability of creating good individuals compared to a non-adaptive operator. 9 The ability to escape from local optima is also ensured by the stochastic selection process, i.e. by the ability to temporarily worsen the fitness of the population.

Distribution Tree-Building Real-valued Evolutionary Algorithm

Distribution Tree-Building Real-valued Evolutionary Algorithm Petr Pošík Faculty of Electrical Engineering, Department of Cybernetics Czech Technical University in Prague Technická 2, 166 27 Prague 6,