MAXIMUM LIKELIHOOD ESTIMATION USING ACCELERATED GENETIC ALGORITHMS

In: Journal of Applied Statistical Science Volume 18, Number 3, pp. 1 7 ISSN: 1067-5817 c 2011 Nova Science Publishers, Inc. MAXIMUM LIKELIHOOD ESTIMATION USING ACCELERATED GENETIC ALGORITHMS Füsun Akman 1, Olcay Akman 1 and Joshua W. Hallam 2 1 Department of Mathematics Illinois State University USA 2 Department of Mathematics Michigan State University USA Abstract We implement Genetic Algorithms, accelerated by population reduction, in imum likelihood estimation of parameters of statistical distributions. Introduction A genetic algorithm (GA) is an optimization technique inspired by biological evolution operating under natural selection. First popularized by John Holland [6] and extensively studied by Goldberg [3], this technique has been shown to be robust and capable of dealing with highly multimodal and discontinuous search landscapes where the traditional optimization techniques fail. Traditional methods such as hill-climbing and derivative-based methods are often able to find optimal points in ordinary landscapes. However, with multimodal landscapes, they may get stuck in local optima, whereas the structure of genetic algorithms helps avoid this problem. Specifically, minimization of the highly-nonlinear artificial neural network error surfaces still remains to be one of the hurdles in high-density multivariate modeling. For instance, producing a Kohonen map in a typical clustering problem using Self Organizing Feature Maps (SOFM) requires a significantly long computing time due to the ruggedness of the high-dimensional error surface. Genetic algorithms are routinely employed in these types of optimization problems. In a genetic algorithm, a group of possible solutions, i.e., a population of chromosomes, are substituted into a fitness function (the function being optimized) and hence assigned fitness values. The chromosomes with desirable fitness values are allowed to mate with other chromosomes, mutate, and move on to the next generation. This process is repeated until either a certain number of generations is reached or there is no change in the best solution found for many generations. At the end of the algorithm, the chromosome with the highest fitness value is considered to be the solution. In order to take advantage of the analogy with natural biological processes, the chromosomes are encoded as binary strings. Let l denote the length of a string. Typically, if the fitness function has K independent variables, then l is an integer multiple of K. The binary string is then broken into parts of equal length, each representing one of the K variables, and converted into a real number based on the range of possible values for the variables. The fitness associated with the chromosome is calculated by evaluating the fitness function at the K real values obtained from the chromosome. More formally, if f : R K R denotes the fitness function, and g : {0, 1} l R K denotes the transformation from the binary strings to the real values, then the fitness of a chromosome is

2 Füsun Akman, Olcay Akman and Joshua W. Hallam calculated as fitness = f(g(chrom)), and the chromosomes with optimal fitness are chosen for the next generation. However, this choice is not deterministic. Usually, two chromosomes are selected at random, and the one with the higher fitness is kept. This technique is called binary tournament. Chromosomes can be chosen with replacement in a tournament. The chosen chromosomes are then put in a mating pool. The process continues until the mating pool has the desired size of the population in the next generation. To begin creating the next generation, two chromosomes are chosen from the pool and mated. Mating in a GA is analogous to genetic recombination, in which segments of the code are swapped between the two chromosomes. The number of crossing-over points is up to the user, but in our work we used three. After the mating occurs, the two new chromosomes are mutated. That is, with a certain small probability, each bit may be changed from 0 to 1 or 1 to 0. The crossover and mutation create two new chromosomes, which will be put into the next generation s population. The steps are repeated until all pairs in the mating pool have mated. The process of selection, mating, and mutation continues from generation to generation, creating better solutions as time progresses. In a typical GA, the population size remains constant throughout the entire algorithm. We believe that this is a poor choice of allocation of resources. Instead, we propose a GA that reduces population size at every time step, allowing for a larger initial population size. Starting with a larger population, the algorithm has a better chance of selecting parts of the optimal solution earlier. In short, we believe that our method would enable the algorithm to find the optimal solution more efficiently. Accelerated Genetic Algorithms We have developed three different methods of population reduction for genetic algorithms. The first is an adaptive measure and the other two are based on a predetermined pattern. We describe these three methods in detail below. 1. Adaptive Population Reduction. Adaptively sizing population is defined as continually changing the population size based on parameters within the algorithm. The changes in population size would depend on those in average fitness and genetic variance. This method contrasts with predetermined sizing methods, in which the population size at each generation is independent of the changes in the population and in the fitness values generated. Adaptive measures have been offered by several authors, and a review of current methods can be found in Lima and Lobo [7]. In this paper, we present a new method based on the change in best fitness. (For convenience, we will assume from now on that best fitness means the largest fitness value in the current pool.) A variant of our approach was used earlier by Eiben et al. [2]: their method was to increase the population size if the best fitness increased, decrease the size if there was a short term lack of fitness increase, and increase the population size if no change occurred for a long period of time. This approach may have several problems associated with it. For example, for the population to be increased, obviously new chromosomes must be created. However, if new chromosomes are just created by cloning the best existing ones, as in the Eiben et al. study [2], then there is no increase in genetic diversity. Hence it would be more beneficial, in theory, to generate additional random individuals to simulate natural gene flow. Another problem is that typically fitness increases the fastest early in a genetic algorithm, which would imply that the population size should grow early in the algorithm. If the individuals are obtained

Maximum Likelihood Estimation Using Accelerated Genetic Algorithms 3 only by cloning, then the population will lose genetic diversity even faster because of the dominance of the numerous clones with large fitness. It seems that when population size is likely to increase early in the algorithm, simply starting with a larger population size would both contribute higher genetic diversity and use the same amount of computation. Our approach takes a very different stand from [2]. We believe that as the best fitness increases, we may reduce the population size and still obtain reasonable results with fewer computation than would be employed in a typical genetic algorithm. A key point with this approach is that population size gets reduced when only the best fitness improves (the method never increases the population size). In justification, suppose that we wish to do optimization in a multimodal fitness landscape. If we start with a large population, then we can hit more points on the rugged landscape. However, as time passes, since the solutions will start aggregating around certain areas (hopefully near solution) we can reduce the population size without much loss of accuracy. In other words, a reduction in complexity allows for a smaller population to optimize the problem with the same or better results than a larger population. Since the chromosomes with the best fitness will be allowed to mate often, the solutions will continue to concentrate around the desired solution. Thus, the change in best fitness is a good indicator of how well the algorithm is performing. The small population size, with the implementation of elitism, allows genetic drift to fine-tune the solution without losing the best solution in the process: suppose that the population has aggregated in a small partition of the search space such that there are only slight changes in fitness. At this point, it is more economical to have a small population because a chromosome with a small difference in fitness from that of the optimal solution has a better chance to be chosen to participate in a tournament. (Although the choice to participate in the tournament is random, with a smaller population, every chromosome has a better chance to be chosen.) Thus, those with a slightly better fitness can participate and be chosen for the mating pool. At the same time, this part of the algorithm is merely choosing between solutions which only differ little, and it is less important than the phase of the algorithm making large jumps in fitness. We have developed a formula to quantify the amount of reduction. It is based on the idea that the population size should be reduced proportionally to the change in best fitness. Let N t be the population size at generation t. Denote the change in best fitness at generation t as ft best = (ft 1 best f t 2 best best )/ ft 2. We use the absolute value to deal with fitness values which can be both positive and negative. We then determine a parameter f best such that (1 ft best )N t, if ft best f best N(t + 1) = (1 f)n best t, if ft best > f best (1) MIN POPSIZE, if N(t + 1) will be less than MIN POPSIZE. When this type of decrease is used, we implement elitism, allowing the best chromosome to continue on to the next generation without change, so that the change in fitness is always positive. Clearly, we have f best < 1, however, this value should be chosen in the interval [.05,.2] based on empirical evidence. As a side note, the typical genetic algorithm is a special case of the method we have produced, where f best = 0. The determination of minimum population size is arbitrary. However, to avoid the negative effects of extremely small populations, we set the value MIN POPSIZE equal to 20 based on work by Reeves [9]. As can be seen from the formula, the shape of the population curve is exponential decrease, followed by a steady section, again followed by an exponential decrease,

4 Füsun Akman, Olcay Akman and Joshua W. Hallam and this pattern continues. 2. Predetermined Exponential Decrease. Although the adaptive method produces a population curve which has segments of exponential decrease, it requires computing ft best at every generation, as well as the determination of f. best We now present a method which requires neither and reduces the population exponentially. The Schema Theorem [6] shows that on the average, the number of highly fit schema increases exponentially. Based on this fact, we believe that we can reduce the population size exponentially and get results comparable to an algorithm which has no reduction. To perform this reduction, the following formula is used: N(t) = (N 0 )e c t, where c = ln N END N 0 Number of Generations. (2) N END denotes the population size at the end of the algorithm. It is set to be 20, in agreement with the minimum population size used in the adaptive method. 1. Predetermined Linear Decrease It is not possible to predict the shape of the exponential increase of schema without direct and complicated calculations during the algorithm. Therefore, we have also developed a reduction method which is not exponential, but instead, which decreases the population size in a linear trend. This avoids decreasing the population too quickly, but has the benefit of reducing the number of computations needed in a traditional genetic algorithm. The following formula is used to determine the population size at each generation: N(t) = mt + N 0, where m = N 0 N END Number of Generations. (3) A discussion of the performance of the methods discussed above can be found in [1]. Diversifying In addition to reducing population size over time, we have developed another method to increase the efficiency of a genetic algorithm. According to Fisher s Fundamental Theorem of Natural Selection [5], the increase in mean fitness of a population is equal to the variance in fitness. In genetic algorithms, the easiest way to increase variance in fitness would be to allow every possible solution to be represented in the population. Of course, this is equivalent to an exhaustive search. We believe the next best procedure is to force the population to start with the highest variance in each position of the chromosome. Since each position is either 0 or 1, this would imply that at each position there are ideally the same amount of 0 s and 1 s across the entire population. To implement this procedure, half of the initial population is randomly generated. The other half is then generated by taking each of the chromosomes in the first half and changing each bit from 1 to 0 or 0 to 1. We call this process diversifying. In addition to increasing variance at each position, the procedure guarantees that within one generation, recombination alone could generate the optimal solution. This does not imply that mutation is not necessary, as selection acts on the entire string, and not on individual positions. Since selection will reduce variance at each position, mutation is still required to maintain some variance.

Maximum Likelihood Estimation Using Accelerated Genetic Algorithms 5 Maximum Likelihood Estimation Choosing the correct parameters in given data, which one believes is from a specific distribution, is a global optimization problem. In many cases, the search space of the parameters is multimodal and finding the correct solution may be difficult for the classic search/optimization techniques. Genetic algorithms provide a useful tool for exploring possible solutions. Let f(x θ) be a p.d.f with x = {x 1, x 2,... x n } and parameter vector θ = {θ 1, θ 2... θ n }. Let L(θ x) be the likelihood function that we wish to optimize with respect to θ. As is commonplace, ln L is optimized instead of L. This transformation is helpful when using a GA, because the fitness values for L would have been in the interval [0, 1], which forces the fitness space to be flat, and the convergence of the solution to be slow. By using log likelihood, the solution fitness space expands to the interval (, 0) and gives better contrast between possible solutions. In order to use genetic algorithms for imum likelihood estimation, the following procedure should be used. First, a bootstrap sample of size n from x should be generated. Using this sample, we imize the likelihood (or log likelihood) function using the genetic algorithm to obtain a parameter vector, ˆθ, which reflects the imum given the bootstrap sample of data. We repeat this process m times to generate ˆθ 1, ˆθ 2,... θ ˆ m. Using these m solutions, we construct a covariance matrix for the parameter vectors and the overall best fit ˆθ is given by the average of best fits found in the previous m runs of the genetic algorithm. Table 1. Covariance Matrix for Airplane 7907. p µ λ p 0.1386936 2.663881 20.01815 µ 2.6638812 304.957664 1189.76515 λ 20.0181523 1189.765153 9304.16663 Table 2. Covariance Matrix for Airplane 7909. p µ λ p 0.09958262 2.705328 3.141576 µ 2.70532883 130.961146 181.918055 λ 3.14157588 181.918055 501.227123 An Application To illustrate an application, we consider the aircraft data given by Proschan in [8]. The mixture inverse Gaussian distribution was fitted by Gupta and Akman in [4]. If x is inverse Gaussian, then the p.d.f. is given by ( ) λ 1/2 ( f(x µ, λ, p) = 2πx 3 exp λ (x µ) 2 2µ 2 x 1 p + p x ), (4) µ

6 Füsun Akman, Olcay Akman and Joshua W. Hallam with parameters λ > 0, µ > 0, and 0 p 1. This produces the following log likelihood function, given a random sample X 1, X 2,..., X n : L = n 2 ln(λ) λ 2µ 2 n 2 ln(2π) 3 2 x i + λn µ λ 2 1 x i + ( ln 1 p + px ) i µ ln(x i ). (5) We used the data for airplane numbers 7907 and 7909 to find the optimal parameter vector (p, µ, λ). We followed the procedure outlined above, bootstrapping 100 times. We used a GA with adaptive reduction and diversification to obtain our results. The covariance matrix is given for Airplane 7907 in Table 1 and Airplane 7909 in Table 2. Taking average of the 100 runs gives the optimal parameter vectors for 7907 and 7909, shown in Table 3. Table 3. Optimal values for p, µ, λ. p µ λ 7907 0.3113695 48.9713807 112.2475586 7909 0.5970175 56.6431133 71.147073 Conclusion Using GA in imum likelihood estimation is a viable method of obtaining parameter estimates, especially considering the small variance attained as a result. We believe that using a GA becomes more meaningful when the likelihood surface is more rugged (and has more dimensions) than the usual distributions. Acknowledgment We would like to thank Dr. Nader Ebrahimi of Northern Illinois University for his suggestions and guidance. References [1] Akman, O., Hallam, J., and Akman, F. (2010). Genetic Algorithms with Shrinking Population Size. Computational Statistics, 25, 691 705. [2] Eiben, A. E., Marchiori, E., and Valko, V. A. (2004). Evolutionary algorithms with on-the-fly population size adjustment. In Parallel Problem Solving from Nature, PPSN VIII, volume 3242 of Lecture Notes in Computer Science, 41 50. Springer. [3] Goldberg, D.E. Genetic algorithms in search, optimization, and machine learning. Reading, MA: Addison-Wesley, 1989.

Maximum Likelihood Estimation Using Accelerated Genetic Algorithms 7 [4] Gupta, R. and Akman, O. (1995). On the reliability studies of a weighted inverse Gaussian model. Journal of Statistical Planning and Inference, 69 83. [5] Fisher, R. A. (1930). The Genetical Theory of Natural Selection. Clarendon Press, Oxford. [6] Holland, J. (1975). Adaptation in Natural and Artificial Systems. The MIT Press, 1975. [7] Lima, C. F. and Lobo, F. G. (2005). A Review of Adaptive Population Sizing Schemes in Genetic Algorithms. In Proceedings of the 2005 workshops on Genetic and evolutionary computation, 228 234. ACM. [8] Proschan, F. (1963). Theoretical explanation of observed decreasing failure rate. Technometrics 5, 375. [9] Reeves, C. R. (1993). Using genetic algorithms with small populations. In Proceedings of the 5th International Conference on Genetic Algorithms, 92 97, Morgan Kaufmann.