DERIVATIVE-FREE OPTIMIZATION Main bibliography J.-S. Jang, C.-T. Sun and E. Mizutani. Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence. Prentice Hall, New Jersey, 1997. AndriesP. Engelbrecht. Computational Intelligence: An Introduction. John Wiley, Chichester, 2002. J. Kennedy, R. C. Eberhartand Y. Shi. Swarm Intelligence. Morgan Kaufmann Publishers, 2002. Michael Negnevitsky. Artificial Intelligence: A Guide to Intelligent Systems. Addison-Wesley, Pearson Education, 2002. 467 1
Optimization methods Derivative-based optimization Explicit use of derivative of objective function Analytic solution possible Faster convergence Derivative-free optimization Use only (objective) function evaluation No need for extra information Slower convergence 468 Motivation Some problems are so complex that it is not possible to find an optimal solution In these problems, it is still important to find a good feasible solution close to the optimal A heuristic methodis a procedure to find a very good feasible solution for a considered problem Procedure should be efficient to deal with very large problems, and be an iterative algorithm Heuristic methods usually fit a specific problem rather than a variety of problems 469 2
Metaheuristics Heuristics are specific to the problem being solved Metaheuristicsare general solution methods that provide: a general structure guidelines for developing a heuristic method for a particular type of problem 470 Nature of metaheuristics Example: maximize 5 4 3 2 f ( x) = 12x 975x + 28000x 345000x + 1800000x subject to 0 x 31 Function has three local optima The example is a nonconvex programmingproblem. f(x) is sufficiently complicated to solve analitically Simple heuristic method: conduct a local improvement procedure. 471 3
Example: objective function 472 Local improvement procedure Starts with initial trial and uses a hill-climbing procedure Example: gradient search procedure, bisection method, etc. Converges to a localoptimum. Stops without reaching global optimum (depends on initialization) Typical sequence: see figure Drawback:procedure converges to local optimum. This is only a global optimum if search begins in the neighborhood of this global optimum 473 4
Local improvement procedure 474 Nature of metaheuristics How to overcome this drawback? What happens in large problems with many variables? Metaheuristic:solution method that orchestrates the interaction between local improvement procedures and a process to escape from local optimain a robust way A trial solution after a local optimum can be inferiorto this local optimum 475 5
Solutions by metaheuristics 476 Metaheuristics Advantage:deals well with large complicated problems Disadvantage:no guarantee to find optimal solution or even a nearly optimal solution When possible, an algorithm that can guarantee optimality should be used instead Can be applied to nonlinear or integer programming Most commonly is applied to combinatorial optimization 477 6
Most common metaheuristics Simulated Annealing (SA) Random search Genetic Algorithms (GA) Ant Colony Optimization (ACO) Particle Swarm Optimization (PSO), etc. 478 Common characteristics Derivative freeness: methods rely on evaluations of objective function; search direction follows heuristic guidelines Intuitive guidelines: concepts are usually bio-inspired Slowness: slower than derivative-based optimization for continuous optimization problems Flexibility: allows any objective function (even structure of data-fitting model) 479 7
Common characteristics Randomness: stochastic methods use random numbers to determine search directions; may be global optimizers given enough computation time (optimistic view) Analytic opacity: knowledge based on empirical studies due to randomness and problem-specific nature Iterative nature: need of stopping criteria to determine when to terminate the optimization process 480 Stopping criteria Let kdenote iteration count and f k the best objective function obtained at count k: Computation time: amount of computational time, or number of function evaluations and/or iteration counts is reached Optimization goal: f k is less than a certain preset goal value Minimal improvement: f k f k-1 is less than a preset value Minimal relative improvement: (f k f k-1 )/ f k-1 is less than a preset value 481 8
GENETIC ALGORITHMS Genetic Algorithms Motivation What evolution brings us? Vision Hearing Smelling Taste Touch Learning and reasoning Can we emulate the evolutionary process with today s fast computers? 483 9
Genetic Algorithms Introduced by John Holland in 1975 Randomizedsearch algorithms based on mechanics of natural selection and genetics Principle of natural selection through survival of the fittest with randomized search Search efficiently in large spaces Robust with respect tothe complexity of the search problem Use a populationof solutions instead of searching only one solution at a time: easily parallelized algorithms 484 Basic elements Candidate solutionis encoded as a string of characters in binary or real. Bit string is called a chromosome. Solution represented by a chromosome is the individual. A number of individuals form a population. Population is updated iteratively; each iteration is called a generation. Objective function is called the fitness function. Fitness value is maximized. Multiple solutions are evaluated in parallel. 485 10
Definitions Population:a collection of solutions for the studied (optimization) problem Individual: a single solution in a GA Chromosome:(bit string) representation of a single solution Gene: part of a chromosome, usually representing a variable characterizing part of the solution 486 Definitions Encoding:conversion of a solution to its equivalent bit string representation (chromosome) Decoding: conversion of a chromosome to its equivalent solution Fitness:scalar value denoting the suitability of a solution 487 11
GA terminology Generation t x y solution fitness 1 0 0 0 individual (2,0) 4 population 0 1 0 1 0 0 1 1 0 1 1 0 (1,1) (0,3) (1,2) 2 3 3 0 1 0 1 (1,1) 2 gene chromosome 488 Basic genetic algorithm Initialization:Start initial population of solutions, e.g. randomly. Evaluate the fitness for each individual. Iteration: 1. Select some members of population to become parents. 2. Cross genetic material of parents in a crossover operation. Mutation can occur in some genes. 3. Take care of infeasiblesolutions, by making them feasible. 4. Evaluate fitness of new members, including the clones. Stopping rule:stop using fixed number of iterations, fixed number of iterations without improvement, etc. 489 12
Genetic algorithm Define Initial Population Parents Fitness Function Increment Generation Assess Fitness Children Best Individuals Mutation Selection Crossover Genetic Algorithm 490 Termination criteria Number of generations (restart GA if best solution is not satisfactory) Fitness of best individual Average fitness of population Difference of best fitness (across generations) Difference of average fitness (across generations) 491 13
Reproduction Three steps: Selection Crossover Mutation In GAs, the population size is often kept constant. User is free to choose which methods to use for all three steps. 492 Roulette-wheel selection individuals fitness p = 0.16 p = 0.23 p = 0.11 p = 0.07 p = 0.19 p = 0.24 01100 10001 11010 00111 11000 10110 34 48 23 15 41 50 selection 01100 10001 10001 11000 10110 10110 Sum = 211 Cumulative probability: 0.16, 0.39, 0.50, 0.57, 0.76, 1.00 493 14
Roulette-wheel selection 1 16% 1 6 24% 2 23% 2 3 4 5 19% 4 7% 3 11% 5 6 494 Tournament selection Select pairs randomly Fitter individual wins deterministic probabilistic constant probability of winning probability of winning depends on fitness It is also possible to combine tournament selection with roulette-wheel 495 15
Crossover Exchange parts of chromosome with a crossover probability (p c is about 0.8) Select crossover points randomly One-point crossover: 0 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 0 crossover point 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 1 496 N-point crossover Select N points for exchanging parts Exchange multiple parts Two-point crossover: crossover points 0 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 497 16
Uniform crossover Exchange bits using a randomly generated mask Uniform crossover: 0 1 0 1 0 0 1 0 0 1 1 mask 0 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 498 Mutation Crossover is used to search the solution space Mutation is needed to escape from local optima Introduces genetic diversity Mutation is rare (p m is about 0.01) Uniform mutation: 0 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 mutated bit 499 17
GA iteration 10010110 01100010 10100100 10011001 01111101......... Current generation Selection Elitism Crossover Mutation reproduction 10010110 01100010 10100100 10011101 01111001......... Next generation 500 Spaces in GA iteration fitness operators gene-space problem-space fitness-space generation N 01100 10001 11010 00111 11000 10110 (de)coding 12 17 26 7 24 22 fitness function 34 48 23 15 41 50 geneticoperators generation N+1 01011 10111 11001 00011 11010 01010 501 18
Encoding and decoding Chromosomes represent solutions for a problem in a selected encoding Solutions must be encoded into their chromosome representation and chromosomes must be decoded to evaluate their fitness Success of a GA can depend on the coding used May change the nature of the problem Common coding methods simple binary coding gray coding (binary) real valued coding (requires special genetic operators) 502 Handling constraints Explicit fitness function penalty function barrier function setting fitness of unfeasible solutions to zero (search may be very inefficient due to unfeasible solutions) Implicit(preferred method) special encoding GA searches always for feasible solutions smaller search space adhoc method, may be difficult to find 503 19
Questions in genetic algorithms 1. What is the encoding scheme? 2. What should the population size be? 3. How should the individuals of the current population be selected to become parents? 4. How should the genes of the children be derived from the genes of the parents? 5. How should mutations occur in the genes of the children? 6. Which stopping rule should be used? 504 Example Maximization of the peaks function using GAs z = f ( x, y) x 1 3(1 x) e 10 x y = e e 5 3 2 2 2 2 2 2 2 x ( y+ 1) 3 5 x y ( x+ 1) y 505 20
Example Derivatives of the peaks function: 506 Example settings Search domain: [-3,3] [-3,3] 8-bit binary coding Search space size = 2 8 2 8 = 65536 Each generation with 20 individuals Fitness value = value of peaks function minimum function value across population One-point crossover scheme: 1.0 crossover rate Uniform mutation: 0.01 mutation rate Elitism of the best 2 individuals across generations 30 generations 507 21
Example GA process: Initial population 5th generation 10th generation 508 Example performance profile 10 5 Fitness 0 Best Average Poorest -5 0 5 10 15 20 25 30 Generations 509 22
Real-coded genetic algorithms Most problems concern the optimization of real variables Binary encoding of real variables interval discretization mantissa-exponent representation Search for real variables encoded in binary is not efficient Real coded GAs use special mutation and crossover operators 510 Real coded GAs (notation) Each chromosome consists of a string of real numbers r [0,1] is a random number (uniform) t = 0,1,,T is the generation number s v and s w are chromosomes selected for operation k {1,2,,N} is the position of an element in the chromosome v k min and v k max are the lower and upper bounds of the parameter encoded by element k 511 23
Crossover operators Simple arithmetic crossover t+ 1 s = ( v,, v, w,, w ) v 1 k k+ 1 n t+ 1 sw = ( w1,, wk, vk + 1,, vn) Whole arithmetic crossover s = r( s ) + (1 r) s t+ 1 t t v v w + 1 s = r( s ) + (1 r) s Heuristic crossover t t t w w v s = s + r( s s ) t+ 1 t t t v v w v s = s + r( s s ) t+ 1 t t t w w v w 512 Mutation operators Uniform mutation random selected element v k replaced by v k in the range [v k min,v k max ] Multiple uniform mutation uniform mutation of n randomly selected elements, n {1,2,,N} Gaussian mutation - all elements are mutated: s + = ( v + f,, v + f,, v + f ) t 1 v 1 1 k k n n where f k, k = 1,2,,N is a random number drawn from a Gaussian distribution 513 24
Issues for evolution Genetic diversity of population Population size Selection strategy (policy) Evolutionary pressure Parameters of operators Co-evolution 514 Example: GA for training models Real coded GA can optimize fuzzy and neural models. Example Optimization of fuzzy models: M. Setnes and H. Roubos. GA-fuzzy modeling and classification: complexity and performance. IEEE Transactions on Fuzzy Systems, 8(5):509-522 Oct 2000. Real-coded GA is subjected to constraints that maintain the semantic properties of the rules. The technique is applied to a wine data classification problem. 515 25
Codification and parameters Triangular membership functions are used: Representation of a fuzzy model with Mrules in a chromossome : Population with L individuals: 516 GA fitness and operators Performance measured in terms of the mean-squarederror (MSE): K 1 J = ( y yˆ ) K k = 1 k Roulette wheel selection method with chance: Pl P l l k 2 with P 1, {1,, } l = l L 2 J Chance for crossover: 95%. Chance for mutation: 5%. l 517 26
Genetic training algorithm Create and simulate the initial population Create initial chromossome from initial rule base Compute contraint vectors v k min and v k max Create rest of initial population Repeat genetic optimization for t = 1,2,,T Evaluate fitness of chromossomes Select chromosomes for operation and deletion Create next generation: operate on selected chromosomes and substitute by offspring Select the best chromosome (solution) from final generation 518 Application: wine classification Wine data contains chemical analysis of 178 wines derived from threedifferent cultivars. 13 continuous attributes are available for classification: alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflavanoids phenols,proanthocyaninsm color intensity, hue, OD280/OD315 ofdilluted wines and proline (see figure). Attributes have 59 class 1 instances; 71 class 2 instances and 48 class 3 instances. 519 27
Attributes 520 Optimized memb. functions 521 28
Rules and GA convergence 522 29