Identificazione e Controllo Intelligente Algoritmi di Ottimizzazione: Parte A Aspetti generali e algoritmi classici David Naso A.A. 2006-2007 Identificazione e Controllo Intelligente 1 Search and optimization Problem: Let Q be the domain of allowable values for a vector q. Find the values of a vector q element of Q that minimizes a scalar valued loss function L(q). Such a problem can be encountered in all areas engineering, physical sciences, business or medicine Doing the most with the least is the essence of the optimization objective Identificazione e Controllo Intelligente 2 1
Local vs Global optimization 1/2 L(θ) L(θ) θ θ θlocal θglobal θlocal θglobal Identificazione e Controllo Intelligente 3 Local vs Global optimization 2/2 Due to the limitation of almost all the optimization algorithm it is only possible to approach a local optimum. The local minimum may be still a fully acceptable solution for the resources available (human time, money, computer time, ) to be spent in the optimization. Some algorithms are sometimes able to find global solutions (e.g. stochastic approximation, simulated annealing, genetic algorithms) Identificazione e Controllo Intelligente 4 2
Stochastic vs Deterministic optimization 1. There is noise in the loss function measurements. 2. There is a random choice to select the search direction. These hypotheses contrast with standard deterministic optimization that assumes to have perfect information about the loss function (steepest descent) and its derivatives (Newton Raphson) In most practical problems such an information is not available and deterministic algorithms are inappropriate Identificazione e Controllo Intelligente 5 No free lunch theorems Wolpert and Macready (1997) An algorithm that is effective in one class of problems is guaranteed to be ineffective on another class. The theorem applies to problems with a finite (but arbitrarily large) number of options. Just to get a feel for the theorem, consider the needle in the haystack problem: no search algorithm can beat blind random search. Identificazione e Controllo Intelligente 6 3
Direct search methods examples: Random search Nelder Mead algorithm require only loss function measurements are versatile and broad applicable easy to implement have a long record of practical efficiency useful when modest precision is required in the solution Identificazione e Controllo Intelligente 7 Gradient-based stochastic algorithms examples: Stochastic gradient algorithm Back-propagation loss function is assumed to be noisy and non-linear noisy measurements of the loss function gradient are needed countless applications in the last 50 years (neural network training, image restoration) Many real world possible applications but: the method needs gradient measurements Identificazione e Controllo Intelligente 8 4
Gradient-free stochastic algorithms examples: Stochastic approximation Finite difference algorithm only noisy loss function measurement are available gradient is not calculated but approximated using loss function measurement not efficient in high-dimensional problems Simultaneous perturbation stochastic approximation (SPSA) reduces the number of loss function measurements Identificazione e Controllo Intelligente 9 Global search algorithms examples: SPSA method Annealing type algorithms Genetic Algorithms capable to solve complex search problems performances depend on configuration parameters The best trade-off among effectiveness, simplicity, speed of convergence and noise immunity has to be pursued in real world optimization problems. Better a rough answer to the right question than an exact answer to the wrong one. Identificazione e Controllo Intelligente 10 5
Classical Gradient-Based Optimization Classical optimization setting of interest Find θ that minimizes the differentiable loss L(θ) subject to θ satisfying relevant constraints (θ Θ) Standard nonlinear unconstrained optimization setting: Find θ such that L θ = 0 L( θ) θ * L 0 = θ Identificazione e Controllo Intelligente 11 θ Constrained vs. Unconstrained L θ = 0 setting usually associated with unconstrained optimization Most real problems include constraints Many constrained problems can also be converted to L θ = 0 Penalty functions, projection methods, ad hoc methods and common sense, etc Considerations for constraints: Hard vs. soft Explicit vs. implicit Identificazione e Controllo Intelligente 12 6
Gradients and Hessians Often used directly in deterministic methods; indirectly in stochastic methods Exact gradients and Hessians generally not available in stochastic optimization Gradient g(θ) of L(θ) is the vector of 1st partial derivatives L g( θ) = θ Hessian of L(θ) is the matrix H(θ) consisting the 2nd partial derivatives 2 L H( θ) = T θ θ Hessian useful in characterizing shape of L and in providing search direction for (deterministic) Newton-Raphson algorithm Identificazione e Controllo Intelligente 13 Rationale Behind Steepest Descent Update Direction for i th Element of θ Identificazione e Controllo Intelligente 14 7
1 st -order (Steepest Descent) 2 nd -order (Newton-Raphson) Directions Identificazione e Controllo Intelligente 15 Varianti Gli algoritmi di ricerca derivative-based differiscono per il tipo di uso che si fa dell hessiano e poi dalla tecnica con cui si determina il passo (lo step size) una volta definita la direzione in cui andare. [Demo ottimizzazione derivative based] Identificazione e Controllo Intelligente 16 8
Nelder and Mead - Simplex Search Simplex: a set of n+1 points in n-dim. space A triangle in a 2D space A tetrahedron in a 3D space Concept of downhill simplex search Repeatedly replaces the highest points with a lower one Consecutive successful replacements lead to the enlargement of the simplex Consecutive unsucessful replacements lead to the shrinkage of the simplex Identificazione e Controllo Intelligente 17 Downhill Simplex Search Flowchart Identificazione e Controllo Intelligente 18 9
Steps of Nonlinear Simplex Algorithm θ max θmin θ max θmin θ cent θ cent θ 2max θ refl Reflection θ 2max θ refl θ exp Expansion when L(θ refl ) < L(θ min ) θ max θ 2max θ min θ cent θ cont θ refl Contraction when L(θ refl ) < L(θ max ) ( outside ) θ max θ min θ max θ min θ cont θ cent θ θ refl 2max Contraction when L(θ refl ) L(θ max ) ( inside ) θ cent θ cont θ θ refl 2max Shrink after failed contraction when L(θ refl )< L(θ max ) Identificazione e Controllo Intelligente 19 Downhill Simplex Search Example: Find the min. of the peaks function z = f(x, y) = 3*(1-x)^2*exp(-(x^2) - (y+1)^2) - 10*(x/5 - x^3 - y^5)*exp(-x^2-y^2) -1/3*exp(-(x+1)^2 - y^2). MATLAB file: go_simp.m Identificazione e Controllo Intelligente 20 10
Random Hillclimbing Properties: Intuitive Simple Analogy: Get down to a valley blindfolded Two heuristics: Reverse step Bias direction Identificazione e Controllo Intelligente 21 Random Hillclimbing Flowchart: Select a random dx f(x+b+dx)<f(x)? no yes x = x + b + dx b = 0.2 b + 0.4 dx f(x+b-dx)<f(x)? no yes x = x + b - dx b = b - 0.4 dx b = 0.5 b Identificazione e Controllo Intelligente 22 11
Random Search Example: Find the min. of the peaks function z = f(x, y) = 3*(1-x)^2*exp(-(x^2) - (y+1)^2) - 10*(x/5 - x^3 - y^5)*exp(-x^2-y^2) -1/3*exp(-(x+1)^2 - y^2). Identificazione e Controllo Intelligente 23 12