Numerical Optimization: Introduction and gradient-based methods

Similar documents
Constrained and Unconstrained Optimization

Theoretical Concepts of Machine Learning

Introduction to Optimization

Introduction to optimization methods and line search

APPLIED OPTIMIZATION WITH MATLAB PROGRAMMING

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods.

Mathematical Programming and Research Methods (Part II)

Introduction to Optimization

A Brief Look at Optimization

Optimization. Industrial AI Lab.

Convexization in Markov Chain Monte Carlo

Introduction to Modern Control Systems

Numerical Optimization

Introduction to Optimization

Convexity and Optimization

Tree-GP: A Scalable Bayesian Global Numerical Optimization algorithm

Modern Methods of Data Analysis - WS 07/08

Logistic Regression

Today. Golden section, discussion of error Newton s method. Newton s method, steepest descent, conjugate gradient

Convex Optimization CMU-10725

Multi Layer Perceptron trained by Quasi Newton learning rule

Convexity and Optimization

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Introduction to Optimization Problems and Methods

Efficient Iterative Semi-supervised Classification on Manifold

Lecture 6 - Multivariate numerical optimization

Comparison of Interior Point Filter Line Search Strategies for Constrained Optimization by Performance Profiles

Programming, numerics and optimization

Convex Optimization MLSS 2015

Week 5. Convex Optimization

Recent Developments in Model-based Derivative-free Optimization

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

A Brief Overview of Optimization Problems. Steven G. Johnson MIT course , Fall 2008

Experimental Data and Training

Multivariate Numerical Optimization

Characterizing Improving Directions Unconstrained Optimization

Classical Gradient Methods

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CS281 Section 3: Practical Optimization

5 Machine Learning Abstractions and Numerical Optimization

Simplicial Global Optimization

Neural Networks: Optimization Part 1. Intro to Deep Learning, Fall 2018

An evolutionary annealing-simplex algorithm for global optimisation of water resource systems

CS 450 Numerical Analysis. Chapter 7: Interpolation

An Asynchronous Implementation of the Limited Memory CMA-ES

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

Introduction to Design Optimization: Search Methods

A Brief Overview of Optimization Problems. Steven G. Johnson MIT course , Fall 2008

Minima, Maxima, Saddle points

Classification of Optimization Problems and the Place of Calculus of Variations in it

x n+1 = x n f(x n) f (x n ), (1)

Machine Learning for Signal Processing Lecture 4: Optimization

Convexity Theory and Gradient Methods

Optimization in Scilab

CMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro

Foundations of Computing

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

Lecture 12: Feasible direction methods

B553 Lecture 12: Global Optimization

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

What is machine learning?

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction

A Study on the Optimization Methods for Optomechanical Alignment

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam

Algorithm Design (4) Metaheuristics

IE598 Big Data Optimization Summary Nonconvex Optimization

Mesh segmentation. Florent Lafarge Inria Sophia Antipolis - Mediterranee

Projection-Based Methods in Optimization

Convex sets and convex functions

Computational Methods. Constrained Optimization

Lecture 2 September 3

Today s class. Roots of equation Finish up incremental search Open methods. Numerical Methods, Fall 2011 Lecture 5. Prof. Jinbo Bi CSE, UConn

Open and Closed Sets

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Newton and Quasi-Newton Methods

CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

Clustering algorithms and introduction to persistent homology

Local and Global Minimum

Convex sets and convex functions

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

ORIE 6300 Mathematical Programming I November 13, Lecture 23. max b T y. x 0 s 0. s.t. A T y + s = c

Introduction to Optimization

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Lecture 19 Subgradient Methods. November 5, 2008

10703 Deep Reinforcement Learning and Control

Fast Approximate Energy Minimization via Graph Cuts

M. Sc. (Artificial Intelligence and Machine Learning)

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Algorithms for convex optimization

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng

Contents. I The Basic Framework for Stationary Problems 1

Alternating Projections

arxiv: v1 [cs.cv] 2 May 2016

Generative and discriminative classification techniques

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. Chapter 4 : Optimization for Machine Learning

Humanoid Robotics. Least Squares. Maren Bennewitz

Combinatorial optimization and its applications in image Processing. Filip Malmberg

Lecture 2: August 31

Transcription:

Numerical Optimization: Introduction and gradient-based methods Master 2 Recherche LRI Apprentissage Statistique et Optimisation Anne Auger Inria Saclay-Ile-de-France November 2011 http://tao.lri.fr/tiki-index.php?page=courses Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 1 / 38

1 Numerical optimization A few examples Data fitting, regression In machine learning Black-box optimization Different notions of optimum Typical difficulties in optimization Deterministic vs stochastic - local versus global methods 2 Mathematical tools and optimality conditions First order differentiability and gradients Second order differentiability and hessian Optimality Conditions for unconstrainted optimization 3 Gradient based optimization algorithms Root finding methods (1-D optimization) Relaxation algorithm Descent methods Gradient descent, Newton descent, BFGS Trust regions methods Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 2 / 38

Numerical optimization Numerical or continuous optimization Unconstrained optimization Optimize a function where parameters to optimize are continuous (live in R n ). min x R n f (x) n: dimension of the problem corresponds to dimension of euclidian vector space R n Maximization vs Minimization Maximize f = Minimize f Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 3 / 38

Numerical optimization A few examples Analytical functions Convex quadratic function: f (x) = 1 2 xt Ax + b T x + c (1) = 1 2 (x x 0) T A(x x 0 ) + c (2) Exercice where A R n n, symmetric positive definite, b R n, c R. Express x 0 in (2) as a function of A and b. Express the minimum of f. For n = 2, plot the level sets of a convex-quadratic function where level sets are defined as the sets L c = {x R n f (x) = c}. Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 4 / 38

Numerical optimization A few examples Data fitting - Data calibration Objective Given a sequence of data points (x i, y i ) R p R, i = 1,..., N, find a model y = f (x) that explains the data experimental measurements in biology, chemistry,... In general, choice of a parametric model or family of functions (f θ ) θ R n use of expertise for choosing model or simple models only affordable (linear, quadratic) Try to find the parameter θ R n fitting best to the data Fitting best to the data Minimize the quadratic error: min θ R n N f θ (x i ) y i 2 i=1 Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 5 / 38

Numerical optimization A few examples Optimization and machine learning (Simple) Linear regression Given a set of data (examples): {y i, xi 1,..., x p i }{{} X T i min w R p,β R } i=1...n N w T X i + β y i 2 i=1 } {{ } Xw y 2 same as data fitting with linear model, i.e. f (w,β) (x) = w T x + β, θ R p Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 6 / 38

Numerical optimization A few examples Optimization and Machine Learning (cont.) Clustering k-means Given a set of observations (x 1,..., x p ) where each observation is a n-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k p), S = {S 1, S 2,..., S k } so as to minimize the within-cluster sum of squares: minimize µ1,...,µ k k i=1 x j S i x j µ i 2 Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 7 / 38

Black-box optimization Optimization of well placement Numerical optimization A few examples Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 8 / 38

Numerical optimization Different notions of optimum Different notions of optimum local versus global minimum local minimum x : for all x in a neighborhood of x, f (x) f (x ) global minimum: for all x, f (x) f (x ) essential infimum of a function: Given a measure µ on R n, the essential infimum of f is defined as ess inf f = sup{b R : µ({x : f (x) < b}) = 0} important to keep in mind in the context of stochastic optimization algorithms Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 9 / 38

100 90 80 70 60 50 40 30 20 10 4 3 2 1 0 1 2 3 4 0 Numerical optimization Typical difficulties in optimization What Makes a Function Difficult to Solve? ruggedness non-smooth, discontinuous, multimodal, and/or noisy function dimensionality non-separability (considerably) larger than three dependencies between the objective variables ill-conditioning cut from 3-D example, solvable with an evolution strategy a narrow ridge Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 10 / 38

Numerical optimization Typical difficulties in optimization What Makes a Function Difficult to Solve? 100 90 80 70 60 50 40 30 20 10 0 4 3 2 1 0 1 2 3 4 cut from 3-D example, solvable with an evolution strategy Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 10 / 38

Numerical optimization Curse of Dimensionality Typical difficulties in optimization The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 11 / 38

Numerical optimization Curse of Dimensionality Typical difficulties in optimization The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 100 points onto a real interval, say [0, 1]. To get similar coverage, in terms of distance between adjacent points, of the 10-dimensional space [0, 1] 10 would require 100 10 = 10 20 points. A 100 points appear now as isolated points in a vast empty space. Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 11 / 38

Numerical optimization Curse of Dimensionality Typical difficulties in optimization The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 100 points onto a real interval, say [0, 1]. To get similar coverage, in terms of distance between adjacent points, of the 10-dimensional space [0, 1] 10 would require 100 10 = 10 20 points. A 100 points appear now as isolated points in a vast empty space. Consequently, a search policy (e.g. exhaustive search) that is valuable in small dimensions might be useless in moderate or large dimensional search spaces. Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 11 / 38

Separable Problems Numerical optimization Typical difficulties in optimization Definition (Separable Problem) A function f is separable if arg min (x 1,...,x n) f (x 1,..., x n ) = ( ) arg min f (x 1,...),..., arg min f (..., x n ) x 1 x n it follows that f can be optimized in a sequence of n independent 1-D optimization processes Example: Additively decomposable functions n f (x 1,..., x n ) = f i (x i ) i=1 Rastrigin function 3 2 1 0 1 2 3 3 2 1 0 1 2 3 Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 12 / 38

Non-Separable Problems Numerical optimization Typical difficulties in optimization Building a non-separable problem from a separable one (1,2) Rotating the coordinate system f : x f (x) separable f : x f (Rx) non-separable R rotation matrix 3 3 2 1 0 1 R 2 1 0 1 2 2 3 3 2 1 0 1 2 3 3 3 2 1 0 1 2 3 1 Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. Sixth ICGA, pp. 57-64, Morgan Kaufmann 2 Salomon (1996). Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms. BioSystems, 39(3):263-278 Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 13 / 38

If H I (small condition number of H) first order information (e.g. the gradient) is sufficient. Otherwise second order information (estimation of H 1 ) is necessary. Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 14 / 38 Numerical optimization Ill-Conditioned Problems Curvature of level sets Typical difficulties in optimization Consider the convex-quadratic function f (x) = 1 2 (x x ) T H(x x ) = 1 2 i h i,i xi 2 + 1 2 i j h i,j x i x j H is Hessian matrix of f and symmetric positive definite gradient direction f (x) T Newton direction H 1 f (x) T Ill-conditioning means squeezed level sets (high curvature). Condition number equals nine here. Condition numbers up to 10 10 are not unusual in real world problems.

Numerical optimization Typical difficulties in optimization What Makes a Function Difficult to Solve?... and what can be done The Problem Ruggedness The Approach in ESs and continuous EDAs non-local policy, large sampling width (step-size) as large as possible while preserving a reasonable convergence speed stochastic, non-elitistic, population-based method recombination operator serves as repair mechanism Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 15 / 38

Numerical optimization Typical difficulties in optimization What Makes a Function Difficult to Solve?... and what can be done The Problem Ruggedness The Approach in ESs and continuous EDAs non-local policy, large sampling width (step-size) as large as possible while preserving a reasonable convergence speed stochastic, non-elitistic, population-based method recombination operator serves as repair mechanism Dimensionality, Non-Separability exploiting the problem structure locality, neighborhood, encoding Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 15 / 38

Numerical optimization Typical difficulties in optimization What Makes a Function Difficult to Solve?... and what can be done The Problem Ruggedness The Approach in ESs and continuous EDAs non-local policy, large sampling width (step-size) as large as possible while preserving a reasonable convergence speed stochastic, non-elitistic, population-based method recombination operator serves as repair mechanism Dimensionality, Non-Separability exploiting the problem structure locality, neighborhood, encoding Ill-conditioning second order approach changes the neighborhood metric Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 15 / 38

Numerical optimization Different optimization methods Deterministic vs stochastic - local versus global methods Deterministic methods local methods Convex optimization methods - gradient based methods most often require to use gradients of functions converge to local optima, fast if function has the right assumptions (smooth enough) deterministic methods that do not require convexity: simplex method, pattern search methods Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 16 / 38

Numerical optimization Deterministic vs stochastic - local versus global methods Different optimization methods (cont.) Remarks Stochastic methods global methods use of randomness to be able to escape local optima pure random search, simulated annealing (SA) not reasonable methods in practice (too slow) genetic algorithms not powerful for numerical optimization, originally introduced for binary search spaces {0, 1} n evolution strategies powerful for numerical optimization Impossible to be exhaustive Classification is not that binary methods combining deterministic and stochastic do of course exist Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 17 / 38

Mathematical tools and optimality conditions First order differentiability and gradients 1 Numerical optimization A few examples Data fitting, regression In machine learning Black-box optimization Different notions of optimum Typical difficulties in optimization Deterministic vs stochastic - local versus global methods 2 Mathematical tools and optimality conditions First order differentiability and gradients Second order differentiability and hessian Optimality Conditions for unconstrainted optimization 3 Gradient based optimization algorithms Root finding methods (1-D optimization) Relaxation algorithm Descent methods Gradient descent, Newton descent, BFGS Trust regions methods Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 18 / 38

Mathematical tools and optimality conditions First order differentiability and gradients Differentiability Generalization of derivability in 1-D Given a normed vector space (E,. ) and complete (Banach space), consider f : U E R with U open set of E. Exercice f is differentiable in x U if there exists a continuous linear form Df x such that f (x + h) = f (x) + Df x (h) + o( h ) (3) Df x is the differential of f in x Consider E = R n with the scalar product x, y = x T y. Let a R n, show that f (x) = a, x is differentiable and compute its differential. Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 19 / 38

Mathematical tools and optimality conditions First order differentiability and gradients Gradient If the norm. comes from a scalar product, i.e. x = x, x (the Banach space E is then called an Hilbert space), the gradient of f in x denoted f (x) is defined as the element of E such that Df x (h) = f (x), h (4) Riesz representation Theorem Taylor formula - order one Replacing differential by (4) in (3) we obtain the Taylor formula: f (x + h) = f (x) + f (x), h + o( h ) Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 20 / 38

Mathematical tools and optimality conditions First order differentiability and gradients Gradient (cont) Exercice Compute the gradient of the function f (x) = a, x. Exercice Level sets are sets defined as L c = {x R n f (x) = c}. Let x 0 L c. Show that f (x 0 ) is orthogonal to the level set in x 0. Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 21 / 38

Mathematical tools and optimality conditions First order differentiability and gradients Gradient Examples if f (x) = a, x, f (x) = a in R n, if f (x) = x T Ax, then f (x) = (A + A T )x particular case if f (x) = x 2, then f (x) = 2x in R, f (x) = f (x) in (R n,. 2 ) where x 2 = x, x is the euclidian norm ( f f (x) = }{{} natural gradient(amari,2000) ) T (x) x n (x),..., f x 1 }{{} vanilla gradient Attention: this equality will not hold for all norms. Has some importance in machine learning, see the natural gradient topic introduced by Amari. Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 22 / 38

Mathematical tools and optimality conditions Second order differentiability and hessian Second order differentiability (first order) differential: gives a linear local approximation second order differential: gives a quadratic local approximation Definition: second order differentiability f : U E R is differentiable at the second order in x U if it is differentiable in a neighborhood of x and if u Df u is differentiable in x Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 23 / 38

Mathematical tools and optimality conditions Second order differentiability (cont.) Second order differentiability and hessian Other definition f : U E R is differentiable at the second order in x U iff there exists a continuous linear application Df x and a bilinear symmetric continuous application D 2 f x such that In a Hilbert (E, ) f (x + h) = f (x) + Df x (h) + 1 2 D2 f x (h, h) + o( h 2 ) D 2 f x (h, h) = 2 f (x)(h), h where 2 f (x) : E E is a symmetric continuous operator. Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 24 / 38

Hessian matrix Mathematical tools and optimality conditions Second order differentiability and hessian In (R n, x, y = x T y), 2 f (x) is represented by a symmetric matrix called the hessian matrix. It can be computed as 2 (f ) = 2 f x 2 1 2 f x 2 x 1. 2 f x n x 1 2 f x 1 x 2 2 f 2 f x 2 2. x 1 x n 2 f x 2 x n.... 2 f x n x 2 2 f xn 2. Exercice In R n, compute the hessian matrix of f (x) = 1 2 xt Ax. Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 25 / 38

Mathematical tools and optimality conditions Optimality Conditions for unconstrainted optimization Optimality condition First order Necessary condition for 1-D optimization problems f : R R Assume f is derivable x is a local optimum f (x ) = 0 not a necessary condition: consider f (x) = x 3 proof via Taylor formula: f (x + h) = f (x ) + f (x )h + o( h ) Points y such that f (y) = 0 are called critical or stationary points. Generalization If f : E R with (E,, ) a Hilbert is differentiable x is a local minimum of f f (x ) = 0. proof via Taylor formula Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 26 / 38

Mathematical tools and optimality conditions Optimality Conditions for unconstrainted optimization Optimality conditions Second order necessary and sufficient conditions If f is twice continuously differentiable if x is a local optimum, then f (x ) = 0 and 2 f (x ) is positive semi-definite if f (x ) = 0 and 2 f (x ) is positive definite, then there is an α > 0 such that f (x) f (x ) + α x x 2 Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 27 / 38

Gradient based optimization algorithms 1 Numerical optimization A few examples Data fitting, regression In machine learning Black-box optimization Different notions of optimum Typical difficulties in optimization Deterministic vs stochastic - local versus global methods 2 Mathematical tools and optimality conditions First order differentiability and gradients Second order differentiability and hessian Optimality Conditions for unconstrainted optimization 3 Gradient based optimization algorithms Root finding methods (1-D optimization) Relaxation algorithm Descent methods Gradient descent, Newton descent, BFGS Trust regions methods Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 28 / 38

Gradient based optimization algorithms Optimization algorithms Iterative procedures that generate a sequence (x n ) n N where at each time step n, x n is the estimate of the optimum. Convergence No exact result in general, however the algorithms aim at converging to optima (local or global), i.e. x n x n 0 convergence speed of the algorithm = how fast x n x goes to zero Equivalently, given a fixed precision ɛ, the algorithms aim at approaching the optimum x with precision ɛ in finite time running time = how many iterations is needed to reach the precision ɛ Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 29 / 38

Gradient based optimization algorithms Root finding methods (1-D optimization) Newton-Raphson method Method to find the root of a function f : R R linear approximation of f : f (x + h) = f (x) + hf (x) + o( h ) at the first order, f (x + h) = 0 leads to h = f (x) f (x) Algorithm: x n+1 = x n f (xn) f (x n) quadratic convergence: x n+1 x µ x n x 2 provided the initial point is close enough from the root x and there is no other critical point in a neighborhood of x secant method: replaces the derivative computation its the finite difference approximation Application to the minimization of f try to solve the equation f (x) = 0 algorithm: x n+1 = x n f (x n) f (x n) Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 30 / 38

Gradient based optimization algorithms Root finding methods (1-D optimization) Bissection method An other method to find the root of a 1-D function bisects an interval and then selects a subinterval in which a root must lie method guaranteed to converge if f continuous and the intial points bracket the solution x n x a 1 b 1 2 n Wikipedia image Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 31 / 38

Gradient based optimization algorithms Relaxation algorithm Relaxation algorithm Naive approach for optimization of f : R n R Algorithm successive optimization w.r.t. each variable 1 choose an intial point x 0,k=1 2 while not happy for i = 1,..., n x k+1 i = arg min x R f (x k+1 1,..., x k+1 i 1, x, x k i+1,..., x k n ) use a 1D-minimization procedure endfor k=k+1 Critics Very bad for non-separable problems Not invariant if change of coordinate system Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 32 / 38

Gradient based optimization algorithms Descent methods Descent methods General principle 1 choose an intial point x 0,k=1 2 while not happy 2.1 choose a descent direction d k 0 2.2 choose a step-size σ k > 0 2.3 set x k+1 = x k + σ k d k, k = k + 1 Remaining questions How to choose d k? How to choose σ k? Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 33 / 38

Gradient based optimization algorithms Descent methods Gradient descent Rationale: d k = f (x k ) is a descent direction Indeed for f differentiable f (x σ f (x)) = f (x) σ f (x) 2 + o(σ) < }{{} for σ small enough f (x) Step-size optimal step-size σ k = arg min σ f (x k σ f (x k )) (5) in general precise minimization of (5) is expensive and unecessary, instead a line search algorithm executes a limited number of trial steps until one loose approximation of the minimum is found Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 34 / 38

Gradient based optimization algorithms Descent methods Gradient descent Convergence rate on convex-quadratic function On f (x) = 1 2 xt Ax b T x + c with A a symmetric positive definite matrix with Q = cond(a) we have that the gradient descent algorithm with optimal step-size satisfy: f (x k ) min f ( ) Q 1 2k [f (x 0 ) min f ] Q + 1 The convergence rate depends on the starting point, however ( Q 1 Q+1) 2k gives the correct description of the convergence rate for almost all starting points. Very slow convergence for ill-conditionned problems Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 35 / 38

Gradient based optimization algorithms Descent methods Newton algorithm, BFGS Descent methods Newton method descent direction: [ 2 f (x k )] 1 f (x k ) Newton direction points towards the optimum on f (x) = x T Ax + b T x + c However, hessian matrix is expensive to compute in general and its inversion is also not immediate Broyden-Fletcher-Goldfarb-Shanno (BFGS) method construct in an iterative manner using solely the gradient, an approximation of the Newton direction [ 2 f (x k )] 1 f (x k ) Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 36 / 38

Gradient based optimization algorithms Trust regions methods Trust regions methods General idea instead of choosing a descent direction and minimizing along this direction, trust regions algorithms construct a model m k of the objective function (usually a quadratic model) because the model may not be a good approximation of f far away from the current iterate x k, the search for a minimizer of the model is restricted to a region (trust region) around the current iterate trust region is usually a ball, whose radius is adapted (shrinked if no better solution is found) Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 37 / 38

Gradient based optimization algorithms Was not handled in this class but you might want to have a look at it Trust regions methods Constraint optimization min x R nf (x) subjected to c i (x) = 0, i = 1,..., p and b j (x) 0, j = 1,..., k equality constraints inequality constraints Anne Auger (Inria Saclay-Ile-de-France) Numercial Optimization I November 2011 38 / 38