Modern Methods of Data Analysis - WS 07/08

Modern Methods of Data Analysis Lecture XV (04.02.08) Contents: Function Minimization (see E. Lohrmann & V. Blobel)

Optimization Problem Set of n independent variables Sometimes in addition some constraints A single measure of goodness => objective function In physics data analysis: Objective function: (negative log of a) Likelihood function in MLH method sum of squares in a (nonlinear) Least Square problem Constraints: equality constraints, expressing relations between parameters inequality constraints are limit of certain parameters defining a restricted range of parameters (e.g. m>0)

Aim of Optimization Find the global minimum of the objective function within the allowed range of parameter values in a short time even for a large number of parameters and a complicated objective function even if there are local minima Most methods will converge to the next minimum, which may be the global minimum or a local minimum, going immediately downhill as far as possible. Search for the global minimum requires a special effort.

One-dimensional Minimization Search for minimum of function f(x) of (scalar) argument x Important application in multidimensional minimization: robust minimization along a line line search Aim: robust, efficient and as fast as possible, because each function evaluation may require a large CPU time Standard method: iterations, starting from expression : with convergence to fixed point, with with

Newton Iteration Method Method for the determination of zeros of a function ( ), based on derivatives: Same method for min/max determination ( derived from Taylor expansion ): It follows:

Convergence Behaviour (I) An iterative method is called local convergent of at least order p, if for all start values is valid for all k (c<1 in the linear case p=1). Condition for order p: and the iterative method is convergent of order p.

Convergence Behaviour (II) The linear case (p=1): required. Sequence converging monotonely to for a positive value of, and alternating around for a negative value. linear convergence can be very slow. The constant c is often very close to 1, and many 100 iterations may be necessary with small progress per iteration - not recommended. quadratic convergence usually only few iterations required, very fast in final phase - recommended, at least for the end game.

Convergence for Newton Method... for the determination of a minimum (or maximum): in general Newtons method: - quadratically convergent (locally) - first and second derivative required - may be divergent for a bad start value

Search Without Derivatives Required: robust convergent method for minimum determination without the need to calculate derivatives (which may be complicated or impossible) Aim: determine very short x-interval, which contains the minimum of the function f(x) Strategy of search method with two steps find initial interval, which includes unimodal minimum for some reduce size of interval (sufficiently)

Golden Section Strategy Define new point by New interval, depending on function value Reduction of length of interval by factor by one iteration for cost of computation of one function value (linear convergence). For 10 iterations reduction by factor, not dependent on function behaviour.

Parabola Method More efficient for normal behaviour of functions: fit parabola to last three points and use minimum of parabola as next point Note: many functions to be minimized are parabolic in good approximation => min. of parabola close to function minimum Bild 8.6 (Blobel/Lohrmann) But: method can get stuck with unbalanced section of interval (parabolic interpolation become instable) => Combined method: use mixture of parabola and golden section method to avoid unbalanced section of the interval.

Search Methods in n Dimensions Search method in n dimensions do not require any derivatives, only function values. Examples: Line search in one variable: sequentially in all dimensions (usually rather inefficient) Simplex method by Nelder and Mead: simple, but making use of earlier function evaluations in an efficient way ( learning ) Monte Carlo search: random search in n dimensions, using result as starting values for more efficient methods; meaningful if several local minima may exist In general search methods are acceptable initially (far from the optimum), but are inefficient and slow in the end game.

Simplex Method A simplex is formed by n+1 points in n-d space (n=2 triangle) sorted such that values are in the order In addition: mean of best n points = center of gravity Method: sequence of cycles with new point in each cycle, replacing worst point, with new (updated) simplex in each cycle. At the start of each cycle new test point of worst point at the center or gravity:, obtained by reflexion

The Simplex Method A few steps of the simplex method. Starting from the simplex with the center of gravity c. The points and are test points.

A Cycle in Simplex Method Depending on value : : Test point is middle point and is added, the previous worst point is removed : Test point is best point, search direction seems to effective. A new point (with β > 1) is determined and the function value is evaluated. For extra step is successful, is replaced by otherwise by : The simplex is too big, it has to be reduced. For the test point replaces the worst point. A new test point is defined by with 0<γ<1. If this point with is an improvement, then is replaces by this point. Otherwise a new simplex is defined by replacing all points but by for j=2,...,n+1 with 0<δ<1, which requires n function evaluations. Typical values are α=1, β=2, γ=0.5 and δ=0.5.

Monte Carlo Search in n Dimensions Search in a box: Lower and upper boundaries defined a test point: and with (uniformly distributed). Check for the point with the smallest function value among several test points. Search in a sphere: Define step size vector with:, and search with (from standard normal distribution). If new point has smaller function value, use this as next starting point. Meaningful in higher dimensions, especially if existence of many local minima expected, as method to get good starting value.

n Dimensional Minimization with Derivatives minimize Taylor expansion: function derivative Function value and derivatives are evaluated at function gradient Hesse matrix

Covariance Matrix Note: if objective function is: a sum of squares of deviations, defined by the Method or Least Square or a negative log. Likelihood function, defined according to the Maximum Likelihood function then the inverse Hessian matrix H at the minimum is a good estimate of the covariance matrix of the parameters : The second derivative needs most of the time to be computed anyhow at least at the last iteration step.

The Newton Step Step determined from For a quadratic function the Newton step is, in length and direction, a step to the minimum of the function. Sometimes large angle ( ) between Newton direction and (the direction of steepest descent). Calculation of distance to minimum (called EDM in MINUIT) if Hessian positive-definite. For a quadratic function the distance to the minimum is d/2.

General Iteration Scheme Test for convergence: If the conditions for convergence are satisfied, the algorithms terminates with as the solution. The difference and d are used in the test Compute a search vector: A vector is computed as the new search vector. The newton search vector is determine from Line Search: A one-dimensional minimization is done for the function and is determined. (this step is essential to get a stable method!) Update: The point is defined by and k is increased by 1,

Method of Steepest Descent The search vector is equal to the negative gradient Step seems to be natural choice; Only gradient required (no Hesse matrix) good; No step size defined (in contrast to the Newton step) bad; rate of convergence only linear: and are largest and smallest eigenvalue and κ condition number of Hesse matrix H. For a large value of κ, c close to one and slow convergence very bad. Optimal step size, if Hessian known:

Derivative Calculation The (optimal) Newton method requires first derivatives of F(x) : computation second derivatives of F(x) : computation Analytical derivatives may be impossible or difficult to obtain. Numerical derivatives require good step size δ for differential quotient E.g numerical derivative of f(x) in one dimension: Can the Newton (or quasi Newton) method be used without explicit calculation of the complete Hessian?

Minimization of Objective Function with gradient Hessian Newton step Least squares contributions: Ignoring second derivatives improves the Newton step!

Newton steps... in fit of Exponential Colour contours of objective function steps correspond to ΔΧ²~ 50 2. derivatives ignored : 2. derivatives included Ignoring second derivatives improves the Newton step!

Variable Metrik Method (I) Calculation of Hessian (with n(n+1)/2 different elements) from sequence of first derivatives (gradients) by update of estimate from change of gradient. Step is calculated from After a line search with minimum at with gradient Update matrix (with new value is : ) is not completely defined by those equations. Note: an accurate line search is essential for the success.

Variable Metrik Method (II) Most effective update formula (Broydo/Fletcher/Goldfarb/Shanno (BFGS)) Initial matrix may be the unit matrix Properties: the method generates n independent search directions for a quadratic function and the estimated Hessian converges to the true Hessian. Potential problems: no real convergence for good starting point; estimate destroyed for small, inaccurate steps (round-off errors)

Minimization with MINUIT Several options can be selected: Option MIGRAD: minimizes the objective function, calculates : first derivatives numerically and uses the BFGS update formula for the Hessian fast Option HESSE: calculates the Hesse matrix numerically recommended after minimization Option MINIMIZE: minimization by MIGRAD and HESSE calculation with checks