Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Search & Optimization Search and Optimization method deals with how to find the best model(s) that makes the score function producing the minimum (maximum) value Search for the best model structure from a set of candidates Optimize the model parameters within a given model Major problem: Both the number of the possible model structures and parameter space are very huge! How to conduct efficient search and optimization?
Simply search strategy for models Exhaustive search For every model in the candidate set, find the best parameters w.r.t. the score function, and then compare the scores of all the models to find the best. Pros: Cons: Guarantee to find the best model w.r.t. the score function May be implemented in a parallel fashion Re-optimize the parameter for each new model structure Face to potentially combinatorial explosion Highly inefficient, and most of the time infeasible!
Being smart with compromise (I) Making use of the decomposable score function The score function for a new structure will be an additive function of the score function for the previous structure as well as a term accounting for the change in structure Pros: Cons: Easy to obtain the score function value of the current model based on the previous model Limited to particular score function
Being smart with compromise (II) Use approximation for the best parameter ( incremental ) Leave the existing parameters in the model fixed to the previous values and only optimize the parameters added to the model Pros: Cons: Reduces the number of parameters to be estimated when changing the model structure for a little bit Can save up time for searching more candidate model structures Proven to be suboptimal Error accumulation problem
Being smart with compromise (III) Heuristic search Apply some heuristic to narrow down the search space of the model structures Pros: Efficient under combinatorial explosions Intuitive, and easy to implement Cons: Lack of mathematically validity Might be ineffective under certain unknown situation.
State-space formulation for model search Model search problem can be viewed as one of moving through a discrete set of states State space representation Each state corresponds to a particular model in the candidate set Each state can be represented as a vertex in a graph Search operators Search operators corresponds to legal moves in our search space Can be represented as edges between the (state) vertices in a graph
Simple greedy search algorithm 1. Initialize: 2. Iterate: Choose a initial state M 0, corresponding to a particular model structure M k evaluate the score function at all possible adjacent states (as defined by the operators) and move to the best one 3. Stopping criterion: Repeat step 2 until no further improvement can be attained in the local score function 4. Multiple restart: Repeat steps 1 through 3 from different initial starting points and choose the best solution found. Suboptimal
Systematic search Search tree Instead of following the single path to search the best at every step, we keep track of multiple models simultaneously. Traverse on search tree Blind search Breadth-first search Consume huge memories Depth-first search Memory-efficient
Systematic search Traverse on search tree with heuristics Beam search Keep track of the b best models at any point in the search! Suboptimal. A trade-off for efficiency! Branch-and-bound Keep track of the best model structure so far Calculate analytically a lower bound on the best possible score function from a particular branch of the search tree If the bound is greater than the best score function so far, prune this branch. Difficult to find a tight bound. Scalability is limited!
Parameter optimization For a given model structure, we can link the parameter for this model structure to the score function, and directly optimize over the score function. Optimization can be down by calculating the minimum (or maximum) value directly Close form solution: Let, solving d linear equations Iterative Optimization
Greedy method for optimizing smooth functions 1. Initialize (Randomly) Choose an initial value for the parameter vector 2. Iterate: Determined by Local information 3. Convergence How much to change the value Repeat 2 until S appears to have attained a local minimum 4. Multiple restart Repeat steps 1 through 3 from different initial starting points and choose the best solution found
Univariate Optimization (I) The Newton-Raphson Method Based on Taylor series expansion Since then, Second order method So the update rule is Pros: Cons: Convergence rate is quadratic if close to the solution May not converge at all if far from the solution
Univariate Optimization (II) The Gradient descent Method The update rule is: Momentum-Based methods: Accelerate the convergence of gradient descent by adding momentum term. Considering the history of the path information Bracketing Method Finding a bracket that provably contains the extremum of the function Accelerate the convergence in low curvature region.
From 1-D to n-d (I) Two key questions: In which direction should we move from θ i? How far should we step in that direction? Multivariate methods: Multivariate gradient descent If the learning rate is sufficiently small, the gradient descent is guaranteed to converge to local minimum of S
From 1-D to n-d (II) Multivariate methods Newton s method If S is quadratic, the step taken by Newton s method directly points to the minimum of S S can be regarded as locally quadratic near the minimum pint This method involved a inverse of a matrix, which might be time consuming
From 1-D to n-d (III) Other Multivariate methods Coordinate descent For each axis of the original space, iteratively conduct univariate gradient descent Using conjugate directions Use principal axes to transform data and then find the direction in the transformed space. Simplex search Use simplex reflection to find search direction and the size of simplex as type of step size
Constrained optimization What is constrained optimization? The parameters take value from the feasible region, instead of the whole space. How to solve? s.t. Introduces Lagrange multipliers Let gradient equal to 0 The optimum is obtained under the KKT condiction.
Maximizing likelihood with missing parameter Problem settings: Given: Data set Hidden variables associated to each data Goal: Optimize the parameters θ w.r.t. the log likelihood function Challenge: Both parameters and the hidden variables are unknown! Aha, we can fix one and determine the other, and then fix the other one and determine this one!
The EM Algorithm The Expectation-Maximization (EM) algorithm is used for finding MLE of parameters in probabilistic models that depends on unobserved under variables. It is an iterative procedure altering between Expectation step and Maximization step E step: computes an expectation of the log likelihood w.r.t. the current estimate of the conditional distribution for the hidden variables M step: computes the parameters which maximize the expected log likelihood found on the E step
Why EM works? Idea behind EM Find a lower bound of log likelihood function At each step, optimize the lower bound w.r.t. one set of unknown parameters by fixing other set of parameters at their current values Iterates until the process converges Maximizing the lower bound function will find the parameters that increases the log likelihood function.
Encodes the basic idea of EM with some mathematics The lower bound: Jessen s inequality Let s check it out in detail Iterative optimization: From k-th rounds to the (k+1)-th rounds Convergence: F value will be increased during the any successive rounds F is bounded from the top by L which has an maximum value
The EM cookbook To use EM, the following issues should be determined: What is the log likelihood function. (EM can apply to any likelihood function) What is the model parameters and what is the hidden variables What is the expectation in the E step, and how to compute it. What is supposed to be maximized based on the expectation, and how to compute it.
Optimizing parameters with single scan In many online applications Data come in stream Limited space for storing training data Need quick response for a given input. We can only receive one data point, make use of it and then discard it. Stochastic approximation would be applied Reform the batch score function to instantaneous score function which only depends on the current example Optimize the instantaneous score function (e.g., take a gradient descent) The average instantaneous score function should asymptotically approach to the batch score function to guarantee the approximation is good.
Heuristic search and optimization (I) Genetic search Genetic algorithms are general set of heuristic search techniques based on ideas from evolutionary biology A framework of GA Represent models as chromosomes (binary string) Evolve a population of such chromosomes by selectively pairing (according to their fitness defined by a score function) and mutate chromosomes to create off-springs Essential idea of GA: Maintaining a set of candidate models instead of one to allow simultaneously exploring of the state space New states are created based on combination of the current state, allowing jump to different part of the state-space to avoid being stuck in local minimum
Heuristic search and optimization (II) Simulated annealing Simulated annealing is a heuristic search technique based on ideas from physics The framework Allow the moves in state-space that decrease the score function to be minimized Allow some moves (with some probability) that can increase the score function, which is controlled by the temperature that gradually decreases. Key idea Higher temperature enables large moves in the parameter space to explore many possible place at beginning, in the hope that the large moves may lead to the deepest basin Lower temperature decreases the energy for large moves such that the search is stabilized into local search, to avoid escaping from the deep basin
Let s move to Chapter 9