CS281 Section 3: Practical Optimization David Duvenaud and Dougal Maclaurin Most parameter estimation problems in machine learning cannot be solved in closed form, so we often have to resort to numerical optimization. In this section we ll describe some of the common optimization techniques used in machine learning, when to use them, and common pitfalls. 1 Gradient-free optimization Gradient-free optimization is always a slog, and usually a bad idea. But we sometimes find ourselves doing it anyways. There are a few standard fallbacks that will sometimes sort of work, at least for small problems: 1.1 Grid search This just means choosing a set of values for each dimension, and exhaustively trying all combinations. The number of evaluations that you make of your function scales exponentially in the number of dimensions you re optimizing over. 1.2 Random search Random search just means trying completely random points, with no adaptation. This is usually actually better than grid search. The reason is that adding irrelevant dimensions to your problem doesn t hurt you at all. Often, one doesn t which dimensions of your problem are important. With grid search, adding an irrelevant dimension at least doubles the total time taken. 1.3 Bayesian optimization Can t we do something smarter? How would a person go about optimizing a function? Bayesian optimization is a nice way to optimize functions when they re expensive enough that it s worth thinking about where to evaluate next. BayesOpt has been independently re-discovered many times, because it s a very natural way to approach the problem: Write down a huge set of functions that you might be optimizing (given what you ve seen so far), and then ask which point is most likely to be better than the best one you ve seen so far. The downsides are that it s slow, and requires dedicated software. (github.com/hips/ spearmint) 1
1.4 Are you sure there aren t gradients? Having gradients is just so much better than not. If you can t use gradients, chances are you ll never be able to optimize more than, say, 10 dimensions. If you do have gradients - the sky s the limit. 1,000,000 parameters? No problem! The more parameters you have, the more information you have about how to optimize them. Maybe you can find a way to get gradients into the picture somehow. Can you find a continuous relaxation of your problem? How are you computing your function? Why not just differentiate that? 2 Gradient Descent How would we minimize some function f (x) given access to queries of both f (x) itself and its gradient f (x)? If we start at some x = x 0, an obvious strategy is just to go downhill: x i+1 = x i α f (x i ) (1) This method is simple and effective, but has a couple of problems. First, the parameter α, known as the learning rate or step size has to be set to roughly the right size, but it s hard to know how big it should be ahead of time. People sometimes also change α at each iteration, so that the initial steps are large and they become smaller over time. An alternative is not to use a fixed α at all, but to perform an explicit line search in the direction of the gradient f (x). 2.1 The problem with gradient descent: Ravines and saddle points Gradient descent becomes difficult when the function being optimized looks locally like a ravine, or a saddle. (Show animations from http://imgur.com/a/hqolp) 3 Why not use the Hessian? Second-order methods The second derivative is known as the Hessian, A, which is a matrix of size D by D: A ij = x i x j f (x) (2) In high dimensions, gradient descent is slow when the local Hessian is ill-conditioned. The condition number of a matrix is the ratio of its highest to its lowest eigenvalues. Matrices with large condition number are known as ill-conditioned. 4 Quasi-Newton methods: Conjugate Gradients and L-BFGS Imagine minimizing a poorly conditioned quadratic function: f (x) = 1 2 xt Ax b T x (3) 2
Imagine what would happen if you rescaled things. If you squished the space so that the elliptical contours become circles, the Hessian is the identity, and things become very easy. Similarly, Newton s method, which would take us right to the optimum in one step, is: x i+1 = x i A 1 f (x i ) (4) Even if we re not optimizing a quadratic, A 1 f (x i ) is often a much better direction to move in than f (x i ). The problem is that if D is large, it s too expensive to compute and store A (it takes O(D 2 ) time and space) let alone invert it (O(D 3 ) time). BFGS and CG work by implicitly building an estimate of the inverse Hessian as they go. 4.1 Quasi-Newton is easy to use One of the main advantages of using Quasi-Newton methods is that they usually work out-of-thebox without the need to tune the learning rate, or any of the other tolerances. 4.2 What if it s expensive to compute gradients? For many optimization problems in machine learning, people opt not to use Quasi-Newton (second order) methods, even though they re far easier to use. The reason is usually that computing the exact gradient requires summing over all datapoints in a large training set, or computing an intractable integral. In this case, just evaluating the exact gradient can take minutes or hours, and it has to be completely recomputed after each step. If our gradient is just a sum or average over datapoints, could we take a shortcut, and just estimate the gradient using a tiny subset of the data? We could even use a different random subset every time, so that any bias would be averaged out. This is known as minibatches. Using minibatches (usually 100 or 200 datapoints) speeds up the gradient computation massively, especially if your dataset is large. Unfortunately, because of the variance in the gradients introduced by subsampling, Quasi-Newton methods can become unstable or get stuck. Sometimes people use Quasi-Newton methods anyways (with relatively large batches), but usually people switch to stochastic gradient descent. (By the way, coming up with a Quasi-Newton method that s robust to noisy gradients is an active research area, and would be a huge contribution if it could be make to work robustly!) 4.3 Stochastic Gradient Descent (SGD) SGD is a workhorse of machine learning. The basic recipe is the same as standard gradient descent, just using a noisy approximation to the true gradient. A popular variant is SGD with momentum. (Show animations from http://imgur.com/ SmDARzn) Coming up with variants of SGD is an active research area. 5 Computing Gradients So, how do we compute gradients of our function? The answer is: by using the chain rule, the product rule, and all the standard identities from calculus. The good news is, this process is 3
entirely automatable (ignoring numerical issues). If you can write down your function as a series of operations, chances are an automatic differentiation library will be able to take the derivative for you. 5.1 Automatic differentiation Most popular languages have a few automatic differentiation libraries. The most popular one for Python is called Theano - its best feature is that it can run on the CPU or GPU. Since running days-long optimizations are the bottleneck for a lot of modern machine learning, using GPUs is almost a necessity for some areas of research. The drawback of Theano is that it requires you to learn another mini-language in which to express your computation. An autodiff library that works on plain Python and Numpy code is autograd: github.com/ HIPS/autograd 5.2 Reverse-mode differentiation A quick aside - there are multiple ways to compute derivatives of multivariate functions. Which one is fastest depends on how many inputs and how many outputs your function has. When optimizing, we usually have many inputs and a single output. In this case, reverse-mode differentiation is the only practical option. It takes about as long to compute the gradient as it does to compute the original function. In the neural network literature, the word backpropagation means exactly reverse-mode differentiation. 5.3 Checking Gradients Gradients are nice to work with, because they re easy to verify numerically. We can do so by using the definition of the derivative: f (x) x 6 Constrained Optimization = lim h 0 f (x + h) f (x) h (5) Lots of literature on constrained optimization. Constrained optimization just means that your parameter space isn t R D, but some subset of it. So you have to do extra work to stay in the allowable region. For example: If we want to optimize the variance parameter σ 2 of a Gaussian, it can only be positive. You can use unconstrained optimization, but you can also simply optimize the log y = log σ 2. Since any value of y is valid, you ve turned constrained optimization into unconstrained optimization! I ve never not been able to turn a constrained optimization problem into an unconstrained one by using some trick along these lines. 7 Optimization sanity checks Check gradient numerically 4
Check that random restarts converge to similar final values (same value, if convex) Start from a known optimum, and check that optimizer doesn t move 8 Takeaways If gradient is deterministic (batch optimization) - use BFGS or CG If problem is convex - Can use fancier but BFGS or CG is still usually ok If gradient is stochastic (minibatches) - use SGD with momentum (or a variant) If no gradients - use Bayesian optimization (spearmint) or random search. No genetic algorithms! Always check your gradients! 9 Time permitting - examples At a terminal run pip install autograd and run the examples. github.com/hips/autograd/examples Show that BFGS doesn t really work for stochastic gradients. 5