CS281 Section 3: Practical Optimization

Similar documents
A Brief Look at Optimization

25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods.

Modern Methods of Data Analysis - WS 07/08

Deep Neural Networks Optimization

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Classical Gradient Methods

Logistic Regression and Gradient Ascent

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Introduction to optimization methods and line search

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

10703 Deep Reinforcement Learning and Control

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

5 Machine Learning Abstractions and Numerical Optimization

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Perceptron: This is convolution!

Lecture : Training a neural net part I Initialization, activations, normalizations and other practical details Anne Solberg February 28, 2018

Lecture 6 - Multivariate numerical optimization

Neural Networks: Optimization Part 1. Intro to Deep Learning, Fall 2018

Logistic Regression. Abstract

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Experimental Data and Training

IE598 Big Data Optimization Summary Nonconvex Optimization

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Linear Regression Optimization

Today. Golden section, discussion of error Newton s method. Newton s method, steepest descent, conjugate gradient

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

DEEP LEARNING IN PYTHON. The need for optimization

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Bayesian Methods in Vision: MAP Estimation, MRFs, Optimization

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Constrained and Unconstrained Optimization

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Theoretical Concepts of Machine Learning

Notes on Multilayer, Feedforward Neural Networks

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Cost Functions in Machine Learning

Large Scale Distributed Deep Networks

Optimization. Industrial AI Lab.

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Multivariate Numerical Optimization

Tested Paradigm to Include Optimization in Machine Learning Algorithms

CSC412: Stochastic Variational Inference. David Duvenaud

Solving for dynamic user equilibrium

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Introduction to Optimization

Numerical Optimization: Introduction and gradient-based methods

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Natural Language Processing

Lecture 12: convergence. Derivative (one variable)

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

SIMULATED ANNEALING WITH AN EFFICIENT UNIVERSAL BARRIER

Machine Learning for Signal Processing Lecture 4: Optimization

K-Means and Gaussian Mixture Models

Inverse Kinematics (part 1) CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Winter 2018

Derek Bridge School of Computer Science and Information Technology University College Cork. from sklearn.preprocessing import add_dummy_feature

Multi Layer Perceptron trained by Quasi Newton learning rule

Short-Cut MCMC: An Alternative to Adaptation

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Image Registration Lecture 4: First Examples

Fall 09, Homework 5

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

Chumpy Documentation. Release 1.0. Matthew Loper

Gradient Descent. Michail Michailidis & Patrick Maiden

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

A Course in Machine Learning

Ensemble methods in machine learning. Example. Neural networks. Neural networks

What Every Programmer Should Know About Floating-Point Arithmetic

Constrained optimization

Characterizing Improving Directions Unconstrained Optimization

Logistic Regression

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

Parallel Deep Network Training

Introduction to Optimization Problems and Methods

10.4 Linear interpolation method Newton s method

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 3: Linear Classification

Effectively Scaling Deep Learning Frameworks

Accelerated Machine Learning Algorithms in Python

Generating Functions

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Towards the world s fastest k-means algorithm

15.1 Optimization, scaling, and gradient descent in Spark

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Programming, numerics and optimization

Replacing Neural Networks with Black-Box ODE Solvers

Class 6 Large-Scale Image Classification

Lecture 19: November 5

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Neural Networks. By Laurence Squires

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 161 Lecture Notes 16 Analyzing Search and Sort

CPSC 340: Machine Learning and Data Mining

Transcription:

CS281 Section 3: Practical Optimization David Duvenaud and Dougal Maclaurin Most parameter estimation problems in machine learning cannot be solved in closed form, so we often have to resort to numerical optimization. In this section we ll describe some of the common optimization techniques used in machine learning, when to use them, and common pitfalls. 1 Gradient-free optimization Gradient-free optimization is always a slog, and usually a bad idea. But we sometimes find ourselves doing it anyways. There are a few standard fallbacks that will sometimes sort of work, at least for small problems: 1.1 Grid search This just means choosing a set of values for each dimension, and exhaustively trying all combinations. The number of evaluations that you make of your function scales exponentially in the number of dimensions you re optimizing over. 1.2 Random search Random search just means trying completely random points, with no adaptation. This is usually actually better than grid search. The reason is that adding irrelevant dimensions to your problem doesn t hurt you at all. Often, one doesn t which dimensions of your problem are important. With grid search, adding an irrelevant dimension at least doubles the total time taken. 1.3 Bayesian optimization Can t we do something smarter? How would a person go about optimizing a function? Bayesian optimization is a nice way to optimize functions when they re expensive enough that it s worth thinking about where to evaluate next. BayesOpt has been independently re-discovered many times, because it s a very natural way to approach the problem: Write down a huge set of functions that you might be optimizing (given what you ve seen so far), and then ask which point is most likely to be better than the best one you ve seen so far. The downsides are that it s slow, and requires dedicated software. (github.com/hips/ spearmint) 1

1.4 Are you sure there aren t gradients? Having gradients is just so much better than not. If you can t use gradients, chances are you ll never be able to optimize more than, say, 10 dimensions. If you do have gradients - the sky s the limit. 1,000,000 parameters? No problem! The more parameters you have, the more information you have about how to optimize them. Maybe you can find a way to get gradients into the picture somehow. Can you find a continuous relaxation of your problem? How are you computing your function? Why not just differentiate that? 2 Gradient Descent How would we minimize some function f (x) given access to queries of both f (x) itself and its gradient f (x)? If we start at some x = x 0, an obvious strategy is just to go downhill: x i+1 = x i α f (x i ) (1) This method is simple and effective, but has a couple of problems. First, the parameter α, known as the learning rate or step size has to be set to roughly the right size, but it s hard to know how big it should be ahead of time. People sometimes also change α at each iteration, so that the initial steps are large and they become smaller over time. An alternative is not to use a fixed α at all, but to perform an explicit line search in the direction of the gradient f (x). 2.1 The problem with gradient descent: Ravines and saddle points Gradient descent becomes difficult when the function being optimized looks locally like a ravine, or a saddle. (Show animations from http://imgur.com/a/hqolp) 3 Why not use the Hessian? Second-order methods The second derivative is known as the Hessian, A, which is a matrix of size D by D: A ij = x i x j f (x) (2) In high dimensions, gradient descent is slow when the local Hessian is ill-conditioned. The condition number of a matrix is the ratio of its highest to its lowest eigenvalues. Matrices with large condition number are known as ill-conditioned. 4 Quasi-Newton methods: Conjugate Gradients and L-BFGS Imagine minimizing a poorly conditioned quadratic function: f (x) = 1 2 xt Ax b T x (3) 2

Imagine what would happen if you rescaled things. If you squished the space so that the elliptical contours become circles, the Hessian is the identity, and things become very easy. Similarly, Newton s method, which would take us right to the optimum in one step, is: x i+1 = x i A 1 f (x i ) (4) Even if we re not optimizing a quadratic, A 1 f (x i ) is often a much better direction to move in than f (x i ). The problem is that if D is large, it s too expensive to compute and store A (it takes O(D 2 ) time and space) let alone invert it (O(D 3 ) time). BFGS and CG work by implicitly building an estimate of the inverse Hessian as they go. 4.1 Quasi-Newton is easy to use One of the main advantages of using Quasi-Newton methods is that they usually work out-of-thebox without the need to tune the learning rate, or any of the other tolerances. 4.2 What if it s expensive to compute gradients? For many optimization problems in machine learning, people opt not to use Quasi-Newton (second order) methods, even though they re far easier to use. The reason is usually that computing the exact gradient requires summing over all datapoints in a large training set, or computing an intractable integral. In this case, just evaluating the exact gradient can take minutes or hours, and it has to be completely recomputed after each step. If our gradient is just a sum or average over datapoints, could we take a shortcut, and just estimate the gradient using a tiny subset of the data? We could even use a different random subset every time, so that any bias would be averaged out. This is known as minibatches. Using minibatches (usually 100 or 200 datapoints) speeds up the gradient computation massively, especially if your dataset is large. Unfortunately, because of the variance in the gradients introduced by subsampling, Quasi-Newton methods can become unstable or get stuck. Sometimes people use Quasi-Newton methods anyways (with relatively large batches), but usually people switch to stochastic gradient descent. (By the way, coming up with a Quasi-Newton method that s robust to noisy gradients is an active research area, and would be a huge contribution if it could be make to work robustly!) 4.3 Stochastic Gradient Descent (SGD) SGD is a workhorse of machine learning. The basic recipe is the same as standard gradient descent, just using a noisy approximation to the true gradient. A popular variant is SGD with momentum. (Show animations from http://imgur.com/ SmDARzn) Coming up with variants of SGD is an active research area. 5 Computing Gradients So, how do we compute gradients of our function? The answer is: by using the chain rule, the product rule, and all the standard identities from calculus. The good news is, this process is 3

entirely automatable (ignoring numerical issues). If you can write down your function as a series of operations, chances are an automatic differentiation library will be able to take the derivative for you. 5.1 Automatic differentiation Most popular languages have a few automatic differentiation libraries. The most popular one for Python is called Theano - its best feature is that it can run on the CPU or GPU. Since running days-long optimizations are the bottleneck for a lot of modern machine learning, using GPUs is almost a necessity for some areas of research. The drawback of Theano is that it requires you to learn another mini-language in which to express your computation. An autodiff library that works on plain Python and Numpy code is autograd: github.com/ HIPS/autograd 5.2 Reverse-mode differentiation A quick aside - there are multiple ways to compute derivatives of multivariate functions. Which one is fastest depends on how many inputs and how many outputs your function has. When optimizing, we usually have many inputs and a single output. In this case, reverse-mode differentiation is the only practical option. It takes about as long to compute the gradient as it does to compute the original function. In the neural network literature, the word backpropagation means exactly reverse-mode differentiation. 5.3 Checking Gradients Gradients are nice to work with, because they re easy to verify numerically. We can do so by using the definition of the derivative: f (x) x 6 Constrained Optimization = lim h 0 f (x + h) f (x) h (5) Lots of literature on constrained optimization. Constrained optimization just means that your parameter space isn t R D, but some subset of it. So you have to do extra work to stay in the allowable region. For example: If we want to optimize the variance parameter σ 2 of a Gaussian, it can only be positive. You can use unconstrained optimization, but you can also simply optimize the log y = log σ 2. Since any value of y is valid, you ve turned constrained optimization into unconstrained optimization! I ve never not been able to turn a constrained optimization problem into an unconstrained one by using some trick along these lines. 7 Optimization sanity checks Check gradient numerically 4

Check that random restarts converge to similar final values (same value, if convex) Start from a known optimum, and check that optimizer doesn t move 8 Takeaways If gradient is deterministic (batch optimization) - use BFGS or CG If problem is convex - Can use fancier but BFGS or CG is still usually ok If gradient is stochastic (minibatches) - use SGD with momentum (or a variant) If no gradients - use Bayesian optimization (spearmint) or random search. No genetic algorithms! Always check your gradients! 9 Time permitting - examples At a terminal run pip install autograd and run the examples. github.com/hips/autograd/examples Show that BFGS doesn t really work for stochastic gradients. 5