Gradient Methods for Machine Learning

Similar documents
Classical Gradient Methods

Theoretical Concepts of Machine Learning

Multi Layer Perceptron trained by Quasi Newton learning rule

CS281 Section 3: Practical Optimization

A Brief Look at Optimization

Experimental Data and Training

Modern Methods of Data Analysis - WS 07/08

Introduction to Optimization

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Today. Golden section, discussion of error Newton s method. Newton s method, steepest descent, conjugate gradient

Ensemble methods in machine learning. Example. Neural networks. Neural networks

5 Machine Learning Abstractions and Numerical Optimization

Research on Evaluation Method of Product Style Semantics Based on Neural Network

arxiv: v1 [cs.cv] 2 May 2016

10703 Deep Reinforcement Learning and Control

Training of Neural Networks. Q.J. Zhang, Carleton University

25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods.

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Characterizing Improving Directions Unconstrained Optimization

Convex Optimization CMU-10725

Lecture 4. Convexity Robust cost functions Optimizing non-convex functions. 3B1B Optimization Michaelmas 2017 A. Zisserman

Hartley - Zisserman reading club. Part I: Hartley and Zisserman Appendix 6: Part II: Zhengyou Zhang: Presented by Daniel Fontijne

IE598 Big Data Optimization Summary Nonconvex Optimization

THE DEVELOPMENT OF THE POTENTIAL AND ACADMIC PROGRAMMES OF WROCLAW UNIVERISTY OF TECH- NOLOGY ITERATIVE LINEAR SOLVERS

Neural Networks: Optimization Part 1. Intro to Deep Learning, Fall 2018

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

Humanoid Robotics. Least Squares. Maren Bennewitz

Perceptron: This is convolution!

Algoritmi di Ottimizzazione: Parte A Aspetti generali e algoritmi classici

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Iterative Methods for Solving Linear Problems

Visual Tracking (1) Tracking of Feature Points and Planar Rigid Objects

Non-deterministic Search techniques. Emma Hart

Lecture 5: Optimization of accelerators in simulation and experiments. X. Huang USPAS, Jan 2015

CHAPTER VI BACK PROPAGATION ALGORITHM

Recapitulation on Transformations in Neural Network Back Propagation Algorithm

Introduction to optimization methods and line search

MATLAB Based Optimization Techniques and Parallel Computing

Algebraic Iterative Methods for Computed Tomography

Accelerating the convergence speed of neural networks learning methods using least squares

Image Registration Lecture 4: First Examples

Programming, numerics and optimization

March 19, Heuristics for Optimization. Outline. Problem formulation. Genetic algorithms

Optimization. there will solely. any other methods presented can be. saved, and the. possibility. the behavior of. next point is to.

Solving IK problems for open chains using optimization methods

Visual Tracking (1) Pixel-intensity-based methods

CS 395T Lecture 12: Feature Matching and Bundle Adjustment. Qixing Huang October 10 st 2018

Computational Methods. Constrained Optimization

Image Compression: An Artificial Neural Network Approach

Chapter 3 Numerical Methods

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

What is machine learning?

10.6 Conjugate Gradient Methods in Multidimensions

Machine Learning for Signal Processing Lecture 4: Optimization

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Linear Regression Optimization

Outline 7/2/201011/6/

Constrained and Unconstrained Optimization

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Fast oriented bounding box optimization on the rotation group SO(3, R)

Laboratory exercise. Laboratory experiment OPT-1 Nonlinear Optimization

Optimization. Industrial AI Lab.

Machine Learning 13. week

Recent Developments in Model-based Derivative-free Optimization

Comparison of Interior Point Filter Line Search Strategies for Constrained Optimization by Performance Profiles

Learning via Optimization

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Machine Learning Basics. Sargur N. Srihari

Numerical Optimization: Introduction and gradient-based methods

Logistic Regression

Dynamic Analysis of Structures Using Neural Networks

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

A large number of user subroutines and utility routines is available in Abaqus, that are all programmed in Fortran. Subroutines are different for

Optimization for Machine Learning

Performance Evaluation of an Interior Point Filter Line Search Method for Constrained Optimization

Cost Functions in Machine Learning

AN IMPROVED ITERATIVE METHOD FOR SOLVING GENERAL SYSTEM OF EQUATIONS VIA GENETIC ALGORITHMS

XFSL: A TOOL FOR SUPERVISED LEARNING OF FUZZY SYSTEMS

Efficient training algorithms for a class of shunting inhibitory convolutional neural networks

Backpropagation + Deep Learning

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Inverse Kinematics (part 1) CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Winter 2018

An evolutionary annealing-simplex algorithm for global optimisation of water resource systems

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Newton and Quasi-Newton Methods

Convexization in Markov Chain Monte Carlo

A Survey of Basic Deterministic, Heuristic, and Hybrid Methods for Single-Objective Optimization and Response Surface Generation

An Improved Backpropagation Method with Adaptive Learning Rate

10.7 Variable Metric Methods in Multidimensions

Scalable Data Analysis

Automatic Differentiation for R

6 BLAS (Basic Linear Algebra Subroutines)

MLPQNA-LEMON Multi Layer Perceptron neural network trained by Quasi Newton or Levenberg-Marquardt optimization algorithms

Lecture 6 - Multivariate numerical optimization

Discovery of the Source of Contaminant Release

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Convolution Neural Nets meet

ECE 901: Large-scale Machine Learning and Optimization Spring Lecture 13 March 6

Transcription:

Gradient Methods for Machine Learning Nic Schraudolph Course Overview 1. Mon: Classical Gradient Methods Direct (gradient-free), Steepest Descent, Newton, Levenberg-Marquardt, BFGS, Conjugate Gradient 2. Tue: Stochastic Approximation (SA) Why necessary, why difficult. Step size adaptation. 3. Thu: Stochastic Meta-Descent (SMD) Advanced stochastic step size adaptation method. 4. Fri: Algorithmic Differentiation (AD) Forward/reverse mode. Fast Hessian-vector products. 1 2 Classical Gradient Methods Note simultaneous course at AMSI (math) summer school: Nonlin. Optimization Methods (see http://wwwmaths.anu.edu.au/events/amsiss05/) Recommended textbook (Springer Verlag, 1999): Nocedal & Wright, Numerical Optimization Here: just quick overview, unconstrained only But will consider large, nonlinear problems Function Optimization Goal: given (diff able) function find minimum Gradient methods find only local minimum For convex functions, local min. = global min. In machine learning, Fn. is defined over data: May comprise loss and regularization terms 3 4 2005 Nic Schraudolph. All rights reserved. 1

Methods by Gradient Order 0th order (direct, gradient-free) methods use only the function values themselves 1st order gradient methods additionally use function s gradient 2nd order gradient methods also use the function s Hessian Direct (Gradient-Free) Methods Many distinct algorithms: Simulated annealing, Monte Carlo optim. Perturbation methods, SPSA, Tabu search Genetic algorithms, evolutionary strategies, ant colony optimization, Differ in many implementation details but all share a common approach. 5 6 Prototypical Direct Method Randomly initialize pool W of candidates Repeat until converged: pick parent(s) w i from generate child(ren) w i = perturb(w i ) compare child to parent (or entire pool): Δ = f(w i ) - f(w i ) if Δ < 0 accept w i into W (may replace w i ) else if global optimization: accept w i into W with probability P(e -Δ ) Direct Methods: Advantages No need to derive or compute gradients Can solve discrete/combinatorial problems Can address even non-formalized problems Can find (non-convex fn. s) global optimum Highly and easily parallelizable Very fast iteration when perturbation and evaluation are both incremental, i.e. O(1) 7 8 2005 Nic Schraudolph. All rights reserved. 2

Direct Methods: Disadvantages No sense of appropriate direction or size of step to take (perturbation is random) Some algorithms try to fix this heuristically Takes too many iterations to converge Global optim. requires knowing acceptance of inferior candidates slower still No strong mathematical underpinnings jungle of ad-hoc heuristics Gradient Descent Perturbs parameter vector in steepest downhill direction (= neg. gradient): Step size η can be set to small positive constant: simple gradient descent by line minimization: steepest descent adaptively (more on this later) Advantage: Cheap to compute: iteration typically just O(n) 9 10 Gradient Descent: Disadvantages Line minimization may be expensive Convergence slow for ill-conditioned problems: #iterations condition# of Hessian Newton s Method Local quadratic model has gradient therefore let 11 12 2005 Nic Schraudolph. All rights reserved. 3

Newton s Method Big advantage: Jumps directly to minimum of quadratic bowl (regardless of ill-conditioning) Disadvantages: Hessian expensive to invert: nearly O(n 3 ) Hessian must be positive definite: May make huge, uncontrolled steps 13 Let Then Gauss-Newton Approximation Gauss-Newton: G f H l 0 G f 0 At minimum, G f = H f For sum-squared loss: H l = I and = model, l = loss Jacobian: pseudo-inverse 14 Levenberg-Marquardt G t is Gauss-Newton approximation to H t (guaranteed positive semi-definite) λ 0 adaptively controlled, limits step to an elliptical model-trust region Fixes Newton s stability issues, but still O(n 3 ) 15 Quasi-Newton: BFGS Iteratively updates estimate B of H -1 Guarantees B T = B and B > 0 Reduces complexity to O(n 2 ) per iteration Requires line minimization (direction ) Update formula: where 16 2005 Nic Schraudolph. All rights reserved. 4

Conjugate Gradient Conjugate Gradient: Properties Search directions are conjugate: set by line minimization No matrices each iteration costs only O(n) Minimizes quadratic fn. exactly in n iterations Restart every n iterations for nonlinear fn.s Optimal progress after k < n iterations (= orthogonal in local (Hestenes-Stiefel, 1952) Mahalonobis metric) (a.k.a. Beale-Sørenson) NB: other formulae for β (Polak-Ribiere, Fletcher-Reeves) equivalent for quadratic but inferior for nonlinear fn.s! 17 An incremental 2 nd -order method! Revolutionary. Drives nearly all large-scale optimization today. 18 Stochastic Approximation Modern ML problems increasingly data-rich (cheap sensors & storage, ubiquitous networking) Classical formulation of optimization problem inefficient for large X Often want answers online, as data arrives Can t wait for all the data (never-ending stream) Need to track non-stationary data (moving target) Stochastic Approximation (SA) Solution: estimate function, gradient, etc. from small, current subsample S X of the data: S may just be current data point (fully online) Optimization alg.s should resample S at each iteration, converge to true minimum as t 19 20 2005 Nic Schraudolph. All rights reserved. 5

Houston, we have a problem The best classical methods can t handle SA. Conjugate gradients break down with noise Line minimizations (BFGS, CG) are incorrect Newton, Levenberg-Marquardt too expensive per iteration for large-scale online operation Extended Kalman Filters Designed for online operation; use an adaptive gain matrix: Widely used in signal processing, but O(n 2 ) per iteration: don t scale to large n Require an explicit model of stochasticity Assumes Gaussianity, i.i.d., etc. Assumes parameters are known 21 22 Back to Square One Simple gradient descent works with SA; proven convergence if step size η t obeys (Robbins & Munro) But convergence too slow to be useful try to accelerate such that SA still works Local Step Sizes Give each parameter its own step size: Still O(n) per iteration p i > 0 descent direction Act as diagonal conditioner: Hadamard product (component-wise) 23 24 2005 Nic Schraudolph. All rights reserved. 6

Adapting Local Step Sizes Problems, Problems Key idea: perform simultaneous gradient descent in step sizes ( meta-descent ): meta-step size Doesn t work. 25 p i can go negative, have small dynamic range use multiplicative update Autocorrelation of stoch. gradient very noisy replace g t with running average g t Step size update extremely ill-conditioned (condition number κ squares at meta-level!) normalize gradient autocorrelation radical form of normalization: use sign only 26 We now have Sign-based Methods With some variations, this is known as Delta-bar-delta (Jacobs 1988) Adaptive BP (Silva&Almeida 1990) SuperSAB (Tollenaere 1990) RPROP (Riedmiller 1993) Don t work online. Sign-Based Methods: Problem Consider 2-way classification task with 10% positives, classifier only learns bias (= d.c. component). Let e t = error at time t. At optimum (output = 0.1) E(e t ) = 0.1 (1-0.1) + 0.9 (0-0.1) = 0. Assume i.i.d. sampling and step size zero. Now E(e t+1 e t ) = E(e t+1 ) E(e t ) = 0 But: E(sign(e t+1 e t )) = 0.01 + 0.81-2 0.09 = 0.64 sign-based method will increase step size will never anneal to converge on solution 27 28 2005 Nic Schraudolph. All rights reserved. 7

Need for Linearity Exponentiated Gradient Problem: sign function is nonlinear conflicts with goal of SA: Linear normalization (Almeida et al. 1999): Change to log-space step sizes: Exponentiate and re-linearize: Normalization factor: since p g H -1 g, p g g affine invariant self-normalizing Works, but there s a better way to do this... 29 guard against negative values 30 Multi-Step Approach Problem: p t affects not just w t+1, but all future w Stochastic Meta-Descent Local step sizes (multi-step model) (actual effect) adapted via where (0 λ 1) (smoothed single-step) (single-step model) to capture long-term dependence of w on p. 31 32 2005 Nic Schraudolph. All rights reserved. 8

SMD: The Tricky Bit SMD: The v Update whew! 33 We obtain a simple iterative update for v Closely related to TD(λ) reinf. learning (Sutton) H t v t can be computed as efficiently as two gradient eval.s (typically O(n) - more later) Predecessors (IDBD, K1, ELK1) diagonalized H; here we have full Hessian at no extra cost! 34 SMD: Fixpoint of v Fixpoint of is Levenberg-Marquardt style gradient step: SMD: Gauss-Newton SMD uses Gauss-Newton approximation of Hessian for improved stability Fast Gv product (even a bit faster than Hv) v g affine invariant at fixpoint (normalization) v too noisy for direct use (Orr & Leen); SMD stable due to double integration v p w H G 35 36 2005 Nic Schraudolph. All rights reserved. 9

Four Regions Benchmark Benchmark: Convergence Compare simple stoch. gradient (SGD), conventional step size adaptation (ALAP), stochastic meta-descent (SMD), and global extended Kalman filtering (GEKF). 37 38 Benchmark: Cost Comparison Benchmark: CPU Time Algorithm storage weight flops_ update CPU ms pattern SGD 1 6 0.5 SMD 3 18 1.0 ALAP 4 18 1.0 GEKF >90 >1500 40 39 40 2005 Nic Schraudolph. All rights reserved. 10

Nic Schraudolph: Gradient Methods for Machine Learning 5/15/05 Benchmark: Autocorrelated Data Application: Turbulent Flow (with M. Milano, Inst. of Comput. Science, ETH Zürich) i.i.d. uniform Sobol Brownian original simulation (75 000 d.o.f.) linear PCA (160 p.c.) neural network (160 nonlin. p.c.) 41 42 Application II: Hand Tracking Application: Turbulent Flow (with M. Bray & L. van Gool, Computer Vision Lab, ETH Zürich) 15 neural nets, each about 180 000 weights generic model has over 20 000 000 weights! Detailed hand model (10k vertices, 23 d.o.f.) Randomly sample a few points on model surface Here SMD outperformed Matlab toolbox able to train generic model Project them to image Compare with camera image at these points Use resulting stoch. gradient to adjust model via SMD 43 2005 Nic Schraudolph. All rights reserved. 44 11

Hand Tracking: Results 3 s 232 s 114 s 3 s SMD: 40-fold speed-up over state of the art Speed-up due to stochastic approximation Stochasticity helps escape local minima better tracking performance Work continues at ETH, Oxford, and NICTA: Multiple, ordinary video cameras, occlusions Real-time tracking of hands, face, body, 45 46 SA, SMD: Summary ANGie Project Data-rich ML problems need SA for efficiency Classical gradient methods don t work with SA Like CG, SMD combines Extreme scalability: cheap O(n) iterations Efficiency: rapid (superlinear) convergence Unlike CG, SMD designed to work with SA 2 nd order without the cost (fast Hv product) 47 ANGie project will explore SMD at NICTA: Mathematical analysis (stability, convergence) Further development of the core algorithm Development of AD tools & techniques Use of SMD in different ML settings (kernels, graphical models, RL, control, ) Reference applications (computer vision, ) Jobs available (postdoc, Ph.D.) 48 2005 Nic Schraudolph. All rights reserved. 12

Algorithmic Differentiation (AD) a.k.a. automatic differentiation (www.autodiff.org) Given (code for) a diff able function, produces (code for) derivative function(s) automatically Solves major software engineering problem by ensuring correctness of derivative code Textbook (SIAM 2000): Griewanck, Evaluating Derivatives: Principles & Techniques of AD 49 Differentiation Strategies Symbolic: sin = cos d(m -1 ) = -M -1 (dm)m -1 Transformation of symbolic algebraic expressions Knowledge-intensive; goal is mathematical insight Numeric: Knowledge-free; goal is just numerical result Approximate; choice of step h is problematic Inefficient for calculating high-dim. gradients (approximates only forward mode of AD) 50 Differentiation Strategies Algorithmic: Low-level symbolic diff. for numeric purposes Transformation & evaluation of algebraic code Differentiate high-level constructs by way of their implementation in terms of lower-level primitives Exact and efficient. Two modes: given Forward mode calculates Reverse mode calculates (perturbation) (gradient) Forward Mode Propagates perturbations forward through the tangent linear system: Basic rules: Sums: c = a + b dc = da + db Products: c = a b dc = a db + da b Chain rule: c = f(a) dc = f (a) da Higher-level (math library, linear algebra, ) rules can be added to increase efficiency 51 52 2005 Nic Schraudolph. All rights reserved. 13

Forward Mode: Implementation Straightforward since control flow is unchanged. Many ways to do it - Source transformation: a*=b; Byte code transformation Augmented byte code interpreter Overloaded C++ class (available on request) Or just use complex number library a*=b; da*=b; da+=a*db; Forward Mode: Implementation Complex arithmetic can perform forward AD! Consider complex (x, ε dx), where ε = 10-150 : Sums: (a, ε da) + (b, ε db) = (a+b, ε (da+db)) Products: (a, ε da)(b, ε db) = (ab - ε 2 da db, ε (a db + da b)) 10-300 0 Very fast and usually accurate enough - but can t use this trick for complex numbers 53 54 Calculating Gradients Gradient of scalar fn. = transposed Jacobian: Can calculate individual elements by forward AD: n iterations for gradient inefficient for high-dim. systems (large n) Reverse mode AD obtains gradient efficiently (single iteration) but is harder to implement Reverse Mode Propagates gradients back through the adjoint system: Gradient of scalar fn.: For neural nets, known as backprop(agation) Requires reversal of fn. s dataflow. Need to Memorize dataflow & intermediate results Unroll loops, overwrites, etc. Hard to do well! 55 56 2005 Nic Schraudolph. All rights reserved. 14

Reverse Mode: Rules Similar to forward mode: Sums: Forks: Products: c = a + b b = a; c = a c = a b Chain rule: c = f(a) Higher-level rules essential for efficiency Ongoing research, e.g. reverse AD of fixed-point iterations without unrolling (Pearlmutter 2004) 57 Visual Dataflow Programming Make dataflow explicit: focus programmer on constructs for which reverse AD is efficient 58 Fast Hessian-Vector Product Course Summary Applying forward mode to gradient code: gives product of H f with arbitrary vector v. As fast as 2-3 gradient evaluations; usually O(n) - even though H f is n n matrix! Similar trick for Gauss-Newton approximation: forward mode Hv product reverse mode 59 1. Classical Gradient Methods Direct (gradient-free), Steepest Descent, Newton, Levenberg-Marquardt, BFGS, Conjugate Gradient 2. Stochastic Approximation (SA) Why necessary, why difficult. Step size adaptation. 3. Stochastic Meta-Descent (SMD) Advanced stochastic step size adaptation method. 4. Algorithmic Differentiation (AD) Forward/reverse mode. Fast Hessian-vector products. 60 2005 Nic Schraudolph. All rights reserved. 15