Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

Similar documents
Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Parallelism. CS6787 Lecture 8 Fall 2017

IE598 Big Data Optimization Summary Nonconvex Optimization

MapReduce ML & Clustering Algorithms

ECS289: Scalable Machine Learning

ECE 901: Large-scale Machine Learning and Optimization Spring Lecture 13 March 6

Cost Functions in Machine Learning

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

A Brief Look at Optimization

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

William Yang Group 14 Mentor: Dr. Rogerio Richa Visual Tracking of Surgical Tools in Retinal Surgery using Particle Filtering

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

Ensemble methods in machine learning. Example. Neural networks. Neural networks

CS281 Section 3: Practical Optimization

Convex Optimization MLSS 2015

Convex Optimization. Lijun Zhang Modification of

5 Machine Learning Abstractions and Numerical Optimization

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 6453: Parameter Server. Soumya Basu March 7, 2017

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Gradient Descent. Michail Michailidis & Patrick Maiden

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Large Scale Distributed Deep Networks

Linear Regression & Gradient Descent

Integral Geometry and the Polynomial Hirsch Conjecture

2. Linear Regression and Gradient Descent

MY WEAK CONSISTENCY IS STRONG WHEN BAD THINGS DO NOT COME IN THREES ZECHAO SHANG JEFFREY XU YU

A Course in Machine Learning

Distributed non-convex optimization

Optimization for Machine Learning

Scaling Distributed Machine Learning

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

CS489/698: Intro to ML

Conflict Graphs for Parallel Stochastic Gradient Descent

Scalable Data Analysis

princeton univ. F 15 cos 521: Advanced Algorithm Design Lecture 2: Karger s Min Cut Algorithm

Lecture 25 Nonlinear Programming. November 9, 2009

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

AM205: lecture 2. 1 These have been shifted to MD 323 for the rest of the semester.

CMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro

Lecture 22 : Distributed Systems for ML

Lecture 26: Missing data

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Modern Methods of Data Analysis - WS 07/08

Spectral Clustering. Presented by Eldad Rubinstein Based on a Tutorial by Ulrike von Luxburg TAU Big Data Processing Seminar December 14, 2014

Deep Neural Networks Optimization

Perceptron: This is convolution!

I How does the formulation (5) serve the purpose of the composite parameterization

Support Vector Machines.

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

15. Cutting plane and ellipsoid methods

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Ellipsoid Algorithm :Algorithms in the Real World. Ellipsoid Algorithm. Reduction from general case

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

A Brief Introduction to Reinforcement Learning


Case Study 1: Estimating Click Probabilities

A General Greedy Approximation Algorithm with Applications

Mathematical Programming and Research Methods (Part II)

Robust Principal Component Analysis (RPCA)

BAYESIAN GLOBAL OPTIMIZATION

Simplicial Hyperbolic Surfaces

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Learning via Optimization

Combine the PA Algorithm with a Proximal Classifier

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Abstract. A graph G is perfect if for every induced subgraph H of G, the chromatic number of H is equal to the size of the largest clique of H.

Low Rank Representation Theories, Algorithms, Applications. 林宙辰 北京大学 May 3, 2012

MTAEA Convexity and Quasiconvexity

Optimization. Industrial AI Lab.

An Introduction to NNs using Keras

Surface Reconstruction with MLS

SIMULATED ANNEALING WITH AN EFFICIENT UNIVERSAL BARRIER

Exploiting Low-Rank Structure in Semidenite Programming by Approximate Operator Splitting

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018

CSE 573: Artificial Intelligence Autumn 2010

Introduction to Deep Learning

Toward Scalable Deep Learning

Optimization for Machine Learning

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Convex Optimization / Homework 2, due Oct 3

Global Optimality in Neural Network Training

Efficient Deep Learning Optimization Methods

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Outlier Pursuit: Robust PCA and Collaborative Filtering

25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods.

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

Near-Data Processing for Differentiable Machine Learning Models

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Stanford University. A Distributed Solver for Kernalized SVM

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

MS&E 213 / CS 269O : Chapter 1 Introduction to Introduction to Optimization Theory

Transcription:

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems Christopher De Sa Soon at Cornell University cdesa@stanford.edu stanford.edu/~cdesa Kunle Olukotun Christopher Ré Stanford University

Same motivation as plenary talk: Large Scale Machine Learning Machine learning problems are big Larger datasets & more sophisticated models à better ML services Today, all production machine learning algorithms run in parallel Because modern architectures are parallel Most work is done with stochastic iterative algorithms Because of the scale of the datasets and models

Background: Learning Perhaps the biggest class of machine learning applications Goal: learn the model that best accomplishes a task Formally: find model that minimizes some loss function for a dataset model min x NX i=1 f(x; y i ) training examples Applications ranging from linear regression to deep learning De facto algorithm used at scale: Stochastic gradient descent (SGD)

What is stochastic gradient descent? Repeat: Pick a training example y it uniformly at random Update the model x t using a gradient estimate Iterate x t+1 = x t rf(x t ; y it ) This is not generally guaranteed to converge to the global optimum for non-convex problems.

No General Guarantees Because of local minima Example: for this function, even gradient descent won t get to the global optimum from everywhere Non-optimal local minimum

How to proceed? Show only convergence to a critical point Straightforward to do

Showing we get close to critical points For any x, let g be the gradient estimate at x. Then, E [f(x g)] = E f(x) g T rf(x)+ 2 g T r 2 f(z) g apple f(x) krf(x)k 2 + 2 M Summing up across iterations, for square-summable but not summable step sizes TX 1 t=0 t Ekrf(x t )k 2 apple E [f(x 0 ) f(x T )] + TX 1 t=0 2 t M apple1

How to proceed? Show only convergence to a critical point Straightforward to do Results are weak usually convergence to a critical point is not good enough Show only local convergence, given close enough initialization Also straightforward to do

Showing local convergence Assume that the second derivative matrix is nonsingular at the optimum, r 2 f(x ) 0. Then there exists a ball around the optimum where f is convex. In fact, strongly convex. If we initialize far enough into the interior of this ball, any standard analysis of convex SGD will work. With minor tweaks to make sure we don t escape the ball.

How to proceed? Show only convergence to a critical point Straightforward to do Results are weak often convergence to a critical point is not good enough Show only local convergence, given close enough initialization Also straightforward to do Hard bit is finding explicit initialization algorithm that is close enough Show global convergence from almost all initializations For a restricted class of problems

Guarantees for Non-Convex SGD Main question: when can we show that SGD converges globally (from almost all initializations) for a non-convex problem with an explicit rate of convergence that is robust to additional sources of noise (such as asynchrony)? Problem for this talk: matrix completion Global Convergence of SGD for Some Nonconvex Matrix Problems [De Sa et al, ICML 2015] Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms [De Sa et al, NIPS 2015]

Outline 1. A nonconvex problem: matrix completion 2. SGD for matrix completion 3. Convergence guarantees 4. Robustness to additional noise

Outline 1. A nonconvex problem: matrix completion 2. SGD for matrix completion 3. Convergence guarantees 4. Robustness to additional noise

Matrix Completion Problem Goal is to fit a low-rank matrix X to samples apple minimize E Ã Apply quadratic substitution X 2 F subject to X 2 R n n, rank (X) apple 1,X 0. apple minimize E X = yy T subject to y 2 R n. Ã yy T 2 F (Burer-Monteiro)

Many Applications Standard matrix completion Matrix sensing Subspace tracking Principle component analysis How to represent many applications? Ø Same optimization problem Ø Different application à different noise model

Key Insight: Use Weak Noise Model Using only weak assumptions about the sample distribution lets us handle many application cases with a unified analysis How much can we prove by assuming only that the estimates are unbiased, and that their variance is bounded? i E hã = A. apple 2 8y, z : E y T Ãz apple 2 kyk 2 kzk 2

Simple Gradient Flow of 2D Case 3 2 1 0 1 2 3 3 2 1 0 1 2 3 ẋ =4x (x 2 + y 2 )x ẏ = y (x 2 + y 2 )y Consequences of non-convexity: We get pushed in multiple directions There are multiple unstable fixed points What s nice about this problem: Symmetry No non-optimal local minima

Bad Trajectory 3 2 1 0 1 2 3 3 2 1 0 1 2 3 If we initialize on bad trajectory, could take forever to escape. Using weak noise model: Can t show convergence from everywhere in reasonable time can t show convergence from initial points near bad trajectory algorithm can always jump onto bad trajectory then stay there for an arbitrarily long time

Outline 1. A nonconvex problem: matrix completion 2. SGD for matrix completion 3. Convergence guarantees 4. Robustness to additional noise

SGD for Matrix Completion Standard SGD gives the update step y k+1 = y k 4 k y k y T k y k à k y k. By using a variable step size scheme (SGD on a manifold) y k+1 = I + Ãk y k 1+ ky k k 2 1. To just get the direction y k+1 = I + Ãk y k.

Many Related Algorithms y k+1 = I + Ãk y k. Many algorithms use this or a similar update rule Alecton [De Sa et al, ICML 2015] Oja s algorithm [Oja, Journal of mathematical biology, 1982] Stochastic power iteration Need to periodically normalize: Doesn t affect direction y := y/ kyk

Outline 1. A nonconvex problem: matrix completion 2. SGD for matrix completion 3. Convergence guarantees 4. Robustness to additional noise

Analyze Non-Convex SGD using Martingales Using a standard Lyapunov-function approach won't work. This approach shows convergence from everywhere, which we've shown doesn't happen Martingale approach: Handles processes which can fail with some probability Bounds the probability of failure Bonus: lets us combine noise from different sources with a unified analysis E [x t+1 x t ]=x t

Measuring Convergence Success condition: angular closeness to global optimum u 1 k = (ut 1 y k ) 2 ky k k 2 1 We let F t, the failure event, denote the event that the algorithm hasn t succeeded after t iterations.

Theorem For any > 0, if we run for t iterations where t n log n 2 2 G( ), then the probability of failure is bounded by P (F t ) apple, where G is some fixed function of.

Technique Analyze the quantity t = (u T 1 y t ) 2 y T t I +(1 )u 1 u T 1 y t Can prove that if success hasn t occurred, for some R E [ t+1 F t ] t (1 + R (1 t )) Careful martingale analysis of this recurrence proves the theorem

Outline 1. A nonconvex problem: matrix completion 2. SGD for matrix completion 3. Convergence guarantees 4. Robustness to additional noise

Techniques for Fast Parallel SGD Asynchronous execution (HOGWILD!) run multiple threads of SGD without locks exposes multi-core parallelism may have race conditions, but theory says: it s okay Low-precision computation (BUCKWILD!) use low-precision fixed-point arithmetic (e.g. 8-bit) exposes SIMD parallelism

Low-Cost Parallelism With HOGWILD! Multiple parallel workers Asynchronous parallel updates (no locks) to a single shared model x t

Recent Hardware Trend: SIMD Single data-parallel instruction can process multiple values at once. x " y " = x " y " x $ y $ = x $ y $ x % y % = x % y % x & y & = x & y & one vmulpd instruction Source of parallelism independent of threading.

Low Precision and SIMD Parallelism Major benefit of low-precision: use SIMD instructions to get more parallelism on CPU SIMD Precision SIMD Parallelism 64-bit float vector F64 F64 F64 F64 32-bit float vector F32 F32 F32 F32 F32 F32 F32 F32 16-bit int vector 8-bit int vector 4 multiplies/cycle (vmulpd instruction) 8 multiplies/cycle (vmulps instruction) 16 multiplies/cycle (vpmaddwd instruction) 32 multiplies/cycle (vpmaddubsw instruction)

Low Precision and Memory Major benefit of low-precision: puts less pressure on memory and caches Precision in DRAM Memory Throughput 64-bit float vector F64 F64 F64 5 numbers/ns 32-bit float vector F32 F32 F32 F32 F32 F32 10 numbers /ns 16-bit int vector 20 numbers /ns 8-bit int vector 40 numbers /ns (assuming ~40 GB/sec memory bandwidth)

Taming the Wild Can we also prove convergence results that give rates for nonconvex SGD using these techniques? Taming the Wild [De Sa et al, NIPS 2015)] We show a principled way to give a rate for a HOGWILD! version of an algorithm, given a martingale-based rate for its sequential version. This applies to the non-convex matrix completion results. First convergence rate for asynchronous SGD on a non-convex problem. Same results also apply to low-precision methods!

Result Outline Martingale proof uses a martingale W such that W t (x t,x t 1,...,x 0 ) t 1 if algorithm hasn t succeeded Just need simple continuity bounds: the martingale W is Lipschitz continuous with parameter H the gradients are Lipschitz continuous with parameter R the expected magnitude of an update is bounded by ξ the expected staleness due to asynchrony is bounded by τ P (failure Hogwild! ) P (failure sequential ) apple 1 1 HR

Intuition If the algorithm is continuous, noisy perturbations (due to race conditions, round-off error, etc.) just propagate and sum up to be roughly Gaussian noise at the output. Can use Lipschitz continuity to formalize this intuition. Practical upshot: works for almost all types of extra noise! As long as the original proof uses a weak enough noise model. Even for non-convex problems like matrix completion.

Follow-Up Work Systems analysis of low-precision asynchronous SGD Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent [De Sa et al, ISCA 2017] How can we change the hardware to better support these algorithms? Accelerating the convergence of SGD for PCA/matrix completion A general analysis of momentum + stochastic power iteration Using a weak noise model Great previous work here (especially for PCA application)

Conclusion Main question: when can we show that nonconvex stochastic gradient descent converges? Approach in this talk: study matrix completion/pca SGD converges globally Using very weak noise models Theory makes analyzing techniques easy Asynchronous execution Low-precision computation Thank you! cdesa@stanford.edu stanford.edu/~cdesa