Gradient Descent. Michail Michailidis & Patrick Maiden

Size: px

Start display at page:

Download "Gradient Descent. Michail Michailidis & Patrick Maiden"

Miles Powell
5 years ago
Views:

1 Gradient Descent Michail Michailidis & Patrick Maiden

2 Outline Mo4va4on Gradient Descent Algorithm Issues & Alterna4ves Stochas4c Gradient Descent Parallel Gradient Descent HOGWILD!

3 Mo4va4on It is good for finding global minima/maxima if the func4on is convex It is good for finding local minima/maxima if the func4on is not convex It is used for op4mizing many models in Machine learning: It is used in conjunc-on with: Neural Networks Linear Regression Logis4c Regression Back- propaga4on algorithm Support Vector Machines

4 Func4on Example

5 Quickest ever review of mul4variate calculus Deriva4ve Par4al Deriva4ve Gradient Vector

6 Deriva4ve f(x)= x 2 Slope of the tangent line f (x) = df/dx =2x f (x)= d 2 f/dx = 2 Easy when a func4on is univariate

7 Par4al Deriva4ve Mul4variate Func4ons For mul4variate func4ons (e.g two variables) we need par4al deriva4ves one per dimension. Examples of mul4variate func4ons: f(x,y)= x 2 + y 2 Convex! f(x,y)= x 2 y 2 Concave! f(x,y)= cos 2 (x) + y 2 f(x,y)= cos 2 (x) + cos 2 (y)

Par4al Deriva4ve Cont d To visualize the par4al deriva4ve for each of the dimensions x and y, we can imagine a plane that cuts our surface along the two

8 Par4al Deriva4ve Cont d To visualize the par4al deriva4ve for each of the dimensions x and y, we can imagine a plane that cuts our surface along the two dimensions and once again we get the slope of the tangent line. surface: f(x,y)=9 x 2 y 2 plane: y=1 cut: f(x,1)=8 x 2 slope / deriva-ve of cut: f (x)= 2x

9 Par4al Deriva4ve Cont d 2 If we par4ally differen4ate a func4on with respect to x, we pretend y is constant f(x,y)=9 x 2 y 2 f(x,y)=9 x 2 c 2 f(x,y)=9 c 2 y 2 f x = f/ x = 2x f y = f/ y = 2y

10 Par4al Deriva4ve Cont d 3 The two tangent lines that pass through a point, define the tangent plane to that point

11 Gradient Vector Is the vector that has as coordinates the par4al deriva4ves of the func4on: f(x,y)=9 x 2 y 2 f/ x = 2x f= f/ x i+ f/ y j=( f/ x, f/ y )=( 2x, 2y) f/ y = 2y Note: Gradient Vector is not parallel to tangent surface

12 Gradient Descent Algorithm & Walkthrough Idea Start somewhere Take steps based on the gradient vector of the current posi4on 4ll convergence Convergence : happens when change between two steps < ε

13 Gradient Descent Code (Python) f(x)= x 4 3 x 3 +2 f (x)=4 x 3 9 x 2 f (x)=4 x 3 9 x 2

14 Gradient Descent Algorithm & Walkthrough

15 Poten4al issues of gradient descent - Convexity We need a convex func4on à so there is a global minimum: f(x,y)= x 2 + y 2

16 Poten4al issues of gradient descent Convexity (2)

17 Poten4al issues of gradient descent Step Size As we saw before, one parameter needs to be set is the step size Bigger steps leads to faster convergence, right?

18 Alterna4ve algorithms Newton s Method Approximates a polynomial and jumps to the min of that func4on Needs Hessian BFGS More complicated algorithm Commonly used in actual op4miza4on packages

19 Stochas4c Gradient Descent Mo4va4on One way to think of gradient descent is as a minimiza4on of a sum of func4ons: w=w α L (w)=w α L i (w) ( L i is the loss func4on evaluated on the i- th element of the dataset) On large datasets, it may be computa4onally expensive to iterate over the whole dataset, so pulling a subset of the data may perform beeer Addi4onally, sampling the data leads to noise that can avoid finding shallow local minima. This is good for op4mizing non- convex func4ons. (Murphy)

20 Stochas4c Gradient descent Online learning algorithm Instead of going through the en4re dataset on each itera4on, randomly sample and update the model Initialize w and α Until convergence do: Sample one example i from dataset //stochastic portion w = w - α L i (w) return w

21 Stochas4c Gradient descent (2) Checking for convergence afer each data example can be slow One can simulate stochas4city by reshuffling the dataset on each pass: Initialize w and α Until convergence do: shuffle dataset of n elements //simulating stochasticity For each example i in n: w = w - α L i (w) return w This is generally faster than the classic itera4ve approach ( noise ) However, you are s4ll passing over the en4re dataset each 4me An approach in the middle is to sample batches, subsets of the en4re dataset This can be parallelized!

22 Parallel Gradient descent Training data is chunked into batches and distributed Initialize w and α Loop until convergence: generate randomly sampled chunk of data m on each worker machine v: L v (w) = sum( L i (w)) // compute gradient on batch w = w α sum( L v (w)) //update global w model return w

23 HOGWILD! (Niu, et al. 2011) Unclear why it is called this Idea: In Parallel SGD, each batch needs to finish before star4ng next pass In HOGWILD!, share the global model amongst all machines and update on- the- fly No need to wait for all worker machines to finish before star4ng next epoch Assump4on: component- wise addi4on is atomic and does not require locking

24 HOGWILD! - Pseudocode Initialize global model w On each worker machine: loop until convergence: draw a sample e from complete dataset E get current global state w and compute L e (w) for each component i in e: w i = w i α b v T L e (w) // b v is v th std. basis component update global w return w

25 Comparison Parallel SGD HOGWILD! W 0 G A G B G c W x W 1 G A G B G c G A G B G c W 2

26 Comparison RR Round Robin Each machine updates x as it comes in. Wait for all before star4ng next pass AIG Like Hogwild but does fine- grained locking of variables that are going to be used

27 Comparison (2) SVM Graph Cuts Matrix Comple4on

28 Moral of the story Having an idea of how gradient descent works informs your use of others implementa4ons There are very good implementa4ons of the algorithm and other approaches to op4miza4on in many languages Packages: Python NumPy/SciPy Matlab Matlab Op4miza4on toolbox Pmtk3 R General- purpose op4miza4on: op4m() R Op4miza4on Infrastructure (ROI) TupleWare Coming soon.

29 Resources Par-al Deriva-ves: hep://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml hep://mathinsight.org/nondifferen4able_discon4nuous_par4al_deriva4ves hep:// Gradients Vector Field Interac4ve Visualiza4on: hep://dlippman.imathas.com/g1/grapher.html from heps:// 1 hep://simmakers.com/wp- content/uploads/sof/gradient.gif Gradient Descent: hep://en.wikipedia.org/wiki/gradient_descent hep:// (Stanford ML Lecture 2) hep://en.wikipedia.org/wiki/stochas4c_gradient_descent Murphy, Machine Learning, a Probabils2c Perspec2ve, 2012, MIT Press Hogwild paper: hep://pages.cs.wisc.edu/~brecht/papers/hogwildtr.pdf

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 179 Lecture 16. Logistic Regression & Parallel SGD CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)