Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Similar documents
Machine Learning Basics. Sargur N. Srihari

How Learning Differs from Optimization. Sargur N. Srihari

Hyperparameters and Validation Sets. Sargur N. Srihari

Challenges motivating deep learning. Sargur N. Srihari

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Perceptron: This is convolution!

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Early Stopping. Sargur N. Srihari

Divide and Conquer Kernel Ridge Regression

Linear Regression Optimization

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Optimization. Industrial AI Lab.

What is machine learning?

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Metric Learning for Large-Scale Image Classification:

Support vector machines

Class 6 Large-Scale Image Classification

Deep Learning and Its Applications

Deep Boltzmann Machines

Network Traffic Measurements and Analysis

GAN Frontiers/Related Methods

ECS289: Scalable Machine Learning

Evaluating Classifiers

Combine the PA Algorithm with a Proximal Classifier

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Evaluating Classifiers

Deep Learning with Tensorflow AlexNet

Machine Learning / Jan 27, 2010

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

A Brief Look at Optimization

Large Scale Manifold Transduction

Back Propagation and Other Differentiation Algorithms. Sargur N. Srihari

Kernels for Structured Data

Structured Learning. Jun Zhu

CS281 Section 3: Practical Optimization

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Deep Neural Networks Optimization

Topics in Machine Learning

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

A Brief Introduction to Reinforcement Learning

Building Classifiers using Bayesian Networks

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Adaptive Dropout Training for SVMs

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Stanford University. A Distributed Solver for Kernalized SVM

Gossip Learning. Márk Jelasity

08 An Introduction to Dense Continuous Robotic Mapping

Machine Learning: Think Big and Parallel

Case Study 1: Estimating Click Probabilities

Markov Network Structure Learning

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

Deep Feedforward Networks

Lecture #11: The Perceptron

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

CS570: Introduction to Data Mining

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Neural Networks and Deep Learning

Ensemble methods in machine learning. Example. Neural networks. Neural networks

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

A Simple (?) Exercise: Predicting the Next Word

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Lecture 13: Model selection and regularization

291 Programming Assignment #3

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Variational Autoencoders. Sargur N. Srihari

ε-machine Estimation and Forecasting

Logistic Regression. Abstract

CPSC 340: Machine Learning and Data Mining

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Machine Learning Classifiers and Boosting

Logistic Regression and Gradient Ascent

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

LSML19: Introduction to Large Scale Machine Learning

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Optimization for Machine Learning

Machine Learning in Telecommunications

ECS289: Scalable Machine Learning

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

Parallel Implementation of Deep Learning Using MPI

A Taxonomy of Semi-Supervised Learning Algorithms

IE598 Big Data Optimization Summary Nonconvex Optimization

Modern Methods of Data Analysis - WS 07/08

Transcription:

Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1

Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets 4. Estimators, Bias and Variance 5. Maximum Likelihood Estimation 6. Bayesian Statistics 7. Supervised Learning Algorithms 8. Unsupervised Learning Algorithms 9. Stochastic Gradient Descent 10. Building a Machine Learning Algorithm 2 11. Challenges Motivating Deep Learning

Stochastic Gradient Descent (SGD) Nearly all deep learning is powered by one very important algorithm: SGD SGD is an extension of the gradient descent algorithm A recurring problem in machine learning: large training sets are necessary for good generalization but large training sets are also computationally expensive 3

Method of Gradient Descent Criterion f(x) is minimized by moving from the current solution in direction of the negative of gradient Steepest descent proposes a new point ( ) x' = x ε x f x where ε is the learning rate, a positive scalar. Set to a small constant. 4

Cost function is sum over samples Criterion in machine learning is a cost function Cost function often decomposes as a sum of per sample loss function E.g., Negative conditional log-likelihood of training data is J(θ) = E x,y~ˆpdata L(x,y,θ) = 1 m where m is the no. of samples and L is the per-example loss L(x,y,θ)= - log p(y x;θ) m i=1 L( x (i),y (i),θ) 5

Gradient is also sum over samples For these additive cost functions, gradient descent requires computing m i=1 ( ) θ J(θ) = 1 m θ L x (i),y (i),θ Computational cost of this operation is O(m) As training set size grows to billions, time taken for single gradient step becomes prohibitively long 6

Insight of SGD Insight: Gradient descent is an expectation Expectation may be approximated using small set of samples In each step of SGD we can sample a minibatch of examples B ={x (1),..,x (m ) } drawn uniformly from the training set Minibatch size m is typically chosen to be small: 1 to a hundred Crucially m is held fixed even if sample set is in billions We may fit a training set with billions of examples using updates computed on only a hundred examples 7

SGD Estimate on minibatch Estimate of gradient is formed as g = 1 m ' m ' L x (i),y (i),θ θ ( ) i=1 using examples from minibatch B SGD then follows the estimated gradient downhill θ θ εg where ε is the learning rate 8

How good is SGD? In the past gradient descent was regarded as slow and unreliable Application of gradient descent to non-convex optimization problems was regarded as unprincipled SGD is not guaranteed to arrive at even a local minumum in reasonable time But it often finds a very low value of the cost function quickly enough 9

SGD and Training Set Size SGD is the main way to train large linear models on very large data sets For a fixed model size, the cost per SGD update does not depend on the training set size m As mà model will eventually converge to its best possible test error before SGD has sampled every example in the training set Asymptotic cost of training a model with SGD is 10 O(1) as a function of m

Deep Learning vs SVM Prior to advent of DL main way to learn nonlinear models was to use the kernel trick in combination with a linear model Many kernel learning algorithms require constructing an m x m matrix G i,j =k(x (i),x (j) ) Constructing this matrix is O(m 2 ) Growth of Interest in Deep Learning In the beginning because it performed better on medium sized data sets (thousands of examples) Additional interest in industry because it provided a scalable way of training nonlinear models on large 11 datasets