Machine Learning Basics. Sargur N. Srihari

Similar documents
Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

How Learning Differs from Optimization. Sargur N. Srihari

Hyperparameters and Validation Sets. Sargur N. Srihari

Challenges motivating deep learning. Sargur N. Srihari

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Deep Learning with Tensorflow AlexNet

Optimization. Industrial AI Lab.

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Kernels for Structured Data

Divide and Conquer Kernel Ridge Regression

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

A Brief Look at Optimization

Deep Learning and Its Applications

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Back Propagation and Other Differentiation Algorithms. Sargur N. Srihari

Class 6 Large-Scale Image Classification

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Linear Regression Optimization

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

ECS289: Scalable Machine Learning

Early Stopping. Sargur N. Srihari

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Metric Learning for Large-Scale Image Classification:

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

06: Logistic Regression

Support Vector Machines.

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Support Vector Machines

Support vector machines

Deep Feedforward Networks

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Support Vector Machines.

291 Programming Assignment #3

Stanford University. A Distributed Solver for Kernalized SVM

Structured Learning. Jun Zhu

Logistic Regression. Abstract

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Large Scale Manifold Transduction

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Optimization for Machine Learning

GAN Frontiers/Related Methods

Learning via Optimization

Deep Boltzmann Machines

Support Vector Machines

Perceptron: This is convolution!

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Deep Generative Models Variational Autoencoders

ε-machine Estimation and Forecasting

Chakra Chennubhotla and David Koes

Robot Mapping. TORO Gradient Descent for SLAM. Cyrill Stachniss

FastText. Jon Koss, Abhishek Jindal

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Scalable Data Analysis

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

A Brief Introduction to Reinforcement Learning

Lecture 3: Theano Programming

CSC 4510 Machine Learning

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Combine the PA Algorithm with a Proximal Classifier

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

27: Hybrid Graphical Models and Neural Networks

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Markov Network Structure Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Machine Learning / Jan 27, 2010

Neural Networks and Deep Learning

6 BLAS (Basic Linear Algebra Subroutines)

Penalizied Logistic Regression for Classification

Network Traffic Measurements and Analysis

Logistic Regression and Gradient Ascent

Optimization. 1. Optimization. by Prof. Seungchul Lee Industrial AI Lab POSTECH. Table of Contents

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Deep Neural Networks Optimization

Conquering Massive Clinical Models with GPU. GPU Parallelized Logistic Regression

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Variational Autoencoders. Sargur N. Srihari

CSE 250B Assignment 2 Report

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Neural Network Neurons

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

CS281 Section 3: Practical Optimization

ECS289: Scalable Machine Learning

All lecture slides will be available at CSC2515_Winter15.html

5 Machine Learning Abstractions and Numerical Optimization

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

Mathematical Themes in Economics, Machine Learning, and Bioinformatics

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Machine Learning: Think Big and Parallel

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Case Study 1: Estimating Click Probabilities

Adaptive Dropout Training for SVMs

Transcription:

Machine Learning Basics Sargur N. srihari@cedar.buffalo.edu 1

Overview Deep learning is a specific type of ML Necessary to have a solid understanding of the basic principles of ML 2

Topics Stochastic Gradient Descent Building a Machine Learning Algorithm Challenges Motivating Deep Learning 3

Stochastic Gradient Descent Nearly all deep learning is powered by one very important algorithm: SGD SGD is an extension of the gradient descent algorithm Recurring problem in ML: large training sets necessary for good generalization but large training sets are also computationally expensive 4

Method of Gradient Descent Steepest descent proposes a new point ( ) x' = x ε x f x where ε is the learning rate, a positive scalar. Set to a small constant. 5

Cost function is sum over samples Cost function often decomposes as a sum of per sample loss function E.g., Negative conditional log-likelihood of training data is J(θ) = E x,y~pdata L(x,y,θ) = 1 m L x (i),y (i),θ where m is the no. of samples and L is the per-example loss L(x,y,θ)=-log p(y x;θ) m i=1 ( ) 6

Gradient is also sum over samples Gradient descent requires computing θ J(θ) = 1 m m L x (i),y (i),θ θ ( ) i=1 Computational cost of this operation is O(m) As training set size grows to billions, time taken for single gradient step becomes prohibitively long 7

Insight of SGD Gradient descent uses expectation of gradient Expectation may be approximated using small set of samples In each step of SGD we can sample a minibatch of examples B ={x (1),..,x (m ) } drawn uniformly from the training set Minibatch size m is typically chosen to be small: 1 to a hundred Crucially m is held fixed even if sample set is in billions We may fit a training set with billions of examples using updates computed on only a hundred examples 8

SGD Estimate on minibatch Estimate of gradient is formed as g = 1 m ' m ' L x (i),y (i),θ θ ( ) i=1 using examples from minibatch B SGD then follows the estimated gradient downhill θ θ εg where ε is the learning rate 9

How good is SGD? In the past gradient descent was regarded as slow and unreliable Application of gradient descent to non-convex optimization problems was regarded as unprincipled SGD is not guaranteed to arrive at even a local minumum in reasonable time But it often finds a very low value of the cost function quickly enough 10

SGD and Training Set Size SGD is the main way to train large linear models on very large data sets For a fixed model size, the cost per SGD update does not depend on the training set size m As mà model will eventually converge to its best possible test error before SGD has sampled every example in the training set Asymptotic cost of training a model with SGD is 11 O(1) as a function of m

Deep Learning vs SVM Prior to advent of DL main way to learn nonlinear models was to use the kernel trick in combination with a linear model Many kernel learning algos require constructing an m x m matrix G i,j =k(x (i),x (j) ) Constructing this matrix is O(m 2 ) Growth of Interest in Deep Learning In the beginning because it performed better on medium sized data sets (thousands of examples) Additional interest in industry because it provided a scalable way of training nonlinear models on 12 large datasets

Building an ML Algorithm All ML algos are instances of a simple recipe: 1. Specification of a dataset 2. A cost function 3. An optimization procedure 4. A model Example of building a linear regression algorithm is shown next 13

Building a Linear Regression Algorithm 1. Data set : X and y 2. Cost function: J(w,b) = E x,y pdata log p model (y x) 3. Model specification: p model (y x) = N(y;x T w +b,1) 4. Optimization algo: solving for where the cost is zero using the normal equations We can replace any of these components mostly independently from the others and obtain a variety of algorithms 14

About the Cost Function Machine Learning Includes at least one term that causes learning process to perform statistical estimation Most common cost is negative log-likelihood Minimizing the cost maximizes the likelihood Cost function may include additional terms E.g., we can add weight decay to get J(w,b) = λ w 2 2 Ex,y pdata log p model (y x) which still allows closed-form optimization If we change model to be nonlinear most functions cannot be optimized in closed-form 15 Requires numerical optimization: gradient descent

Challenges Motivating DL Simple ML algorithms work very well on a wide variety of important problems However they have not succeeded in solving central problems of AI, such as recognizing speech and recognizing objects 16