Linear Regression & Gradient Descent

Similar documents
Linear Regression QuadraJc Regression Year

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Instructor: Jessica Wu Harvey Mudd College

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

Network Traffic Measurements and Analysis

CSC 411: Lecture 02: Linear Regression

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Perceptron as a graph

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Ensemble Methods: Bagging

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Deep Neural Networks Optimization

Machine Learning Lecture-1

Logis&c Regression. Aar$ Singh & Barnabas Poczos. Machine Learning / Jan 28, 2014

Machine Learning Classifiers and Boosting

Machine Learning / Jan 27, 2010

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

2. Linear Regression and Gradient Descent

Instance-based Learning

Instance-based Learning

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

CSE446: Linear Regression. Spring 2017

Data Mining: Models and Methods

Machine Learning Lecture 3

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

CS8803: Statistical Techniques in Robotics Byron Boots. Thoughts on Machine Learning and Robotics. (how to apply machine learning in practice)

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

Bias-Variance Trade-off (cont d) + Image Representations

The exam is closed book, closed notes except your one-page cheat sheet.

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Machine Learning. Topic 4: Linear Regression Models

3 Nonlinear Regression

K-Means and Gaussian Mixture Models

Logistic Regression. May 28, Decision boundary is a property of the hypothesis and not the data set e z. g(z) = g(z) 0.

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

3 Nonlinear Regression

Machine Learning Lecture 3

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

IST 597 Foundations of Deep Learning Fall 2018 Homework 1: Regression & Gradient Descent

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

An Introduction to Machine Learning

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Fall 09, Homework 5

Machine Learning Lecture 3

Simple Model Selection Cross Validation Regularization Neural Networks

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Linear Regression and K-Nearest Neighbors 3/28/18

Gradient Descent - Problem of Hiking Down a Mountain

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to ANSYS DesignXplorer

5 Machine Learning Abstractions and Numerical Optimization

EE 511 Linear Regression

CSE446: Linear Regression. Spring 2017

Linear Regression Optimization

Logistic Regression and Gradient Ascent

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

CSC 4510 Machine Learning

CS489/698: Intro to ML

Support vector machines

Missing Data Analysis for the Employee Dataset

Optimization and least squares. Prof. Noah Snavely CS1114

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Nearest Neighbor Predictors

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Using Machine Learning to Optimize Storage Systems

Chapter 7: Numerical Prediction

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Machine Learning using Matlab. Lecture 3 Logistic regression and regularization

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Linear Regression Implementation

1 Case study of SVM (Rob)

5 Learning hypothesis classes (16 points)

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Cost Functions in Machine Learning

STA 4273H: Sta-s-cal Machine Learning

732A54/TDDE31 Big Data Analytics

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

All lecture slides will be available at CSC2515_Winter15.html

CSE 446 Bias-Variance & Naïve Bayes

A Brief Look at Optimization

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Transcription:

Linear Regression & Gradient Descent These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Regression Given: n Data X = x (1),...,x (n)o where x (i) 2 R d n Corresponding labels y = y (1),...,y (n)o where y (i) 2 R September Arctic Sea Ice Extent (1,000,000 sq km) 9 8 7 6 5 4 3 2 1 Linear Regression Quadratic Regression 0 1970 1980 1990 2000 2010 2020 Year Data from G. Witt. Journal of Statistics Education, Volume 21, Number 1 (2013) 2

Linear Regression Hypothesis: y = 0 + 1 x 1 + 2 x 2 +...+ d x d = dx j x j Assume x 0 = 1 j=0 Fit model by minimizing sum of squared errors x Figures are courtesy of Greg Shakhnarovich 3

Least Squares Linear Regression Cost Function J( ) = 1 2n Fit by solving min nx i=1 J( ) h x (i) y (i) 2 4

Intuition Behind Cost Function Slide by Andrew Ng 5

Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 6

Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 7

Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 8

Intuition Behind Cost Function (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 9

Basic Search Procedure Choose initial value for Until we reach a minimum: Choose a new value for to reduce J( ) J(q 0,q 1 ) q 1 q 0 Figure by Andrew Ng 10

Basic Search Procedure Choose initial value for Until we reach a minimum: Choose a new value for to reduce J( ) J(q 0,q 1 ) q 1 q 0 Figure by Andrew Ng 11

Basic Search Procedure Choose initial value for Until we reach a minimum: Choose a new value for to reduce J( ) J(q 0,q 1 ) Since the least squares objective function is convex (concave), we q 1 don t need to worry q 0 about local minima in linear regression Figure by Andrew Ng 12

Gradient Descent Initialize Repeat until convergence j j @ @ j J( ) simultaneous update for j = 0... d learning rate (small) e.g., α = 0.05 3 J( ) 2 1 0-0.5 0 0.5 1 1.5 2 2.5 13

Gradient Descent Initialize Repeat until convergence j j @ @ j J( ) simultaneous update for j = 0... d For Linear Regression: @ @ j J( ) = @ 1 @ j 2n = @ @ j 1 2n = 1 n = 1 n nx i=1 nx i=1 nx h x (i) y (i) 2 i=1 nx i=1 dx k=0 dx k=0 dx k=0 k x (i) k k x (i) k k x (i) k y (i) y (i)!! nx y (i) x (i) j! 2 @ @ j dx k=0 k x (i) k y (i)! 14

Gradient Descent for Linear Regression Initialize Repeat until convergence j j 1 nx h x (i) n i=1 y (i) x (i) j simultaneous update for j = 0... d To achieve simultaneous update At the start of each GD iteration, compute h x (i) Use this stored value in the update step loop Assume convergence when L 2 norm: s X kvk 2 = vi qv 2 = 1 2 + v2 2 +...+ v2 v i k new old k 2 < 15

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) h(x) = -900 0.1 x Slide by Andrew Ng 16

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 17

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 18

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 19

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 20

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 21

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 22

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 23

Gradient Descent (for fixed, this is a function of x) (function of the parameters ) Slide by Andrew Ng 24

Choosing α α too small slow convergence α too large Increasing value for J( ) May overshoot the minimum May fail to converge May even diverge To see if gradient descent is working, print out The value should decrease at each iteration If it doesn t, adjust α J( ) each iteration 25

Extending Linear Regression to More Complex Models The inputs X for linear regression can be: Original quantitative inputs Transformation of quantitative inputs e.g. log, exp, square root, square, etc. Polynomial transformation example: y = b 0 + b 1 x + b 2 x 2 + b 3 x 3 Basis expansions Dummy coding of categorical inputs Interactions between variables example: x 3 = x 1 x 2 This allows use of linear regression techniques to fit non-linear datasets.

Linear Basis Function Models Generally, h (x) = dx j=0 j j (x) basis function Typically, 0(x) =1so that 0 acts as a bias In the simplest case, we use linear basis functions : j(x) =x j Based on slide by Christopher Bishop (PRML)

Linear Basis Function Models Polynomial basis functions: These are global; a small change in x affects all basis functions Gaussian basis functions: These are local; a small change in x only affect nearby basis functions. μ j and s control location and scale (width). Based on slide by Christopher Bishop (PRML)

Linear Basis Function Models Sigmoidal basis functions: where These are also local; a small change in x only affects nearby basis functions. μ j and s control location and scale (slope). Based on slide by Christopher Bishop (PRML)

Example of Fitting a Polynomial Curve with a Linear Model y = 0 + 1 x + 2 x 2 +...+ p x p = px j x j j=0

Quality of Fit Price Price Price Size Underfitting (high bias) Size Correct fit Size Overfitting (high variance) Overfitting: The learned hypothesis may fit the training set very well ( J( ) 0 )...but fails to generalize to new examples Based on example by Andrew Ng 31

Regularization A method for automatically controlling the complexity of the learned hypothesis Idea: penalize for large values of Can incorporate into the cost function Works well when we have a lot of features, each that contributes a bit to predicting the label Can also address overfitting by eliminating features (either manually or via model selection) j 32

Regularization Linear regression objective function J( ) = 1 2n nx i=1 h x (i) y (i) 2 + 2 dx j=1 2 j model fit to data regularization is the regularization parameter ( 0) No regularization on 0! 33

Understanding Regularization J( ) = 1 2n nx i=1 h x (i) y (i) 2 + 2 dx j=1 2 j dx Note that j 2 = k 1:d k 2 2 j=1 This is the magnitude of the feature coefficient vector! We can also think of this as: dx ( j 0) 2 = k 1:d ~0k 2 2 j=1 L 2 regularization pulls coefficients toward 0 34

Understanding Regularization J( ) = 1 2n nx i=1 h x (i) y (i) 2 + 2 dx j=1 2 j What happens if we set to be huge (e.g., 10 10 )? Price Size 0 0 0 0 Based on example by Andrew Ng 35

Regularized Linear Regression Cost Function J( ) = 1 2n Fit by solving nx i=1 min h x (i) y (i) 2 + 2 dx J( ) j=1 2 j @ @ 0 J( ) @ @ j J( ) Gradient update: 0 0 1 n j j 1 n nx i=1 nx i=1 h x (i) y (i) h x (i) y (i) x (i) j j regularization 36

Regularized Linear Regression J( ) = 1 2n 0 0 1 n nx h x (i) y (i) 2 dx + 2 i=1 j j 1 n nx h x (i) y (i) i=1 nx i=1 h x (i) y (i) x (i) j j=1 2 j j We can rewrite the gradient step as: j j (1 ) 1 nx h x (i) n i=1 y (i) x (i) j 37