CSE446: Linear Regression Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer
Prediction of continuous variables Billionaire says: Wait, that s not what I meant! You say: Chill out, dude. He says: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA. You say: I can regress that
Linear Regression Prediction Prediction 40 26 24 22 20 20 30 20 0 0 10 20 0 0 10 20 30 40
Ordinary Least Squares (OLS) Error or residual Observation Prediction 0 0 20
The regression problem Instances: <xj, tj> Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions {h1,,hk} Find coeffs w={w1,,wk} Why is this usually called linear regression? model is linear in the parameters Can we estimate functions that are not lines???
Linear Basis: 1D input Need a bias term: {h1(x) = x, h2(x)=1} y x
Parabola: {h1(x) = x2, h2(x)=x, h3(x)=1} y x 2D: {h1(x) = x12, h2(x)= x22, h3(x)=x1x2, } Can define any basis functions hi(x) for ndimensional input x=<x1,,xn>
The regression problem Instances: <xj, tj> Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions {h1,,hk} Find coeffs w={w1,,wk} Why is this usually called linear regression? model is linear in the parameters Can we estimate functions that are not lines??? Precisely, minimize the residual squared error:
Regression: matrix notation weights N observed outputs K basis func N data points K basis functions measuremen ts
Regression: closed form solution
Regression solution: simple matrix math where k k matrix for k basis functions k 1 vector
But, why? Billionaire (again) says: Why sum squared error??? You say: Gaussians, Dr. Gateson, Gaussians Model: prediction is linear function plus Gaussian noise t(x) = i wi hi(x) + Learn w using MLE:
Maximizing log-likelihood Maximize wrt w: Least-squares Linear Regression is MLE for Gaussians!!!
Regularization in Linear Regression One sign of overfitting: large parameter values! Regularized or penalized regressions modified learning object to penalize large parameters
Ridge Regression Introduce a new objective function: Prefer low error but also add a squared penalize for large weights λ is hyperparameter that balances tradeoff Explicitly writing out bias feature (essentially h0=1), which is not penalized
Ridge Regression in Matrix Notation Ridge Regression: matrix notation w r idge = arg m in w XN t(x j ) (w 0 + j= 1! Xk w ih i(x j )) i= 1 2 + λ Xk w i2 i= 1 + λ wt I 0+k w 2005-2013 Carlos Guestrin bias column and k basis functions weights weights observations 21 N observed outputs K+1 basis functions N data points N data points 1 k basis K+1 basis functs plusfunc bias h0...hk N data points 1... measuremen ts k+1 0 0 0 1 0 0 0 0 1 k+1 k+1 x k+1 identity matrix, but with 0 in upper left
w r idge = a r g m in w XN t(x j ) (w 0 + j= 1 Xk! 2 w i h i (x j )) + λ Xk i= 1 Ridge Regression: closed form solution w i2 i= 1 + λ wt I 0+k w h0...hk weights N data points K+1 basis func N data points K+1 basis functions observations 21 2005-2013 Carlos Guestrin Minimizing the Ridge Regression Objective XN Xk! 2 Xk
Ridge Regression in Matrix Notation Regression solution: simple matrix math w r idge = arg m in w XN t(x j ) (w 0 + j= 1 Xk! w ih i(x j )) i= 1 2 + λ Xk w i2 i= 1 + λ wt I 0+k w h0...hk N data points K+1 basis func N data points Compare to un-regularized regression: K+1 basis functions 2005-2013 Carlos Guestrin weights observations 21
Ridge Regression How does varying lambda change w? Larger λ? Smaller λ? As λ 0? Becomes same a MLE, unregularized As λ? All weights will be 0!
Ridge Coefficient Path Feature Weight Ridge Coefficent Path Larger λ From Kevin Murphy textbook Smaller λ Typical approach: select λ using cross validation, more on this later in the quarter
How to pick lambda? Experimentation cycle Select a hypothesis f to best match training set Tune hyperparameters on held-out set Try many different values of lambda, pick best one Or, can do k-fold cross validation No held-out set Divide training set into k subsets Repeatedly train on k-1 and test on remaining one Average the results Training Part 1 Training Data Training Part 2 Held-Out (Development) Data Training Part K Test Data Test Data
Why squared regularization? Ridge: LASSO: Linear penalty pushes more weights to zero Allows for a type of feature selection But, not differentiable and no closed form solution.
GeometricP i cintuition t u r e o f L a sso a n d R id g e r eg r essio n w 2 ^ w. w2 MLE 2005-2014 Carlos Guestrin. L a sso MLE w1 w1 R id g e R eg r essio n ^ w an d R id g e From Rob Tibshirani R id g e R slides 10
LASSO Coefficent Path Feature Weight From Kevin Murphy textbook Larger λ Smaller λ
Bias-Variance tradeoff Intuition Model too simple: does not fit the data well A biased solution Model too complex: small changes to the data, solution changes a lot A high-variance solution
Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance
Training set error Given a dataset (Training data) Choose a loss function e.g., squared error (L2) for regression Training error: For a particular set of parameters, loss function on training data:
Training error as a function of model complexity
Prediction error Training set error can be poor measure of quality of solution Prediction error (true error): We really care about error over all possibilities:
Prediction error as a function of model complexity
Computing prediction error To correctly predict error Hard integral! May not know t(x) for every x, may not know p(x) Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x1,,xm} from p(x) Approximate integral with sample average
Why training set error doesn t approximate prediction error? Sampling approximation of prediction error: Training error : Very similar equations!!! Why is training set a bad measure of prediction error???
Why training set error doesn t approximate prediction error? Sampling approximation of prediction error: Because you cheated!!! Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples Training error : Training error is a (optimistically) biased estimate of prediction error Very similar equations!!! Why is training set a bad measure of prediction error???
Test set error Given a dataset, randomly split it into two parts: Training data {x1,, xntrain} Test data {x1,, xntest} Use training data to optimize parameters w Test set error: For the final solution w*, evaluate the error using:
Test set error as a function of model complexity
Overfitting: this slide is so important we are looking at it again! Assume: Data generated from distribution D(X,Y) A hypothesis space H Define: errors for hypothesis h H Training error: errortrain(h) Data (true) error: errortrue(h) We say h overfits the training data if there exists an h H such that: errortrain(h) < errortrain(h ) and errortrue(h) > errortrue(h )
Summary: error estimators Gold Standard: Training: optimistically biased Test: our final measure
bias Error as a function of number of training examples for a fixed model complexity little data infinite data
Error as function of regularization parameter, fixed model complexity λ= λ=0
Summary: error estimators Gold Standard: Be careful!!! Test set only unbiased if you never never ever ever Training: dooptimistically any any anybiased any learning on the test data For example, if you use the test set to select the degree of the polynomial no longer unbiased!!! our final measure this problem later in the quarter) Test:(We will address
What you need to know Regression Basis function = features Optimizing sum squared error Relationship between regression and Gaussians Regularization Ridge regression math LASSO Formulation How to set lambda Bias-Variance trade-off