CSE446: Linear Regression. Spring 2017

Size: px

Start display at page:

Download "CSE446: Linear Regression. Spring 2017"

Paula Thornton
5 years ago
Views:

1 CSE446: Linear Regression Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer

2 Prediction of continuous variables Billionaire says: Wait, that s not what I meant! You say: Chill out, dude. He says: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA. You say: I can regress that

3 Linear Regression Prediction Prediction

4 Ordinary Least Squares (OLS) Error or residual Observation Prediction

5 The regression problem Instances: <xj, tj> Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions {h1,,hk} Find coeffs w={w1,,wk} Why is this usually called linear regression? model is linear in the parameters Can we estimate functions that are not lines???

6 Linear Basis: 1D input Need a bias term: {h1(x) = x, h2(x)=1} y x

7 Parabola: {h1(x) = x2, h2(x)=x, h3(x)=1} y x 2D: {h1(x) = x12, h2(x)= x22, h3(x)=x1x2, } Can define any basis functions hi(x) for ndimensional input x=<x1,,xn>

8 The regression problem Instances: <xj, tj> Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions {h1,,hk} Find coeffs w={w1,,wk} Why is this usually called linear regression? model is linear in the parameters Can we estimate functions that are not lines??? Precisely, minimize the residual squared error:

9 Regression: matrix notation weights N observed outputs K basis func N data points K basis functions measuremen ts

10 Regression: closed form solution

11 Regression solution: simple matrix math where k k matrix for k basis functions k 1 vector

12 But, why? Billionaire (again) says: Why sum squared error??? You say: Gaussians, Dr. Gateson, Gaussians Model: prediction is linear function plus Gaussian noise t(x) = i wi hi(x) + Learn w using MLE:

13 Maximizing log-likelihood Maximize wrt w: Least-squares Linear Regression is MLE for Gaussians!!!

14 Regularization in Linear Regression One sign of overfitting: large parameter values! Regularized or penalized regressions modified learning object to penalize large parameters

15 Ridge Regression Introduce a new objective function: Prefer low error but also add a squared penalize for large weights λ is hyperparameter that balances tradeoff Explicitly writing out bias feature (essentially h0=1), which is not penalized

16 Ridge Regression in Matrix Notation Ridge Regression: matrix notation w r idge = arg m in w XN t(x j ) (w 0 + j= 1! Xk w ih i(x j )) i= λ Xk w i2 i= 1 + λ wt I 0+k w Carlos Guestrin bias column and k basis functions weights weights observations 21 N observed outputs K+1 basis functions N data points N data points 1 k basis K+1 basis functs plusfunc bias h0...hk N data points 1... measuremen ts k k+1 k+1 x k+1 identity matrix, but with 0 in upper left

17 w r idge = a r g m in w XN t(x j ) (w 0 + j= 1 Xk! 2 w i h i (x j )) + λ Xk i= 1 Ridge Regression: closed form solution w i2 i= 1 + λ wt I 0+k w h0...hk weights N data points K+1 basis func N data points K+1 basis functions observations Carlos Guestrin Minimizing the Ridge Regression Objective XN Xk! 2 Xk

18 Ridge Regression in Matrix Notation Regression solution: simple matrix math w r idge = arg m in w XN t(x j ) (w 0 + j= 1 Xk! w ih i(x j )) i= λ Xk w i2 i= 1 + λ wt I 0+k w h0...hk N data points K+1 basis func N data points Compare to un-regularized regression: K+1 basis functions Carlos Guestrin weights observations 21

19 Ridge Regression How does varying lambda change w? Larger λ? Smaller λ? As λ 0? Becomes same a MLE, unregularized As λ? All weights will be 0!

20 Ridge Coefficient Path Feature Weight Ridge Coefficent Path Larger λ From Kevin Murphy textbook Smaller λ Typical approach: select λ using cross validation, more on this later in the quarter

21 How to pick lambda? Experimentation cycle Select a hypothesis f to best match training set Tune hyperparameters on held-out set Try many different values of lambda, pick best one Or, can do k-fold cross validation No held-out set Divide training set into k subsets Repeatedly train on k-1 and test on remaining one Average the results Training Part 1 Training Data Training Part 2 Held-Out (Development) Data Training Part K Test Data Test Data

22 Why squared regularization? Ridge: LASSO: Linear penalty pushes more weights to zero Allows for a type of feature selection But, not differentiable and no closed form solution.

23 GeometricP i cintuition t u r e o f L a sso a n d R id g e r eg r essio n w 2 ^ w. w2 MLE Carlos Guestrin. L a sso MLE w1 w1 R id g e R eg r essio n ^ w an d R id g e From Rob Tibshirani R id g e R slides 10

24 LASSO Coefficent Path Feature Weight From Kevin Murphy textbook Larger λ Smaller λ

25 Bias-Variance tradeoff Intuition Model too simple: does not fit the data well A biased solution Model too complex: small changes to the data, solution changes a lot A high-variance solution

26 Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance

27 Training set error Given a dataset (Training data) Choose a loss function e.g., squared error (L2) for regression Training error: For a particular set of parameters, loss function on training data:

28 Training error as a function of model complexity

29 Prediction error Training set error can be poor measure of quality of solution Prediction error (true error): We really care about error over all possibilities:

30 Prediction error as a function of model complexity

31 Computing prediction error To correctly predict error Hard integral! May not know t(x) for every x, may not know p(x) Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x1,,xm} from p(x) Approximate integral with sample average

32 Why training set error doesn t approximate prediction error? Sampling approximation of prediction error: Training error : Very similar equations!!! Why is training set a bad measure of prediction error???

33 Why training set error doesn t approximate prediction error? Sampling approximation of prediction error: Because you cheated!!! Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples Training error : Training error is a (optimistically) biased estimate of prediction error Very similar equations!!! Why is training set a bad measure of prediction error???

xntest} Use training data to optimize parameters w Test

34 Test set error Given a dataset, randomly split it into two parts: Training data {x1,, xntrain} Test data {x1,, xntest} Use training data to optimize parameters w Test set error: For the final solution w*, evaluate the error using:

35 Test set error as a function of model complexity

36 Overfitting: this slide is so important we are looking at it again! Assume: Data generated from distribution D(X,Y) A hypothesis space H Define: errors for hypothesis h H Training error: errortrain(h) Data (true) error: errortrue(h) We say h overfits the training data if there exists an h H such that: errortrain(h) < errortrain(h ) and errortrue(h) > errortrue(h )

37 Summary: error estimators Gold Standard: Training: optimistically biased Test: our final measure

38 bias Error as a function of number of training examples for a fixed model complexity little data infinite data

39 Error as function of regularization parameter, fixed model complexity λ= λ=0

40 Summary: error estimators Gold Standard: Be careful!!! Test set only unbiased if you never never ever ever Training: dooptimistically any any anybiased any learning on the test data For example, if you use the test set to select the degree of the polynomial no longer unbiased!!! our final measure this problem later in the quarter) Test:(We will address

41 What you need to know Regression Basis function = features Optimizing sum squared error Relationship between regression and Gaussians Regularization Ridge regression math LASSO Formulation How to set lambda Bias-Variance trade-off

CSE446: Linear Regression. Spring 2017

CSE446: Linear Regression. Spring 2017 CSE446: Linear Regression Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Prediction of continuous variables Billionaire says: Wait, that s not what I meant! You say: Chill