EE 511 Linear Regression Instructor: Hanna Hajishirzi hannaneh@washington.edu Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer,
Announcements Hw1 due in a week before class Always check the discussion board 2
MAP Recap Learning is Collect some data E.g., coin flips Choose a hypothesis class or model E.g., Binomial and prior based on expert knowledge Choose a loss function: E.g., parameter posterior likelihood Choose an optimization procedure Set derivatives to zero to obtain MAP Justifying the accuracy of the estimate E.g., if the model is correct your are doing best possible 3
Prediction of continuous variables Billionaire says: Wait, that s not what I meant! You say: Chill out, dude. He says: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA. You say: I can regress that
Stock market
Weather prediction revisited Temperature 72 F
Linear Regression Prediction Prediction 40 26 24 20 22 20 0 0 20 30 20 10 0 0 10 20 30 40
Ordinary Least Squares (OLS) Observation Error or residual Prediction 0 0 20
The regression problem Instances: <x j, t j > Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions {h 1,,h k } Find coeffs w={w 1,,w k } Why is this usually called linear regression? model is linear in the parameters Can we estimate functions that are not lines???
Transformed data 10
Linear Basis: 1D input Need a bias term: {h 1 (x) = x, h 2 (x)=1} y x
Parabola: {h 1 (x) = x 2, h 2 (x)=x, h 3 (x)=1} y 2D: {h 1 (x) = x 12, h 2 (x)= x 22, h 3 (x)=x 1 x 2, } Can define any basis functions h i (x) for n- dimensional input x=<x 1,,x n > x
The regression problem Instances: <x j, t j > Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions {h 1,,h k } Find coeffs w={w 1,,w k } Why is this usually called linear regression? model is linear in the parameters Can we estimate functions that are not lines??? Precisely, minimize the residual squared error:
Regression: matrix notation K basis functions N data points weights K basis func N observed outputs measurements
Regression: closed form solution
Regression solution: simple matrix math where k k matrix for k basis functions k 1 vector
But, why? Why sum squared error??? Model: prediction is linear function plus Gaussian noise t(x) = i w i h i (x) + Learn w using MLE:
Maximizing log-likelihood Maximize wrt w: Least-squares Linear Regression is MLE for Gaussians!!!
Regularization in Linear Regression One sign of overfitting: large parameter values! Regularized or penalized regressions modified learning object to penalize large parameters
Ridge Regression Introduce a new objective function: Prefer low error but also add a squared penalize for large weights λ is hyperparameter that balances tradeoff Explicitly writing out bias feature (essentially h 0 =1), which is not penalized
Ridge Regression in Matrix Notation Ridge Regression: matrix notation + λ w T I 0+ k w 1... 1 bias column and k basis functions N data points N data points K+1 basis functions 2005-2013 Carlos Guestrin weights weights k basis functs plus bias K+1 basis func observations 21 N data points N observed outputs measurements k+1 0 0 0 0 1 0 0 0 1 k+1 x k+1 identity matrix, but with 0 in upper left k+1
Ridge Regression: closed form solution + λ w T I 0+ k w N data points K+1 basis func N data points K+1 basis functions 2005-2013 Carlos Guestrin weights observations 21
Ridge Regression in Matrix Notation Regression solution: simple matrix math + λ w T I 0+ k w N data points Compare to un-regularized regression: K+1 basis func N data points K+1 basis functions 2005-2013 Carlos Guestrin weights observations 21
Ridge Regression How does varying lambda change w? Larger λ? Smaller λ? As λ 0? Becomes same a MLE, unregularized As λ? All weights will be 0!
idge Coefficient Path Feature Weight Typical approach: select λ using cross validation, more on this later in the quarter Ridge Coefficent Path From Kevin Murphy textbook Larger λ Smaller λ
How to pick lambda? Experimentation cycle Select a hypothesis f to best match training set Tune hyperparameters on held-out set Try many different values of lambda, pick best one Or, can do k-fold cross validation No held-out set Divide training set into k subsets Repeatedly train on k-1 and test on remaining one Average the results Training Data Held-Out (Development) Data Test Data Training Part 1 Training Part 2 Training Part K Test Data
Why squared regularization? Ridge: LASSO: Linear penalty pushes more weights to zero Allows for a type of feature selection But, not differentiable and no closed form solution.
Geometric Intuition w 2 b 2 ^. b w MLE w. 2 b 2 ^ b w MLE R idge R egression b1 w 1 Lasso b 1 w 1 From Rob Tibshirani slides 005-2014 Carlos Guestrin 10
Feature Weight LASSO Coefficent Path From Kevin Murphy textbook Larger λ Smaller λ
Bias-Variance tradeoff Intuition Model too simple: does not fit the data well A biased solution Model too complex: small changes to the data, solution changes a lot A high-variance solution
Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance
Training set error Given a dataset (Training data) Choose a loss function e.g., squared error (L 2 ) for regression Training error: For a particular set of parameters, loss function on training data:
Training error as a function of model complexity
Prediction error Training set error can be poor measure of quality of solution Prediction error (true error): We really care about error over all possibilities:
Prediction error as a function of model complexity
Computing prediction error To correctly predict error Hard integral! May not know t(x) for every x, may not know p(x) Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x 1,,x M } from p(x) Approximate integral with sample average
Why training set error doesn t approximate prediction error? Sampling approximation of prediction error: Training error : Very similar equations!!! Why is training set a bad measure of prediction error???
Why training set error doesn t approximate prediction error? Sampling approximation Because of prediction you cheated!!! error: Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples Training error : Training error is a (optimistically) biased estimate of prediction error Very similar equations!!! Why is training set a bad measure of prediction error???
Test set error Given a dataset, randomly split it into two parts: Training data {x 1,, x Ntrain } Test data {x 1,, x Ntest } Use training data to optimize parameters w Test set error: For the final solution w*, evaluate the error using:
Test set error as a function of model complexity
Overfitting: this slide is so important we Assume: are looking at it again! Data generated from distribution D(X,Y) A hypothesis space H Define: errors for hypothesis h H Training error: error train (h) Data (true) error: error true (h) We say h overfits the training data if there exists an h H such that: and error train (h) < error train (h ) error true (h) > error true (h )
Summary: error estimators Gold Standard: Training: optimistically biased Test: our final measure
Error as a function of number of training examples for a fixed model complexity bias little data infinite data
Error as function of regularization parameter, fixed model complexity λ= λ=0
Summary: error estimators Gold Standard: Be careful!!! Test set only unbiased if you never never ever ever do any any any any learning on the test data Training: optimistically biased For example, if you use the test set to select the degree of the polynomial no longer unbiased!!! (We will address this problem later in the quarter) Test: our final measure
What you need to know Regression Basis function = features Optimizing sum squared error Relationship between regression and Gaussians Regularization Ridge regression math LASSO Formulation How to set lambda Bias-Variance trade-off
Regression Recap Learning is Collect some data E.g., sensor data and temperature Randomly split dataset into TRAIN and TEST E.g., 80%, 20% randomly split Choose a hypothesis class or model E.g., linear Choose a loss function: E.g., least square error Choose an optimization procedure Set derivatives to zero to obtain estimator Justifying the accuracy of the estimate E.g., report test error 50