EE 511 Linear Regression

Similar documents
CSE446: Linear Regression. Spring 2017

CSE446: Linear Regression. Spring 2017

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

CSE 446 Bias-Variance & Naïve Bayes

Simple Model Selection Cross Validation Regularization Neural Networks

Boosting Simple Model Selection Cross Validation Regularization

10601 Machine Learning. Model and feature selection

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Machine Learning / Jan 27, 2010

Introduction to Automated Text Analysis. bit.ly/poir599

EE 511 Neural Networks

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

CSC 411: Lecture 02: Linear Regression

Using Machine Learning to Optimize Storage Systems

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Nonparametric Methods Recap

Lecture on Modeling Tools for Clustering & Regression

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Machine Learning. Topic 4: Linear Regression Models

Hyperparameters and Validation Sets. Sargur N. Srihari

Model Complexity and Generalization

Instance-based Learning

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Bias-Variance Decomposition Error Estimators Cross-Validation

Bias-Variance Decomposition Error Estimators

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Machine Learning: An Applied Econometric Approach Online Appendix

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Regularization and model selection

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Lasso Regression: Regularization for feature selection

CPSC 340: Machine Learning and Data Mining

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Last time... Bias-Variance decomposition. This week

CS 229 Midterm Review

CPSC 340: Machine Learning and Data Mining

What is machine learning?

Network Traffic Measurements and Analysis

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Model selection and validation 1: Cross-validation

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18

Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Machine Learning: Think Big and Parallel

Logistic Regression and Gradient Ascent

CSC 411 Lecture 4: Ensembles I

Machine Learning. Chao Lan

model order p weights The solution to this optimization problem is obtained by solving the linear system

Chapter 7: Numerical Prediction

1 StatLearn Practical exercise 5

Linear Methods for Regression and Shrinkage Methods

In the real world, light sources emit light particles, which travel in space, reflect at objects or scatter in volumetric media (potentially multiple

Moving Beyond Linearity

LECTURE 12: LINEAR MODEL SELECTION PT. 3. October 23, 2017 SDS 293: Machine Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Cross-validation. Cross-validation is a resampling method.

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Notes and Announcements

Lecture 26: Missing data

STA 4273H: Statistical Machine Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Parameter Estimation. Learning From Data: MLE. Parameter Estimation. Likelihood. Maximum Likelihood Parameter Estimation. Likelihood Function 12/1/16

Support Vector Machines

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Non-linear models. Basis expansion. Overfitting. Regularization.

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

STA 4273H: Sta-s-cal Machine Learning

Linear Regression & Gradient Descent

The exam is closed book, closed notes except your one-page cheat sheet.

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Lecture 13: Model selection and regularization

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

MATH 829: Introduction to Data Mining and Analysis Model selection

Perceptron as a graph

Multicollinearity and Validation CIVL 7012/8012

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Markov Decision Processes (MDPs) (cont.)

Supervised Learning for Image Segmentation

Fitting. Fitting. Slides S. Lazebnik Harris Corners Pkwy, Charlotte, NC

Going nonparametric: Nearest neighbor methods for regression and classification

6 Model selection and kernels

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance

Logistic Regression. Abstract

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

Perceptron: This is convolution!

Fall 09, Homework 5

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Transcription:

EE 511 Linear Regression Instructor: Hanna Hajishirzi hannaneh@washington.edu Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer,

Announcements Hw1 due in a week before class Always check the discussion board 2

MAP Recap Learning is Collect some data E.g., coin flips Choose a hypothesis class or model E.g., Binomial and prior based on expert knowledge Choose a loss function: E.g., parameter posterior likelihood Choose an optimization procedure Set derivatives to zero to obtain MAP Justifying the accuracy of the estimate E.g., if the model is correct your are doing best possible 3

Prediction of continuous variables Billionaire says: Wait, that s not what I meant! You say: Chill out, dude. He says: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA. You say: I can regress that

Stock market

Weather prediction revisited Temperature 72 F

Linear Regression Prediction Prediction 40 26 24 20 22 20 0 0 20 30 20 10 0 0 10 20 30 40

Ordinary Least Squares (OLS) Observation Error or residual Prediction 0 0 20

The regression problem Instances: <x j, t j > Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions {h 1,,h k } Find coeffs w={w 1,,w k } Why is this usually called linear regression? model is linear in the parameters Can we estimate functions that are not lines???

Transformed data 10

Linear Basis: 1D input Need a bias term: {h 1 (x) = x, h 2 (x)=1} y x

Parabola: {h 1 (x) = x 2, h 2 (x)=x, h 3 (x)=1} y 2D: {h 1 (x) = x 12, h 2 (x)= x 22, h 3 (x)=x 1 x 2, } Can define any basis functions h i (x) for n- dimensional input x=<x 1,,x n > x

The regression problem Instances: <x j, t j > Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions {h 1,,h k } Find coeffs w={w 1,,w k } Why is this usually called linear regression? model is linear in the parameters Can we estimate functions that are not lines??? Precisely, minimize the residual squared error:

Regression: matrix notation K basis functions N data points weights K basis func N observed outputs measurements

Regression: closed form solution

Regression solution: simple matrix math where k k matrix for k basis functions k 1 vector

But, why? Why sum squared error??? Model: prediction is linear function plus Gaussian noise t(x) = i w i h i (x) + Learn w using MLE:

Maximizing log-likelihood Maximize wrt w: Least-squares Linear Regression is MLE for Gaussians!!!

Regularization in Linear Regression One sign of overfitting: large parameter values! Regularized or penalized regressions modified learning object to penalize large parameters

Ridge Regression Introduce a new objective function: Prefer low error but also add a squared penalize for large weights λ is hyperparameter that balances tradeoff Explicitly writing out bias feature (essentially h 0 =1), which is not penalized

Ridge Regression in Matrix Notation Ridge Regression: matrix notation + λ w T I 0+ k w 1... 1 bias column and k basis functions N data points N data points K+1 basis functions 2005-2013 Carlos Guestrin weights weights k basis functs plus bias K+1 basis func observations 21 N data points N observed outputs measurements k+1 0 0 0 0 1 0 0 0 1 k+1 x k+1 identity matrix, but with 0 in upper left k+1

Ridge Regression: closed form solution + λ w T I 0+ k w N data points K+1 basis func N data points K+1 basis functions 2005-2013 Carlos Guestrin weights observations 21

Ridge Regression in Matrix Notation Regression solution: simple matrix math + λ w T I 0+ k w N data points Compare to un-regularized regression: K+1 basis func N data points K+1 basis functions 2005-2013 Carlos Guestrin weights observations 21

Ridge Regression How does varying lambda change w? Larger λ? Smaller λ? As λ 0? Becomes same a MLE, unregularized As λ? All weights will be 0!

idge Coefficient Path Feature Weight Typical approach: select λ using cross validation, more on this later in the quarter Ridge Coefficent Path From Kevin Murphy textbook Larger λ Smaller λ

How to pick lambda? Experimentation cycle Select a hypothesis f to best match training set Tune hyperparameters on held-out set Try many different values of lambda, pick best one Or, can do k-fold cross validation No held-out set Divide training set into k subsets Repeatedly train on k-1 and test on remaining one Average the results Training Data Held-Out (Development) Data Test Data Training Part 1 Training Part 2 Training Part K Test Data

Why squared regularization? Ridge: LASSO: Linear penalty pushes more weights to zero Allows for a type of feature selection But, not differentiable and no closed form solution.

Geometric Intuition w 2 b 2 ^. b w MLE w. 2 b 2 ^ b w MLE R idge R egression b1 w 1 Lasso b 1 w 1 From Rob Tibshirani slides 005-2014 Carlos Guestrin 10

Feature Weight LASSO Coefficent Path From Kevin Murphy textbook Larger λ Smaller λ

Bias-Variance tradeoff Intuition Model too simple: does not fit the data well A biased solution Model too complex: small changes to the data, solution changes a lot A high-variance solution

Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance

Training set error Given a dataset (Training data) Choose a loss function e.g., squared error (L 2 ) for regression Training error: For a particular set of parameters, loss function on training data:

Training error as a function of model complexity

Prediction error Training set error can be poor measure of quality of solution Prediction error (true error): We really care about error over all possibilities:

Prediction error as a function of model complexity

Computing prediction error To correctly predict error Hard integral! May not know t(x) for every x, may not know p(x) Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x 1,,x M } from p(x) Approximate integral with sample average

Why training set error doesn t approximate prediction error? Sampling approximation of prediction error: Training error : Very similar equations!!! Why is training set a bad measure of prediction error???

Why training set error doesn t approximate prediction error? Sampling approximation Because of prediction you cheated!!! error: Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples Training error : Training error is a (optimistically) biased estimate of prediction error Very similar equations!!! Why is training set a bad measure of prediction error???

Test set error Given a dataset, randomly split it into two parts: Training data {x 1,, x Ntrain } Test data {x 1,, x Ntest } Use training data to optimize parameters w Test set error: For the final solution w*, evaluate the error using:

Test set error as a function of model complexity

Overfitting: this slide is so important we Assume: are looking at it again! Data generated from distribution D(X,Y) A hypothesis space H Define: errors for hypothesis h H Training error: error train (h) Data (true) error: error true (h) We say h overfits the training data if there exists an h H such that: and error train (h) < error train (h ) error true (h) > error true (h )

Summary: error estimators Gold Standard: Training: optimistically biased Test: our final measure

Error as a function of number of training examples for a fixed model complexity bias little data infinite data

Error as function of regularization parameter, fixed model complexity λ= λ=0

Summary: error estimators Gold Standard: Be careful!!! Test set only unbiased if you never never ever ever do any any any any learning on the test data Training: optimistically biased For example, if you use the test set to select the degree of the polynomial no longer unbiased!!! (We will address this problem later in the quarter) Test: our final measure

What you need to know Regression Basis function = features Optimizing sum squared error Relationship between regression and Gaussians Regularization Ridge regression math LASSO Formulation How to set lambda Bias-Variance trade-off

Regression Recap Learning is Collect some data E.g., sensor data and temperature Randomly split dataset into TRAIN and TEST E.g., 80%, 20% randomly split Choose a hypothesis class or model E.g., linear Choose a loss function: E.g., least square error Choose an optimization procedure Set derivatives to zero to obtain estimator Justifying the accuracy of the estimate E.g., report test error 50