CSE446: Linear Regression. Spring 2017

Similar documents
CSE446: Linear Regression. Spring 2017

EE 511 Linear Regression

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

CSE 446 Bias-Variance & Naïve Bayes

Simple Model Selection Cross Validation Regularization Neural Networks

Boosting Simple Model Selection Cross Validation Regularization

10601 Machine Learning. Model and feature selection

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

CSC 411: Lecture 02: Linear Regression

Introduction to Automated Text Analysis. bit.ly/poir599

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Using Machine Learning to Optimize Storage Systems

Machine Learning. Topic 4: Linear Regression Models

Model Complexity and Generalization

Lecture on Modeling Tools for Clustering & Regression

Nonparametric Methods Recap

Lasso Regression: Regularization for feature selection

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Hyperparameters and Validation Sets. Sargur N. Srihari

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Machine Learning: An Applied Econometric Approach Online Appendix

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Machine Learning / Jan 27, 2010

Network Traffic Measurements and Analysis

Regularization and model selection

Bias-Variance Decomposition Error Estimators

Instance-based Learning

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Topics in Machine Learning-EE 5359 Model Assessment and Selection

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Machine Learning: Think Big and Parallel

Linear Regression & Gradient Descent

Chapter 7: Numerical Prediction

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

model order p weights The solution to this optimization problem is obtained by solving the linear system

Bias-Variance Decomposition Error Estimators Cross-Validation

Lecture 26: Missing data

CPSC 340: Machine Learning and Data Mining

Model selection and validation 1: Cross-validation

Moving Beyond Linearity

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

The exam is closed book, closed notes except your one-page cheat sheet.

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Logistic Regression and Gradient Ascent

Last time... Bias-Variance decomposition. This week

Lecture 13: Model selection and regularization

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

What is machine learning?

CS 229 Midterm Review

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Linear Methods for Regression and Shrinkage Methods

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

MATH 829: Introduction to Data Mining and Analysis Model selection

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance

Fitting. Fitting. Slides S. Lazebnik Harris Corners Pkwy, Charlotte, NC

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Support Vector Machines

Nearest Neighbor with KD Trees

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

Non-linear models. Basis expansion. Overfitting. Regularization.

Classification and Regression Trees

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

In the real world, light sources emit light particles, which travel in space, reflect at objects or scatter in volumetric media (potentially multiple

1 StatLearn Practical exercise 5

Instance-based Learning

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

Perceptron as a graph

Lecture 16: High-dimensional regression, non-linear regression

Model validation T , , Heli Hiisilä

CSC 411 Lecture 4: Ensembles I

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Introduction to Machine Learning Spring 2018 Note Sparsity and LASSO. 1.1 Sparsity for SVMs

Cross-validation. Cross-validation is a resampling method.

LECTURE 12: LINEAR MODEL SELECTION PT. 3. October 23, 2017 SDS 293: Machine Learning

Simulation studies. Patrick Breheny. September 8. Monte Carlo simulation Example: Ridge vs. Lasso vs. Subset

Unsupervised Learning: Clustering

Model Fitting, RANSAC. Jana Kosecka

STA 4273H: Sta-s-cal Machine Learning

6 Model selection and kernels

Logistic Regression. Abstract

STA 4273H: Statistical Machine Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Multivariate Analysis Multivariate Calibration part 2

Transcription:

CSE446: Linear Regression Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Prediction of continuous variables Billionaire says: Wait, that s not what I meant! You say: Chill out, dude. He says: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA. You say: I can regress that

Linear Regression Prediction Prediction 40 26 24 22 20 20 30 20 0 0 10 20 0 0 10 20 30 40

Ordinary Least Squares (OLS) Error or residual Observation Prediction 0 0 20

The regression problem Instances: <xj, tj> Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions {h1,,hk} Find coeffs w={w1,,wk} Why is this usually called linear regression? model is linear in the parameters Can we estimate functions that are not lines???

Linear Basis: 1D input Need a bias term: {h1(x) = x, h2(x)=1} y x

Parabola: {h1(x) = x2, h2(x)=x, h3(x)=1} y x 2D: {h1(x) = x12, h2(x)= x22, h3(x)=x1x2, } Can define any basis functions hi(x) for ndimensional input x=<x1,,xn>

The regression problem Instances: <xj, tj> Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions {h1,,hk} Find coeffs w={w1,,wk} Why is this usually called linear regression? model is linear in the parameters Can we estimate functions that are not lines??? Precisely, minimize the residual squared error:

Regression: matrix notation weights N observed outputs K basis func N data points K basis functions measuremen ts

Regression: closed form solution

Regression solution: simple matrix math where k k matrix for k basis functions k 1 vector

But, why? Billionaire (again) says: Why sum squared error??? You say: Gaussians, Dr. Gateson, Gaussians Model: prediction is linear function plus Gaussian noise t(x) = i wi hi(x) + Learn w using MLE:

Maximizing log-likelihood Maximize wrt w: Least-squares Linear Regression is MLE for Gaussians!!!

Regularization in Linear Regression One sign of overfitting: large parameter values! Regularized or penalized regressions modified learning object to penalize large parameters

Ridge Regression Introduce a new objective function: Prefer low error but also add a squared penalize for large weights λ is hyperparameter that balances tradeoff Explicitly writing out bias feature (essentially h0=1), which is not penalized

Ridge Regression in Matrix Notation Ridge Regression: matrix notation w r idge = arg m in w XN t(x j ) (w 0 + j= 1! Xk w ih i(x j )) i= 1 2 + λ Xk w i2 i= 1 + λ wt I 0+k w 2005-2013 Carlos Guestrin bias column and k basis functions weights weights observations 21 N observed outputs K+1 basis functions N data points N data points 1 k basis K+1 basis functs plusfunc bias h0...hk N data points 1... measuremen ts k+1 0 0 0 1 0 0 0 0 1 k+1 k+1 x k+1 identity matrix, but with 0 in upper left

w r idge = a r g m in w XN t(x j ) (w 0 + j= 1 Xk! 2 w i h i (x j )) + λ Xk i= 1 Ridge Regression: closed form solution w i2 i= 1 + λ wt I 0+k w h0...hk weights N data points K+1 basis func N data points K+1 basis functions observations 21 2005-2013 Carlos Guestrin Minimizing the Ridge Regression Objective XN Xk! 2 Xk

Ridge Regression in Matrix Notation Regression solution: simple matrix math w r idge = arg m in w XN t(x j ) (w 0 + j= 1 Xk! w ih i(x j )) i= 1 2 + λ Xk w i2 i= 1 + λ wt I 0+k w h0...hk N data points K+1 basis func N data points Compare to un-regularized regression: K+1 basis functions 2005-2013 Carlos Guestrin weights observations 21

Ridge Regression How does varying lambda change w? Larger λ? Smaller λ? As λ 0? Becomes same a MLE, unregularized As λ? All weights will be 0!

Ridge Coefficient Path Feature Weight Ridge Coefficent Path Larger λ From Kevin Murphy textbook Smaller λ Typical approach: select λ using cross validation, more on this later in the quarter

How to pick lambda? Experimentation cycle Select a hypothesis f to best match training set Tune hyperparameters on held-out set Try many different values of lambda, pick best one Or, can do k-fold cross validation No held-out set Divide training set into k subsets Repeatedly train on k-1 and test on remaining one Average the results Training Part 1 Training Data Training Part 2 Held-Out (Development) Data Training Part K Test Data Test Data

Why squared regularization? Ridge: LASSO: Linear penalty pushes more weights to zero Allows for a type of feature selection But, not differentiable and no closed form solution.

GeometricP i cintuition t u r e o f L a sso a n d R id g e r eg r essio n w 2 ^ w. w2 MLE 2005-2014 Carlos Guestrin. L a sso MLE w1 w1 R id g e R eg r essio n ^ w an d R id g e From Rob Tibshirani R id g e R slides 10

LASSO Coefficent Path Feature Weight From Kevin Murphy textbook Larger λ Smaller λ

Bias-Variance tradeoff Intuition Model too simple: does not fit the data well A biased solution Model too complex: small changes to the data, solution changes a lot A high-variance solution

Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance

Training set error Given a dataset (Training data) Choose a loss function e.g., squared error (L2) for regression Training error: For a particular set of parameters, loss function on training data:

Training error as a function of model complexity

Prediction error Training set error can be poor measure of quality of solution Prediction error (true error): We really care about error over all possibilities:

Prediction error as a function of model complexity

Computing prediction error To correctly predict error Hard integral! May not know t(x) for every x, may not know p(x) Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x1,,xm} from p(x) Approximate integral with sample average

Why training set error doesn t approximate prediction error? Sampling approximation of prediction error: Training error : Very similar equations!!! Why is training set a bad measure of prediction error???

Why training set error doesn t approximate prediction error? Sampling approximation of prediction error: Because you cheated!!! Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples Training error : Training error is a (optimistically) biased estimate of prediction error Very similar equations!!! Why is training set a bad measure of prediction error???

Test set error Given a dataset, randomly split it into two parts: Training data {x1,, xntrain} Test data {x1,, xntest} Use training data to optimize parameters w Test set error: For the final solution w*, evaluate the error using:

Test set error as a function of model complexity

Overfitting: this slide is so important we are looking at it again! Assume: Data generated from distribution D(X,Y) A hypothesis space H Define: errors for hypothesis h H Training error: errortrain(h) Data (true) error: errortrue(h) We say h overfits the training data if there exists an h H such that: errortrain(h) < errortrain(h ) and errortrue(h) > errortrue(h )

Summary: error estimators Gold Standard: Training: optimistically biased Test: our final measure

bias Error as a function of number of training examples for a fixed model complexity little data infinite data

Error as function of regularization parameter, fixed model complexity λ= λ=0

Summary: error estimators Gold Standard: Be careful!!! Test set only unbiased if you never never ever ever Training: dooptimistically any any anybiased any learning on the test data For example, if you use the test set to select the degree of the polynomial no longer unbiased!!! our final measure this problem later in the quarter) Test:(We will address

What you need to know Regression Basis function = features Optimizing sum squared error Relationship between regression and Gaussians Regularization Ridge regression math LASSO Formulation How to set lambda Bias-Variance trade-off