Lasso Regression: Regularization for feature selection

Similar documents
Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Lecture 13: Model selection and regularization

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Cross-validation. Cross-validation is a resampling method.

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Going nonparametric: Nearest neighbor methods for regression and classification

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

CSE446: Linear Regression. Spring 2017

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

CSE 446 Bias-Variance & Naïve Bayes

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Nonparametric Methods Recap

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

Lecture on Modeling Tools for Clustering & Regression

MATH 829: Introduction to Data Mining and Analysis Model selection

Linear Methods for Regression and Shrinkage Methods

Machine Learning: An Applied Econometric Approach Online Appendix

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Going nonparametric: Nearest neighbor methods for regression and classification

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

EE 511 Linear Regression

RESAMPLING METHODS. Chapter 05

Gradient LASSO algoithm

Network Lasso: Clustering and Optimization in Large Graphs

CSE446: Linear Regression. Spring 2017

PS 6: Regularization. PART A: (Source: HTF page 95) The Ridge regression problem is:

Variable Selection 6.783, Biomedical Decision Support

Machine Learning. Topic 4: Linear Regression Models

Introduction to Automated Text Analysis. bit.ly/poir599

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

10601 Machine Learning. Model and feature selection

CSC411/2515 Tutorial: K-NN and Decision Tree

Model selection and validation 1: Cross-validation

Topics In Feature Selection

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Machine Learning / Jan 27, 2010

Multicollinearity and Validation CIVL 7012/8012

Lasso. November 14, 2017

CSC 411: Lecture 02: Linear Regression

6 Model selection and kernels

Logistic Regression and Gradient Ascent

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Simple Model Selection Cross Validation Regularization Neural Networks

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Nearest Neighbor with KD Trees

Network Traffic Measurements and Analysis

Chapter 6: Linear Model Selection and Regularization

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

LECTURE 11: LINEAR MODEL SELECTION PT. 2. October 18, 2017 SDS 293: Machine Learning

Boosting Simple Model Selection Cross Validation Regularization

Yelp Recommendation System

Chapter 7: Numerical Prediction

Nearest Neighbor with KD Trees

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

Multiresponse Sparse Regression with Application to Multidimensional Scaling

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

Penalizied Logistic Regression for Classification

CPSC 340: Machine Learning and Data Mining

Comparison of Statistical Learning and Predictive Models on Breast Cancer Data and King County Housing Data

LECTURE 12: LINEAR MODEL SELECTION PT. 3. October 23, 2017 SDS 293: Machine Learning

Regularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

How do we obtain reliable estimates of performance measures?

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Predicting housing price

Hyperparameters and Validation Sets. Sargur N. Srihari

CS229 Lecture notes. Raphael John Lamarre Townshend

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Non-linear models. Basis expansion. Overfitting. Regularization.

scikit-learn (Machine Learning in Python)

Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments

Localized Lasso for High-Dimensional Regression

Information Driven Healthcare:

[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization

Problem 1: Complexity of Update Rules for Logistic Regression

Machine Learning Feature Creation and Selection

Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008.

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Locality- Sensitive Hashing Random Projections for NN Search

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

14. League: A factor with levels A and N indicating player s league at the end of 1986

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

Lecture 16: High-dimensional regression, non-linear regression

Model Clustering via Group Lasso

Machine Learning: Think Big and Parallel

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Voxel selection algorithms for fmri

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Machine Learning Techniques for Data Mining

Model Complexity and Generalization

Linear Regression & Gradient Descent

Pattern Recognition for Neuroimaging Data

Features: representation, normalization, selection. Chapter e-9

Slides for Data Mining by I. H. Witten and E. Frank

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Package EBglmnet. January 30, 2016

Transcription:

Lasso Regression: Regularization for feature selection CSE 416: Machine Learning Emily Fox University of Washington April 12, 2018 Symptom of overfitting 2 Often, overfitting associated with very large estimated parameters ŵ y price ($) square feet (sq.ft.) f ŵ x OVERFIT 1

Training Data y Feature extraction h(x) ML model ŵ ŷ ML algorithm Quality metric 3 Consider specific total cost Want to balance: i. How well function fits data ii. Magnitude of coefficients Total cost = measure of fit + measure of magnitude of coefficients RSS(w) w 2 2 4 2

Consider resulting objective What if ŵ selected to minimize 2 RSS(w) + λ w 2 tuning parameter = balance of fit and magnitude Ridge regression (a.k.a L 2 regularization) 5 Measure of magnitude of regression coefficient What summary # is indicative of size of regression coefficients? - Sum? - Sum of absolute value? - Sum of squares (L 2 norm) 6 3

Feature selection task Why might you want to perform feature selection? Efficiency: - If size(w) = 100B, each prediction is expensive - If ŵ sparse, computation only depends on # of non-zeros many zeros X ŷ i = ŵ j h j (x i ) ŵ j 6=0 Interpretability: - Which features are relevant for prediction? 8 4

Sparsity: Housing application 9 $? Lot size Single Family Year built Last sold price Last sale price/sqft Finished sqft Unfinished sqft Finished basement sqft # floors Flooring types Parking type Parking amount Cooling Heating Exterior materials Roof type Structure style Dishwasher Garbage disposal Microwave Range / Oven Refrigerator Washer Dryer Laundry location Heating type Jetted Tub Deck Fenced Yard Lawn Garden Sprinkler System Sparsity: Reading your mind very sad very happy Activity in which brain regions can predict happiness? 10 5

Option 1: All subsets or greedy variants Find best model of size: 0 0 12 6

Find best model of size: 1 0 1 13 Find best model of size: 1 0 1 14 7

Find best model of size: 1 0 1 15 Find best model of size: 1 0 1 16 8

Find best model of size: 1 0 1 17 Find best model of size: 1 0 1 18 9

Find best model of size: 1 0 1 19 Find best model of size: 1 0 1 20 10

Find best model of size: 1 0 1 21 Find best model of size: 2 0 1 2 22 11

Note: Not necessarily nested! 0 1 2 23 Note: Not necessarily nested! 0 1 2 24 12

Find best model of size: 3 0 1 2 3 25 Find best model of size: 4 0 1 2 3 4 26 13

Find best model of size: 5 0 1 2 3 4 5 27 Find best model of size: 6 0 1 2 3 4 5 6 28 14

Find best model of size: 7 0 1 2 3 4 5 6 7 29 Find best model of size: 8 0 1 2 3 4 5 6 7 8 30 15

Choosing model complexity? Option 1: Assess on validation set Option 2: Cross validation Option 3+: Other metrics for penalizing model complexity like BIC 31 Complexity of all subsets How many models were evaluated? - each indexed by features included y i = ε i y i = w 0 h 0 (x i ) + ε i y i = w 1 h 1 (x i ) + ε i y i = w 0 h 0 (x i ) + w 1 h 1 (x i ) + ε i y i = w 0 h 0 (x i ) + w 1 h 1 (x i ) + + w D h D (x i )+ ε i [0 0 0 0 0 0] [1 0 0 0 0 0] [0 1 0 0 0 0] [1 1 0 0 0 0] [1 1 1 1 1 1] 2 D 2 8 = 256 2 30 = 1,073,741,824 2 1000 = 1.071509 x 10 301 2 100B = HUGE!!!!!! Typically, computationally infeasible 32 16

Greedy algorithms Forward stepwise: Starting from simple model and iteratively add features most useful to fit Backward stepwise: Start with full model and iteratively remove features least useful to fit Combining forward and backward steps: In forward algorithm, insert steps to remove features no longer as important Lots of other variants, too. 33 2017 Emily Fox 33 Option 2: Regularize 17

Ridge regression: L 2 regularized regression Total cost = measure of fit + λ measure of magnitude of coefficients RSS(w) 2 w 2 =w 02 + +w 2 D Encourages small weights but not exactly 0 35 Coefficient path ridge 36 coefficients ŵ j 100000 0 100000 200000 0 50000 100000 150000 200000 λ bedrooms bathrooms sqft_living sqft_lot floors yr_built yr_renovat waterfront 18

Using regularization for feature selection Instead of searching over a discrete set of solutions, can we use regularization? - Start with full model (all possible features) - Shrink some coefficients exactly to 0 i.e., knock out certain features - Non-zero coefficients indicate selected features 37 Thresholding ridge coefficients? Why don t we just set small ridge coefficients to 0? 38 0 # bedrooms # bathrooms sq.ft. living sq.ft. lot floors year built year renovated last sales price cost per sq.ft. heating # showers waterfront 19

Thresholding ridge coefficients? Selected features for a given threshold value 39 0 # bedrooms # bathrooms sq.ft. living sq.ft. lot floors year built year renovated last sales price cost per sq.ft. heating # showers waterfront Thresholding ridge coefficients? Let s look at two related features 40 0 # bedrooms # bathrooms sq.ft. living sq.ft. lot floors year built year renovated last sales price cost per sq.ft. heating # showers waterfront Nothing measuring bathrooms was included! 20

Thresholding ridge coefficients? If only one of the features had been included 41 0 # bedrooms # bathrooms sq.ft. living sq.ft. lot floors year built year renovated last sales price cost per sq.ft. heating # showers waterfront Thresholding ridge coefficients? Would have included bathrooms in selected model 42 0 # bedrooms # bathrooms sq.ft. living sq.ft. lot floors year built year renovated last sales price cost per sq.ft. heating # showers waterfront Can regularization lead directly to sparsity? 21

Try this cost instead of ridge Total cost = measure of fit + λ measure of magnitude of coefficients RSS(w) w 1 = w 0 + + w D Lasso regression (a.k.a. L 1 regularized regression) Leads to sparse solutions! 43 Lasso regression: L 1 regularized regression Just like ridge regression, solution is governed by a continuous parameter λ If λ=0: If λ= : RSS(w) + λ w 1 tuning parameter = balance of fit and sparsity If λ in between: 44 22

Coefficient path ridge 45 coefficients ŵ j 100000 0 100000 200000 0 50000 100000 150000 200000 λ bedrooms bathrooms sqft_living sqft_lot floors yr_built yr_renovat waterfront Coefficient path lasso coefficients ŵ j 100000 0 100000 200000 0 50000 100000 150000 200000 λ bedrooms bathrooms sqft_living sqft_lot floors yr_built yr_renovat waterfront 46 23

Revisit polynomial fit demo What happens if we refit our high-order polynomial, but now using lasso regression? Will consider a few settings of λ 47 How to choose λ: Cross validation 24

If sufficient amount of data Training set Validation set Test set fit ŵ λ test performance of ŵ λ to select λ * assess generalization error of ŵ λ* 49 Start with smallish dataset All data 50 25

Still form test set and hold out Rest of data Test set 51 How do we use the other data? Rest of data use for both training and validation, but not so naively 52 26

Recall naïve approach Training set Valid. set small validation set Is validation set enough to compare performance of ŵ λ across λ values? No 53 Choosing the validation set Valid. set small validation set Didn t have to use the last data points tabulated to form validation set Can use any data subset 54 27

Choosing the validation set Valid. set small validation set Which subset should I use? ALL! average performance over all choices 55 K-fold cross validation Rest of data N K N K N K N K N K Preprocessing: Randomly assign data to K groups (use same split of data for all other steps) 56 28

K-fold cross validation Valid set error 1 (λ) ŵ λ (1) For k=1,,k 1. Estimate ŵ λ (k) on the training blocks 2. Compute error on validation block: error k (λ) 57 K-fold cross validation Valid set error 2 (λ) For k=1,,k ŵ λ (2) 1. Estimate ŵ (k) λ on the training blocks 2. Compute error on validation block: error k (λ) 58 29

K-fold cross validation Valid set ŵ λ (3) error 3 (λ) For k=1,,k 1. Estimate ŵ λ (k) on the training blocks 2. Compute error on validation block: error k (λ) 59 K-fold cross validation Valid set ŵ λ (4) error 4 (λ) For k=1,,k 1. Estimate ŵ λ (k) on the training blocks 2. Compute error on validation block: error k (λ) 60 30

K-fold cross validation Valid set 61 ŵ λ (5) For k=1,,k error 5 (λ) 1. Estimate ŵ λ (k) on the training blocks 2. Compute error on validation block: error k (λ) 1 KX K Compute average error: CV(λ) = k=1 error k (λ) K-fold cross validation Valid set Repeat procedure for each choice of λ Choose λ* to minimize CV(λ) 62 31

What value of K? Formally, the best approximation occurs for validation sets of size 1 (K=N) leave-one-out cross validation Computationally intensive - requires computing N fits of model per λ Typically, K=5 or 10 5-fold CV 10-fold CV 63 Choosing λ via cross validation for lasso Cross validation is choosing the λ that provides best predictive accuracy Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for feature selection c.f., Machine Learning: A Probabilistic Perspective, Murphy, 2012 for further discussion 64 32

Practical concerns with lasso Debiasing lasso Lasso shrinks coefficients relative to LS solution à more bias, less variance True coefficients (D=4096, non-zero = 160) Can reduce bias as follows: 1. Run lasso to select features 2. Run least squares regression with only selected features Relevant features no longer shrunk relative to LS fit of same reduced model L1 reconstruction (non-zero = 1024, MSE = 0.0072) Debiased (non-zero = 1024, MSE = 3.26e-005) Least squares (non-zero = 0, MSE = 1.568) Figure used with permission of Mario Figueiredo (captions modified to fit course) 66 33

Issues with standard lasso objective 1. With group of highly correlated features, lasso tends to select amongst them arbitrarily - Often prefer to select all together 2. Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution Elastic net aims to address these issues - hybrid between lasso and ridge regression - uses L 1 and L 2 penalties See Zou & Hastie 05 for further discussion 67 Summary for feature selection and lasso regression 34

Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature selection in general, be careful about interpreting selected features - selection only considers features included - sensitive to correlations between features - result depends on algorithm used - there are theoretical guarantees for lasso under certain conditions 69 What you can do now Describe all subsets and greedy variants for feature selection Analyze computational costs of these algorithms Formulate lasso objective Describe what happens to estimated lasso coefficients as tuning parameter λ is varied Interpret lasso coefficient path plot Contrast ridge and lasso regression Implement K-fold cross validation to select lasso tuning parameter λ 70 35