Lecture 13: Model selection and regularization
|
|
- Harold Skinner
- 5 years ago
- Views:
Transcription
1 Lecture 13: Model selection and regularization Reading: Sections STATS 202: Data mining and analysis October 23, / 17
2 What do we know so far In linear regression, adding predictors always decreases the training error or RSS. 2 / 17
3 What do we know so far In linear regression, adding predictors always decreases the training error or RSS. However, adding predictors does not necessarily improve the test error. 2 / 17
4 What do we know so far In linear regression, adding predictors always decreases the training error or RSS. However, adding predictors does not necessarily improve the test error. When n < p, there is no least squares solution: ˆβ = (X T X) 1 X T y. }{{} So, we must find a way to select fewer predictors if we want to use least squares linear regression. 2 / 17
5 What do we know so far In linear regression, adding predictors always decreases the training error or RSS. However, adding predictors does not necessarily improve the test error. When n < p, there is no least squares solution: ˆβ = (X T X) 1 X T y. }{{} Singular So, we must find a way to select fewer predictors if we want to use least squares linear regression. 2 / 17
6 Best subset selection Simple idea: let s compare all models with k predictors. 3 / 17
7 Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. 3 / 17
8 Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. Choose the model with the smallest RSS. Do this for every possible k. 3 / 17
9 Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. Choose the model with the smallest RSS. Do this for every possible k. Residual Sum of Squares 2e+07 4e+07 6e+07 8e+07 R Number of Predictors Number of Predictors 3 / 17
10 Best subset selection Simple idea: let s compare all models with k predictors. There are ( p k) = p!/ [k!(p k)!] possible models. Choose the model with the smallest RSS. Do this for every possible k. Residual Sum of Squares 2e+07 4e+07 6e+07 8e+07 R Number of Predictors Number of Predictors Naturally, the RSS and R 2 improve as we increase k. 3 / 17
11 Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 4 / 17
12 Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC): 1 nˆσ 2 (RSS + 2kˆσ2 ) where ˆσ 2 is an estimate of the irreducible error, typically ˆσ 2 = 1 n p 1 RSS of full model with all p predictors 4 / 17
13 Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 1 n (RSS + 2kˆσ2 ) where ˆσ 2 is an estimate of the irreducible error, typically ˆσ 2 = 1 n p 1 RSS of full model with all p predictors 4 / 17
14 Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 1 n (RSS + log(n)kˆσ2 ) 4 / 17
15 Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 3. Adjusted R 2 : R 2 = 1 RSS TSS where TSS= i (y i ȳ) 2 4 / 17
16 Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 3. Adjusted R 2 : R 2 adj = 1 RSS/(n k 1) TSS/(n 1) 4 / 17
17 Best subset selection To optimize k, we want to minimize the test error, not the training error. We could use cross-validation, or alternative estimates of test error: 1. Akaike Information Criterion (AIC) or C p : 2. Bayesian Information Criterion (BIC): 3. Adjusted R 2 : How do they compare to cross validation: They are much less expensive to compute. They are motivated by asymptotic arguments and rely on model assumptions (eg. normality of the errors). Equivalent concepts for other models (e.g. logistic regression). 4 / 17
18 Example Best subset selection for the Credit dataset. C p BIC Adjusted R Number of Predictors Number of Predictors Number of Predictors 5 / 17
19 Example Cross-validation vs. the BIC. Square Root of BIC Validation Set Error Cross Validation Error Number of Predictors Number of Predictors Number of Predictors 6 / 17
20 Example Cross-validation vs. the BIC. Square Root of BIC Validation Set Error Cross Validation Error Number of Predictors Number of Predictors Number of Predictors Recall: In k-fold cross validation, we can estimate a standard error or accuracy for our test error estimate and apply the one standard-error rule. 6 / 17
21 Stepwise selection methods Best subset selection has 2 problems: 7 / 17
22 Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 7 / 17
23 Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 2. If for a fixed k, there are too many possibilities, we increase our chances of overfitting (just like in multiple testing!). 7 / 17
24 Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 2. If for a fixed k, there are too many possibilities, we increase our chances of overfitting (just like in multiple testing!). The model selected has high variance. 7 / 17
25 Stepwise selection methods Best subset selection has 2 problems: 1. It is often very expensive computationally. We have to fit 2 p models! 2. If for a fixed k, there are too many possibilities, we increase our chances of overfitting (just like in multiple testing!). The model selected has high variance. In order to mitigate these problems, we can restrict our search space for the best model. This reduces the variance of the selected model at the expense of an increase in bias. 7 / 17
26 Forward selection 8 / 17
27 Forward selection vs. best subset 9 / 17
28 Backward selection 10 / 17
29 Forward vs. backward selection You cannot apply backward selection when p > n. 11 / 17
30 Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. 11 / 17
31 Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. Example. X 1, X 2 N (0, σ) independent. Regress Y onto X 1, X 2, X 3. X 3 = X 1 + 3X 2 Y = X 1 + 2X 2 + ɛ 11 / 17
32 Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. Example. X 1, X 2 N (0, σ) independent. Regress Y onto X 1, X 2, X 3. X 3 = X 1 + 3X 2 Y = X 1 + 2X 2 + ɛ Forward: {X3 } {X 3, X 2 } {X 3, X 2, X 1 } 11 / 17
33 Forward vs. backward selection You cannot apply backward selection when p > n. Although it seems like they should, they need not produce the same sequence of models. Example. X 1, X 2 N (0, σ) independent. Regress Y onto X 1, X 2, X 3. X 3 = X 1 + 3X 2 Y = X 1 + 2X 2 + ɛ Forward: {X3 } {X 3, X 2 } {X 3, X 2, X 1 } Backward: {X1, X 2, X 3 } {X 1, X 2 } {X 2 } 11 / 17
34 Other stepwise selection strategies Mixed stepwise selection: Do forward selection, but at every step, remove any variables that are no longer necessary. 12 / 17
35 Other stepwise selection strategies Mixed stepwise selection: Do forward selection, but at every step, remove any variables that are no longer necessary. Forward stagewise selection: Like forward selection, but when introducing a new variable, do not adjust the coefficients of the previously added variables. 12 / 17
36 Other stepwise selection strategies Mixed stepwise selection: Do forward selection, but at every step, remove any variables that are no longer necessary. Forward stagewise selection: Like forward selection, but when introducing a new variable, do not adjust the coefficients of the previously added variables / 17
37 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward / 17
38 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? 13 / 17
39 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? This introduces bias, but may significantly decrease the variance of the estimates. If the latter effect is larger, this would decrease the test error. 13 / 17
40 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the coefficients ˆβ toward 0. Why would shrunk coefficients be better? This introduces bias, but may significantly decrease the variance of the estimates. If the latter effect is larger, this would decrease the test error. There are Bayesian motivations to do this: the prior tends to shrink the parameters. 13 / 17
41 Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 β 2 j 14 / 17
42 Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 The parameter λ is a tuning parameter. It modulates the importance of fit vs. shrinkage. β 2 j 14 / 17
43 Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 The parameter λ is a tuning parameter. It modulates the importance of fit vs. shrinkage. β 2 j We find an estimate ˆβ λ R cross-validation. for many values of λ and then choose it by 14 / 17
44 Ridge regression Ridge regression solves the following optimization: min β n y i β 0 i=1 p β j x i,j j=1 In blue, we have the RSS of the model. 2 + λ In red, we have the squared l 2 norm of β, or β 2 2. p j=1 The parameter λ is a tuning parameter. It modulates the importance of fit vs. shrinkage. We find an estimate ˆβ λ R for many values of λ and then choose it by cross-validation. Fortunately, this is no more expensive than running a least-squares regression. β 2 j 14 / 17
45 Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = X 0 + β 1 X 1 + β 2 X β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. 15 / 17
46 Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = X 0 + β 1 X 1 + β 2 X β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. In ridge regression, this is not true. 15 / 17
47 Ridge regression In least-squares linear regression, scaling the variables has no effect on the fit of the model: Y = X 0 + β 1 X 1 + β 2 X β p X p. Multiplying X 1 by c can be compensated by dividing ˆβ 1 by c, ie. after doing this we have the same RSS. In ridge regression, this is not true. In practice, what do we do? Scale each variable such that it has sample variance 1 before running the regression. This prevents penalizing some coefficients more than others. 15 / 17
48 Example. Ridge regression Ridge regression of default in the Credit dataset. Standardized Coefficients e 02 1e+00 1e+02 1e+04 Income Limit Rating Student Standardized Coefficients λ ˆβ R λ 2/ ˆβ 2 16 / 17
49 Bias-variance tradeoff In a simulation study, we compute bias, variance, and test error as a function of λ. Mean Squared Error Mean Squared Error e 01 1e+01 1e λ ˆβ R λ 2/ ˆβ 2 Cross validation would yield an estimate of the test error. 17 / 17
Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.
Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records
More information2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy
2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.
More information[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization
[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization L. Jason Anastasopoulos ljanastas@uga.edu February 2, 2017 Gradient descent Let s begin with our simple problem of estimating
More informationMachine Learning. Topic 4: Linear Regression Models
Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There
More informationSTA121: Applied Regression Analysis
STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model
More informationDS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University
DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September 20 2018 Review Solution for multiple linear regression can be computed in closed form
More informationChapter 6: Linear Model Selection and Regularization
Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the
More informationLECTURE 11: LINEAR MODEL SELECTION PT. 2. October 18, 2017 SDS 293: Machine Learning
LECTURE 11: LINEAR MODEL SELECTION PT. 2 October 18, 2017 SDS 293: Machine Learning Announcements 1/2 CS Internship Lunch Presentations Come hear where Computer Science majors interned in Summer 2017!
More informationLecture 16: High-dimensional regression, non-linear regression
Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we
More informationLinear Methods for Regression and Shrinkage Methods
Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider
More informationLECTURE 12: LINEAR MODEL SELECTION PT. 3. October 23, 2017 SDS 293: Machine Learning
LECTURE 12: LINEAR MODEL SELECTION PT. 3 October 23, 2017 SDS 293: Machine Learning Announcements 1/2 Presentation of the CS Major & Minors TODAY @ lunch Ford 240 FREE FOOD! Announcements 2/2 CS Internship
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider
More informationPerformance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018
Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:
More informationLecture 7: Linear Regression (continued)
Lecture 7: Linear Regression (continued) Reading: Chapter 3 STATS 2: Data mining and analysis Jonathan Taylor, 10/8 Slide credits: Sergio Bacallado 1 / 14 Potential issues in linear regression 1. Interactions
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection CSE 416: Machine Learning Emily Fox University of Washington April 12, 2018 Symptom of overfitting 2 Often, overfitting associated with very large
More informationTopics in Machine Learning-EE 5359 Model Assessment and Selection
Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1 Training and Generalization Training stage: Utilizing
More informationNonparametric Methods Recap
Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010 Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority
More informationModel selection and validation 1: Cross-validation
Model selection and validation 1: Cross-validation Ryan Tibshirani Data Mining: 36-462/36-662 March 26 2013 Optional reading: ISL 2.2, 5.1, ESL 7.4, 7.10 1 Reminder: modern regression techniques Over the
More informationLasso. November 14, 2017
Lasso November 14, 2017 Contents 1 Case Study: Least Absolute Shrinkage and Selection Operator (LASSO) 1 1.1 The Lasso Estimator.................................... 1 1.2 Computation of the Lasso Solution............................
More informationGradient LASSO algoithm
Gradient LASSO algoithm Yongdai Kim Seoul National University, Korea jointly with Yuwon Kim University of Minnesota, USA and Jinseog Kim Statistical Research Center for Complex Systems, Korea Contents
More informationLeveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg
Model Optimization Leveling Up as a Data Scientist http://shorelinechurch.org/wp-content/uploa ds/2014/10/level-up-ds.jpg Bias and Variance Error = (expected loss of accuracy) 2 + flexibility of model
More informationLecture 26: Missing data
Lecture 26: Missing data Reading: ESL 9.6 STATS 202: Data mining and analysis December 1, 2017 1 / 10 Missing data is everywhere Survey data: nonresponse. 2 / 10 Missing data is everywhere Survey data:
More informationWhat is machine learning?
Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship
More informationFMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu
FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)
More informationCSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13
CSE 634 - Data Mining Concepts and Techniques STATISTICAL METHODS Professor- Anita Wasilewska (REGRESSION) Team 13 Contents Linear Regression Logistic Regression Bias and Variance in Regression Model Fit
More informationChapter 10: Variable Selection. November 12, 2018
Chapter 10: Variable Selection November 12, 2018 1 Introduction 1.1 The Model-Building Problem The variable selection problem is to find an appropriate subset of regressors. It involves two conflicting
More informationMultiple Linear Regression
Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors
More informationVariable selection is intended to select the best subset of predictors. But why bother?
Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.
More informationAlgorithms for LTS regression
Algorithms for LTS regression October 26, 2009 Outline Robust regression. LTS regression. Adding row algorithm. Branch and bound algorithm (BBA). Preordering BBA. Structured problems Generalized linear
More informationDS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University
DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 24 2019 Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description
More informationMATH 829: Introduction to Data Mining and Analysis Model selection
1/12 MATH 829: Introduction to Data Mining and Analysis Model selection Dominique Guillot Departments of Mathematical Sciences University of Delaware February 24, 2016 2/12 Comparison of regression methods
More informationLecture 17: Smoothing splines, Local Regression, and GAMs
Lecture 17: Smoothing splines, Local Regression, and GAMs Reading: Sections 7.5-7 STATS 202: Data mining and analysis November 6, 2017 1 / 24 Cubic splines Define a set of knots ξ 1 < ξ 2 < < ξ K. We want
More informationModel selection. Peter Hoff STAT 423. Applied Regression and Analysis of Variance. University of Washington /53
/53 Model selection Peter Hoff STAT 423 Applied Regression and Analysis of Variance University of Washington Diabetes example: y = diabetes progression x 1 = age x 2 = sex. dim(x) ## [1] 442 64 colnames(x)
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Feature Selection Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. Admin Assignment 3: Due Friday Midterm: Feb 14 in class
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationCPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016
CPSC 34: Machine Learning and Data Mining Feature Selection Fall 26 Assignment 3: Admin Solutions will be posted after class Wednesday. Extra office hours Thursday: :3-2 and 4:3-6 in X836. Midterm Friday:
More informationRegression on SAT Scores of 374 High Schools and K-means on Clustering Schools
Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques
More informationChapter 7: Numerical Prediction
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 7: Numerical Prediction Lecture: Prof. Dr.
More informationCPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017
CPSC 340: Machine Learning and Data Mining Feature Selection Fall 2017 Assignment 2: Admin 1 late day to hand in tonight, 2 for Wednesday, answers posted Thursday. Extra office hours Thursday at 4pm (ICICS
More informationRegularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel
Regularization Methods Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel Today s Lecture Objectives 1 Avoiding overfitting and improving model interpretability with the help of regularization
More informationModel Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer
Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error
More information6 Model selection and kernels
6. Bias-Variance Dilemma Esercizio 6. While you fit a Linear Model to your data set. You are thinking about changing the Linear Model to a Quadratic one (i.e., a Linear Model with quadratic features φ(x)
More informationStatistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.
Statistical Consulting Topics Using cross-validation for model selection Cross-validation is a technique that can be used for model evaluation. We often fit a model to a full data set and then perform
More informationMachine Learning / Jan 27, 2010
Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal
More informationLecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017
Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last
More informationBayes Estimators & Ridge Regression
Bayes Estimators & Ridge Regression Readings ISLR 6 STA 521 Duke University Merlise Clyde October 27, 2017 Model Assume that we have centered (as before) and rescaled X o (original X) so that X j = X o
More information7. Collinearity and Model Selection
Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among
More informationFeature Selec+on. Machine Learning Fall 2018 Kasthuri Kannan
Feature Selec+on Machine Learning Fall 2018 Kasthuri Kannan Interpretability vs. Predic+on Types of feature selec+on Subset selec+on/forward/backward Shrinkage (Lasso/Ridge) Best model (CV) Feature selec+on
More information1 StatLearn Practical exercise 5
1 StatLearn Practical exercise 5 Exercise 1.1. Download the LA ozone data set from the book homepage. We will be regressing the cube root of the ozone concentration on the other variables. Divide the data
More informationOverfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff
Overfitting Machine Learning CSE546 Carlos Guestrin University of Washington October 2, 2013 1 Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More
More informationGAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.
GAMs semi-parametric GLMs Simon Wood Mathematical Sciences, University of Bath, U.K. Generalized linear models, GLM 1. A GLM models a univariate response, y i as g{e(y i )} = X i β where y i Exponential
More informationRandom Forests and Boosting
Random Forests and Boosting Tree-based methods are simple and useful for interpretation. However they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy.
More informationRESAMPLING METHODS. Chapter 05
1 RESAMPLING METHODS Chapter 05 2 Outline Cross Validation The Validation Set Approach Leave-One-Out Cross Validation K-fold Cross Validation Bias-Variance Trade-off for k-fold Cross Validation Cross Validation
More informationCPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017
CPSC 340: Machine Learning and Data Mining More Regularization Fall 2017 Assignment 3: Admin Out soon, due Friday of next week. Midterm: You can view your exam during instructor office hours or after class
More informationMODEL DEVELOPMENT: VARIABLE SELECTION
7 MODEL DEVELOPMENT: VARIABLE SELECTION The discussion of least squares regression thus far has presumed that the model was known with respect to which variables were to be included and the form these
More informationHyperparameters and Validation Sets. Sargur N. Srihari
Hyperparameters and Validation Sets Sargur N. srihari@cedar.buffalo.edu 1 Topics in Machine Learning Basics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation
More information14. League: A factor with levels A and N indicating player s league at the end of 1986
PENALIZED REGRESSION Ridge and The LASSO Note: The example contained herein was copied from the lab exercise in Chapter 6 of Introduction to Statistical Learning by. For this exercise, we ll use some baseball
More informationSEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018
SEMANTIC COMPUTING Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) Dagmar Gromann International Center For Computational Logic TU Dresden, 21 December 2018 Overview Handling Overfitting Recurrent
More informationSplines and penalized regression
Splines and penalized regression November 23 Introduction We are discussing ways to estimate the regression function f, where E(y x) = f(x) One approach is of course to assume that f has a certain shape,
More informationFeature Subset Selection for Logistic Regression via Mixed Integer Optimization
Feature Subset Selection for Logistic Regression via Mixed Integer Optimization Yuichi TAKANO (Senshu University, Japan) Toshiki SATO (University of Tsukuba) Ryuhei MIYASHIRO (Tokyo University of Agriculture
More informationNonparametric Classification Methods
Nonparametric Classification Methods We now examine some modern, computationally intensive methods for regression and classification. Recall that the LDA approach constructs a line (or plane or hyperplane)
More informationThe exam is closed book, closed notes except your one-page cheat sheet.
CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right
More informationUVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia
UVA CS 6316/4501 Fall 2016 Machine Learning Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff Dr. Yanjun Qi University of Virginia Department of Computer Science 11/9/16 1 Rough Plan HW5
More informationMulticollinearity and Validation CIVL 7012/8012
Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.
More informationCPSC 340: Machine Learning and Data Mining. Regularization Fall 2016
CPSC 340: Machine Learning and Data Mining Regularization Fall 2016 Assignment 2: Admin 2 late days to hand it in Friday, 3 for Monday. Assignment 3 is out. Due next Wednesday (so we can release solutions
More informationGeneral Factorial Models
In Chapter 8 in Oehlert STAT:5201 Week 9 - Lecture 2 1 / 34 It is possible to have many factors in a factorial experiment. In DDD we saw an example of a 3-factor study with ball size, height, and surface
More informationmodel order p weights The solution to this optimization problem is obtained by solving the linear system
CS 189 Introduction to Machine Learning Fall 2017 Note 3 1 Regression and hyperparameters Recall the supervised regression setting in which we attempt to learn a mapping f : R d R from labeled examples
More informationInformation Criteria Methods in SAS for Multiple Linear Regression Models
Paper SA5 Information Criteria Methods in SAS for Multiple Linear Regression Models Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN ABSTRACT SAS 9.1 calculates Akaike s Information
More informationThe Basics of Decision Trees
Tree-based Methods Here we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. Since the set of splitting
More informationRegularization and model selection
CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial
More informationDiscussion Notes 3 Stepwise Regression and Model Selection
Discussion Notes 3 Stepwise Regression and Model Selection Stepwise Regression There are many different commands for doing stepwise regression. Here we introduce the command step. There are many arguments
More informationThis is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or
STA 450/4000 S: February 2 2005 Flexible modelling using basis expansions (Chapter 5) Linear regression: y = Xβ + ɛ, ɛ (0, σ 2 ) Smooth regression: y = f (X) + ɛ: f (X) = E(Y X) to be specified Flexible
More informationCross-validation for detecting and preventing overfitting
Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.
More informationCross-validation and the Bootstrap
Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,
More informationLecture 19: Decision trees
Lecture 19: Decision trees Reading: Section 8.1 STATS 202: Data mining and analysis November 10, 2017 1 / 17 Decision trees, 10,000 foot view R2 R5 t4 1. Find a partition of the space of predictors. X2
More informationPredicting Web Service Levels During VM Live Migrations
Predicting Web Service Levels During VM Live Migrations 5th International DMTF Academic Alliance Workshop on Systems and Virtualization Management: Standards and the Cloud Helmut Hlavacs, Thomas Treutner
More informationRobust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson
Robust Regression Robust Data Mining Techniques By Boonyakorn Jantaranuson Outline Introduction OLS and important terminology Least Median of Squares (LMedS) M-estimator Penalized least squares What is
More information10601 Machine Learning. Model and feature selection
10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior
More informationMachine Learning. Chao Lan
Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian
More informationI How does the formulation (5) serve the purpose of the composite parameterization
Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)
More informationWeek 4: Simple Linear Regression II
Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties
More informationCSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18
CSE 417T: Introduction to Machine Learning Lecture 6: Bias-Variance Trade-off Henry Chai 09/13/18 Let! ", $ = the maximum number of dichotomies on " points s.t. no subset of $ points is shattered Recall
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationRegression Analysis and Linear Regression Models
Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical
More informationResampling methods (Ch. 5 Intro)
Zavádějící faktor (Confounding factor), ale i 'současně působící faktor' Resampling methods (Ch. 5 Intro) Key terms: Train/Validation/Test data Crossvalitation One-leave-out = LOOCV Bootstrup key slides
More informationCSE446: Linear Regression. Spring 2017
CSE446: Linear Regression Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Prediction of continuous variables Billionaire says: Wait, that s not what I meant! You say: Chill
More informationCross-validation. Cross-validation is a resampling method.
Cross-validation Cross-validation is a resampling method. It refits a model of interest to samples formed from the training set, in order to obtain additional information about the fitted model. For example,
More informationLinear Penalized Spline Model Estimation Using Ranked Set Sampling Technique
Linear Penalized Spline Model Estimation Using Ranked Set Sampling Technique Al Kadiri M. A. Abstract Benefits of using Ranked Set Sampling (RSS) rather than Simple Random Sampling (SRS) are indeed significant
More informationLecture 20: Bagging, Random Forests, Boosting
Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter 8 STATS 202: Data mining and analysis November 13, 2017 1 / 17 Classification and Regression trees, in a nut shell Grow the tree by recursively
More informationA Course in Machine Learning
A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling
More informationPackage EBglmnet. January 30, 2016
Type Package Package EBglmnet January 30, 2016 Title Empirical Bayesian Lasso and Elastic Net Methods for Generalized Linear Models Version 4.1 Date 2016-01-15 Author Anhui Huang, Dianting Liu Maintainer
More informationGeneral Factorial Models
In Chapter 8 in Oehlert STAT:5201 Week 9 - Lecture 1 1 / 31 It is possible to have many factors in a factorial experiment. We saw some three-way factorials earlier in the DDD book (HW 1 with 3 factors:
More informationOutline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model
Topic 16 - Other Remedies Ridge Regression Robust Regression Regression Trees Outline - Fall 2013 Piecewise Linear Model Bootstrapping Topic 16 2 Ridge Regression Modification of least squares that addresses
More informationWeek 4: Simple Linear Regression III
Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of
More informationComplexity Challenges to the Discovery of Relationships in Eddy Current Non-destructive Test Data
Complexity Challenges to the Discovery of Relationships in Eddy Current Non-destructive Test Data CPT John R. Brence United States Military Academy Donald E. Brown, PhD University of Virginia Outline Background
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More information