model order p weights The solution to this optimization problem is obtained by solving the linear system

Similar documents
Hyperparameters and Validation Sets. Sargur N. Srihari

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Lecture on Modeling Tools for Clustering & Regression

Cross-validation. Cross-validation is a resampling method.

CS294-1 Assignment 2 Report

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Regularization and model selection

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Model selection and validation 1: Cross-validation

Lecture 13: Model selection and regularization

CSE446: Linear Regression. Spring 2017

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Nonparametric Methods Recap

LECTURE 11: LINEAR MODEL SELECTION PT. 2. October 18, 2017 SDS 293: Machine Learning

Machine Learning: An Applied Econometric Approach Online Appendix

Chap.12 Kernel methods [Book, Chap.7]

CPSC 340: Machine Learning and Data Mining

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

How Learning Differs from Optimization. Sargur N. Srihari

Cross-validation and the Bootstrap

Logistic Regression and Gradient Ascent

Notes on Multilayer, Feedforward Neural Networks

Linear Feature Engineering 22

Introduction to SNNS

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

EE 511 Linear Regression

Lecture 3: Linear Classification

Evaluating Classifiers

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

6.867 Machine Learning

Tree-based methods for classification and regression

Lasso. November 14, 2017

Knowledge Discovery and Data Mining

MS&E 226: Small Data

3 Nonlinear Regression

Random Forest A. Fornaser

CSE 158 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

CSC411/2515 Tutorial: K-NN and Decision Tree

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

3 Nonlinear Regression

Divide and Conquer Kernel Ridge Regression

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

CSE446: Linear Regression. Spring 2017

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Bayes Net Learning. EECS 474 Fall 2016

Network Traffic Measurements and Analysis

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

Lecture 26: Missing data

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

A Course in Machine Learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Multicollinearity and Validation CIVL 7012/8012

Lecture 03 Linear classification methods II

Lecture 5: Decision Trees (Part II)

Introduction to Automated Text Analysis. bit.ly/poir599

Cross-validation and the Bootstrap

HOUGH TRANSFORM CS 6350 C V

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

CSC 411: Lecture 02: Linear Regression

CS281 Section 3: Practical Optimization

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

CSC 411 Lecture 4: Ensembles I

Lecture 19: Decision trees

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 13: Address Translation

PROBLEM 4

ECG782: Multidimensional Digital Signal Processing

Radial Basis Function Networks: Algorithms

LECTURE 11. Memory Hierarchy

Bias-Variance Decomposition Error Estimators

Model Answers to The Next Pixel Prediction Task

Bias-Variance Decomposition Error Estimators Cross-Validation

Robust PDF Table Locator

The Problem of Overfitting with Maximum Likelihood

CPSC 340: Machine Learning and Data Mining

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

10601 Machine Learning. Model and feature selection

Large Scale Data Analysis Using Deep Learning

MS&E 213 / CS 269O : Chapter 1 Introduction to Introduction to Optimization Theory

The exam is closed book, closed notes except your one-page cheat sheet.

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Lecture 37: ConvNets (Cont d) and Training

Coloring 3-Colorable Graphs

Learning from Data Linear Parameter Models

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

The Basics of Decision Trees

Notes based on: Data Mining for Business Intelligence

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Regularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Transcription:

CS 189 Introduction to Machine Learning Fall 2017 Note 3 1 Regression and hyperparameters Recall the supervised regression setting in which we attempt to learn a mapping f : R d R from labeled examples {( x i,y i )} n i=1, where y i f ( x i ). The typical strategy in regression is to pick some parametric model and then fit the parameters of the model to the data. Here we consider a model of the form f α ( x) = model order p j=1 α j φ j ( x) }{{}}{{} The functions φ j compute features of the input x. They are sometimes called basis functions because our model f α is a linear combination of them. In the simplest case, we could just use the components of x as features (i.e. φ j ( x) = x j ) but in future lectures it will sometimes be helpful to disambiguate the features of an example from the example itself, so we encourage this way of thinking now. Once we ve fixed a model, we can the optimal value of α by minimizing some cost function L that measures how poorly the model fits the data: weights min L({ x i,y i } n e.g. i=1, α) = min α α n i=1 features ( f α ( x) y i ) 2 Observe that the model order p is not one of the decision variables being optimized in the minimization above. For this reason p is called a hyperparameter. We might say more specifically that it is a model hyperparameter, since it determines the structure of the model. Let s consider another example. Ridge regression is like ordinary least squares except that we add an l 2 penalty on the parameters α: min α A α b 2 + λ α 2 2 The solution to this optimization problem is obtained by solving the linear system (A T A + λi) α = A T b The regularization weight λ is also a hyperparameter, as it is fixed during the minimization above. However λ, unlike the previously discussed hyperparameter p, is not a part of the model. Rather, it is an aspect of the optimization procedure used to fit the model, so we say it is an optimization hyperparameter. Hyperparameters tend to fall into one of these two categories. CS 189, Fall 2017, Note 3 1

2 Kinds of error Let us define the mean-squared ing error (or just ing error) as MSE ( α) = 1 n n i=1 This is a measure of how well the model fits the data. ( f α ( x i ) y i ) 2 Note that as we add more features, ing error can only decrease. This is because the optimizer can set α j = 0 if feature j cannot be used to reduce ing error. ing error But ing error is not exactly the kind of error we care about. In particular, ing error only reflects the data in the ing set, whereas we really want the model to perform well on new points sampled from the same data distribution as the ing data. With this motivation we define the true error as MSE( α) = E[( f α ( x) y) 2 ] where the expectation is taken over all pairs ( x,y) from the data distribution. We generally cannot compute the true error because we do not have access to the data distribution, but we will see later that we can estimate it. Adding more features tends to reduce true error as long as the additional features are useful predictors of the output. However, if we keep adding features, these begin to fit noise in the ing data instead of the true signal, causing true error to actually increase. This phenomenon is known as overfitting. CS 189, Fall 2017, Note 3 2

True Error The regularization hyperparameter λ has a somewhat different effect on ing error. Observe that if λ = 0, we recover the exact OLS problem, which is directly minimizing the ing error. As λ increases, the optimizer places less emphasis on the ing error and more emphasis on reducing the magnitude of the parameters. This leads to a degradation in ing error as λ grows: ing error Regularization Weight 3 Validation We ve given some examples of hyperparameters and mentioned that they are not determined by the data-fitting optimization procedure. How, then, should we choose the values of p and λ? 3.1 Validation Error The key idea is to set aside some portion (say 30%) of the ing data, called the validation set, which is not allowed to be used when fitting the model: Validation Training Test We can use this validation set to estimate the true error, as in MSE ( α) = 1 m m k=1 ( f λ ( x v,k ) y v,k ) 2 E[( f λ ( x) y) 2 ] CS 189, Fall 2017, Note 3 3

where f λ is the model obtained by solving the regularized problem with value λ. The validation error tracks the true error reasonably well as long as m is large: Avg. Validation Error Note that this approximation only works because the validation set is disjoint from the ing set. If we ed and evaluated the model using the same data points, the estimate would be very biased, since the model has been constructed specifically to perform well on those points. Now that we can estimate the true error, we have a simple method for choosing hyperparameter values: simply try a bunch of configurations of the hyperparameters and choose the one that yields the lowest validation error. 3.2 Cross-validation Setting aside a validation set works well, but comes at a cost, since we cannot use the validation data for ing. Since having more data generally improves the quality of the ed model, we may prefer not to let that data go to waste, especially if we have little data to begin with and/or collecting more data is expensive. Cross-validation is an alternative to having a dedicated validation set. k-fold cross-validation works as follows: 1. Shuffle the data and partition it into k equally-sized (or as equal as possible) blocks. 2. For i = 1,...,k, Train the model on all the data except block i. Evaluate the model (i.e. compute the validation error) using block i. 1 2 3 4 5 6 k... CS 189, Fall 2017, Note 3 4

3. Average the k validation errors; this is our final estimate of the true error. Note that this process (except for the shuffling and partitioning) must be repeated for every hyperparameter configuration we wish to test. This is the principle drawback of k-fold cross-validation as compared to using a held-out validation set there is roughly k times as much computation required. This is not a big deal for the relatively small linear models that we ve seen so far, but it can be prohibitively expensive when the model takes a long time to, as is the case in the Big Data regime or when using neural networks. CS 189, Fall 2017, Note 3 5