Non-linear models. Basis expansion. Overfitting. Regularization.

Similar documents
Bias-Variance Decomposition Error Estimators

Bias-Variance Decomposition Error Estimators Cross-Validation

Performance Evaluation

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Cross-validation for detecting and preventing overfitting

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Cross-validation. Cross-validation is a resampling method.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Nonparametric Methods Recap

Decision Trees. Petr Pošík. Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics

CSE446: Linear Regression. Spring 2017

Introduction to Automated Text Analysis. bit.ly/poir599

Overfitting, Model Selection, Cross Validation, Bias-Variance

Model Complexity and Generalization

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

A Rational Shift in Behavior. Translating Rational Functions. LEARnIng goals

Discussion: Clustering Random Curves Under Spatial Dependence

Lecture 7. CS4442/9542b: Artificial Intelligence II Prof. Olga Veksler. Outline. Machine Learning: Cross Validation. Performance evaluation methods

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Chapter 3. Interpolation. 3.1 Introduction

EE 511 Linear Regression

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Lecture on Modeling Tools for Clustering & Regression

Multicollinearity and Validation CIVL 7012/8012

Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

LECTURE 6: CROSS VALIDATION

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Lasso Regression: Regularization for feature selection

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Hyperparameters and Validation Sets. Sargur N. Srihari

Lecture 13: Model selection and regularization

Chapter 5: Polynomial Functions

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Fitting a transformation: Feature-based alignment April 30 th, Yong Jae Lee UC Davis

CSE446: Linear Regression. Spring 2017

Gradient LASSO algoithm

TIPS4RM: MHF4U: Unit 1 Polynomial Functions

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Partial Fraction Decomposition

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Unit I - Chapter 3 Polynomial Functions 3.1 Characteristics of Polynomial Functions

Last time... Bias-Variance decomposition. This week

Modeling and Simulation Exam

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Network Traffic Measurements and Analysis

CSE 446 Bias-Variance & Naïve Bayes

Lesson 2.1 Exercises, pages 90 96

5.2 Graphing Polynomial Functions

Stat 342 Exam 3 Fall 2014

Machine Learning: An Applied Econometric Approach Online Appendix

Using Machine Learning to Optimize Storage Systems

y = f(x) x (x, f(x)) f(x) g(x) = f(x) + 2 (x, g(x)) 0 (0, 1) 1 3 (0, 3) 2 (2, 3) 3 5 (2, 5) 4 (4, 3) 3 5 (4, 5) 5 (5, 5) 5 7 (5, 7)

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Investigation Free Fall

GLOBAL EDITION. Interactive Computer Graphics. A Top-Down Approach with WebGL SEVENTH EDITION. Edward Angel Dave Shreiner

Systems of Linear Equations

Diversity visualization in evolutionary algorithms

Dimensionality Reduction, including by Feature Selection.

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Boosted Optimization for Network Classification. Bioinformatics Center Kyoto University

Determining the 2d transformation that brings one image into alignment (registers it) with another. And

5.2 Graphing Polynomial Functions

Random Forest A. Fornaser

Artificial Neural Networks MLP, RBF & GMDH

Section 4.3 Features of a Line

10601 Machine Learning. Model and feature selection

Chapter 7: Numerical Prediction

Variable Selection 6.783, Biomedical Decision Support

Tight Clustering: a method for extracting stable and tight patterns in expression profiles

Predictor Selection Algorithm for Bayesian Lasso

Machine Learning using Matlab. Lecture 3 Logistic regression and regularization

Modeling with CMU Mini-FEA Program

2.3 Polynomial Functions of Higher Degree with Modeling

Implicit differentiation

Transformations of Functions. 1. Shifting, reflecting, and stretching graphs Symmetry of functions and equations

3-2. Families of Graphs. Look Back. OBJECTIVES Identify transformations of simple graphs. Sketch graphs of related functions.

Essential Question How many turning points can the graph of a polynomial function have?

1.2. Characteristics of Polynomial Functions. What are the key features of the graphs of polynomial functions?

3.2 Polynomial Functions of Higher Degree

Online Homework Hints and Help Extra Practice

CSC 411: Lecture 02: Linear Regression

Automatic basis selection for RBF networks using Stein s unbiased risk estimator

Machine Learning / Jan 27, 2010

Chapter 1. Limits and Continuity. 1.1 Limits

CSC 411 Lecture 4: Ensembles I

Clustering Part 2. A Partitional Clustering

NUMERICAL PERFORMANCE OF COMPACT FOURTH ORDER FORMULATION OF THE NAVIER-STOKES EQUATIONS

Derivatives 3: The Derivative as a Function

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

k-nearest Neighbors + Model Selection

Model selection and validation 1: Cross-validation

Pattern Recognition for Neuroimaging Data

Using a Table of Values to Sketch the Graph of a Polynomial Function

model order p weights The solution to this optimization problem is obtained by solving the linear system

Lecture 25: Review I

Moving Beyond Linearity

Transcription:

Non-linear models. Basis epansion. Overfitting. Regularization. Petr Pošík Czech Technical Universit in Prague Facult of Electrical Engineering Dept. of Cbernetics Non-linear models Basis epansion..................................................................................................... 3 Two spaces......................................................................................................... Remarks........................................................................................................... How to evaluate a predictive model? 7 Model evaluation................................................................................................... 8 Training and testing error........................................................................................... Overfitting........................................................................................................ Bias vs Variance................................................................................................... Crossvalidation.................................................................................................... 3 How to determine a suitable model fleibilit.......................................................................... How to prevent overfitting?......................................................................................... 5 Regularization Ridge............................................................................................................. 7 Lasso............................................................................................................. 8 Summar 9 Competencies.....................................................................................................

When a linear model is not enough... / Basis epansion a.k.a. feature space straightening. Wh? Linear decision boundar (or linear regression model) ma not be fleible enough to perform accurate classification (regression). The algorithms for fitting linear models can be used to fit (certain tpe of) non-linear models! How? Let s define a new multidimensional image space F. Feature vectors are transformed into this image space F (new features are derived) using mapping Φ: z = Φ(), = (,,..., D ) z = (Φ (), Φ (),..., Φ G ()), while usuall D G. In the image space, a linear model is trained. However, this is equivalent to training a non-linear model in the original space. f G (z) = w z + w z +...+w G z G + w f() = f G (Φ()) = w Φ ()+w Φ ()+...+w G Φ G ()+w P. Pošík c 7 Artificial Intelligence 3 / Two coordinate sstems Transformation into a high-dimensional image space Feature space Image space = (,,..., D ) z = (z, z,..., z G ) z = log z = 3 z 3 = e... Training a linear model in the image space f() = w log + w 3 + w 3 e +...+w f G (z) = w z + w z + w 3 z 3 +...+w P. Pošík c 7 Artificial Intelligence / Non-linear model in the feature space

Two coordinate sstems: simple graphical Transformation eample into a high-dimensional image space D Feature space D Image space.8.......8 8 5 5 9 8 7 5 3 9 8 7 8 Training a linear model in the image space 5 5 3 8 5 3 8 P. Pošík c 7 Artificial Intelligence 5 / Non-linear model in the feature space Basis epansion: remarks Advantages: Universal, generall usable method. Disadvantages: We must define what new features shall form the high-dimensional space F. The eamples must be reall transformed into the high-dimensional space F. When too much derived features is used, the resulting models are prone to overfitting (see net slides). For certain tpe of algorithms, there is a method how to perform the basis epansion without actuall carring out the mapping! (See the net lecture.) P. Pošík c 7 Artificial Intelligence / 3

How to evaluate a predictive model? 7 / Model evaluation Fundamental question: What is a good measure of model qualit from the machine-learning standpoint? We have various measures of model error: For regression tasks: MSE, MAE,... For classification tasks: misclassification rate, measures based on confusion matri,... Some of them can be regarded as finite approimations of the Baes risk. Are these functions good approimations when measured on the data the models were trained on? 3.5 f() = f() = 3 3 +3.5 f() =.9 +.99 f() =. + (.3) + (.7 ) + (.5 3 ).5.5.5.5.5.5.5.5.5.5.5.5.5.5 Using MSE onl, both models are equivalent!!! Using MSE onl, the cubic model is better than linear!!! A basic method of evaluation is model validation on a different, independent data set from the same source, i.e. on testing data. P. Pošík c 7 Artificial Intelligence 8 / Validation on testing data Eample: Polnomial regression with varring degree: X U(, 3) Y X + N(, ) Poln om d e g.:, tr. e rr.: 8.3 9, te s t. e rr.:.9 Poln om d e g.:, tr. e rr.:. 3, te s t. e rr.:.8 Poln om d e g.:, tr. e rr.:. 7, te s t. e rr.:.9 5 3 Poln om d e g.: 3, tr. e rr.:. 5, te s t. e rr.:.9 9 3 Poln om d e g.: 5, tr. e rr.:., te s t. e rr.:.9 7 9 3 Poln om d e g.: 9, tr. e rr.:.5 5, te s t. e rr.:. 7 3 3 3 P. Pošík c 7 Artificial Intelligence 9 /

Training and testing error 9 8 7 Tra in in g e rror Te s tin g e rror MSE 5 3 8 Polnom de gre e The training error decreases with increasing model fleibilit. The testing error is minimal for certain degree of model fleibilit. P. Pošík c 7 Artificial Intelligence / Overfitting Definition of overfitting: Let H be a hpotheses space. Let h H and h H be different hpotheses from this space. Let Err Tr (h) be an error of the hpothesis h measured on the training dataset (training error). Let Err Tst (h) be an error of the hpothesis h measured on the testing dataset (testing error). We sa that h is overfitted if there is another h for which Err Tr (h ) < Err Tr (h ) Err Tst (h ) > Err Tst (h ) Model Error Testing data Model Fleibilit When overfitted, the model works well for the training data, but fails for new (testing) data. Overfitting is a general phenomenon affecting all kinds of inductive learning of models with tunable fleibilit. We want models and learning algorithms with a good generalization abilit, i.e. we want models that encode onl the relationships valid in the whole domain, not those that learned the specifics of the training data, i.e. we want algorithms able to find onl the relationships valid in the whole domain and ignore specifics of the training data. P. Pošík c 7 Artificial Intelligence / 5

Bias vs Variance Poln om d e g.:, tr. e rr.:. 3, te s t. e rr.:.8 Poln om d e g.:, tr. e rr.:. 7, te s t. e rr.:.9 5 Poln om d e g.: 9, tr. e rr.:.5 5, te s t. e rr.:. 7 3 3 3 High bias: model not fleible enough (Underfit) Just right (Good fit) High variance: model fleibilit too high (Overfit) Testing data High bias problem: Err Tr (h) is high Err Tst (h) Err Tr (h) Model Error High variance problem: Err Tr (h) is low Err Tst (h) >> Err Tr (h) Model Fleibilit P. Pošík c 7 Artificial Intelligence / Crossvalidation How to estimate the true error of a model on new, unseen data? Simple crossvalidation: Split the data into training and testing subsets. Train the model on training data. Evaluate the model error on testing data. K-fold crossvalidation: Split the data into k folds (k is usuall 5 or ). In each iteration: Use k folds to train the model. Use fold to test the model, i.e. measure error. Iter. Training Training Testing Iter. Training Testing Training Iter. k Testing Training Training Aggregate (average) the k error measurements to get the final error estimate. Train the model on the whole data set. Leave-one-out (LOO) crossvalidation: k = T, i.e. the number of folds is equal to the training set size. Time consuming for large T. P. Pošík c 7 Artificial Intelligence 3 /

How to determine a suitable model fleibilit Simpl test models of varing compleities and choose the one with the best testing error, right? The testing data are used here to tune a meta-parameter of the model. The testing data are used to train (a part of) the model, thus essentiall become part of training data. The error on testing data is no longer an unbiased estimate of model error; it underestimates it. A new, separate data set is needed to estimate the model error. Using simple crossvalidation:. : use cca 5 % of data for model building.. Validation data: use cca 5 % of data to search for the suitable model fleibilit. 3. Train the suitable model on training + validation data.. Testing data: use cca 5 % of data for the final estimate of the model error. Using k-fold crossvalidation. : use cca 75 % of data to find and train a suitable model using crossvalidation.. Testing data: use cca 5 % of data for the final estimate of the model error. The ratios are not set in stone, there are other possibilities, e.g. ::, etc. P. Pošík c 7 Artificial Intelligence / How to prevent overfitting?. Feature selection: Reduce the number of features. Select manuall, which features to keep. Tr to identif a suitable subset of features during learning phase (man feature selection methods eist; none is perfect).. Regularization: Keep all the features, but reduce the magnitude of their weights w. Works well, if we have a lot of features each of which contributes a bit to predicting. P. Pošík c 7 Artificial Intelligence 5 / 7

Regularization / Ridge regularization (a.k.a. Tikhonov regularization) Ridge regularization penalizes the size of the model coefficients: Training and testing errors as functions of regularization parameter: Modification of the optimization criterion: J(w) = T T i= ( ) +α (i) h w ( (i) D ) w d. d= 3.5 3..5 Training error Testing error The solution is given b a modified Normal equation w = (X T X+αI) X T MSE..5 As α, w ridge w OLS. As α, w ridge. OLS - ordinar least squares. Just a simple multiple linear regression...5-8 - - - 8 Regularization factor The values of coefficients (weights w) as functions of regularization parameter: 5 Coefficient size 5 5-8 - - - 8 Regularization factor P. Pošík c 7 Artificial Intelligence 7 / Lasso regularization Lasso regularization penalizes the size of the model coefficients: Training and testing errors as functions of regularization parameter: Modification of the optimization criterion: J(w) = T T i= ( ) +α (i) h w ( (i) D ) w d. d= As α, Lasso regularization decreases the number of non-zero coefficients, effectivel also performing feature selection and creating sparse models. MSE.8.....8.. Training error Testing error.. -8-7 - -5 - -3 - - Regularization factor The values of coefficients as functions of regularization parameter:.7..5 Coefficient size..3.... -8-7 - -5 - -3 - - Regularization factor P. Pošík c 7 Artificial Intelligence 8 / 8

Summar 9 / Competencies After this lecture, a student shall be able to... eplain the reason for doing basis epansion (feature space straightening), and describe its principle; show the effect of basis epansion with a linear model on a simple eample for both classification and regression settings; implement user-defined basis epansions in certain programming language; list advantages and disadvantages of basis epansion; eplain wh the error measured on the training data is not a good estimate of the epected error of the model for new data, and whether it under- or overestimates the true error; eplain basic methods to get unbiased estimate of the true model error (testing data, k-fold crossvalidation, LOO crossvalidation); describe the general form of dependenc of the model training and testing errors on the model compleit/fleibilit/capacit; define overfitting; discuss high bias and high variance problems of models; eplain how to proceed if a suitable model compleit must be chosen as part of the training process; list basic methods of overfitting prevention; describe the principles of ridge (Tikhonov) and lasso regularizations and their effects on the model parameters. P. Pošík c 7 Artificial Intelligence / 9