Non-linear models. Basis expansion. Overfitting. Regularization.

Non-linear models. Basis epansion. Overfitting. Regularization. Petr Pošík Czech Technical Universit in Prague Facult of Electrical Engineering Dept. of Cbernetics Non-linear models Basis epansion..................................................................................................... 3 Two spaces......................................................................................................... Remarks........................................................................................................... How to evaluate a predictive model? 7 Model evaluation................................................................................................... 8 Training and testing error........................................................................................... Overfitting........................................................................................................ Bias vs Variance................................................................................................... Crossvalidation.................................................................................................... 3 How to determine a suitable model fleibilit.......................................................................... How to prevent overfitting?......................................................................................... 5 Regularization Ridge............................................................................................................. 7 Lasso............................................................................................................. 8 Summar 9 Competencies.....................................................................................................

When a linear model is not enough... / Basis epansion a.k.a. feature space straightening. Wh? Linear decision boundar (or linear regression model) ma not be fleible enough to perform accurate classification (regression). The algorithms for fitting linear models can be used to fit (certain tpe of) non-linear models! How? Let s define a new multidimensional image space F. Feature vectors are transformed into this image space F (new features are derived) using mapping Φ: z = Φ(), = (,,..., D ) z = (Φ (), Φ (),..., Φ G ()), while usuall D G. In the image space, a linear model is trained. However, this is equivalent to training a non-linear model in the original space. f G (z) = w z + w z +...+w G z G + w f() = f G (Φ()) = w Φ ()+w Φ ()+...+w G Φ G ()+w P. Pošík c 7 Artificial Intelligence 3 / Two coordinate sstems Transformation into a high-dimensional image space Feature space Image space = (,,..., D ) z = (z, z,..., z G ) z = log z = 3 z 3 = e... Training a linear model in the image space f() = w log + w 3 + w 3 e +...+w f G (z) = w z + w z + w 3 z 3 +...+w P. Pošík c 7 Artificial Intelligence / Non-linear model in the feature space

Two coordinate sstems: simple graphical Transformation eample into a high-dimensional image space D Feature space D Image space.8.......8 8 5 5 9 8 7 5 3 9 8 7 8 Training a linear model in the image space 5 5 3 8 5 3 8 P. Pošík c 7 Artificial Intelligence 5 / Non-linear model in the feature space Basis epansion: remarks Advantages: Universal, generall usable method. Disadvantages: We must define what new features shall form the high-dimensional space F. The eamples must be reall transformed into the high-dimensional space F. When too much derived features is used, the resulting models are prone to overfitting (see net slides). For certain tpe of algorithms, there is a method how to perform the basis epansion without actuall carring out the mapping! (See the net lecture.) P. Pošík c 7 Artificial Intelligence / 3

How to evaluate a predictive model? 7 / Model evaluation Fundamental question: What is a good measure of model qualit from the machine-learning standpoint? We have various measures of model error: For regression tasks: MSE, MAE,... For classification tasks: misclassification rate, measures based on confusion matri,... Some of them can be regarded as finite approimations of the Baes risk. Are these functions good approimations when measured on the data the models were trained on? 3.5 f() = f() = 3 3 +3.5 f() =.9 +.99 f() =. + (.3) + (.7 ) + (.5 3 ).5.5.5.5.5.5.5.5.5.5.5.5.5.5 Using MSE onl, both models are equivalent!!! Using MSE onl, the cubic model is better than linear!!! A basic method of evaluation is model validation on a different, independent data set from the same source, i.e. on testing data. P. Pošík c 7 Artificial Intelligence 8 / Validation on testing data Eample: Polnomial regression with varring degree: X U(, 3) Y X + N(, ) Poln om d e g.:, tr. e rr.: 8.3 9, te s t. e rr.:.9 Poln om d e g.:, tr. e rr.:. 3, te s t. e rr.:.8 Poln om d e g.:, tr. e rr.:. 7, te s t. e rr.:.9 5 3 Poln om d e g.: 3, tr. e rr.:. 5, te s t. e rr.:.9 9 3 Poln om d e g.: 5, tr. e rr.:., te s t. e rr.:.9 7 9 3 Poln om d e g.: 9, tr. e rr.:.5 5, te s t. e rr.:. 7 3 3 3 P. Pošík c 7 Artificial Intelligence 9 /

Training and testing error 9 8 7 Tra in in g e rror Te s tin g e rror MSE 5 3 8 Polnom de gre e The training error decreases with increasing model fleibilit. The testing error is minimal for certain degree of model fleibilit. P. Pošík c 7 Artificial Intelligence / Overfitting Definition of overfitting: Let H be a hpotheses space. Let h H and h H be different hpotheses from this space. Let Err Tr (h) be an error of the hpothesis h measured on the training dataset (training error). Let Err Tst (h) be an error of the hpothesis h measured on the testing dataset (testing error). We sa that h is overfitted if there is another h for which Err Tr (h ) < Err Tr (h ) Err Tst (h ) > Err Tst (h ) Model Error Testing data Model Fleibilit When overfitted, the model works well for the training data, but fails for new (testing) data. Overfitting is a general phenomenon affecting all kinds of inductive learning of models with tunable fleibilit. We want models and learning algorithms with a good generalization abilit, i.e. we want models that encode onl the relationships valid in the whole domain, not those that learned the specifics of the training data, i.e. we want algorithms able to find onl the relationships valid in the whole domain and ignore specifics of the training data. P. Pošík c 7 Artificial Intelligence / 5

Bias vs Variance Poln om d e g.:, tr. e rr.:. 3, te s t. e rr.:.8 Poln om d e g.:, tr. e rr.:. 7, te s t. e rr.:.9 5 Poln om d e g.: 9, tr. e rr.:.5 5, te s t. e rr.:. 7 3 3 3 High bias: model not fleible enough (Underfit) Just right (Good fit) High variance: model fleibilit too high (Overfit) Testing data High bias problem: Err Tr (h) is high Err Tst (h) Err Tr (h) Model Error High variance problem: Err Tr (h) is low Err Tst (h) >> Err Tr (h) Model Fleibilit P. Pošík c 7 Artificial Intelligence / Crossvalidation How to estimate the true error of a model on new, unseen data? Simple crossvalidation: Split the data into training and testing subsets. Train the model on training data. Evaluate the model error on testing data. K-fold crossvalidation: Split the data into k folds (k is usuall 5 or ). In each iteration: Use k folds to train the model. Use fold to test the model, i.e. measure error. Iter. Training Training Testing Iter. Training Testing Training Iter. k Testing Training Training Aggregate (average) the k error measurements to get the final error estimate. Train the model on the whole data set. Leave-one-out (LOO) crossvalidation: k = T, i.e. the number of folds is equal to the training set size. Time consuming for large T. P. Pošík c 7 Artificial Intelligence 3 /

How to determine a suitable model fleibilit Simpl test models of varing compleities and choose the one with the best testing error, right? The testing data are used here to tune a meta-parameter of the model. The testing data are used to train (a part of) the model, thus essentiall become part of training data. The error on testing data is no longer an unbiased estimate of model error; it underestimates it. A new, separate data set is needed to estimate the model error. Using simple crossvalidation:. : use cca 5 % of data for model building.. Validation data: use cca 5 % of data to search for the suitable model fleibilit. 3. Train the suitable model on training + validation data.. Testing data: use cca 5 % of data for the final estimate of the model error. Using k-fold crossvalidation. : use cca 75 % of data to find and train a suitable model using crossvalidation.. Testing data: use cca 5 % of data for the final estimate of the model error. The ratios are not set in stone, there are other possibilities, e.g. ::, etc. P. Pošík c 7 Artificial Intelligence / How to prevent overfitting?. Feature selection: Reduce the number of features. Select manuall, which features to keep. Tr to identif a suitable subset of features during learning phase (man feature selection methods eist; none is perfect).. Regularization: Keep all the features, but reduce the magnitude of their weights w. Works well, if we have a lot of features each of which contributes a bit to predicting. P. Pošík c 7 Artificial Intelligence 5 / 7

Regularization / Ridge regularization (a.k.a. Tikhonov regularization) Ridge regularization penalizes the size of the model coefficients: Training and testing errors as functions of regularization parameter: Modification of the optimization criterion: J(w) = T T i= ( ) +α (i) h w ( (i) D ) w d. d= 3.5 3..5 Training error Testing error The solution is given b a modified Normal equation w = (X T X+αI) X T MSE..5 As α, w ridge w OLS. As α, w ridge. OLS - ordinar least squares. Just a simple multiple linear regression...5-8 - - - 8 Regularization factor The values of coefficients (weights w) as functions of regularization parameter: 5 Coefficient size 5 5-8 - - - 8 Regularization factor P. Pošík c 7 Artificial Intelligence 7 / Lasso regularization Lasso regularization penalizes the size of the model coefficients: Training and testing errors as functions of regularization parameter: Modification of the optimization criterion: J(w) = T T i= ( ) +α (i) h w ( (i) D ) w d. d= As α, Lasso regularization decreases the number of non-zero coefficients, effectivel also performing feature selection and creating sparse models. MSE.8.....8.. Training error Testing error.. -8-7 - -5 - -3 - - Regularization factor The values of coefficients as functions of regularization parameter:.7..5 Coefficient size..3.... -8-7 - -5 - -3 - - Regularization factor P. Pošík c 7 Artificial Intelligence 8 / 8

Summar 9 / Competencies After this lecture, a student shall be able to... eplain the reason for doing basis epansion (feature space straightening), and describe its principle; show the effect of basis epansion with a linear model on a simple eample for both classification and regression settings; implement user-defined basis epansions in certain programming language; list advantages and disadvantages of basis epansion; eplain wh the error measured on the training data is not a good estimate of the epected error of the model for new data, and whether it under- or overestimates the true error; eplain basic methods to get unbiased estimate of the true model error (testing data, k-fold crossvalidation, LOO crossvalidation); describe the general form of dependenc of the model training and testing errors on the model compleit/fleibilit/capacit; define overfitting; discuss high bias and high variance problems of models; eplain how to proceed if a suitable model compleit must be chosen as part of the training process; list basic methods of overfitting prevention; describe the principles of ridge (Tikhonov) and lasso regularizations and their effects on the model parameters. P. Pošík c 7 Artificial Intelligence / 9