Derek Bridge School of Computer Science and Information Technology University College Cork

Size: px

Start display at page:

Download "Derek Bridge School of Computer Science and Information Technology University College Cork"

Joshua Flynn
6 years ago
Views:

1 CS4619: Artificial Intelligence II Overfitting and Underfitting Derek Bridge School of Computer Science and Information Technology University College Cork Initialization In [1]: %load_ext autoreload %autoreload 2 %matplotlib inline In [2]: import pandas as pd import numpy as np import matplotlib.pyplot as plt In [3]: from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.model_selection import validation_curve from sklearn.model_selection import learning_curve from sklearn.model_selection import cross_validate Acknowledgements The book was helpful again: A. Géron: Hands-On Machine Learning with Scikit-Learn & TensorFlow, O'Reilly, 2017 I based some of this notebook on some resources from Jake VanderPlas ( ( 1 of 11

2 Introduction You are building an estimator but its performance is not good enough: In the case of a regressor, its validation error is too high In the case of a classifier, its validation accuracy is too low What should you do? The options include: Gather more training examples (see also Data Augmentation, later) Remove noise in the training examples Add more features or remove features Change model: move to a more complex model or maybe to a less complex model Stick with your existing model but add constraints to it to reduce its complexity or remove constraints to increase its complexity Surprisingly, gathering more training examples may not help; adding more features may in some cases worsen the performance; changing to a more complex model may in some worsen the performance it all depends on what is causing the poor performance (underfitting or overfitting) This lecture shows you how to diagnose the problem and choose remedies that suit the diagnosis Defining Underfitting and Overfitting To illustrate the concepts, we will use an artificial dataset So that we can plot things in 2D, the dataset will have just one feature: a numeric-valued feature whose values range from 0 to 1 The target will also be numeric-valued and will be a non-linear function of the feature But we'll add a bit of noise to the dataset too In [4]: # Functions for creating the dataset def make_dataset(m, func, error): X = np.random.random(m) y = func(x, error) return X.reshape(m, 1), y def f(x, error = 1.0): y = 10-1 / (x + 0.1) if error > 0: y = np.random.normal(y, error) return y In [5]: # Call the functions to create a training set X_train, y_train = make_dataset(50, f, 1.0) 2 of 11

3 In [6]: # Plot it so you can see what it looks like plt.xlabel("feature") plt.ylabel("y") plt.ylim(-4, 14) plt.scatter(x_train, y_train, color = "green") In [7]: vals = np.linspace(-0.1, 1.1, 500).reshape(500, 1) Fitting a Linear Model to the Data We'll use OLS Linear Regression to fit a linear model And we'll plot the model that it learns In [8]: Out[8]: # Fit the model linear_estimator = LinearRegression() linear_estimator.fit(x_train, y_train) LinearRegression(copy_X=True, fit_intercept=true, n_jobs=1, normalize=fal se) In [9]: # Get its predictions y_predicted = linear_estimator.predict(vals) 3 of 11

4 In [10]: # Plot the training set and also the predictions made by the model plt.xlabel("feature") plt.ylabel("y") plt.ylim(-4, 14) plt.scatter(x_train, y_train, color = "green") plt.plot(vals, y_predicted, color = "blue") It's easy to see that a linear model is a poor choice It underfits the data: The model is not complex enough: it is too simple to capture the underlying structure of the data Fitting a Quadratic Model to the Data What happens if we try to fit a more complex model such as a quadratic function? In [11]: Out[11]: # Fit the model quadratic_estimator = Pipeline([ ("polyfeatures", PolynomialFeatures(degree=2, include_bias=false )), ("estimator", LinearRegression()) ]) quadratic_estimator.fit(x_train, y_train) Pipeline(memory=None, steps=[('polyfeatures', PolynomialFeatures(degree=2, include_bias=fa lse, interaction_only=false)), ('estimator', LinearRegression(copy_X=True, fit_intercept=true, n_jobs=1, normalize=false))]) In [12]: # Get its predictions y_predicted = quadratic_estimator.predict(vals) 4 of 11

5 In [13]: # Plot the training set and also the predictions made by the model on th e validation set plt.xlabel("feature") plt.ylabel("y") plt.ylim(-4, 14) plt.scatter(x_train, y_train, color = "green") plt.plot(vals, y_predicted, color = "blue") This fits the training data much better: it underfits less We could now try a cubic model But let's skip all that and try something much more complex Fitting a Much Higher Degree Polynomial to the Data So what happens if we fit a polynomial of degree 30? In [14]: Out[14]: poly_estimator = Pipeline([ ("polyfeatures", PolynomialFeatures(degree=30, include_bias=fals e)), ("estimator", LinearRegression()) ]) poly_estimator.fit(x_train, y_train) Pipeline(memory=None, steps=[('polyfeatures', PolynomialFeatures(degree=30, include_bias=f alse, interaction_only=false)), ('estimator', LinearRegression(copy_X=Tru e, fit_intercept=true, n_jobs=1, normalize=false))]) In [15]: # Get its predictions y_predicted = poly_estimator.predict(vals) 5 of 11

6 In [16]: # Plot the training set and also the predictions made by the model on th e validation set plt.xlabel("feature") plt.ylabel("y") plt.ylim(-4, 14) plt.scatter(x_train, y_train, color = "green") plt.plot(vals, y_predicted, color = "blue") While a model of this complexity fits the training set really well, it seems clear that this model is a poor choice It is not capturing the target function; it is fitting to the noise in the training set It overfits the data: The model is too complex relative to the amount of training data and the noisiness of the training data Fitting Models of Different Complexities to the Data We can plot complexity along the -axis and error on the -axis In our case, we plot the degree of the polynomial along the -axis In fact, we'll plot two lines: training error and validation error x y x In [17]: # I'll make a larger dataset than the one I used above because I want to split this one # into training and validation sets X, y = make_dataset(200, f, 1.0) In [18]: degrees = np.arange(1, 30) estimator = Pipeline([ ("polyfeatures", PolynomialFeatures(include_bias=False)), ("estimator", LinearRegression()) ]) mses_train, mses_val = validation_curve( estimator, X, y, "polyfeatures degree", degrees, cv=10, scoring="ne g_mean_squared_error") mean_mses_train = np.mean(np.abs(mses_train), axis=1) mean_mses_val = np.mean(np.abs(mses_val), axis=1) 6 of 11

7 In [19]: plt.xlabel("degree") plt.ylabel("mse") plt.ylim(0, 2) plt.plot(degrees, mean_mses_train, label = "training error", color = "re d") plt.plot(degrees, mean_mses_val, label="validation error", color = "gold ") plt.legend() We might get different results each time we run the code, but typically Training error starts high and gets ever lower: the more complex models can wiggle their way through the noise in the data Validation error starts high, gets lower, and then grows again (but somewhat erratically) The simpler models to the left underfit, so validation error (and training error) are high The more complex models to the right overfit: The training error is low: the more complex models can wiggle their way through the noise in the data The validation error is high (but variable): the models don't generalise from the training data to the validation data Between the two, the complexity is 'just right' In [20]: quartic_estimator = Pipeline([ ("polyfeatures", PolynomialFeatures(degree=4, include_bias=false )), ("estimator", LinearRegression()) ]) In summary, A model underfits the training set if there is a more complex model with lower validation error A model overfits the training set if there is a less complex model with lower validation error Diagnosis How do you tell whether a particular model is underfitting or overfitting? We'll look at two methods: Compare training error and validation error Plot a learning curve 7 of 11

8 Compare training error and validation error The simplest method is to compute the training error and validation error If a model has high training error and high validation error, then it is underfitting If a model has low training error but high validation error, then it is overfitting In [21]: #Underfitting scores = cross_validate(linear_estimator, X, y, cv=10, scoring="neg_mean _squared_error", return_train_score=true) print("training error: ", np.mean(np.abs(scores["train_score"]))) print("validation error: ", np.mean(np.abs(scores["test_score"]))) Training error: Validation error: In [22]: # Overfitting scores = cross_validate(poly_estimator, X, y, cv=10, scoring="neg_mean_s quared_error", return_train_score=true) print("training error: ", np.mean(np.abs(scores["train_score"]))) print("validation error: ", np.mean(np.abs(scores["test_score"]))) Training error: Validation error: In [23]: # Just right scores = cross_validate(quartic_estimator, X, y, cv=10, scoring="neg_mea n_squared_error", return_train_score=true) print("training error: ", np.mean(np.abs(scores["train_score"]))) print("validation error: ", np.mean(np.abs(scores["test_score"]))) Training error: Validation error: Plot a learning curve Learning curves plot training error and validation error against the number of examples in the training set But they are expensive to produce In [24]: train_set_sizes = np.linspace(.1, 1.0, 10) In [25]: # Underfitting train_sizes, mses_train, mses_val = learning_curve(linear_estimator, X, y, train_sizes=train_set_sizes, cv=10, scoring="neg_ mean_squared_error") mean_mses_train = np.mean(np.abs(mses_train), axis=1) mean_mses_val = np.mean(np.abs(mses_val), axis=1) 8 of 11

9 In [26]: plt.xlabel("num. training examples") plt.ylabel("mse") plt.ylim(0, 3) plt.plot(train_sizes, mean_mses_train, label = "training error", color = "purple") plt.plot(train_sizes, mean_mses_val, label = "validation error", color = "orange") plt.legend() Training error: When there are just a few training examples, the model can fit them near perfectly, which is why the curve starts low As more examples are used for training, it becomes impossible for the model to fit the data both because of the noise but because the model isn't complex enough The curve goes up and eventually plateaus Validation error: When there are few training examples, the model cannot generalize well, so test error is high As more examples are used for training, the model is better so validation error comes down But, since the model isn't complex enough, eventually validation error plateaus, very close to the training error In [27]: # Overfitting train_sizes, mses_train, mses_val = learning_curve(poly_estimator, X, y, train_sizes=train_set_sizes, cv=10, scoring="neg_ mean_squared_error") mean_mses_train = np.mean(np.abs(mses_train), axis=1) mean_mses_val = np.mean(np.abs(mses_val), axis=1) 9 of 11

10 In [28]: plt.xlabel("num. training examples") plt.ylabel("mse") plt.ylim(0, 3) plt.plot(train_sizes, mean_mses_train, label = "training error", color = "purple") plt.plot(train_sizes, mean_mses_val, label = "validation error", color = "orange") plt.legend() These curves have a similar shape to the case of underfitting except Training error: This is much lower because the model can wiggle its way through the noise Validation error: There remains a big gap between training error and validation error (although they may get closer if we had even more training examples) In [29]: # Just right train_sizes, mses_train, mses_val = learning_curve(quartic_estimator, X, y, train_sizes=train_set_sizes, cv=10, scoring="neg_ mean_squared_error") mean_mses_train = np.mean(np.abs(mses_train), axis=1) mean_mses_val = np.mean(np.abs(mses_val), axis=1) 10 of 11

11 In [30]: plt.xlabel("num. training examples") plt.ylabel("mse") plt.ylim(0, 3) plt.plot(train_sizes, mean_mses_train, label = "training error", color = "purple") plt.plot(train_sizes, mean_mses_val, label = "validation error", color = "orange") plt.legend() The same kind of shape again But, this time, the gap narrows and they should converge Solutions After the diagnosis come the solutions! If your model underfits: Gathering more training examples will not help Your main options are: Change model: move to a more complex model Add better features (feature engineering) Stick with your existing model but remove constraints (if you can) to increase its complexity If your model overfits, your main options are: Gather more training data (or use Data Augmentation) Remove noise in the training examples Change model: move to a less complex model Simplify by reducing the number of features Stick with your existing model but add constraints (if you can) to reduce its complexity In [ ]: 11 of 11

Derek Bridge School of Computer Science and Information Technology University College Cork

CS468: Artificial Intelligence I Ordinary Least Squares Regression Derek Bridge School of Computer Science and Information Technology University College Cork Initialization In [4]: %load_ext autoreload