Derek Bridge School of Computer Science and Information Technology University College Cork. from sklearn.preprocessing import add_dummy_feature

Size: px

Start display at page:

Download "Derek Bridge School of Computer Science and Information Technology University College Cork. from sklearn.preprocessing import add_dummy_feature"

Walter Bailey
6 years ago
Views:

1 CS4618: Artificial Intelligence I Gradient Descent Derek Bridge School of Computer Science and Information Technology University College Cork Initialization In [1]: %load_ext autoreload %autoreload 2 %matplotlib inline In [2]: import pandas as pd import numpy as np import matplotlib.pyplot as plt In [45]: from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import add_dummy_feature from sklearn.linear_model import SGDRegressor Gradient Descent for OLS Regression We saw the basic idea now, the details In fact, three variants: Batch Gradient Descent Stochastic Gradient Descent Mini-batch Gradient Descent 1 of 10

2 Partial Derivatives We need the gradient of the loss function with regards to each In other words, how much the loss will change if we change a little With respect to a particular, it is called the partial derivative Without doing the calculus, the partial derivatives of X y β with respect to β j are m The gradient vector, β J(X, X y, β) And there is a vectorized way to compute it: β j J(X, y, β) J(X, X y, β) β j β j 1 = ( x ) β j m (i) β y (i) x (i) j i=1, is a vector of each partial derivative: J(X,y,β) X y β β 0 J(X,y,β) X y β β 1 β β J(X, X y, β) = J(X,y,β) X y β β n 1 J(X, X y, β) = (Xβ y) m X T y Gradient Descent, Again Recap: It starts with an initial guess for the values of the parameters Then repeatedly: It updates the parameter values hopefully to reduce the loss But now we know how to update the parameter values to reduce the loss: Compute the gradient vector But this points 'uphill' and we want to go 'downhill' Or And we want to make 'baby steps', so we use the learning rate,, which is between 0 and 1 So subtract the times the gradient vector from β β β β J(X, X y, β) β β (Xβ y) m X T y (BTW, this is vectorized. Naive loop implementations are wrong: they lose the simultaneous update of the β j ) Batch Gradient Descent Pseudocode: initialize β randomly repeat until convergence β β m X T (Xβ y) Why is it called Batch Gradient Descent? The update involves a calculation over the entire training set This can be slow for large training sets X on every iteration 2 of 10

3 Batch Gradient Descent in numpy For the hell of it, let's implement it ourselves (We'll be naughty: we'll train on the whole dataset) In [4]: # Loss function for OLS regression (assumes X contains all 1s in its fir st column) def J(X, y, beta): return np.mean((x.dot(beta) - y) ** 2) / 2.0 In [49]: def batch_gradient_descent_for_ols_linear_regression(x, y, alpha, num_it erations): m, n = X.shape beta = np.random.randn(n) Jvals = np.zeros(num_iterations) for iter in range(num_iterations): beta -= (1.0 * alpha / m) * X.T.dot(X.dot(beta) - y) Jvals[iter] = J(X, y, beta) return beta, Jvals In [50]: # Use pandas to read the CSV file df = pd.read_csv("datasets/dataset_corka.csv") # Get the feature-values and the target values X_without_dummy_unscaled = df[["flarea", "bdrms", "bthrms"]].values y = df["price"].values # Scale it scaler = StandardScaler() X_without_dummy = scaler.fit_transform(x_without_dummy_unscaled) # Add the extra column to X X = add_dummy_feature(x_without_dummy) # Run the Batch Gradient Descent beta, Jvals = batch_gradient_descent_for_ols_linear_regression(x, y, alp ha = 0.03, num_iterations = 1000) # Display beta beta Out[50]: array([ , , , ]) Bear in mind that the coefficients it finds are on the scaled data It's a good idea to plot the values of the loss function against the number of iterations If its value ever increases, then the code might be incorrect (I think it's OK!) the value of is too big and is causing divergence 3 of 10

4 In [51]: fig = plt.figure(figsize=(8,6)) plt.title("$j$ during learning") plt.xlabel("number of iterations") plt.xlim(1, Jvals.size) plt.ylabel("$j$") plt.ylim(3500, 50000) xvals = np.linspace(1, Jvals.size, Jvals.size) plt.scatter(xvals, Jvals) plt.show() The algorithm gives us the problem of choosing the number of iterations An alternative is to use a very large number of iterations but exit when the gradient vector becomes tiny: when its norm becomes smaller than tolerance, η Try it without scaling: In [52]: # Get the feature-values and the target values X_without_dummy = df[["flarea", "bdrms", "bthrms"]].values y = df["price"].values # Add the extra column to X X = add_dummy_feature(x_without_dummy) # Run the Batch Gradient Descent beta, Jvals = batch_gradient_descent_for_ols_linear_regression(x, y, alp ha = 0.03, num_iterations = 4000) # Display beta beta C:\Anaconda3\lib\site-packages\ipykernel\ main.py:3: RuntimeWarning: o verflow encountered in square app.launch_new_instance() C:\Anaconda3\lib\site-packages\ipykernel\ main.py:8: RuntimeWarning: i nvalid value encountered in subtract Out[52]: array([ nan, nan, nan, nan]) 4 of 10

5 How can you get it to work? Some people suggest a variant of Batch Gradient Descent in which the value of value in later iterations is smaller Why do they suggest this? And why isn't it necessary? (But, we'll revisit this idea in Stochastic Gradient Descent) is decreased over time, i.e. its Stochastic Gradient Descent As we saw, Batch Gradient Descent can be slow on large training sets Stochastic Gradient Descent (SGD): On each iteration, it picks just one training example one example at random and computes the gradients on just that This gives huge speed-up It enables us to train on huge training sets since only one example needs to be in memory in each iteration But, because it is stochastic (the randomness), the loss will not necessarily decrease on each iteration: On average, the loss decreases, but in any one iteration, loss may go up or down Eventually, it will get close to the minimum, but it will continue to go up and down a bit So, once you stop it, the β will be close to the best, but not necessarily optimal Ironically, if you have a local minimum (which, with OLS regression, we don't), SGD might even escape the local minimum, and might even get to the global minimum x β β x T (xβ y) Simulated Annealing As we discussed, SGD does not settle at the minimum One solution is to gradually reduce the learning rate Updates start out 'large' so you make progress and can escape local minima But, over time, updates get smaller, allowing SGD to settle at the global minimum The function that determines how to reduce the learning rate is called the learning schedule Reduce it too quickly and you may get stuck in a local minimum or en route to the global minimum Reduce it too slowly and you may bounce around a lot and, if stopped after too few iterations, may end up with a suboptimal solution SGD in scikit-learn The fit method of scikit-learn's SGDRegressor class is doing what we have described: You must scale the features but it inserts the extra column of 1s You can supply a learning_rate and lots of other things (in the code below, we'll just use the defaults) (In the code below, we'll be naughty: we'll train on the whole dataset) 5 of 10

6 In [53]: # Use pandas to read the CSV file df = pd.read_csv("datasets/dataset_corka.csv") # Get the feature-values and the target values X_unscaled = df[["flarea", "bdrms", "bthrms"]].values y = df["price"].values # Scale it scaler = StandardScaler() X = scaler.fit_transform(x_unscaled) # Create the SGDRegressor and fit the model sgd = SGDRegressor() sgd.fit(x, y) Out[53]: SGDRegressor(alpha=0.0001, average=false, epsilon=0.1, eta0=0.01, fit_intercept=true, l1_ratio=0.15, learning_rate='invscaling', loss='squared_loss', n_iter=5, penalty='l2', power_t=0.25, random_state=none, shuffle=true, verbose=0, warm_start=false) SGD in numpy For the hell of it, let's implement a simple version ourselves (Again, we'll be naughty: we'll train on the whole dataset) In [56]: def stochastic_gradient_descent_for_ols_linear_regression(x, y, alpha, n um_epochs): m, n = X.shape beta = np.random.randn(n) Jvals = np.zeros(num_epochs * m) for epoch in range(num_epochs): for i in range(m): rand_idx = np.random.randint(m) xi = X[rand_idx:rand_idx + 1] yi = y[rand_idx:rand_idx + 1] beta -= alpha * xi.t.dot(xi.dot(beta) - yi) Jvals[epoch * m + i] = J(X, y, beta) return beta, Jvals 6 of 10

7 In [57]: # Use pandas to read the CSV file df = pd.read_csv("datasets/dataset_corka.csv") # Get the feature-values and the target values X_without_dummy_unscaled = df[["flarea", "bdrms", "bthrms"]].values y = df["price"].values # Scale it scaler = StandardScaler() X_without_dummy = scaler.fit_transform(x_without_dummy_unscaled) # Add the extra column to X X = add_dummy_feature(x_without_dummy) # Run the Stochastic Gradient Descent beta, Jvals = stochastic_gradient_descent_for_ols_linear_regression(x, y, alpha = 0.03, num_epochs = 50) # Display beta beta Out[57]: array([ , , , ]) In [58]: fig = plt.figure(figsize=(8,6)) plt.title("$j$ during learning") plt.xlabel("number of iterations") plt.xlim(1, Jvals.size) plt.ylabel("$j$") plt.ylim(3500, 50000) xvals = np.linspace(1, Jvals.size, Jvals.size) plt.scatter(xvals, Jvals) plt.show() Quite a bumpy ride! So, let's try simuated annealingl 7 of 10

8 In [59]: def learning_schedule(t): return 5 / (t + 50) def stochastic_gradient_descent_for_ols_linear_regression(x, y, num_epoc hs): m, n = X.shape beta = np.random.randn(n) Jvals = np.zeros(num_epochs * m) for epoch in range(num_epochs): for i in range(m): rand_idx = np.random.randint(m) xi = X[rand_idx:rand_idx + 1] yi = y[rand_idx:rand_idx + 1] alpha = learning_schedule(epoch * m + i) beta -= alpha * xi.t.dot(xi.dot(beta) - yi) Jvals[epoch * m + i] = J(X, y, beta) return beta, Jvals In [60]: # Use pandas to read the CSV file df = pd.read_csv("datasets/dataset_corka.csv") # Get the feature-values and the target values X_without_dummy_unscaled = df[["flarea", "bdrms", "bthrms"]].values y = df["price"].values # Scale it scaler = StandardScaler() X_without_dummy = scaler.fit_transform(x_without_dummy_unscaled) # Add the extra column to X X = add_dummy_feature(x_without_dummy) # Run the Stochastic Gradient Descent beta, Jvals = stochastic_gradient_descent_for_ols_linear_regression(x, y, num_epochs = 50) # Display beta beta Out[60]: array([ e+02, e+02, e+01, e-01]) 8 of 10

9 In [61]: fig = plt.figure(figsize=(8,6)) plt.title("$j$ during learning") plt.xlabel("number of iterations") plt.xlim(1, Jvals.size) plt.ylabel("$j$") plt.ylim(3500, 50000) xvals = np.linspace(1, Jvals.size, Jvals.size) plt.scatter(xvals, Jvals) plt.show() Mini-Batch Gradient Descent Batch Gradient Descent computed gradients from the full training set Stochastic Gradient Descent computed gradients from just one example Mini-Batch Gradient Descent lies between the two: It computes gradients from a small randomly-selected subset of the training set, called a mini-batch Since it lies between the two: It may bounce less and get closer to the global minimum than SGD Although both of them can reach the global minimum with a good learning schedule But it may be harder to escape local minima, if you have them (which, for OLS, we don't) And its time and memory costs lie between the two 9 of 10

10 The Normal Equation versus Gradient Descent Efficiency/scaling-up Normal Equation is linear in m, so can handle large training sets efficiently if they fit into main memory but it has to compute the inverse (or psueudo-inverse) of a between quadratic and cubic in, and so is only feasible for smallish Gradient Descent SGD scales really well to huge matrix, which takes time And all three Gradient Descent methods can handle huge (even 100s of 1000s) (up to a few thousand) Finding the global minimum for OLS regression Normal Equation: guaranteed to find the global minimum Gradient Descent: all a bit dependent on number of iterations, learning rate, learning schedule Feature scaling: Normal Equation: scaling is not needed (In fact, I find that scikit-learn's LinearRegression class produces weird results if I do any scaling. I don't know why. So don't do it!) Gradient Descent: scaling is needed n m n n Finally, Gradient Descent is a general method, whereas the Normal Equation is only for OLS regression n n Logicstic Regression So what about classification using logistic regression? We have a different loss function (cross entropy) Happily, it is convex But there is no equivalent to the Normal Equation, so we must use Gradient Descent Not that it matters, but here is the partial derivative of its loss function with respect to β i (binary classification) m J 1 = ( x )β ) β j m (i) β y (i) x (i) j i=1 scikit-learn has the class LogisticRegression, but also SGDClassifier if you want more control After Christmas Here endeth CS4618 What will we do in CS4619? We will study some more complex models (i.e. non-linear ones) We will study underfitting and overfitting, and solutions to these This will lead into Neural Networks From there, we will study so-called Deep Learning for regression and classification, including for images We will generalize to problems such as sequence to vector, vector to sequence and sequence to sequences such as machine translation, speech recognition, We will reviste Reinforcement Learning We will consider knowledge representation and reasoning It'll be tough but brilliant In [ ]: 10 of 10

Derek Bridge School of Computer Science and Information Technology University College Cork

CS468: Artificial Intelligence I Ordinary Least Squares Regression Derek Bridge School of Computer Science and Information Technology University College Cork Initialization In [4]: %load_ext autoreload