Homework 3. (a) The following code imports the data and standardizes the vector u, called U, include("readclassjson.jl")

EE104, Spring 2017-2018 S. Boyd & S. Lall Homework 3 1. Predicting power demand. Power utilities need to predict the power demanded by consumers. A possible predictor is an auto-regressive (AR) model, which predicts power demands based on a history of power demand data, p t for t = 1,..., T. Here, p t is the power usage in kilowatt hours (kwh) during time interval t. The AR predictor uses standardized power data z t to predict future power demands as an affine function of historical power data. In particular, we set x i = (1, z i+(h 1),..., z i ) y i = z h+i, i = 1,..., T h, and predict future standardized power demands z h+i = y i θ T x i. You will select parameter vector θ R h+1 with ridge regression. The data in power_demand_data.json contains p, the hourly electric power demands for California between July 2015 and December 2017 (the times corresponding to the power demands can be found in dates). Carry out ridge regression for the AR problem on this data, by standardizing the data, computing the data records x i and targets y i, and then solving the ridge regression problem. Although more sophisticated validation techniques are possible, use the first 17,000 data points (approximately two years worth of data) to set the model parameters, and validate the regularization weight λ using the remaining records, with h = 336 (corresponding to two weeks of data). After selecting a reasonable value of λ, compute your final θ by ridge regression on the entire dataset. Provide a plot of validation RMSE versus λ, and a plot of the components of your optimal θ. Comment briefly on the results. Solution. (a) The following code imports the data and standardizes the vector u, called U, include("readclassjson.jl") data = readclassjson("california_hourly_demand.json") U = data["u"] U -= mean(u) U /= norm(u)/sqrt(length(u)) (b) The data matrix can be written row-by-row as x i = (z i+m, z i+m 1, z i+m 2,..., z i ), i = 1,..., T M 1 and y i = z i+m+1. Therefore n = T M 1. 1

(c) The plot of the train and test errors is found below. 0.225 0.200 Train error Test error RMSE 0.175 0.150 0.125 0.100 0.075 0.050 10 3 10 2 10 1 10 0 10 1 10 2 10 3 10 4 lambda We found that λ.38 yielded the lowest test error. The code for fitting this follows using PyPlot include("readclassjson.jl") data = readclassjson("power_demand_data.json") U = data["p"] U -= mean(u) U /= norm(u)/sqrt(length(u)) M = 336 n = size(u,1) - M X = zeros(n, M+1) y = zeros(n) n_test = 365*24 n_train = n - n_test 2

for i=1:n X[i,:] = [1; U[i:i+M-1]] y[i] = U[i+M] println("finished construcing matrix") X_train = X[1:n_train,:] y_train = y[1:n_train] X_test = X[n_train+1:,:] y_test = y[n_train+1:] n_lambda = 20 poss_lambda = logspace(-3, 4, 20) train_err = zeros(n_lambda) test_err = zeros(n_lambda) best_test_rmse = Inf best_test_theta = nothing for idx=1:n_lambda theta = [X_train; zeros(m) sqrt(poss_lambda[idx])*eye(m)] \ [y_train; zeros train_err[idx] = norm(x_train * theta - y_train)/sqrt(n_train) test_err[idx] = norm(x_test * theta - y_test)/sqrt(n_test) println("train $(train_err[idx]) and test $(test_err[idx]) for lambda = $(poss_lambda[idx])") if test_err[idx] < best_test_rmse best_test_rmse = test_err[idx] best_test_theta = theta semilogx(poss_lambda, train_err, label="train error") semilogx(poss_lambda, test_err, label="test error") xlabel("lambda") ylabel("rmse") leg() 3

savefig("predict_demand_rmse.pdf") close() figure() plot(1:m+1, reverse(best_test_theta)) xlabel("hours") ylabel("theta") savefig("predict_demand_theta.pdf") show() close() (d) Plotting the parameter θ gives the figure below 1.50 1.25 1.00 0.75 theta 0.50 0.25 0.00 0.25 0.50 0 50 100 150 200 250 300 350 hours We can see the top 10 variables (without the constant) as 1 2 24 26 25 168 170 23 167 169 4

which make some sense. The two most important predictors are the previous two hours, continued by the energy usage at the same time period in the previous 24 hours ±1 hour. Immediately after, we can see that the next most important features are those corresponding to the usage over the same period 168/24 = 7 days ago (i.e., a week ago). The rest of the features are explained similarly. 2. Least squares gradient descent. Gradient descent is an iterative method that can be used to minimize the average loss of a parameter vector on a dataset. In this problem, we consider the least squares linear regression problem, where empirical risk is the mean-square error, L(θ) = 1 n (θ T x i y i ) 2, n with parameter vector θ R d. i=1 (a) Find L(θ) and implement gradient(x, y, theta), which returns L(θ) from an input data matrix X, target vector y, and current value of the parameter vector theta. (b) Using the function you wrote in part (a), implement gradient descent for the least squares linear regression problem. Generate data with the code below and experiment with various choices of the parameters θ 1, h 1, ɛ, and k max. srand(1234) n, d = 200, 10 X = randn(n, d) y = X * randn(d) +.1*randn(n) On the data above, report θ and a plot of L(θ k ) versus iteration. (c) Using the data in part (b), compute θ using X\y. How does it compare to your iterative solution? Solution. (a) Writing the ERM loss in vectorized form, L(θ) = 1 n n (θ T x i y i ) 2 = 1 n Xθ y 2 2. i=1 We know the derivative of this from EE103, which is just L(θ) = 2 n XT (Xθ y). (b) The following function implements the gradient method in the quadratic loss case 5

function quadratic_grad_method(x, y, theta_init) curr_theta = theta_init k_max = 300 eps_min =.01 step_size = 1 for _ = 1:k_max val_f = norm(x * curr_theta - y)^2/n grad_f = step_size*(2*x *(X*curr_theta - y)/n) # Stop if small gradient if norm(grad_f) < 1e-5 new_theta = curr_theta - step_size*grad_f if norm(x * new_theta - y)^2/n <= val_f step_size *= 1.2 curr_theta = new_theta else step_size /= 2 (c) In this case, we find that our quadratic_grad_method implementation terminates in around 14 steps and the difference between the loss computed with the gradient method and the one computed with the usual least squares approach is around 4 10 11. The complete code for this problem is srand(1234) n, d = 200, 10 X = randn(n, d) y = X * randn(d) +.1*randn(n) function quadratic_grad_method(x, y, theta_init) curr_theta = theta_init k_max = 300 eps_min =.01 step_size = 1 6

for _ = 1:k_max val_f = norm(x * curr_theta - y)^2/n println("current val : $(val_f)") grad_f = step_size*(2*x *(X*curr_theta - y)/n) # Stop if small gradient if norm(grad_f) < 1e-5 new_theta = curr_theta - step_size*grad_f if norm(x * new_theta - y)^2/n <= val_f step_size *= 1.2 curr_theta = new_theta else step_size /= 2 ls_theta = X\y gm_theta = quadratic_grad_method(x, y, zeros(d)) loss_ls = norm(x*ls_theta - y)^2/n loss_gm = norm(x*gm_theta - y)^2/n println("difference of losses is $(loss_gm - loss_ls)") 7