Homework 3. (a) The following code imports the data and standardizes the vector u, called U, include("readclassjson.jl")

Similar documents
Homework 1. for i=1:n curr_dist = norm(x - X[i,:])

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance

NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

CS294-1 Assignment 2 Report

Model selection and validation 1: Cross-validation

Boolean Classification

Variations on Regression Models. Prof. Bennett Math Models of Data Science 2/02/06

Computational Intelligence (CS) SS15 Homework 1 Linear Regression and Logistic Regression

Aravind Baskar A ASSIGNMENT - 3

k-nn classification with R QMMA

CSE 158 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

Midterm Examination , Fall 2011

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Model Answers to The Next Pixel Prediction Task

Divide and Conquer Kernel Ridge Regression

CSE 446 Bias-Variance & Naïve Bayes

Homework 3: Solutions

6.867 Machine Learning

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

Online Learning. Lorenzo Rosasco MIT, L. Rosasco Online Learning

Final Exam. Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm

CSE 190, Spring 2015: Midterm

Assignment 5: Collaborative Filtering

Constrained optimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

MATH 829: Introduction to Data Mining and Analysis Model selection

Stat 4510/7510 Homework 6

Logistic Regression and Gradient Ascent

Package RAMP. May 25, 2017

Programming Exercise 1: Linear Regression

Lab 10 - Ridge Regression and the Lasso in Python

Package nodeharvest. June 12, 2015

Programming Exercise 3: Multi-class Classification and Neural Networks

Simple Model Selection Cross Validation Regularization Neural Networks

Learning Meanings for Sentences with Recursive Autoencoders

Lab 9 - Linear Model Selection in Python

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CSE446: Linear Regression. Spring 2017

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

EE 511 Linear Regression

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

1 Training/Validation/Testing

Regularization Methods. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Computation of the Constrained Infinite Time Linear Quadratic Optimal Control Problem

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Collaborative Filtering Applied to Educational Data Mining

2. Linear Regression and Gradient Descent

Package RPEnsemble. October 7, 2017

HW7 Solutions. Gabe Hope. May The accuracy of the nearest neighbor classifier was: 0.9, and the classifier took 2.55 seconds to run.

Package obliquerf. February 20, Index 8. Extract variable importance measure

Programming Exercise 4: Neural Networks Learning

CSC 411: Lecture 02: Linear Regression

Logistic Regression with a Neural Network mindset

Lecture Linear Support Vector Machines

Logistic Regression. Abstract

Yelp Recommendation System

PS 6: Regularization. PART A: (Source: HTF page 95) The Ridge regression problem is:

Alternating Direction Method of Multipliers

Lecture 26: Missing data

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Machine Problem 8 - Mean Field Inference on Boltzman Machine

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Problem Set 5. MAS 622J/1.126J: Pattern Recognition and Analysis. Due Monday, 8 November Resubmission due, 12 November 2010

Penalized regression Statistical Learning, 2011

Geology Geomath Estimating the coefficients of various Mathematical relationships in Geology

Convex Optimization / Homework 2, due Oct 3

Package SAM. March 12, 2019

HW3: Multiple Linear Regression

Homework 5. Due: April 20, 2018 at 7:00PM

Lecture 12: convergence. Derivative (one variable)

Applied Regression Modeling: A Business Approach

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

No more questions will be added

Regularized Committee of Extreme Learning Machine for Regression Problems

Penalizied Logistic Regression for Classification

Optimization and least squares. Prof. Noah Snavely CS1114

14. League: A factor with levels A and N indicating player s league at the end of 1986

Computerlinguistische Anwendungen Support Vector Machines

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?

1. Approximation and Prediction Problems

Note Set 4: Finite Mixture Models and the EM Algorithm

CS 237: Probability in Computing

Lab 16 - Multiclass SVMs and Applications to Real Data in Python

Package SSLASSO. August 28, 2018

AMS : Combinatorial Optimization Homework Problems - Week V

Package svmpath. R topics documented: August 30, Title The SVM Path Algorithm Date Version Author Trevor Hastie

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Package bayescl. April 14, 2017

ECE251DN: Homework #3 Solutions

Generalized least squares (GLS) estimates of the level-2 coefficients,

Predicting Bus Arrivals Using One Bus Away Real-Time Data

Lecture : Neural net: initialization, activations, normalizations and other practical details Anne Solberg March 10, 2017

Week 3: Perceptron and Multi-layer Perceptron

06: Logistic Regression

Transcription:

EE104, Spring 2017-2018 S. Boyd & S. Lall Homework 3 1. Predicting power demand. Power utilities need to predict the power demanded by consumers. A possible predictor is an auto-regressive (AR) model, which predicts power demands based on a history of power demand data, p t for t = 1,..., T. Here, p t is the power usage in kilowatt hours (kwh) during time interval t. The AR predictor uses standardized power data z t to predict future power demands as an affine function of historical power data. In particular, we set x i = (1, z i+(h 1),..., z i ) y i = z h+i, i = 1,..., T h, and predict future standardized power demands z h+i = y i θ T x i. You will select parameter vector θ R h+1 with ridge regression. The data in power_demand_data.json contains p, the hourly electric power demands for California between July 2015 and December 2017 (the times corresponding to the power demands can be found in dates). Carry out ridge regression for the AR problem on this data, by standardizing the data, computing the data records x i and targets y i, and then solving the ridge regression problem. Although more sophisticated validation techniques are possible, use the first 17,000 data points (approximately two years worth of data) to set the model parameters, and validate the regularization weight λ using the remaining records, with h = 336 (corresponding to two weeks of data). After selecting a reasonable value of λ, compute your final θ by ridge regression on the entire dataset. Provide a plot of validation RMSE versus λ, and a plot of the components of your optimal θ. Comment briefly on the results. Solution. (a) The following code imports the data and standardizes the vector u, called U, include("readclassjson.jl") data = readclassjson("california_hourly_demand.json") U = data["u"] U -= mean(u) U /= norm(u)/sqrt(length(u)) (b) The data matrix can be written row-by-row as x i = (z i+m, z i+m 1, z i+m 2,..., z i ), i = 1,..., T M 1 and y i = z i+m+1. Therefore n = T M 1. 1

(c) The plot of the train and test errors is found below. 0.225 0.200 Train error Test error RMSE 0.175 0.150 0.125 0.100 0.075 0.050 10 3 10 2 10 1 10 0 10 1 10 2 10 3 10 4 lambda We found that λ.38 yielded the lowest test error. The code for fitting this follows using PyPlot include("readclassjson.jl") data = readclassjson("power_demand_data.json") U = data["p"] U -= mean(u) U /= norm(u)/sqrt(length(u)) M = 336 n = size(u,1) - M X = zeros(n, M+1) y = zeros(n) n_test = 365*24 n_train = n - n_test 2

for i=1:n X[i,:] = [1; U[i:i+M-1]] y[i] = U[i+M] println("finished construcing matrix") X_train = X[1:n_train,:] y_train = y[1:n_train] X_test = X[n_train+1:,:] y_test = y[n_train+1:] n_lambda = 20 poss_lambda = logspace(-3, 4, 20) train_err = zeros(n_lambda) test_err = zeros(n_lambda) best_test_rmse = Inf best_test_theta = nothing for idx=1:n_lambda theta = [X_train; zeros(m) sqrt(poss_lambda[idx])*eye(m)] \ [y_train; zeros train_err[idx] = norm(x_train * theta - y_train)/sqrt(n_train) test_err[idx] = norm(x_test * theta - y_test)/sqrt(n_test) println("train $(train_err[idx]) and test $(test_err[idx]) for lambda = $(poss_lambda[idx])") if test_err[idx] < best_test_rmse best_test_rmse = test_err[idx] best_test_theta = theta semilogx(poss_lambda, train_err, label="train error") semilogx(poss_lambda, test_err, label="test error") xlabel("lambda") ylabel("rmse") leg() 3

savefig("predict_demand_rmse.pdf") close() figure() plot(1:m+1, reverse(best_test_theta)) xlabel("hours") ylabel("theta") savefig("predict_demand_theta.pdf") show() close() (d) Plotting the parameter θ gives the figure below 1.50 1.25 1.00 0.75 theta 0.50 0.25 0.00 0.25 0.50 0 50 100 150 200 250 300 350 hours We can see the top 10 variables (without the constant) as 1 2 24 26 25 168 170 23 167 169 4

which make some sense. The two most important predictors are the previous two hours, continued by the energy usage at the same time period in the previous 24 hours ±1 hour. Immediately after, we can see that the next most important features are those corresponding to the usage over the same period 168/24 = 7 days ago (i.e., a week ago). The rest of the features are explained similarly. 2. Least squares gradient descent. Gradient descent is an iterative method that can be used to minimize the average loss of a parameter vector on a dataset. In this problem, we consider the least squares linear regression problem, where empirical risk is the mean-square error, L(θ) = 1 n (θ T x i y i ) 2, n with parameter vector θ R d. i=1 (a) Find L(θ) and implement gradient(x, y, theta), which returns L(θ) from an input data matrix X, target vector y, and current value of the parameter vector theta. (b) Using the function you wrote in part (a), implement gradient descent for the least squares linear regression problem. Generate data with the code below and experiment with various choices of the parameters θ 1, h 1, ɛ, and k max. srand(1234) n, d = 200, 10 X = randn(n, d) y = X * randn(d) +.1*randn(n) On the data above, report θ and a plot of L(θ k ) versus iteration. (c) Using the data in part (b), compute θ using X\y. How does it compare to your iterative solution? Solution. (a) Writing the ERM loss in vectorized form, L(θ) = 1 n n (θ T x i y i ) 2 = 1 n Xθ y 2 2. i=1 We know the derivative of this from EE103, which is just L(θ) = 2 n XT (Xθ y). (b) The following function implements the gradient method in the quadratic loss case 5

function quadratic_grad_method(x, y, theta_init) curr_theta = theta_init k_max = 300 eps_min =.01 step_size = 1 for _ = 1:k_max val_f = norm(x * curr_theta - y)^2/n grad_f = step_size*(2*x *(X*curr_theta - y)/n) # Stop if small gradient if norm(grad_f) < 1e-5 new_theta = curr_theta - step_size*grad_f if norm(x * new_theta - y)^2/n <= val_f step_size *= 1.2 curr_theta = new_theta else step_size /= 2 (c) In this case, we find that our quadratic_grad_method implementation terminates in around 14 steps and the difference between the loss computed with the gradient method and the one computed with the usual least squares approach is around 4 10 11. The complete code for this problem is srand(1234) n, d = 200, 10 X = randn(n, d) y = X * randn(d) +.1*randn(n) function quadratic_grad_method(x, y, theta_init) curr_theta = theta_init k_max = 300 eps_min =.01 step_size = 1 6

for _ = 1:k_max val_f = norm(x * curr_theta - y)^2/n println("current val : $(val_f)") grad_f = step_size*(2*x *(X*curr_theta - y)/n) # Stop if small gradient if norm(grad_f) < 1e-5 new_theta = curr_theta - step_size*grad_f if norm(x * new_theta - y)^2/n <= val_f step_size *= 1.2 curr_theta = new_theta else step_size /= 2 ls_theta = X\y gm_theta = quadratic_grad_method(x, y, zeros(d)) loss_ls = norm(x*ls_theta - y)^2/n loss_gm = norm(x*gm_theta - y)^2/n println("difference of losses is $(loss_gm - loss_ls)") 7