Computational Intelligence (CS) SS15 Homework 1 Linear Regression and Logistic Regression

Size: px

Start display at page:

Download "Computational Intelligence (CS) SS15 Homework 1 Linear Regression and Logistic Regression"

Joleen Carroll
5 years ago
Views:

1 Computational Intelligence (CS) SS15 Homework 1 Linear Regression and Logistic Regression Anand Subramoney Points to achieve: 17 pts Extra points: 5* pts Info hour: :00-15:00, HS i12 (ICK1130H) Deadline: :00 Hand-in mode: Hand in a hard-copy of your results at the IGI hand-in boxes (black/white) (Inffeldgasse 16b, 1st floor) on between 9:00 and 17:00. Use the cover sheet from the website. Submit your matlab files and a colored version of your results via TeachCenter file upload (MyFiles). Detailed hand-in instructions: Newsgroup: tu-graz.lv.ew Contents 1 Linear Regression Linear Regression with polynomial features [4 points] Linear Regression with radial basis functions [5 points] Optional: Derivation of Regularized Linear Regression [3* points] Logistic Regression Logistic Regression training with gradient descent and fminunc [6 points + 2* points] Derivation of Gradient [2 points] General remarks Your submission will be graded based on... The correctness of your results (Is your code doing what it should be doing? Are your plots consistent with what algorithm XY should produce for the given task? Is your derivation of formula XY correct?) The depth and correctness of your interpretations (Keep your interpretations as short as possible, but as long as necessary to convey your ideas) The quality of your plots (Is everything important clearly visible in the print-out, are axes labeled,...?) Matlab hints... Your submission should be runnable at least on Matlab R2014b (available on the TU Graz matlab server) Use the Matlab implementation of pseudo-inverse (pinv) 1

2 2 1 Linear Regression 1.1 Linear Regression with polynomial features [4 points] Download hw1_linreg.zip from the course website and unzip. data.mat contains training, validation and test sets of a regression problem with a single input dimension. Load the data file and plot data.y_train vs. data.x_train to see the non-linear structure of the data. Your task is to fit polynomes of different degrees to the training data, and to identify the degree which gives the best predictions on the validation set (performing simple model selection). The features that should be used for linear regression are φ 0 = 1, φ 1 = x, φ 2 = x 2,..., φ n = x n, (1) where x corresponds to the scalar input, and n is the polynome degree. The following should be included in your report: Plot your results for each of the following degrees: 1, 2, 5, 20. Use the provided script poly_plot_results.m to create these plots. Obtain training/validation/test errors for each n {1, 2,..., 29, 30}. Make a plot with the polynome degree on the horizontal axis and the training/validation/test error on the vertical axis. For each of the three errors plot a separate line (in a different color). See poly_main_model_selection.m. Report which degree {1,..., 30} gives the lowest training error. Plot the results for that degree. Report the test error. Report which degree {1,..., 30} gives the lowest validation error. Plot the results for that degree. Report the test error. Briefly describe and discuss your findings in your own words. The scripts poly_*.m contain a skeleton of functionality for carrying out this task. Find the TODOs and implement the missing functionality as required. 1.2 Linear Regression with radial basis functions [5 points] Using the same data set, your task is to fit hypotheses with different numbers of radial basis functions to the training data, and to perform model selection. Let l be the number of RBFs. For a given l the RBF centers should be chosen to uniformly span the whole input range [ 1, 1], i.e. cs = linspace(-1,1,l). The width of the basis functions should be set to σ = 2/l, i.e. with a higher l RBFs should be narrower (implementing a finer resolution). The features used for linear regression should be φ 0 = 1, φ 1 = exp( (x c 1 ) 2 /(2σ 2 )),..., φ l = exp( (x c l ) 2 /(2σ 2 )) (2) where x corresponds to the scalar input and c j is the center of RBF j. The following should be included in your report: Plot your results for each of the following numbers of RBFs: 2, 3, 5, 30. Use rbf_plot_results.m to create these plots. Obtain training/validation/test errors for each l {1, 2,..., 39, 40}. Make a plot with the number of RBFs l on the horizontal axis and the training/validation/test error on the vertical axis. For each of the three errors plot a separate line in a different color. Report which number of RBFs {1,..., 40} gives the lowest training error. Plot the results for that degree. Report the test error.

3 1.3 Optional: Derivation of Regularized Linear Regression [3* points] 3 Report which number of RBFs {1,..., 40} gives the lowest validation error. Plot the results for that degree. Report the test error. Briefly describe and discuss your findings in your own words. The scripts rbf_*.m provide a basic structure for carrying out this task. Find the TODOs and implement the missing functionality as required. 1.3 Optional: Derivation of Regularized Linear Regression [3* points] Given the design matrix X (m rows, n + 1 columns), output vector y (m rows) and parameter vector θ (n + 1 rows), the mean squared error cost function for linear regression can be written compactly as, J(θ) = 1 m Xθ y 2. (3) The notation refers to the Euclidean norm. The optimal parameters which minimize this cost function are given by, θ = (X T X) 1 X T y (4) The derivation of these optimal parameters using matrix calculus is sketched below: J(θ) θ = 2 m (Xθ y)t (Xθ y) θ (5) = 2 m (Xθ y)t X! = 0 T (6) X T (Xθ y) = 0 (7) X T Xθ X T y = 0 (8) X T Xθ = X T y (9) θ = (X T X) 1 X T y (10) The bold-face 0 denotes a column vector of n + 1 zeros. Consider the following slightly extended, regularized cost function: J(θ) = 1 m Xθ y 2 + λ m θ 2 (11) The additional term penalizes parameters of large magnitudes. The constant λ 0 controls the influence of this penalization (low λ weak regularization, high λ strong regularization). Such regularized cost functions are often used for linear regression (and other learning models) instead of (3) because regularization drastically reduces overfitting effects that occur when the number of features is large. Show that the optimal parameters which minimize the regularized linear regression cost (11) are given by, θ = (X T X + λi) 1 X T y, (12) where I denotes the identity matrix of dimension (n + 1) (n + 1). If you get stuck: the wikipedia page on matrix calculus has a useful summary of rules regarding matrix/vector derivatives (numerator layout!).

4 4 2 Logistic Regression 2.1 Logistic Regression training with gradient descent and fminunc [6 points + 2* points] Download hw1_logreg.zip from the course website and unzip. data_logreg.mat contains training and test set of a binary classification problem with two input dimensions. Load the data file and plot gscatter(data.x1_train, data.x2_train, data.y_train) to see the structure of the data. Your task is to fit a logistic regression classifier with polynomial features of varying degrees to the data, and to compare different optimization techniques for this task. The optimization techniques that should be compared are gradient descent with a constant learning rate, gradient descent with an adaptive learning rate (optional), and the advanced optimization method implemented by the MATLAB function fminunc. The features that should be used for logistic regression are all monomials of the two inputs x 1 and x 2 up to some degree l (monomials are polynomials with only one term), i.e. all products (x 1 ) a (x 2 ) b with a, b Z and a + b l. For example, for degree l = 3 the features should be, Your report should include the following: Results for Gradient Descent (GD): φ 0 = 1, (13) φ 1 = (x 1 ) 1, φ 2 = (x 2 ) 1, (14) φ 3 = (x 1 ) 2, φ 4 = (x 1 )(x 2 ), φ 5 = (x 2 ) 2, (15) φ 6 = (x 1 ) 3, φ 7 = (x 1 ) 2 (x 2 ) 1, φ 8 = (x 1 ) 1 (x 2 ) 2, φ 9 = (x 2 ) 3 (16) For degree l = 1 run GD for 10, 100 and 1000 iterations (learning rate η = 1, all three parameters initialized at zero). Report training and test errors after the three different iteration counts (verify your implementation of GD: the training cost should be decreasing with the number of iterations... ). Plot the hypothesis obtained after 1000 iterations over training/test data using logreg_plot.m. Repeat for degrees l {2, 5, 15}. Optional: Results for Gradient Descent with adaptive learning rate (GDad): Implement the adaptive learning rate as follows: after every update check whether the cost increased or decreased. If the cost increased, reject the update (go back to the previous parameter setting) and multiply the learning rate by 0.7. If the cost decreased, accept the update and multiply the learning rate by The iteration count should be increased after every iteration even if the update was rejected. Run GDad for varying degrees l {1, 2, 5, 15} (with zero initialization of parameters, 1000 iterations, and initial learning rate η = 1). Report training and test errors, and plot the hypothesis obtained in each case with logreg_plot.m. Compare the results of GD/GDad and briefly discuss. Results for MATLAB optimizer fminunc: Run fminunc for degrees l {1, 2, 5, 15} (with zero initialization of parameters, MaxIter set to 1000 iterations, function tolerance TolFun and parameter tolerance TolX both set to 1e-8). Report training and test errors, and plot the hypothesis obtained in each case with logreg_plot.m. Compare the obtained results with those above (GD/GDad) and briefly discuss and interpret your observations. The scripts in hw1_logreg.zip contain a skeleton of functionality for carrying out this task. Find the TODOs and implement the missing functionality as required.

5 2.2 Derivation of Gradient [2 points] Derivation of Gradient [2 points] The logistic regression hypothesis function is given by, h θ (x) = σ(x T θ) = σ(θ 0 x 0 + θ 1 x θ n x n ) (17) with the sigmoid function σ(z) = (1 + exp( z)) 1, parameters θ = (θ 0, θ 1,..., θ n ) T, and input vector x = (x 0, x 1,..., x n ) T. By convention the constant feature x 0 is fixed to x 0 = 1. The logistic regression cost function can be written as, J(θ) = 1 m m i=1 ( ) y (i) log(h θ (x (i) )) + (1 y (i) ) log(1 h θ (x (i) )) (18) with x (i) = (x (i) 0, x(i) 1,..., x(i) n ) T and x (i) 0 = 1. The notation log( ) refers to the natural logarithm. Derive the gradient of the cost function, i.e. show that the partial derivative of the cost function with respect to θ j equals J(θ) θ j =... (19) =... = 1 m m i=1 ( h θ (x (i) ) y (i)) x (i) j. (20) Hint: σ(z) z = σ(z) (1 σ(z))

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression Goals: To open up the black-box of scikit-learn and implement regression models. To investigate how adding polynomial