HW3: Multiple Linear Regression

Size: px

Start display at page:

Download "HW3: Multiple Linear Regression"

Ashley Young
5 years ago
Views:

1 STAT INTRO STAT DATA SCI UW Spring Quarter 2017 Néhémy Lim HW3: Multiple Linear Regression Programming assignment. Directions. Comment all functions to receive full credit. Provide a single Python file with the format name_391_hw3.py, where name is your full name. Send your file to nehemyl@uw.edu and zhangkh@uw.edu. Please specify your name in the heading of the . This assignment is due on Thursday, April 20th at 11:59pm PST. The goal of this assignment is to implement a set of functions related to the multiple linear regression. For each question, you are allowed to use functions that have been implemented in previous assignments or questions. (a) In machine learning, it is important to normalize the predictors. A common normalization technique is known as standardization. A predictor is standardized by subtracting its mean and and dividing the difference by its standard deviation. Write a function standardize that takes a twodimensional Numpy array X of n rows and p columns, and returns 3 outputs: a two-dimensional Numpy array X_std of n rows and p columns, where the j-th column of X_std is the standardized version of the j-th column of X, a one-dimensional Numpy array bar_x of p numerical values, that contains the means of each column, a one-dimensional Numpy array std_x of p numerical values, that contains the empirical standard deviations of each column. You may use the Numpy functions mean and std. contain any for or while loop. Your code should not (b) Write a function add_ones that takes a two-dimensional Numpy array M of n rows and p columns, and returns a two-dimensional Numpy array M1 of n rows and p+1 columns, where the first column of M1 is filled with ones, and the remaining p columns are identical to M. Your code should not contain any for or while loop. 1

2 (c) Write a function standardize_design that takes a two-dimensional Numpy array design_mat of n observations and p predictors, and returns a twodimensional Numpy array design_mat_std of n rows and p + 1 columns, where the first column of design_mat_std is filled with ones, and the remaining p columns correspond to the standardized version of the matrix design_mat. Your code should not contain any for or while loop. (d) Write a function compute_lsq_estimates that takes two input arguments: a two-dimensional Numpy array design_mat of n observations and p predictors, and a one-dimensional Numpy array y (responses) of n numerical values, and returns a one-dimensional Numpy array hat_beta of p + 1 numerical values, the least squares estimates for β 0, β 1,..., β p. You may use the Numpy function linalg.inv. Note: Do not forget to standardize the predictors! (e) Write a function predict_lsq that takes 4 input arguments: a two-dimensional Numpy array X_test of nt test observations and p predictors, a one-dimensional Numpy array bar_x of p numerical values, that contains the means of each predictor, computed from a training set, a one-dimensional Numpy array std_x of p numerical values, that contains the empirical standard deviations of each predictor, computed from a training set, a one-dimensional Numpy array hat_beta of p + 1 numerical values, the least squares estimates for β 0, β 1,..., β p, computed from a standardized design matrix and returns a one-dimensional Numpy array hat_y (predictions for each test observation) of nt numerical values. Your code should not contain any for or while loop. (f) Write a function compute_std_err_lsq that takes 3 input arguments: a two-dimensional Numpy array design_mat of n observations and p predictors, a one-dimensional Numpy array y (responses) of n numerical values, and a one-dimensional Numpy array hat_y of associated predictions, and returns two outputs: rse, the residual standard error, and a one-dimensional Numpy array se_hat_beta of p + 1 numerical values, where the j-th element of se_hat_beta is the standard error associated with ˆβ j given by SE( ˆβ j ) = RSE Ω jj, where Ω jj is the (j, j)-th coefficient of the matrix (X T X) 1, and X is the standardized design matrix. Your code should not contain any for or while loop. 2

3 (g) Write a function compute_r2_lsq that takes 2 input arguments: a onedimensional Numpy array y (responses) of n numerical values, and a onedimensional Numpy array hat_y of associated predictions, and returns the R 2 statistic. Your code should not contain any for or while loop. (h) Write a function compute_f_stat_lsq that takes 3 input arguments: a one-dimensional Numpy array y (responses) of n numerical values, a onedimensional Numpy array hat_y of associated predictions, and the number of predictors p, and returns the F -statistic. Your code should not contain any for or while loop. 3

4 Directions. Show and explain all work to receive full credit. Homework is due on Thursday, April 20th at the beginning of class by 12:00 pm. Problem. In this study, we are interested in the deaths due to heart attacks among men between the ages of 55 and 59 in 22 countries. The data set hw3_data.csv contains the following variables: number of phones per 1,000 inhabitants proportion of saturated fat for men between the ages of 55 and 59 proportion of animal fat for men between the ages of 55 and 59 death rates due to heart attacks calculated as follows: 100 {ln (number of deaths due to heart attacks for 100,000 men between the ages of 55 and 59) - 2} Download the file hw3_data.csv that contains the data into your working directory. You can read the data using the following commands: import numpy a s np heart_data = np. l o a d t x t ( " hw3_data. csv ", d e l i m i t e r = ", ", skiprows =1) phones = heart_data [ :, 0 ] # number o f phones per 1000 i n h a b i t a n t s s a t u r a t e d = heart_data [ :, 1 ] # p r o p o r t i o n o f s a t u r a t e d f a t animal = heart_data [ :, 2 ] # p r o p o r t i o n o f animal f a t deaths = heart_data [ :, 3 ] # death r a t e due to h e a r t d i s e a s e For this problem, use the Python functions implemented in the programming assignment. Show your computations by displaying the function calls. (a) Standardize the predictors using the function standardize_design. (b) We first study the simple linear regression model for the heart attack death rates, on the only basis of the number of phones. Determine whether the number of phones is associated significantly with the heart attack death rate. A table for the quantiles of a t-distribution can be found here: tables.pdf, see page 3. (c) Write the multiple linear regression model for the heart attack death rates, on the basis of the number of phones and the proportion of saturated fat. Compute the associated least squares coefficient estimates. (d) Test whether at least one of the predictors number of phones, or proportion of saturated fat, is useful in predicting the heart attack death rate. A table for the quantiles of an F -distribution can be found in the same pdf file, see pages

5 (e) Compute the R 2 statistic, and the residual standard error for the models in questions (b) and (c). Would you say that adding the proportion of saturated fat to the model significantly improves the accuracy? (f) Write the multiple linear regression model for the heart attack death rates, on the basis of the number of phones, the proportion of saturated fat, and the proportion of animal fat. Compute the associated least squares coefficient estimates. (g) A country has the following features: 108 phones per 1000 inhabitants, 33% of saturated fat for men between the ages of 55 and 59, 7% of animal fat for men between the ages of 55 and 59. Predict the heart attack death rate for men between the ages of 55 and 59 in that country. (h) Which coefficient estimates are significantly non-zero? (i) Consider the model in question (f), with an additional interaction term (proportion of saturated fat proportion of animal fat). Compute the associated least squares coefficient estimates. Elaborate on the significance of the interaction effect. 5

Lecture 7: Linear Regression (continued)

Lecture 7: Linear Regression (continued) Reading: Chapter 3 STATS 2: Data mining and analysis Jonathan Taylor, 10/8 Slide credits: Sergio Bacallado 1 / 14 Potential issues in linear regression 1. Interactions