Machine Learning: Practice Midterm, Spring 2018

Size: px

Start display at page:

Download "Machine Learning: Practice Midterm, Spring 2018"

Shanon Briggs
6 years ago
Views:

1 Machine Learning: Practice Midterm, Spring 2018 Name: Instructions You may use the following resources on the midterm: 1. your filled in "methods" table with the accompanying notation page, 2. a single page of notes (8.5 by 11in), 3. a calculator. In this exam, you will use the methods of (statistical) machine learning to solve two prediction problems. The first problem is to predict the energy consumption of applicances in a certain house and the second is to predict whether or not a particular office is occupied. Both datasets are from the UCI Machine Learning Repository. Please note that this exam is longer than I expect you to be able to do in 50 minutes. The actual midterm will have a subset of the questions that are on this practice exam, possibly applied to different datasets.

2 Prediction Problem 1: Energy Consumption Problem Description: The goal is to predict the energy consumption (in watts) of appliances in a certain house. Here is a description of the dataset by the authors: Data was collected every 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Here is a list of the variable names in the dataset: date time, year-month-day hour:minute:second T7, Temperature in ironing room, in Celsius Appliances, energy use in Wh RH_7, Humidity in ironing room, in % T1, Temperature in kitchen area, in Celsius T8, Temperature in teenager room 2, in Celsius RH_1, Humidity in kitchen area, in % RH_8, Humidity in teenager room 2, in % T2, Temperature in living room area, in Celsius T9, Temperature in parents room, in Celsius RH_2, Humidity in living room area, in % RH_9, Humidity in parents room, in % T3, Temperature in laundry room area RH_3, Humidity in laundry room area, in % T4, Temperature in office room, in Celsius RH_4, Humidity in office room, in % T5, Temperature in bathroom, in Celsius RH_5, Humidity in bathroom, in % T6, Temperature outside the building (north side), in Celsius RH_6, Humidity outside the building (north side), in % To, Temperature outside (from Chievres weather station), in Celsius Pressure (from Chievres weather station), in mm Hg RH_out-Humidity outside (from Chievres weather station), in % Wind speed (from Chievres weather station), in m/s Visibility (from Chievres weather station), in km -Tdewpoint (from Chievres weather station), in Celsius

3 Questions: Use the R output below to answer questions 1-4. str(dd) ## 'data.frame': obs. of 26 variables: ## $ date :Classes 'chron', 'dates', 'times' atomic [1:19735] ##....- attr(*, "format")= Named chr [1:2] "m/d/y" "h:m:s" ## attr(*, "names")= chr [1:2] "dates" "times" ##....- attr(*, "origin")= Named num [1:3] ## attr(*, "names")= chr [1:3] "month" "day" "year" ## $ Appliances : int ## $ T1 : num ## $ RH_1 : num ## $ T2 : num ## $ RH_2 : num ## $ T3 : num ## $ RH_3 : num ## $ T4 : num ## $ RH_4 : num ## $ T5 : num ## $ RH_5 : num ## $ T6 : num ## $ RH_6 : num ## $ T7 : num ## $ RH_7 : num ## $ T8 : num ## $ RH_8 : num ## $ T9 : num ## $ RH_9 : num ## $ T_out : num ## $ Press_mm_hg: num ## $ RH_out : num ## $ Windspeed : num ## $ Visibility : num ## $ Tdewpoint : num How many variables are in the dataset? 2. How many observations are there of each variable? 3. Which type of variable is energy consumption of appliances, quantitative or binary? 4. If our goal is to predict the energy consumption of appliances from the sensor data, what methods from Math 407 are appropriate to try?

possible predictors, then answer question

satisfied? (Hint: this is a trick question!

simulation to estimate the variance of the

use from the available predictor variables.

4 Review the following graphs of the response variable Appliances and a few possible predictors, then answer question From what you can see about the dataset in the above graphs, do you believe the assumptions of knns are satisfied? (Hint: this is a trick question!) 6. Briefly explain how you could use a simulation to estimate the variance of the irreducible error in predicting the energy use from the available predictor variables. (you don t need to write any code, just explain in two or three sentences).

5 7. The code snippet below shows knn being trained and tested for k=1,2,...,15. Use the resulting output to choose the best value of k to use. I choose k= # choose a test set set.seed(11) samp<-sample(1:nrow(dd), round(.1*nrow(dd)), replace=true) # use the following 25 variables as predictors in knn pnames<-c("t1", "RH_1","T2", "RH_2", "T3", "RH_3", "T4", "RH_4","T5", "RH_5", "T6","RH_6","T7","RH_7","T8", "RH_8","T9","RH_9","T_out", "Press_mm_hg", "RH_out", "Windspeed", "Visibility", "Tdewpoint","hour") # storage for mean absolute error of knn for 15 different k's MAEk<-numeric(15) # train knn for(k in 1:15) { knnk<-fnn::knn.reg(train=dd[-samp,pnames], test=dd[samp,pnames], y=dd[-samp,"appliances"], k=k) y<-dd[samp,"appliances"] errork<-(y-knnk$pred) MAEk[k]<-mean(abs(errork)) } plot(1:15, MAEk, main="mean Absolue Error of knn on test set", xlab="k", ylab="mean Absolute Error", type='l', xlim=c(1,15)) 8. What (if anything) could we try to improve the performance of knns?

6 The R code below shows a linear model being fit with several predictors, including the temperature and relative humidity of 9 locations around the house. The location codes are 1=kitchen, 2=living room, 3=laundry room, 4=office, 5=bathroom, 6=outside (north side), 7=ironing room, 8=teenager's room, 9=parents' room so, for example, the variable T1 corresponds to the temperature in the kitchen, RH_2 the relative humidity in the living room. Use this output to answer questions 9 and 10 on the next page. # train linear regression model mylr<-lm(appliances~t1+rh_1+t2+rh_2+t3+rh_3+t4+rh_4+t5+rh_5+ T6+RH_6+T7+RH_7+T8+RH_8+T9+RH_9+Hour+Month, data=as.data.frame(dd[-samp,])) # view model coefficients mylr ## Call: ## lm(formula = Appliances ~ T1 + RH_1 + T2 + RH_2 + T3 + RH_3 + ## T4 + RH_4 + T5 + RH_5 + T6 + RH_6 + T7 + RH_7 + T8 + RH_8 + ## T9 + RH_9 + Hour + Month, data = as.data.frame(dd[-samp, ## ])) ## ## Coefficients: ## (Intercept) T1 RH_1 T2 RH_2 ## ## T3 RH_3 T4 RH_4 T5 ## ## RH_5 T6 RH_6 T7 RH_7 ## ## T8 RH_8 T9 RH_9 Hour1 ## ## Hour2 Hour3 Hour4 Hour5 Hour6 ## ## Hour7 Hour8 Hour9 Hour10 Hour11 ## ## Hour12 Hour13 Hour14 Hour15 Hour16 ## ## Hour17 Hour18 Hour19 Hour20 Hour21 ## ## Hour22 Hour23 MonthFeb MonthMar MonthApr ## ## MonthMay ##

7 The linear regression model shown on the previous page has an MAE of 52.75, which is worse than the best knn model. One advantage of a linear regression over knn is that the model is more interpretable. 9. Use the table of model coefficients in the R output and the location codes to determine: a) which location has the largest predicted increase in energy use of the appliances for an increase of one degree Celsius (T), if all other variables in the model are held constant? Location: b) which location has the largest reduction in predicted energy use of the appliances for an increase of 1% in relative humidity (RH), if all other variables in the model are held constant? Location: 10. The month each measurement was recorded was included in the linear regression model as a qualitative variable with categories "Jan", "Feb", "Mar", "Apr" and "May". The category "Jan" was used as the baseline and the model coefficients under "MonthFeb", "MonthMar", "MonthApr" and "MonthMay" correspond to the difference between the baseline "Jan" and the months of "Feb", "Mar", "Apr" and "May". Thus the predicted energy use of appliances in Febuary may be found by adding the intercept ("Intercept") to the coefficient of "MonthFeb". Which month has the lowest predicted energy use by appliances, if all other variables in the model are held constant? Month: 11. Besides adding more predictor variables, what (if anything) could you try changing about the linear regression model to improve it's performance? 12. The linear regression model had an estimated test MSE of Do you expect that the MSE of the training set for the model is larger, smaller or exactly the same as the MSE of the test set? Circle one: larger smaller exactly the same 13. The test MSE of the linear regression model was higher than the test MSE of the best knn model. Do you expect that the higher MSE is a result of more bias, a higher variance in model fits or a higher variance of the irreducible error? Circle one: bias variance in model fits variance of the irreducible error

8 Prediction Problem 2: Office Occupancy Is a particular office occupied or not? To answer this question, the following variables were collected every minute in the office for about two weeks: date time year-month-day hour:minute:second Temperature, in Celsius Relative Humidity, % Light, in Lux CO2, in ppm Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kgair Occupancy, ("Empty" or "Occupied"). The dataset was broken into three parts, a training set and two test sets. 1. Which methods from class are appropriate to try when the predicting a binary response variable such as "Occupancy" using four predictor variables such as "time of day" (hour), "light" (Lux), "CO2" (ppm) and "humidity ratio" (kgwater-vapor/kg-air)? 2. Based on the following histograms of three predictor variables by occupancy, do you expect that either LDA or QDA will work well to predict whether or not the office is empty? Briefly explain.

9 3. Which of the assumptions of logistic regression is most likely NOT satisfied by this dataset? 4. Given the following table of overall misclassification rates on test set 1, chose one of LDA, QDA, logistic regression or knns as the best method. Table: Percentage of test points that were misclassified Method: knns (k=25) Logistic LDA QDA 2.1% 2.78% 2.14% 2.29% 5. Suppose that the logistic regression model had a misclassification rate of 3.07% for empty offices on test set 1 (used in question 4) and a misclassification rate of 0.71% for empty offices on test set 2 (which until now was untouched). Which estimate of the true misclassification rate of empty offices do you trust to be closer to the truth? Circle one: 3.07% from test set 1 or 0.71% from test set 2 6. Compute a 95% confidence interval for the true misclassification of empty offices by the logistic regression model using the fact that the rate was estimated to be 0.71% on a test set of size9752, which had 7703 empty offices and 2049 occupied offices. 7. What test set size would be needed to estimate the true misclassification rate of empty offices to within 2%?

10 8. Write down the prediction model from the trained logistic regression model shown in the following R output: model2 ## ## Call: glm(formula = Occupancy ~ Light + CO2 + HumidityRatio + hour + ## I(hour^2), family = "binomial", data = rr) ## ## Coefficients: ## (Intercept) Light CO2 HumidityRatio hour ## ## I(hour^2) ## ## ## Degrees of Freedom: 8142 Total (i.e. Null); 8137 Residual ## Null Deviance: 8420 ## Residual Deviance: 1045 AIC: Use above R output from logistic regression to compute the odds ratio of the office being occupied for a one Lux increase in the amount of light detected. 10. Briefly describe how 5-fold cross validation could have been used to select the best model instead of a single training and test set. You don t need to write code, just explain the idea in a sentence or two.

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report Team Member Names: Xi Yang, Yi Wen, Xue Zhang Project Title: Improve Room Utilization Introduction Problem