Machine Learning: Practice Midterm, Spring 2018

Size: px
Start display at page:

Download "Machine Learning: Practice Midterm, Spring 2018"

Transcription

1 Machine Learning: Practice Midterm, Spring 2018 Name: Instructions You may use the following resources on the midterm: 1. your filled in "methods" table with the accompanying notation page, 2. a single page of notes (8.5 by 11in), 3. a calculator. In this exam, you will use the methods of (statistical) machine learning to solve two prediction problems. The first problem is to predict the energy consumption of applicances in a certain house and the second is to predict whether or not a particular office is occupied. Both datasets are from the UCI Machine Learning Repository. Please note that this exam is longer than I expect you to be able to do in 50 minutes. The actual midterm will have a subset of the questions that are on this practice exam, possibly applied to different datasets.

2 Prediction Problem 1: Energy Consumption Problem Description: The goal is to predict the energy consumption (in watts) of appliances in a certain house. Here is a description of the dataset by the authors: Data was collected every 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Here is a list of the variable names in the dataset: date time, year-month-day hour:minute:second T7, Temperature in ironing room, in Celsius Appliances, energy use in Wh RH_7, Humidity in ironing room, in % T1, Temperature in kitchen area, in Celsius T8, Temperature in teenager room 2, in Celsius RH_1, Humidity in kitchen area, in % RH_8, Humidity in teenager room 2, in % T2, Temperature in living room area, in Celsius T9, Temperature in parents room, in Celsius RH_2, Humidity in living room area, in % RH_9, Humidity in parents room, in % T3, Temperature in laundry room area RH_3, Humidity in laundry room area, in % T4, Temperature in office room, in Celsius RH_4, Humidity in office room, in % T5, Temperature in bathroom, in Celsius RH_5, Humidity in bathroom, in % T6, Temperature outside the building (north side), in Celsius RH_6, Humidity outside the building (north side), in % To, Temperature outside (from Chievres weather station), in Celsius Pressure (from Chievres weather station), in mm Hg RH_out-Humidity outside (from Chievres weather station), in % Wind speed (from Chievres weather station), in m/s Visibility (from Chievres weather station), in km -Tdewpoint (from Chievres weather station), in Celsius

3 Questions: Use the R output below to answer questions 1-4. str(dd) ## 'data.frame': obs. of 26 variables: ## $ date :Classes 'chron', 'dates', 'times' atomic [1:19735] ##....- attr(*, "format")= Named chr [1:2] "m/d/y" "h:m:s" ## attr(*, "names")= chr [1:2] "dates" "times" ##....- attr(*, "origin")= Named num [1:3] ## attr(*, "names")= chr [1:3] "month" "day" "year" ## $ Appliances : int ## $ T1 : num ## $ RH_1 : num ## $ T2 : num ## $ RH_2 : num ## $ T3 : num ## $ RH_3 : num ## $ T4 : num ## $ RH_4 : num ## $ T5 : num ## $ RH_5 : num ## $ T6 : num ## $ RH_6 : num ## $ T7 : num ## $ RH_7 : num ## $ T8 : num ## $ RH_8 : num ## $ T9 : num ## $ RH_9 : num ## $ T_out : num ## $ Press_mm_hg: num ## $ RH_out : num ## $ Windspeed : num ## $ Visibility : num ## $ Tdewpoint : num How many variables are in the dataset? 2. How many observations are there of each variable? 3. Which type of variable is energy consumption of appliances, quantitative or binary? 4. If our goal is to predict the energy consumption of appliances from the sensor data, what methods from Math 407 are appropriate to try?

4 Review the following graphs of the response variable Appliances and a few possible predictors, then answer question From what you can see about the dataset in the above graphs, do you believe the assumptions of knns are satisfied? (Hint: this is a trick question!) 6. Briefly explain how you could use a simulation to estimate the variance of the irreducible error in predicting the energy use from the available predictor variables. (you don t need to write any code, just explain in two or three sentences).

5 7. The code snippet below shows knn being trained and tested for k=1,2,...,15. Use the resulting output to choose the best value of k to use. I choose k= # choose a test set set.seed(11) samp<-sample(1:nrow(dd), round(.1*nrow(dd)), replace=true) # use the following 25 variables as predictors in knn pnames<-c("t1", "RH_1","T2", "RH_2", "T3", "RH_3", "T4", "RH_4","T5", "RH_5", "T6","RH_6","T7","RH_7","T8", "RH_8","T9","RH_9","T_out", "Press_mm_hg", "RH_out", "Windspeed", "Visibility", "Tdewpoint","hour") # storage for mean absolute error of knn for 15 different k's MAEk<-numeric(15) # train knn for(k in 1:15) { knnk<-fnn::knn.reg(train=dd[-samp,pnames], test=dd[samp,pnames], y=dd[-samp,"appliances"], k=k) y<-dd[samp,"appliances"] errork<-(y-knnk$pred) MAEk[k]<-mean(abs(errork)) } plot(1:15, MAEk, main="mean Absolue Error of knn on test set", xlab="k", ylab="mean Absolute Error", type='l', xlim=c(1,15)) 8. What (if anything) could we try to improve the performance of knns?

6 The R code below shows a linear model being fit with several predictors, including the temperature and relative humidity of 9 locations around the house. The location codes are 1=kitchen, 2=living room, 3=laundry room, 4=office, 5=bathroom, 6=outside (north side), 7=ironing room, 8=teenager's room, 9=parents' room so, for example, the variable T1 corresponds to the temperature in the kitchen, RH_2 the relative humidity in the living room. Use this output to answer questions 9 and 10 on the next page. # train linear regression model mylr<-lm(appliances~t1+rh_1+t2+rh_2+t3+rh_3+t4+rh_4+t5+rh_5+ T6+RH_6+T7+RH_7+T8+RH_8+T9+RH_9+Hour+Month, data=as.data.frame(dd[-samp,])) # view model coefficients mylr ## Call: ## lm(formula = Appliances ~ T1 + RH_1 + T2 + RH_2 + T3 + RH_3 + ## T4 + RH_4 + T5 + RH_5 + T6 + RH_6 + T7 + RH_7 + T8 + RH_8 + ## T9 + RH_9 + Hour + Month, data = as.data.frame(dd[-samp, ## ])) ## ## Coefficients: ## (Intercept) T1 RH_1 T2 RH_2 ## ## T3 RH_3 T4 RH_4 T5 ## ## RH_5 T6 RH_6 T7 RH_7 ## ## T8 RH_8 T9 RH_9 Hour1 ## ## Hour2 Hour3 Hour4 Hour5 Hour6 ## ## Hour7 Hour8 Hour9 Hour10 Hour11 ## ## Hour12 Hour13 Hour14 Hour15 Hour16 ## ## Hour17 Hour18 Hour19 Hour20 Hour21 ## ## Hour22 Hour23 MonthFeb MonthMar MonthApr ## ## MonthMay ##

7 The linear regression model shown on the previous page has an MAE of 52.75, which is worse than the best knn model. One advantage of a linear regression over knn is that the model is more interpretable. 9. Use the table of model coefficients in the R output and the location codes to determine: a) which location has the largest predicted increase in energy use of the appliances for an increase of one degree Celsius (T), if all other variables in the model are held constant? Location: b) which location has the largest reduction in predicted energy use of the appliances for an increase of 1% in relative humidity (RH), if all other variables in the model are held constant? Location: 10. The month each measurement was recorded was included in the linear regression model as a qualitative variable with categories "Jan", "Feb", "Mar", "Apr" and "May". The category "Jan" was used as the baseline and the model coefficients under "MonthFeb", "MonthMar", "MonthApr" and "MonthMay" correspond to the difference between the baseline "Jan" and the months of "Feb", "Mar", "Apr" and "May". Thus the predicted energy use of appliances in Febuary may be found by adding the intercept ("Intercept") to the coefficient of "MonthFeb". Which month has the lowest predicted energy use by appliances, if all other variables in the model are held constant? Month: 11. Besides adding more predictor variables, what (if anything) could you try changing about the linear regression model to improve it's performance? 12. The linear regression model had an estimated test MSE of Do you expect that the MSE of the training set for the model is larger, smaller or exactly the same as the MSE of the test set? Circle one: larger smaller exactly the same 13. The test MSE of the linear regression model was higher than the test MSE of the best knn model. Do you expect that the higher MSE is a result of more bias, a higher variance in model fits or a higher variance of the irreducible error? Circle one: bias variance in model fits variance of the irreducible error

8 Prediction Problem 2: Office Occupancy Is a particular office occupied or not? To answer this question, the following variables were collected every minute in the office for about two weeks: date time year-month-day hour:minute:second Temperature, in Celsius Relative Humidity, % Light, in Lux CO2, in ppm Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kgair Occupancy, ("Empty" or "Occupied"). The dataset was broken into three parts, a training set and two test sets. 1. Which methods from class are appropriate to try when the predicting a binary response variable such as "Occupancy" using four predictor variables such as "time of day" (hour), "light" (Lux), "CO2" (ppm) and "humidity ratio" (kgwater-vapor/kg-air)? 2. Based on the following histograms of three predictor variables by occupancy, do you expect that either LDA or QDA will work well to predict whether or not the office is empty? Briefly explain.

9 3. Which of the assumptions of logistic regression is most likely NOT satisfied by this dataset? 4. Given the following table of overall misclassification rates on test set 1, chose one of LDA, QDA, logistic regression or knns as the best method. Table: Percentage of test points that were misclassified Method: knns (k=25) Logistic LDA QDA 2.1% 2.78% 2.14% 2.29% 5. Suppose that the logistic regression model had a misclassification rate of 3.07% for empty offices on test set 1 (used in question 4) and a misclassification rate of 0.71% for empty offices on test set 2 (which until now was untouched). Which estimate of the true misclassification rate of empty offices do you trust to be closer to the truth? Circle one: 3.07% from test set 1 or 0.71% from test set 2 6. Compute a 95% confidence interval for the true misclassification of empty offices by the logistic regression model using the fact that the rate was estimated to be 0.71% on a test set of size9752, which had 7703 empty offices and 2049 occupied offices. 7. What test set size would be needed to estimate the true misclassification rate of empty offices to within 2%?

10 8. Write down the prediction model from the trained logistic regression model shown in the following R output: model2 ## ## Call: glm(formula = Occupancy ~ Light + CO2 + HumidityRatio + hour + ## I(hour^2), family = "binomial", data = rr) ## ## Coefficients: ## (Intercept) Light CO2 HumidityRatio hour ## ## I(hour^2) ## ## ## Degrees of Freedom: 8142 Total (i.e. Null); 8137 Residual ## Null Deviance: 8420 ## Residual Deviance: 1045 AIC: Use above R output from logistic regression to compute the odds ratio of the office being occupied for a one Lux increase in the amount of light detected. 10. Briefly describe how 5-fold cross validation could have been used to select the best model instead of a single training and test set. You don t need to write code, just explain the idea in a sentence or two.

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report Team Member Names: Xi Yang, Yi Wen, Xue Zhang Project Title: Improve Room Utilization Introduction Problem

More information

Stat 4510/7510 Homework 4

Stat 4510/7510 Homework 4 Stat 45/75 1/7. Stat 45/75 Homework 4 Instructions: Please list your name and student number clearly. In order to receive credit for a problem, your solution must show sufficient details so that the grader

More information

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods. STAT 48 - STAT 48 - December 5, 27 STAT 48 - STAT 48 - Here are a few questions to consider: What does statistical learning mean to you? Is statistical learning different from statistics as a whole? What

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015 STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, tsvv@steno.dk, Steno Diabetes Center June 11, 2015 Contents 1 Introduction 1 2 Recap: Variables 2 3 Data Containers 2 3.1 Vectors................................................

More information

k-nn classification with R QMMA

k-nn classification with R QMMA k-nn classification with R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 1/16 HW (Height and weight) of adults Statistics

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Discriminant analysis in R QMMA

Discriminant analysis in R QMMA Discriminant analysis in R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 1/26 Default data Get the data set Default library(islr)

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

Generalized Additive Models

Generalized Additive Models Generalized Additive Models Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Additive Models GAMs are one approach to non-parametric regression in the multiple predictor setting.

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. 1/44 Cross-validation and the Bootstrap In the section we discuss two resampling

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1) Orange Juice data Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l10-oj-data.html#(1) 1/31 Orange Juice Data The data contain weekly sales of refrigerated

More information

SYS 6021 Linear Statistical Models

SYS 6021 Linear Statistical Models SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are

More information

My dear students, Believe in yourselves. Believe in your abilities. You can DO this! -Dr. M

My dear students, Believe in yourselves. Believe in your abilities. You can DO this! -Dr. M 1/29 2/22 3/12 4/8 5/9 6/20 otal/100 Please do not write in the spaces above. Directions: You have 50 minutes in which to complete this exam. You must show all work, or risk losing credit. Be sure to answer

More information

Robust Linear Regression (Passing- Bablok Median-Slope)

Robust Linear Regression (Passing- Bablok Median-Slope) Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their

More information

Predicting User Ratings Using Status Models on Amazon.com

Predicting User Ratings Using Status Models on Amazon.com Predicting User Ratings Using Status Models on Amazon.com Borui Wang Stanford University borui@stanford.edu Guan (Bell) Wang Stanford University guanw@stanford.edu Group 19 Zhemin Li Stanford University

More information

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017 Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

Comparison of Linear Regression with K-Nearest Neighbors

Comparison of Linear Regression with K-Nearest Neighbors Comparison of Linear Regression with K-Nearest Neighbors Rebecca C. Steorts, Duke University STA 325, Chapter 3.5 ISL Agenda Intro to KNN Comparison of KNN and Linear Regression K-Nearest Neighbors vs

More information

Nina Zumel and John Mount Win-Vector LLC

Nina Zumel and John Mount Win-Vector LLC SUPERVISED LEARNING IN R: REGRESSION Logistic regression to predict probabilities Nina Zumel and John Mount Win-Vector LLC Predicting Probabilities Predicting whether an event occurs (yes/no): classification

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

RESAMPLING METHODS. Chapter 05

RESAMPLING METHODS. Chapter 05 1 RESAMPLING METHODS Chapter 05 2 Outline Cross Validation The Validation Set Approach Leave-One-Out Cross Validation K-fold Cross Validation Bias-Variance Trade-off for k-fold Cross Validation Cross Validation

More information

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error

More information

Stat 342 Exam 3 Fall 2014

Stat 342 Exam 3 Fall 2014 Stat 34 Exam 3 Fall 04 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed There are questions on the following 6 pages. Do as many of them as you can

More information

DM4U_B P 1 W EEK 1 T UNIT

DM4U_B P 1 W EEK 1 T UNIT MDM4U_B Per 1 WEEK 1 Tuesday Feb 3 2015 UNIT 1: Organizing Data for Analysis 1) THERE ARE DIFFERENT TYPES OF DATA THAT CAN BE SURVEYED. 2) DATA CAN BE EFFECTIVELY DISPLAYED IN APPROPRIATE TABLES AND GRAPHS.

More information

Binary Regression in S-Plus

Binary Regression in S-Plus Fall 200 STA 216 September 7, 2000 1 Getting Started in UNIX Binary Regression in S-Plus Create a class working directory and.data directory for S-Plus 5.0. If you have used Splus 3.x before, then it is

More information

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy 2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.

More information

Math 111: Midterm 1 Review

Math 111: Midterm 1 Review Math 111: Midterm 1 Review Prerequisite material (see review section for additional problems) 1. Simplify the following: 20a 2 b 4a 2 b 1 ( 2x 3 y 2 ) 2 8 2 3 + ( 1 4 ) 1 2 2. Factor the following: a)

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

End Behavior and Symmetry

End Behavior and Symmetry Algebra 2 Interval Notation Name: Date: Block: X Characteristics of Polynomial Functions Lesson Opener: Graph the function using transformations then identify key characteristics listed below. 1. y x 2

More information

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

More information

Performing Cluster Bootstrapped Regressions in R

Performing Cluster Bootstrapped Regressions in R Performing Cluster Bootstrapped Regressions in R Francis L. Huang / October 6, 2016 Supplementary material for: Using Cluster Bootstrapping to Analyze Nested Data with a Few Clusters in Educational and

More information

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.

More information

NEURAL NETWORKS. Cement. Blast Furnace Slag. Fly Ash. Water. Superplasticizer. Coarse Aggregate. Fine Aggregate. Age

NEURAL NETWORKS. Cement. Blast Furnace Slag. Fly Ash. Water. Superplasticizer. Coarse Aggregate. Fine Aggregate. Age NEURAL NETWORKS As an introduction, we ll tackle a prediction task with a continuous variable. We ll reproduce research from the field of cement and concrete manufacturing that seeks to model the compressive

More information

Excel Functions & Tables

Excel Functions & Tables Excel Functions & Tables SPRING 2016 Spring 2016 CS130 - EXCEL FUNCTIONS & TABLES 1 Review of Functions Quick Mathematics Review As it turns out, some of the most important mathematics for this course

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Algebra 2 Chapter Relations and Functions

Algebra 2 Chapter Relations and Functions Algebra 2 Chapter 2 2.1 Relations and Functions 2.1 Relations and Functions / 2.2 Direct Variation A: Relations What is a relation? A of items from two sets: A set of values and a set of values. What does

More information

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Statistics Lab #7 ANOVA Part 2 & ANCOVA Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")

More information

5.5 Regression Estimation

5.5 Regression Estimation 5.5 Regression Estimation Assume a SRS of n pairs (x, y ),..., (x n, y n ) is selected from a population of N pairs of (x, y) data. The goal of regression estimation is to take advantage of a linear relationship

More information

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business Section 3.2: Multiple Linear Regression II Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Multiple Linear Regression: Inference and Understanding We can answer new questions

More information

Two-Stage Least Squares

Two-Stage Least Squares Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

More information

Poisson Regression and Model Checking

Poisson Regression and Model Checking Poisson Regression and Model Checking Readings GH Chapter 6-8 September 27, 2017 HIV & Risk Behaviour Study The variables couples and women_alone code the intervention: control - no counselling (both 0)

More information

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Leveling Up as a Data Scientist.   ds/2014/10/level-up-ds.jpg Model Optimization Leveling Up as a Data Scientist http://shorelinechurch.org/wp-content/uploa ds/2014/10/level-up-ds.jpg Bias and Variance Error = (expected loss of accuracy) 2 + flexibility of model

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

Final Exam. Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm

Final Exam. Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm Final Exam Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm Instructions: you will submit this take-home final exam in three parts. 1. Writeup. This will be a complete

More information

Discussion Notes 3 Stepwise Regression and Model Selection

Discussion Notes 3 Stepwise Regression and Model Selection Discussion Notes 3 Stepwise Regression and Model Selection Stepwise Regression There are many different commands for doing stepwise regression. Here we introduce the command step. There are many arguments

More information

Subset Selection in Multiple Regression

Subset Selection in Multiple Regression Chapter 307 Subset Selection in Multiple Regression Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

CSE 190, Spring 2015: Midterm

CSE 190, Spring 2015: Midterm CSE 190, Spring 2015: Midterm Name: Student ID: Instructions Hand in your solution at or before 7:45pm. Answers should be written directly in the spaces provided. Do not open or start the test before instructed

More information

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or  me, I will answer promptly. Statistical Methods Instructor: Lingsong Zhang 1 Issues before Class Statistical Methods Lingsong Zhang Office: Math 544 Email: lingsong@purdue.edu Phone: 765-494-7913 Office Hour: Monday 1:00 pm - 2:00

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Minitab detailed

Minitab detailed Minitab 18.1 - detailed ------------------------------------- ADDITIVE contact sales: 06172-5905-30 or minitab@additive-net.de ADDITIVE contact Technik/ Support/ Installation: 06172-5905-20 or support@additive-net.de

More information

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017 Predictive Checking Readings GH Chapter 6-8 February 8, 2017 Model Choice and Model Checking 2 Questions: 1. Is my Model good enough? (no alternative models in mind) 2. Which Model is best? (comparison

More information

Toward a high resolution temperature distribution map using crowdsourcing smartphone battery temperature

Toward a high resolution temperature distribution map using crowdsourcing smartphone battery temperature Toward a high resolution temperature distribution map using crowdsourcing smartphone battery temperature Nguyen Hai Chau chaunh@vnu.edu.vn, nhchau@gmail.com University of Engineering and Technology Vietnam

More information

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Supervised Learning: Nonparametric

More information

Unit 2: Functions, Equations, & Graphs of Degree One

Unit 2: Functions, Equations, & Graphs of Degree One Date Period Unit 2: Functions, Equations, & Graphs of Degree One Day Topic 1 Relations and Functions Domain and Range 2 Graphing Linear Equations Objective 1 3 Writing Equations of Lines 4 Using the Graphing

More information

I211: Information infrastructure II

I211: Information infrastructure II Data Mining: Classifier Evaluation I211: Information infrastructure II 3-nearest neighbor labeled data find class labels for the 4 data points 1 0 0 6 0 0 0 5 17 1.7 1 1 4 1 7.1 1 1 1 0.4 1 2 1 3.0 0 0.1

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

Neural Network Application Design. Supervised Function Approximation. Supervised Function Approximation. Supervised Function Approximation

Neural Network Application Design. Supervised Function Approximation. Supervised Function Approximation. Supervised Function Approximation Supervised Function Approximation There is a tradeoff between a network s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate). This problem is similar

More information

Notes and Announcements

Notes and Announcements Notes and Announcements Midterm exam: Oct 20, Wednesday, In Class Late Homeworks Turn in hardcopies to Michelle. DO NOT ask Michelle for extensions. Note down the date and time of submission. If submitting

More information

Work through the sheet in any order you like. Skip the starred (*) bits in the first instance, unless you re fairly confident.

Work through the sheet in any order you like. Skip the starred (*) bits in the first instance, unless you re fairly confident. CDT R Review Sheet Work through the sheet in any order you like. Skip the starred (*) bits in the first instance, unless you re fairly confident. 1. Vectors (a) Generate 100 standard normal random variables,

More information

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018 Lab #13 - Resampling Methods Econ 224 October 23rd, 2018 Introduction In this lab you will work through Section 5.3 of ISL and record your code and results in an RMarkdown document. I have added section

More information

The Data. Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO

The Data. Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO The Data The following dataset is from Hastie, Tibshirani and Friedman (2009), from a studyby Stamey et al. (1989) of prostate

More information

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2 Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2 Weijie Cai, SAS Institute Inc., Cary NC July 1, 2008 ABSTRACT Generalized additive models are useful in finding predictor-response

More information

The Power and Sample Size Application

The Power and Sample Size Application Chapter 72 The Power and Sample Size Application Contents Overview: PSS Application.................................. 6148 SAS Power and Sample Size............................... 6148 Getting Started:

More information

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value. Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING

More information

Tree-based methods for classification and regression

Tree-based methods for classification and regression Tree-based methods for classification and regression Ryan Tibshirani Data Mining: 36-462/36-662 April 11 2013 Optional reading: ISL 8.1, ESL 9.2 1 Tree-based methods Tree-based based methods for predicting

More information

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences 1 RefresheR Figure 1.1: Soy ice cream flavor preferences 2 The Shape of Data Figure 2.1: Frequency distribution of number of carburetors in mtcars dataset Figure 2.2: Daily temperature measurements from

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators)

CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators) Name: Email address: Quiz Section: CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering. We will

More information

Dr. Barbara Morgan Quantitative Methods

Dr. Barbara Morgan Quantitative Methods Dr. Barbara Morgan Quantitative Methods 195.650 Basic Stata This is a brief guide to using the most basic operations in Stata. Stata also has an on-line tutorial. At the initial prompt type tutorial. In

More information

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010 THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

More information

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business Section 3.4: Diagnostics and Transformations Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Regression Model Assumptions Y i = β 0 + β 1 X i + ɛ Recall the key assumptions

More information

Non-Parametric Modeling

Non-Parametric Modeling Non-Parametric Modeling CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Non-Parametric Density Estimation Parzen Windows Kn-Nearest Neighbor

More information

Lecture 16: High-dimensional regression, non-linear regression

Lecture 16: High-dimensional regression, non-linear regression Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we

More information

Bias-variance trade-off and cross validation Computer exercises

Bias-variance trade-off and cross validation Computer exercises Bias-variance trade-off and cross validation Computer exercises 6.1 Cross validation in k-nn In this exercise we will return to the Biopsy data set also used in Exercise 4.1 (Lesson 4). We will try to

More information

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm. Assignment No. 4 Title: SD Module- Data Science with R Program R (2) C (4) V (2) T (2) Total (10) Dated Sign Data analysis case study using R for readily available data set using any one machine learning

More information

My dear students, Believe in yourselves. Believe in your abilities. You have got this! -Dr. M

My dear students, Believe in yourselves. Believe in your abilities. You have got this! -Dr. M 1/20 2/10 3/7 4/18 5/10 6/6 7/17 8/12 Total/100 Please do not write in the spaces above. Directions: You have 50 minutes in which to complete this exam. You must show all work, or risk losing credit. Be

More information

Section 9: One Variable Statistics

Section 9: One Variable Statistics The following Mathematics Florida Standards will be covered in this section: MAFS.912.S-ID.1.1 MAFS.912.S-ID.1.2 MAFS.912.S-ID.1.3 Represent data with plots on the real number line (dot plots, histograms,

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 12 Combining

More information

Weibull Reliability Analyses

Weibull Reliability Analyses Visual-XSel Software-Guide for Weibull The Weibull analysis shows the failure frequencies or the unreliability of parts and components in the Weibull-net and interprets them. Basics and more details can

More information

STATS PAD USER MANUAL

STATS PAD USER MANUAL STATS PAD USER MANUAL For Version 2.0 Manual Version 2.0 1 Table of Contents Basic Navigation! 3 Settings! 7 Entering Data! 7 Sharing Data! 8 Managing Files! 10 Running Tests! 11 Interpreting Output! 11

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

Solution to Bonus Questions

Solution to Bonus Questions Solution to Bonus Questions Q2: (a) The histogram of 1000 sample means and sample variances are plotted below. Both histogram are symmetrically centered around the true lambda value 20. But the sample

More information

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or STA 450/4000 S: February 2 2005 Flexible modelling using basis expansions (Chapter 5) Linear regression: y = Xβ + ɛ, ɛ (0, σ 2 ) Smooth regression: y = f (X) + ɛ: f (X) = E(Y X) to be specified Flexible

More information

Box-Cox Transformation for Simple Linear Regression

Box-Cox Transformation for Simple Linear Regression Chapter 192 Box-Cox Transformation for Simple Linear Regression Introduction This procedure finds the appropriate Box-Cox power transformation (1964) for a dataset containing a pair of variables that are

More information

The linear mixed model: modeling hierarchical and longitudinal data

The linear mixed model: modeling hierarchical and longitudinal data The linear mixed model: modeling hierarchical and longitudinal data Analysis of Experimental Data AED The linear mixed model: modeling hierarchical and longitudinal data 1 of 44 Contents 1 Modeling Hierarchical

More information

Chapter 6: Linear Model Selection and Regularization

Chapter 6: Linear Model Selection and Regularization Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the

More information

EE 511 Linear Regression

EE 511 Linear Regression EE 511 Linear Regression Instructor: Hanna Hajishirzi hannaneh@washington.edu Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Announcements Hw1 due

More information

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [12-21] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine

More information

CSC411/2515 Tutorial: K-NN and Decision Tree

CSC411/2515 Tutorial: K-NN and Decision Tree CSC411/2515 Tutorial: K-NN and Decision Tree Mengye Ren csc{411,2515}ta@cs.toronto.edu September 25, 2016 Cross-validation K-nearest-neighbours Decision Trees Review: Motivation for Validation Framework:

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use

More information

Math 112 Spring 2016 Midterm 2 Review Problems Page 1

Math 112 Spring 2016 Midterm 2 Review Problems Page 1 Math Spring Midterm Review Problems Page. Solve the inequality. The solution is: x x,,,,,, (E) None of these. Which one of these equations represents y as a function of x? x y xy x y x y (E) y x 7 Math

More information