Machine Learning: Practice Midterm, Spring 2018
|
|
- Shanon Briggs
- 6 years ago
- Views:
Transcription
1 Machine Learning: Practice Midterm, Spring 2018 Name: Instructions You may use the following resources on the midterm: 1. your filled in "methods" table with the accompanying notation page, 2. a single page of notes (8.5 by 11in), 3. a calculator. In this exam, you will use the methods of (statistical) machine learning to solve two prediction problems. The first problem is to predict the energy consumption of applicances in a certain house and the second is to predict whether or not a particular office is occupied. Both datasets are from the UCI Machine Learning Repository. Please note that this exam is longer than I expect you to be able to do in 50 minutes. The actual midterm will have a subset of the questions that are on this practice exam, possibly applied to different datasets.
2 Prediction Problem 1: Energy Consumption Problem Description: The goal is to predict the energy consumption (in watts) of appliances in a certain house. Here is a description of the dataset by the authors: Data was collected every 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Here is a list of the variable names in the dataset: date time, year-month-day hour:minute:second T7, Temperature in ironing room, in Celsius Appliances, energy use in Wh RH_7, Humidity in ironing room, in % T1, Temperature in kitchen area, in Celsius T8, Temperature in teenager room 2, in Celsius RH_1, Humidity in kitchen area, in % RH_8, Humidity in teenager room 2, in % T2, Temperature in living room area, in Celsius T9, Temperature in parents room, in Celsius RH_2, Humidity in living room area, in % RH_9, Humidity in parents room, in % T3, Temperature in laundry room area RH_3, Humidity in laundry room area, in % T4, Temperature in office room, in Celsius RH_4, Humidity in office room, in % T5, Temperature in bathroom, in Celsius RH_5, Humidity in bathroom, in % T6, Temperature outside the building (north side), in Celsius RH_6, Humidity outside the building (north side), in % To, Temperature outside (from Chievres weather station), in Celsius Pressure (from Chievres weather station), in mm Hg RH_out-Humidity outside (from Chievres weather station), in % Wind speed (from Chievres weather station), in m/s Visibility (from Chievres weather station), in km -Tdewpoint (from Chievres weather station), in Celsius
3 Questions: Use the R output below to answer questions 1-4. str(dd) ## 'data.frame': obs. of 26 variables: ## $ date :Classes 'chron', 'dates', 'times' atomic [1:19735] ##....- attr(*, "format")= Named chr [1:2] "m/d/y" "h:m:s" ## attr(*, "names")= chr [1:2] "dates" "times" ##....- attr(*, "origin")= Named num [1:3] ## attr(*, "names")= chr [1:3] "month" "day" "year" ## $ Appliances : int ## $ T1 : num ## $ RH_1 : num ## $ T2 : num ## $ RH_2 : num ## $ T3 : num ## $ RH_3 : num ## $ T4 : num ## $ RH_4 : num ## $ T5 : num ## $ RH_5 : num ## $ T6 : num ## $ RH_6 : num ## $ T7 : num ## $ RH_7 : num ## $ T8 : num ## $ RH_8 : num ## $ T9 : num ## $ RH_9 : num ## $ T_out : num ## $ Press_mm_hg: num ## $ RH_out : num ## $ Windspeed : num ## $ Visibility : num ## $ Tdewpoint : num How many variables are in the dataset? 2. How many observations are there of each variable? 3. Which type of variable is energy consumption of appliances, quantitative or binary? 4. If our goal is to predict the energy consumption of appliances from the sensor data, what methods from Math 407 are appropriate to try?
4 Review the following graphs of the response variable Appliances and a few possible predictors, then answer question From what you can see about the dataset in the above graphs, do you believe the assumptions of knns are satisfied? (Hint: this is a trick question!) 6. Briefly explain how you could use a simulation to estimate the variance of the irreducible error in predicting the energy use from the available predictor variables. (you don t need to write any code, just explain in two or three sentences).
5 7. The code snippet below shows knn being trained and tested for k=1,2,...,15. Use the resulting output to choose the best value of k to use. I choose k= # choose a test set set.seed(11) samp<-sample(1:nrow(dd), round(.1*nrow(dd)), replace=true) # use the following 25 variables as predictors in knn pnames<-c("t1", "RH_1","T2", "RH_2", "T3", "RH_3", "T4", "RH_4","T5", "RH_5", "T6","RH_6","T7","RH_7","T8", "RH_8","T9","RH_9","T_out", "Press_mm_hg", "RH_out", "Windspeed", "Visibility", "Tdewpoint","hour") # storage for mean absolute error of knn for 15 different k's MAEk<-numeric(15) # train knn for(k in 1:15) { knnk<-fnn::knn.reg(train=dd[-samp,pnames], test=dd[samp,pnames], y=dd[-samp,"appliances"], k=k) y<-dd[samp,"appliances"] errork<-(y-knnk$pred) MAEk[k]<-mean(abs(errork)) } plot(1:15, MAEk, main="mean Absolue Error of knn on test set", xlab="k", ylab="mean Absolute Error", type='l', xlim=c(1,15)) 8. What (if anything) could we try to improve the performance of knns?
6 The R code below shows a linear model being fit with several predictors, including the temperature and relative humidity of 9 locations around the house. The location codes are 1=kitchen, 2=living room, 3=laundry room, 4=office, 5=bathroom, 6=outside (north side), 7=ironing room, 8=teenager's room, 9=parents' room so, for example, the variable T1 corresponds to the temperature in the kitchen, RH_2 the relative humidity in the living room. Use this output to answer questions 9 and 10 on the next page. # train linear regression model mylr<-lm(appliances~t1+rh_1+t2+rh_2+t3+rh_3+t4+rh_4+t5+rh_5+ T6+RH_6+T7+RH_7+T8+RH_8+T9+RH_9+Hour+Month, data=as.data.frame(dd[-samp,])) # view model coefficients mylr ## Call: ## lm(formula = Appliances ~ T1 + RH_1 + T2 + RH_2 + T3 + RH_3 + ## T4 + RH_4 + T5 + RH_5 + T6 + RH_6 + T7 + RH_7 + T8 + RH_8 + ## T9 + RH_9 + Hour + Month, data = as.data.frame(dd[-samp, ## ])) ## ## Coefficients: ## (Intercept) T1 RH_1 T2 RH_2 ## ## T3 RH_3 T4 RH_4 T5 ## ## RH_5 T6 RH_6 T7 RH_7 ## ## T8 RH_8 T9 RH_9 Hour1 ## ## Hour2 Hour3 Hour4 Hour5 Hour6 ## ## Hour7 Hour8 Hour9 Hour10 Hour11 ## ## Hour12 Hour13 Hour14 Hour15 Hour16 ## ## Hour17 Hour18 Hour19 Hour20 Hour21 ## ## Hour22 Hour23 MonthFeb MonthMar MonthApr ## ## MonthMay ##
7 The linear regression model shown on the previous page has an MAE of 52.75, which is worse than the best knn model. One advantage of a linear regression over knn is that the model is more interpretable. 9. Use the table of model coefficients in the R output and the location codes to determine: a) which location has the largest predicted increase in energy use of the appliances for an increase of one degree Celsius (T), if all other variables in the model are held constant? Location: b) which location has the largest reduction in predicted energy use of the appliances for an increase of 1% in relative humidity (RH), if all other variables in the model are held constant? Location: 10. The month each measurement was recorded was included in the linear regression model as a qualitative variable with categories "Jan", "Feb", "Mar", "Apr" and "May". The category "Jan" was used as the baseline and the model coefficients under "MonthFeb", "MonthMar", "MonthApr" and "MonthMay" correspond to the difference between the baseline "Jan" and the months of "Feb", "Mar", "Apr" and "May". Thus the predicted energy use of appliances in Febuary may be found by adding the intercept ("Intercept") to the coefficient of "MonthFeb". Which month has the lowest predicted energy use by appliances, if all other variables in the model are held constant? Month: 11. Besides adding more predictor variables, what (if anything) could you try changing about the linear regression model to improve it's performance? 12. The linear regression model had an estimated test MSE of Do you expect that the MSE of the training set for the model is larger, smaller or exactly the same as the MSE of the test set? Circle one: larger smaller exactly the same 13. The test MSE of the linear regression model was higher than the test MSE of the best knn model. Do you expect that the higher MSE is a result of more bias, a higher variance in model fits or a higher variance of the irreducible error? Circle one: bias variance in model fits variance of the irreducible error
8 Prediction Problem 2: Office Occupancy Is a particular office occupied or not? To answer this question, the following variables were collected every minute in the office for about two weeks: date time year-month-day hour:minute:second Temperature, in Celsius Relative Humidity, % Light, in Lux CO2, in ppm Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kgair Occupancy, ("Empty" or "Occupied"). The dataset was broken into three parts, a training set and two test sets. 1. Which methods from class are appropriate to try when the predicting a binary response variable such as "Occupancy" using four predictor variables such as "time of day" (hour), "light" (Lux), "CO2" (ppm) and "humidity ratio" (kgwater-vapor/kg-air)? 2. Based on the following histograms of three predictor variables by occupancy, do you expect that either LDA or QDA will work well to predict whether or not the office is empty? Briefly explain.
9 3. Which of the assumptions of logistic regression is most likely NOT satisfied by this dataset? 4. Given the following table of overall misclassification rates on test set 1, chose one of LDA, QDA, logistic regression or knns as the best method. Table: Percentage of test points that were misclassified Method: knns (k=25) Logistic LDA QDA 2.1% 2.78% 2.14% 2.29% 5. Suppose that the logistic regression model had a misclassification rate of 3.07% for empty offices on test set 1 (used in question 4) and a misclassification rate of 0.71% for empty offices on test set 2 (which until now was untouched). Which estimate of the true misclassification rate of empty offices do you trust to be closer to the truth? Circle one: 3.07% from test set 1 or 0.71% from test set 2 6. Compute a 95% confidence interval for the true misclassification of empty offices by the logistic regression model using the fact that the rate was estimated to be 0.71% on a test set of size9752, which had 7703 empty offices and 2049 occupied offices. 7. What test set size would be needed to estimate the true misclassification rate of empty offices to within 2%?
10 8. Write down the prediction model from the trained logistic regression model shown in the following R output: model2 ## ## Call: glm(formula = Occupancy ~ Light + CO2 + HumidityRatio + hour + ## I(hour^2), family = "binomial", data = rr) ## ## Coefficients: ## (Intercept) Light CO2 HumidityRatio hour ## ## I(hour^2) ## ## ## Degrees of Freedom: 8142 Total (i.e. Null); 8137 Residual ## Null Deviance: 8420 ## Residual Deviance: 1045 AIC: Use above R output from logistic regression to compute the odds ratio of the office being occupied for a one Lux increase in the amount of light detected. 10. Briefly describe how 5-fold cross validation could have been used to select the best model instead of a single training and test set. You don t need to write code, just explain the idea in a sentence or two.
ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang
ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report Team Member Names: Xi Yang, Yi Wen, Xue Zhang Project Title: Improve Room Utilization Introduction Problem
More informationStat 4510/7510 Homework 4
Stat 45/75 1/7. Stat 45/75 Homework 4 Instructions: Please list your name and student number clearly. In order to receive credit for a problem, your solution must show sufficient details so that the grader
More informationSTAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.
STAT 48 - STAT 48 - December 5, 27 STAT 48 - STAT 48 - Here are a few questions to consider: What does statistical learning mean to you? Is statistical learning different from statistics as a whole? What
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationSTENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015
STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, tsvv@steno.dk, Steno Diabetes Center June 11, 2015 Contents 1 Introduction 1 2 Recap: Variables 2 3 Data Containers 2 3.1 Vectors................................................
More informationk-nn classification with R QMMA
k-nn classification with R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 1/16 HW (Height and weight) of adults Statistics
More informationCross-validation and the Bootstrap
Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationDiscriminant analysis in R QMMA
Discriminant analysis in R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 1/26 Default data Get the data set Default library(islr)
More informationSTA 570 Spring Lecture 5 Tuesday, Feb 1
STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row
More informationGeneralized Additive Models
Generalized Additive Models Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Additive Models GAMs are one approach to non-parametric regression in the multiple predictor setting.
More informationCross-validation and the Bootstrap
Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. 1/44 Cross-validation and the Bootstrap In the section we discuss two resampling
More informationLinear Model Selection and Regularization. especially usefull in high dimensions p>>100.
Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records
More informationOrange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)
Orange Juice data Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l10-oj-data.html#(1) 1/31 Orange Juice Data The data contain weekly sales of refrigerated
More informationSYS 6021 Linear Statistical Models
SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are
More informationMy dear students, Believe in yourselves. Believe in your abilities. You can DO this! -Dr. M
1/29 2/22 3/12 4/8 5/9 6/20 otal/100 Please do not write in the spaces above. Directions: You have 50 minutes in which to complete this exam. You must show all work, or risk losing credit. Be sure to answer
More informationRobust Linear Regression (Passing- Bablok Median-Slope)
Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their
More informationPredicting User Ratings Using Status Models on Amazon.com
Predicting User Ratings Using Status Models on Amazon.com Borui Wang Stanford University borui@stanford.edu Guan (Bell) Wang Stanford University guanw@stanford.edu Group 19 Zhemin Li Stanford University
More informationLecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017
Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last
More informationMachine Learning / Jan 27, 2010
Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,
More informationComparison of Linear Regression with K-Nearest Neighbors
Comparison of Linear Regression with K-Nearest Neighbors Rebecca C. Steorts, Duke University STA 325, Chapter 3.5 ISL Agenda Intro to KNN Comparison of KNN and Linear Regression K-Nearest Neighbors vs
More informationNina Zumel and John Mount Win-Vector LLC
SUPERVISED LEARNING IN R: REGRESSION Logistic regression to predict probabilities Nina Zumel and John Mount Win-Vector LLC Predicting Probabilities Predicting whether an event occurs (yes/no): classification
More informationMulticollinearity and Validation CIVL 7012/8012
Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.
More informationRESAMPLING METHODS. Chapter 05
1 RESAMPLING METHODS Chapter 05 2 Outline Cross Validation The Validation Set Approach Leave-One-Out Cross Validation K-fold Cross Validation Bias-Variance Trade-off for k-fold Cross Validation Cross Validation
More informationModel Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer
Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error
More informationStat 342 Exam 3 Fall 2014
Stat 34 Exam 3 Fall 04 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed There are questions on the following 6 pages. Do as many of them as you can
More informationDM4U_B P 1 W EEK 1 T UNIT
MDM4U_B Per 1 WEEK 1 Tuesday Feb 3 2015 UNIT 1: Organizing Data for Analysis 1) THERE ARE DIFFERENT TYPES OF DATA THAT CAN BE SURVEYED. 2) DATA CAN BE EFFECTIVELY DISPLAYED IN APPROPRIATE TABLES AND GRAPHS.
More informationBinary Regression in S-Plus
Fall 200 STA 216 September 7, 2000 1 Getting Started in UNIX Binary Regression in S-Plus Create a class working directory and.data directory for S-Plus 5.0. If you have used Splus 3.x before, then it is
More information2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy
2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.
More informationMath 111: Midterm 1 Review
Math 111: Midterm 1 Review Prerequisite material (see review section for additional problems) 1. Simplify the following: 20a 2 b 4a 2 b 1 ( 2x 3 y 2 ) 2 8 2 3 + ( 1 4 ) 1 2 2. Factor the following: a)
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming
More informationPerformance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018
Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:
More informationSection 2.3: Simple Linear Regression: Predictions and Inference
Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple
More informationRegression on SAT Scores of 374 High Schools and K-means on Clustering Schools
Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques
More informationMultiple Linear Regression
Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors
More informationEnd Behavior and Symmetry
Algebra 2 Interval Notation Name: Date: Block: X Characteristics of Polynomial Functions Lesson Opener: Graph the function using transformations then identify key characteristics listed below. 1. y x 2
More informationRegression Analysis and Linear Regression Models
Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical
More informationPerforming Cluster Bootstrapped Regressions in R
Performing Cluster Bootstrapped Regressions in R Francis L. Huang / October 6, 2016 Supplementary material for: Using Cluster Bootstrapping to Analyze Nested Data with a Few Clusters in Educational and
More informationThings you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.
1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.
More informationNEURAL NETWORKS. Cement. Blast Furnace Slag. Fly Ash. Water. Superplasticizer. Coarse Aggregate. Fine Aggregate. Age
NEURAL NETWORKS As an introduction, we ll tackle a prediction task with a continuous variable. We ll reproduce research from the field of cement and concrete manufacturing that seeks to model the compressive
More informationExcel Functions & Tables
Excel Functions & Tables SPRING 2016 Spring 2016 CS130 - EXCEL FUNCTIONS & TABLES 1 Review of Functions Quick Mathematics Review As it turns out, some of the most important mathematics for this course
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationAlgebra 2 Chapter Relations and Functions
Algebra 2 Chapter 2 2.1 Relations and Functions 2.1 Relations and Functions / 2.2 Direct Variation A: Relations What is a relation? A of items from two sets: A set of values and a set of values. What does
More informationStatistics Lab #7 ANOVA Part 2 & ANCOVA
Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")
More information5.5 Regression Estimation
5.5 Regression Estimation Assume a SRS of n pairs (x, y ),..., (x n, y n ) is selected from a population of N pairs of (x, y) data. The goal of regression estimation is to take advantage of a linear relationship
More informationSection 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business
Section 3.2: Multiple Linear Regression II Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Multiple Linear Regression: Inference and Understanding We can answer new questions
More informationTwo-Stage Least Squares
Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes
More informationPoisson Regression and Model Checking
Poisson Regression and Model Checking Readings GH Chapter 6-8 September 27, 2017 HIV & Risk Behaviour Study The variables couples and women_alone code the intervention: control - no counselling (both 0)
More informationLeveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg
Model Optimization Leveling Up as a Data Scientist http://shorelinechurch.org/wp-content/uploa ds/2014/10/level-up-ds.jpg Bias and Variance Error = (expected loss of accuracy) 2 + flexibility of model
More informationAn introduction to SPSS
An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible
More informationFinal Exam. Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm
Final Exam Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm Instructions: you will submit this take-home final exam in three parts. 1. Writeup. This will be a complete
More informationDiscussion Notes 3 Stepwise Regression and Model Selection
Discussion Notes 3 Stepwise Regression and Model Selection Stepwise Regression There are many different commands for doing stepwise regression. Here we introduce the command step. There are many arguments
More informationSubset Selection in Multiple Regression
Chapter 307 Subset Selection in Multiple Regression Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that
More informationLecture 13: Model selection and regularization
Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always
More informationCSE 190, Spring 2015: Midterm
CSE 190, Spring 2015: Midterm Name: Student ID: Instructions Hand in your solution at or before 7:45pm. Answers should be written directly in the spaces provided. Do not open or start the test before instructed
More informationStatistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.
Statistical Methods Instructor: Lingsong Zhang 1 Issues before Class Statistical Methods Lingsong Zhang Office: Math 544 Email: lingsong@purdue.edu Phone: 765-494-7913 Office Hour: Monday 1:00 pm - 2:00
More informationD-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview
Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,
More informationMinitab detailed
Minitab 18.1 - detailed ------------------------------------- ADDITIVE contact sales: 06172-5905-30 or minitab@additive-net.de ADDITIVE contact Technik/ Support/ Installation: 06172-5905-20 or support@additive-net.de
More informationPredictive Checking. Readings GH Chapter 6-8. February 8, 2017
Predictive Checking Readings GH Chapter 6-8 February 8, 2017 Model Choice and Model Checking 2 Questions: 1. Is my Model good enough? (no alternative models in mind) 2. Which Model is best? (comparison
More informationToward a high resolution temperature distribution map using crowdsourcing smartphone battery temperature
Toward a high resolution temperature distribution map using crowdsourcing smartphone battery temperature Nguyen Hai Chau chaunh@vnu.edu.vn, nhchau@gmail.com University of Engineering and Technology Vietnam
More informationMS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods
MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Supervised Learning: Nonparametric
More informationUnit 2: Functions, Equations, & Graphs of Degree One
Date Period Unit 2: Functions, Equations, & Graphs of Degree One Day Topic 1 Relations and Functions Domain and Range 2 Graphing Linear Equations Objective 1 3 Writing Equations of Lines 4 Using the Graphing
More informationI211: Information infrastructure II
Data Mining: Classifier Evaluation I211: Information infrastructure II 3-nearest neighbor labeled data find class labels for the 4 data points 1 0 0 6 0 0 0 5 17 1.7 1 1 4 1 7.1 1 1 1 0.4 1 2 1 3.0 0 0.1
More informationData can be in the form of numbers, words, measurements, observations or even just descriptions of things.
+ What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and
More informationNeural Network Application Design. Supervised Function Approximation. Supervised Function Approximation. Supervised Function Approximation
Supervised Function Approximation There is a tradeoff between a network s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate). This problem is similar
More informationNotes and Announcements
Notes and Announcements Midterm exam: Oct 20, Wednesday, In Class Late Homeworks Turn in hardcopies to Michelle. DO NOT ask Michelle for extensions. Note down the date and time of submission. If submitting
More informationWork through the sheet in any order you like. Skip the starred (*) bits in the first instance, unless you re fairly confident.
CDT R Review Sheet Work through the sheet in any order you like. Skip the starred (*) bits in the first instance, unless you re fairly confident. 1. Vectors (a) Generate 100 standard normal random variables,
More informationLab #13 - Resampling Methods Econ 224 October 23rd, 2018
Lab #13 - Resampling Methods Econ 224 October 23rd, 2018 Introduction In this lab you will work through Section 5.3 of ISL and record your code and results in an RMarkdown document. I have added section
More informationThe Data. Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO
Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO The Data The following dataset is from Hastie, Tibshirani and Friedman (2009), from a studyby Stamey et al. (1989) of prostate
More informationStatistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2
Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2 Weijie Cai, SAS Institute Inc., Cary NC July 1, 2008 ABSTRACT Generalized additive models are useful in finding predictor-response
More informationThe Power and Sample Size Application
Chapter 72 The Power and Sample Size Application Contents Overview: PSS Application.................................. 6148 SAS Power and Sample Size............................... 6148 Getting Started:
More information( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.
Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING
More informationTree-based methods for classification and regression
Tree-based methods for classification and regression Ryan Tibshirani Data Mining: 36-462/36-662 April 11 2013 Optional reading: ISL 8.1, ESL 9.2 1 Tree-based methods Tree-based based methods for predicting
More information1 RefresheR. Figure 1.1: Soy ice cream flavor preferences
1 RefresheR Figure 1.1: Soy ice cream flavor preferences 2 The Shape of Data Figure 2.1: Frequency distribution of number of carburetors in mtcars dataset Figure 2.2: Daily temperature measurements from
More informationCSE 158. Web Mining and Recommender Systems. Midterm recap
CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158
More informationCSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators)
Name: Email address: Quiz Section: CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering. We will
More informationDr. Barbara Morgan Quantitative Methods
Dr. Barbara Morgan Quantitative Methods 195.650 Basic Stata This is a brief guide to using the most basic operations in Stata. Stata also has an on-line tutorial. At the initial prompt type tutorial. In
More informationTHIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010
THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE
More informationSection 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business
Section 3.4: Diagnostics and Transformations Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Regression Model Assumptions Y i = β 0 + β 1 X i + ɛ Recall the key assumptions
More informationNon-Parametric Modeling
Non-Parametric Modeling CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Non-Parametric Density Estimation Parzen Windows Kn-Nearest Neighbor
More informationLecture 16: High-dimensional regression, non-linear regression
Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we
More informationBias-variance trade-off and cross validation Computer exercises
Bias-variance trade-off and cross validation Computer exercises 6.1 Cross validation in k-nn In this exercise we will return to the Biopsy data set also used in Exercise 4.1 (Lesson 4). We will try to
More informationR (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.
Assignment No. 4 Title: SD Module- Data Science with R Program R (2) C (4) V (2) T (2) Total (10) Dated Sign Data analysis case study using R for readily available data set using any one machine learning
More informationMy dear students, Believe in yourselves. Believe in your abilities. You have got this! -Dr. M
1/20 2/10 3/7 4/18 5/10 6/6 7/17 8/12 Total/100 Please do not write in the spaces above. Directions: You have 50 minutes in which to complete this exam. You must show all work, or risk losing credit. Be
More informationSection 9: One Variable Statistics
The following Mathematics Florida Standards will be covered in this section: MAFS.912.S-ID.1.1 MAFS.912.S-ID.1.2 MAFS.912.S-ID.1.3 Represent data with plots on the real number line (dot plots, histograms,
More informationData Mining and Knowledge Discovery: Practice Notes
Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2016/01/12 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 12 Combining
More informationWeibull Reliability Analyses
Visual-XSel Software-Guide for Weibull The Weibull analysis shows the failure frequencies or the unreliability of parts and components in the Weibull-net and interprets them. Basics and more details can
More informationSTATS PAD USER MANUAL
STATS PAD USER MANUAL For Version 2.0 Manual Version 2.0 1 Table of Contents Basic Navigation! 3 Settings! 7 Entering Data! 7 Sharing Data! 8 Managing Files! 10 Running Tests! 11 Interpreting Output! 11
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider
More informationSolution to Bonus Questions
Solution to Bonus Questions Q2: (a) The histogram of 1000 sample means and sample variances are plotted below. Both histogram are symmetrically centered around the true lambda value 20. But the sample
More informationThis is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or
STA 450/4000 S: February 2 2005 Flexible modelling using basis expansions (Chapter 5) Linear regression: y = Xβ + ɛ, ɛ (0, σ 2 ) Smooth regression: y = f (X) + ɛ: f (X) = E(Y X) to be specified Flexible
More informationBox-Cox Transformation for Simple Linear Regression
Chapter 192 Box-Cox Transformation for Simple Linear Regression Introduction This procedure finds the appropriate Box-Cox power transformation (1964) for a dataset containing a pair of variables that are
More informationThe linear mixed model: modeling hierarchical and longitudinal data
The linear mixed model: modeling hierarchical and longitudinal data Analysis of Experimental Data AED The linear mixed model: modeling hierarchical and longitudinal data 1 of 44 Contents 1 Modeling Hierarchical
More informationChapter 6: Linear Model Selection and Regularization
Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the
More informationEE 511 Linear Regression
EE 511 Linear Regression Instructor: Hanna Hajishirzi hannaneh@washington.edu Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Announcements Hw1 due
More informationDATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data
DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [12-21] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine
More informationCSC411/2515 Tutorial: K-NN and Decision Tree
CSC411/2515 Tutorial: K-NN and Decision Tree Mengye Ren csc{411,2515}ta@cs.toronto.edu September 25, 2016 Cross-validation K-nearest-neighbours Decision Trees Review: Motivation for Validation Framework:
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use
More informationMath 112 Spring 2016 Midterm 2 Review Problems Page 1
Math Spring Midterm Review Problems Page. Solve the inequality. The solution is: x x,,,,,, (E) None of these. Which one of these equations represents y as a function of x? x y xy x y x y (E) y x 7 Math
More information