Chapter 9 Building the Regression Model I: Model Selection and Validation
|
|
- Alexandrina Patterson
- 6 years ago
- Views:
Transcription
1 Chapter 9 Building the Regression Model I: Model Selection and Validation 許湘伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 9 1 / 42
2 9.1 Polynomial Regression Models From previous: know Now how to fit models how to make inferences from these models Selection of the predictor variables for exploratory ( 探究的 ) observational studies hsuhl (NUK) LR Chap 9 2 / 42
3 9.1 Polynomial Regression Models Overview of Model-Building Process Strategy for building a regression model 1 Data collection and preparation hsuhl (NUK) LR Chap 9 3 / 42
4 9.1 Polynomial Regression Models Overview of Model-Building Process Strategy for building a regression model 1 Data collection and preparation 2 Reduction of number of explanatory variables hsuhl (NUK) LR Chap 9 3 / 42
5 9.1 Polynomial Regression Models Overview of Model-Building Process Strategy for building a regression model 1 Data collection and preparation 2 Reduction of number of explanatory variables 3 Model refinement( 精煉 ) and selection (tentative 暫時性的 ) hsuhl (NUK) LR Chap 9 3 / 42
6 9.1 Polynomial Regression Models Overview of Model-Building Process Strategy for building a regression model 1 Data collection and preparation 2 Reduction of number of explanatory variables 3 Model refinement( 精煉 ) and selection (tentative 暫時性的 ) 4 Model validation hsuhl (NUK) LR Chap 9 3 / 42
7 9.2 Surgical Unit Example Surgical Unit Example a particular type of liver( 肝臟 ) operation Variables: X 1 : blood clotting( 凝塊 ) score X 2 : prognostic( 預後的 ) index X 3 : enzyme( 酵 ) function test score X 4 : liver function test score X 5 : age, in years X 6 : indicator variable for gender (0=male; 1=female) X 7 and X 8 : indicator variables for history of alcohol use: hsuhl (NUK) LR Chap 9 4 / 42
8 9.2 Surgical Unit Example Surgical Unit Example Y : response; survival time Original data: 108 patients Preliminary study: the first 54 patients with the first four variables hsuhl (NUK) LR Chap 9 5 / 42
9 9.2 Surgical Unit Example basic plots/correlation diagnosis several cases as outlying examine the influence of these cases hsuhl (NUK) LR Chap 9 6 / 42
10 9.2 Surgical Unit Example residual plots: several cases as outlying curvature and nonconstant error variance some departure from normality hsuhl (NUK) LR Chap 9 7 / 42
11 9.2 Surgical Unit Example To make the distribution of the error terms more nearly normal to see if the same transformation would also reduce the apparent curvature Y = ln Y hsuhl (NUK) LR Chap 9 8 / 42
12 9.2 Surgical Unit Example Figure : JMP Scatter Plot Matrix and Correlation Matrix when Response Variable Is Y -Surgical Unit Example hsuhl (NUK) LR Chap 9 9 / 42
13 9.2 Surgical Unit Example linearly associated with Y, with X 3, X 4 showing the highest degrees of association and X 1 the lowest intercorrelations( 內相關 ; 交互相關 ) among the potential predictor variables (X 4 vs. X 1, X 2, X 3 ) Basic of the analyses: present the predictor variables in linear terms; not to include any interaction terms To examine whether all of the potential variables are needed or whether a subset of them is adequate model selection hsuhl (NUK) LR Chap 9 10 / 42
14 9.2 Surgical Unit Example Reference: hsuhl (NUK) LR Chap 9 11 / 42
15 9.3 Criteria for Model Selection Criteria for Model Selection From any set of p 1 predictors: 2 p 1 alternative models can be constructed There is the regression model with no X variables. hsuhl (NUK) LR Chap 9 12 / 42
16 9.3 Criteria for Model Selection Criteria for Model Selection (cont.) Model selection procedure have been developed to identify a small group of regression models that are good according to a specified criterion A detailed examination can then be made of a limited number of the more promising( 有希望的 ) or candidate models, leading to the selection of the final regression model to be employed. Focus on six criteria: Rp, 2 Ra,p, 2 C p, AIC p, SBC p, PRESS p (BIC p) hsuhl (NUK) LR Chap 9 13 / 42
17 9.3 Criteria for Model Selection Criteria for Model Selection (cont.) Some notations for model selection P 1: the number of potential X variables in the pool all regression models contain an intercept term β 0 (contains 1 in design matrix X P parameters) p 1: the number of X variables in a subset (p parameters in the regression function for this subset of X variables) n > P 1 p P hsuhl (NUK) LR Chap 9 14 / 42
18 9.3 Criteria for Model Selection R 2 p or SSE p Criterion the use of R 2 which R 2 is high R 2 p criterion the error sum of squares SSE p (subsets for which SSE p is small) R 2 p = 1 SSE p SSTO SSTO is constant for all possible regression models R 2 with #{X} The R 2 p criterion is not intended to identify the subsets that maximize this criterion. (max R 2 p when all P 1 potential X variables are included) to find the point where adding more X variables is not worthwhile( 值得做的 ) ( it leads to a very small increase in R 2 p) hsuhl (NUK) LR Chap 9 15 / 42
19 9.3 Criteria for Model Selection R 2 a,p or MSE p Criterion take account of the number of parameters R 2 a,p: the adjusted coefficient of multiple determination R 2 a,p = 1 ( ) n 1 SSEp n p SSTO = 1 MSE p SSTO n 1 R 2 a,p MSE p for the given Y observations max(r 2 a,p) can decrease as p increases ( the increase in max R 2 p becomes small it is not sufficient to offset the loss of an additional df.) R 2 a,p criterion: to find a few subsets for which R 2 a,p is at the maximum or so close to the maximum hsuhl (NUK) LR Chap 9 16 / 42
20 9.3 Criteria for Model Selection Mallow s C p Criterion concerned with the total mean squared error of the n fitted values The mean squared error for Ŷ i : (Ŷi µ i ) 2 = (E{Ŷi} µ i ) + (Ŷi E{Ŷi}) }{{}}{{} bias component random error component E{Ŷ i µ i } 2 = ( E{Ŷ i µ i } ) 2 + σ 2 {Ŷ i } The total mean squared error for all n fitted values Ŷ i : n [ (E{ Ŷ i µ i } ) 2 n ( + σ {Ŷi}] 2 = E{ Ŷ i µ i } ) 2 n + σ 2 {Ŷi} i=1 i=1 i=1 2 hsuhl (NUK) LR Chap 9 17 / 42
21 9.3 Criteria for Model Selection Mallow s C p Criterion (cont.) Γ p : the criterion measure is the total mean squared error divided by σ 2 Γp = 1 [ n ( E{ Ŷ σ 2 i µ i } ) ] 2 n + σ 2 {Ŷi} i=1 C p : the estimator of Γ p C p = i=1 SSE p (n 2p) MSE(X 1,..., X P 1 ) hsuhl (NUK) LR Chap 9 18 / 42
22 9.3 Criteria for Model Selection Mallow s C p Criterion (cont.) Why C p is an estimator of Γ p ni=1 σ 2 {Ŷ i } = trace(cov(xb)) = pσ 2 E{SSE p } = (E{Ŷi} µ i ) 2 + (n p)σ 2 E{SSE p } = E{Ŷi Y i } 2 = { ) } 2 E (Ŷi E{Ŷi} ε i + E{Ŷi} µ i [ n ] Γ p = 1 ) 2 n (E{Ŷ σ 2 i } µ i + σ 2 {Ŷ i } i=1 i=1 = E{SSE p} σ 2 (n 2p) SSE p C p = (n 2p) MSE(X 1,..., X P 1 ) hsuhl (NUK) LR Chap 9 19 / 42
23 9.3 Criteria for Model Selection Mallow s C p Criterion (cont.) When E{Ŷi} = µ i (no bias with p 1 X variables) E{C p } p the regression model containing all P 1 X variables: C P = SSE(X 1,..., X P 1 ) SSE(X 1,...,X P 1 ) n P (n 2P) = P The C p measure assumes that MSE(X 1,..., X P 1 ) is an unbiased estimator of σ 2. like to find the model: 1 C p is small 2 C p p hsuhl (NUK) LR Chap 9 20 / 42
24 9.3 Criteria for Model Selection AIC p and SBC p Criteria provide penalties for adding predictors: Akaike s information criterion: AIC p Schwarz Bayesian criterion: SBC p (Bayesian information criterion (BIC)) AIC p = n ln SSE p n ln n + 2p (penalty) BIC p = n ln SSE p n ln n + [ln n]p (penalty) Search for models that have small values of AIC p or SBC p SBC p criterion tends to favor more parsimonious models (large penalty if n 8) hsuhl (NUK) LR Chap 9 21 / 42
25 9.3 Criteria for Model Selection PRESS p Criterion PRESS p : prediction sum of squares n PRESS p = (Y i Ŷ i(i) ) 2 i=1 a measure of how well the use of the fitted values for a subset model can predict the observed responses Y i Y i(i) : the fitted value when i is being predicted from a model in which (i) was left out when the regression function was fitted (leave-one-out) When the prediction errors Y i Ŷi(i) are small, so are PRESS p PRESS p can be calculated without requiring n separate regression runs, each time deleting one of the n cases (Chap. 10) hsuhl (NUK) LR Chap 9 22 / 42
26 9.3 Criteria for Model Selection Example for different Criteria Surgical unit example Key R codes: leaps: performs an exhaustive search for the best subsets of the variables in x method=c("cp", "adjr2", "r2"): Calculate C p, adjusted R 2 a or R 2 int=true: Add an intercept to the model nbest: Number of subsets of each size to report for loop procedure hsuhl (NUK) LR Chap 9 23 / 42
27 9.3 Criteria for Model Selection Example for different Criteria (cont.) library(leaps) ex<-read.table("ch09ta01.txt",header=f) colnames(ex)<-c("x1","x2","x3","x4","x5","x6","x7","x8","y","ypr") attach(ex) ex.r2<-leaps( x=cbind(x1,x2,x3,x4,x5,x6,x7,x8), y=ypr, method= r2, nbest=6) p<-seq( min(ex.r2$size),max(ex.r2$size) ) ind<-as.data.frame(ex.r2[c(3:4)]) ind<-ind[with(ind, order(size,r2)), ] plot(ind[,c(1:2)],ylab=expression(r^2), xlab= p,col="red",pch=16) Rp2 = by( data=ex.r2[4],indices=factor(ex.r2$size), FUN=max) lines( Rp2 ~ p,col="blue" ) hsuhl (NUK) LR Chap 9 24 / 42
28 9.3 Criteria for Model Selection R codes for Table 9.2 #### Using for loop ex<-read.table("ch09ta01.txt",header=f) colnames(ex)<-c("x1","x2","x3","x4","x5","x6","x7","x8","y","ypr") attach(ex) nn<-length(ex[,1]) vars <- c("ypr","x1","x2","x3","x4")# N <- list(0,1,2,3,4)# COMB <- sapply(n, function(m) combn(x=vars[2:5], m)) COMB2 <- list() k=0 exlmf<-lm(ypr~x1+x2+x3+x4) SSEF<-deviance(aov(exlmf)) MSEF<-deviance(aov(exlmf))/(n-5) res.table<-null res.table0<-null for(i in seq(comb)) { tmp <- COMB[[i]] for(j in seq(ncol(tmp))) { hsuhl (NUK) LR Chap 9 25 / 42
29 9.3 Criteria for Model Selection R codes for Table 9.2 (cont.) k <- k + 1 if(length(tmp)==0) COMB2[[k]] <- formula(paste("ypr~1")) if(length(tmp)>0) COMB2[[k]] <- formula(paste("ypr", "~", paste(tmp[,j], collapse=" + "))) exlm0<-lm(comb2[[k]],data=ex) exlm<-summary(exlm0) np<-length(tmp[,1])+1 R2p<-exlm$r.squared R2ap<-exlm$adj.r.squared SSEp<-deviance(aov(exlm0)) Cp<-SSEp/MSEF-(nn-2*(np)) AICp<-nn*log(SSEp)-nn*log(nn)+2*np BICp<-nn*log(SSEp)-nn*log(nn)+log(nn)*np PRESSp<-sum((resid(exlm0)/(1 - lm.influence(exlm0)$hat))^2) res.table<-rbind(res.table,c(noquote(paste(tmp[,j], collapse=" + ")), format(round(c(np,r2p,r2ap,cp,aicp,bicp,pressp),3), nsmall = 3))) res.table0<-rbind(res.table0,c(np,r2p,r2ap,cp,aicp,bicp,pressp)) } } library(xtable) res.table hsuhl (NUK) LR Chap 9 26 / 42
30 9.3 Criteria for Model Selection R codes for Table 9.2 (cont.) No. Variables p Rp 2 Ra,p 2 C p AIC p SBC p PRESS p 1 None X X X X X1 + X X1 + X X1 + X X2 + X X2 + X X3 + X X1 + X2 + X X1 + X2 + X X1 + X3 + X X2 + X3 + X X1 + X2 + X3 + X hsuhl (NUK) LR Chap 9 27 / 42
31 9.3 Criteria for Model Selection R codes for Table 9.2 (cont.) hsuhl (NUK) LR Chap 9 28 / 42
32 9.3 Criteria for Model Selection R codes for Table 9.2 (cont.) hsuhl (NUK) LR Chap 9 29 / 42
33 9.4 Automatic Search Procedures for Model Selection Automatic Search Procedures for Model Selection #{possible models }(2 P 1 ) with #{predictors} Evaluating all of the possible alternatives can be a daunting( 使人氣餒的 ) endeavor( 努力 ). Automatic computer-search procedures: "Best" subsets algorithms stepwise regression hsuhl (NUK) LR Chap 9 30 / 42
34 9.4 Automatic Search Procedures for Model Selection "Best" Subset Algorithms Time-saving algorithms have been developed according to a specified criterion, ex: C p the best subsets Without requiring the fitting of all the possible subset regression models only a small fraction of all possible regression models is required to calculate. Several regression models can be identified as "good" for final consideration, depending on which criteria we use. hsuhl (NUK) LR Chap 9 31 / 42
35 9.4 Automatic Search Procedures for Model Selection "Best" Subset Algorithms (cont.) The surgical unit example R 2 a,p = 0.823: seven- or eight-parameter model min(c 7 ) = min(aic 7 ) = min(sbc 5 ) = min(press 5 ) = Not to identify a single best model; hope to identify a small set of promising models for further study hsuhl (NUK) LR Chap 9 32 / 42
36 9.4 Automatic Search Procedures for Model Selection "Best" Subset Algorithms (cont.) hsuhl (NUK) LR Chap 9 33 / 42
37 9.4 Automatic Search Procedures for Model Selection Selection Methods The forward stepwise regression procedure is probably the most widely used of the automatic search methods. The search method develops a sequence of regression models, at each step adding or deleting an X variable. (t or F statistics) the stepwise search procedures end with the identification of a single regression model as "best" Different formats: Forward Selection; Forward Stepwise Selection Backward Elimination; Backward Stepwise Elimination hsuhl (NUK) LR Chap 9 34 / 42
38 9.4 Automatic Search Procedures for Model Selection Forward Stepwise Selection The forward stepwise regression search algorithm: t statistics and the associated P-values 1 Fit a simple linear regression model for each of the P 1 X variables considered for inclusion. For each compute the t statistics for testing whether or not the slope is zero: t = b k s{b k } 2 Pick the largest out of the P 1 t k s and include the corresponding X variable in the regression model if t k exceeds some threshold (ex: comparing the P-value with the limits α in = 0.1). If no such t k stop the procedure hsuhl (NUK) LR Chap 9 35 / 42
39 9.4 Automatic Search Procedures for Model Selection Forward Stepwise Selection (cont.) 3 Using partial t statistics to add the second variable (ex: comparing the P-value with the limits α in = 0.1) 4 Test whether variables in the model need to be dropped from the model: Using partial t statistics (ex: comparing the P-value with the limits α out = 0.15) 5 Go to Step 3 and continue the procedure until stop. Note that allow an X variable brought into the model at an early stage, to be dropped subsequently if it is no longer helpful small α in are often underspecified α in < α out : avoid cycling hsuhl (NUK) LR Chap 9 36 / 42
40 9.4 Automatic Search Procedures for Model Selection Forward Stepwise Selection (cont.) forward selection Backward Elimination: start with all P 1 X-variables R codes: Backward Stepwise Elimination step(): Select a formula-based model by AIC. fastbw(): Fast Backward Variable Selection (library(rms)) add1() or drop1(): Add or Drop All Possible Single Terms to a Model hsuhl (NUK) LR Chap 9 37 / 42
41 Model Validation Model Validation the validation of the selected regression models checking a candidate model against independent data Three basic ways of validating a regression model: 1 Collection of new data to check the model and its predictive ability 2 Comparison of results: theoretical expectation; early results; simulation resutls 3 Holdout sample( 保留樣本 ): check the model and its predictive ability hsuhl (NUK) LR Chap 9 38 / 42
42 Model Validation Model Validation (cont.) Collection of New data to Check Model A means of measuring the actual predictive capability of the selected regression model is to use this model to predict each case in the new data set and then to calculate the mean of the squared prediction errors, to be denoted by MSPR, which stands for mean squared prediction error: MSPR = n i=1(y i Ŷ i ) 2 n Difficulties in replicating a study hsuhl (NUK) LR Chap 9 39 / 42
43 Model Validation Model Validation (cont.) Comparing with Theory, Empirical Evidence, or Simulation Results Unfortunately, there is often little theory that can be used to validate regression models hsuhl (NUK) LR Chap 9 40 / 42
44 Model Validation Model Validation (cont.) Data Splitting often called cross-validation ( 交叉驗證 ): the data set is large enough to split the data into two sets training sample: model-building set prediction set: used to evaluate the reasonableness and predictive ability of the selected model If the entire data set is not large enough for making an equal split, the validation data set will need to be smaller than the model-building data set. Splits of the data can be made at random. Drawback: the variances of the estimated regression coefficients developed from the model-building data set will usually be larger than those that would have been obtained from the fit to the entire data set hsuhl (NUK) LR Chap 9 41 / 42
45 Model Validation Model Validation (cont.) In any case, once the model has been validated, it is customary practice to use the entire data set for estimating the final regression model. Comments: double cross-validation procedure K-fold cross-validating hsuhl (NUK) LR Chap 9 42 / 42
STA121: Applied Regression Analysis
STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider
More informationLinear Model Selection and Regularization. especially usefull in high dimensions p>>100.
Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records
More informationVariable selection is intended to select the best subset of predictors. But why bother?
Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.
More informationInformation Criteria Methods in SAS for Multiple Linear Regression Models
Paper SA5 Information Criteria Methods in SAS for Multiple Linear Regression Models Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN ABSTRACT SAS 9.1 calculates Akaike s Information
More informationChapter 10: Variable Selection. November 12, 2018
Chapter 10: Variable Selection November 12, 2018 1 Introduction 1.1 The Model-Building Problem The variable selection problem is to find an appropriate subset of regressors. It involves two conflicting
More informationTopics in Machine Learning-EE 5359 Model Assessment and Selection
Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1 Training and Generalization Training stage: Utilizing
More information2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy
2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.
More informationMODEL DEVELOPMENT: VARIABLE SELECTION
7 MODEL DEVELOPMENT: VARIABLE SELECTION The discussion of least squares regression thus far has presumed that the model was known with respect to which variables were to be included and the form these
More informationLecture 13: Model selection and regularization
Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always
More informationBIOL 458 BIOMETRY Lab 10 - Multiple Regression
BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several
More informationModel selection Outline for today
Model selection Outline for today The problem of model selection Choose among models by a criterion rather than significance testing Criteria: Mallow s C p and AIC Search strategies: All subsets; stepaic
More informationDiscussion Notes 3 Stepwise Regression and Model Selection
Discussion Notes 3 Stepwise Regression and Model Selection Stepwise Regression There are many different commands for doing stepwise regression. Here we introduce the command step. There are many arguments
More informationChapter 6: Linear Model Selection and Regularization
Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the
More informationModel selection. Peter Hoff STAT 423. Applied Regression and Analysis of Variance. University of Washington /53
/53 Model selection Peter Hoff STAT 423 Applied Regression and Analysis of Variance University of Washington Diabetes example: y = diabetes progression x 1 = age x 2 = sex. dim(x) ## [1] 442 64 colnames(x)
More informationRegression on SAT Scores of 374 High Schools and K-means on Clustering Schools
Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques
More informationAssignment 6 - Model Building
Assignment 6 - Model Building your name goes here Due: Wednesday, March 7, 2018, noon, to Sakai Summary Primarily from the topics in Chapter 9 of your text, this homework assignment gives you practice
More informationMulticollinearity and Validation CIVL 7012/8012
Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.
More information[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization
[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization L. Jason Anastasopoulos ljanastas@uga.edu February 2, 2017 Gradient descent Let s begin with our simple problem of estimating
More informationStatistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection
Statistical Modelling for Social Scientists Manchester University January 20, 21 and 24, 2011 Graeme Hutcheson, University of Manchester Exploratory regression and model selection The lecture notes, exercises
More informationCPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017
CPSC 340: Machine Learning and Data Mining Feature Selection Fall 2017 Assignment 2: Admin 1 late day to hand in tonight, 2 for Wednesday, answers posted Thursday. Extra office hours Thursday at 4pm (ICICS
More information7. Collinearity and Model Selection
Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among
More informationPackage leaps. R topics documented: May 5, Title regression subset selection. Version 2.9
Package leaps May 5, 2009 Title regression subset selection Version 2.9 Author Thomas Lumley using Fortran code by Alan Miller Description Regression subset selection including
More informationQuantitative Methods in Management
Quantitative Methods in Management MBA Glasgow University March 20-23, 2009 Luiz Moutinho, University of Glasgow Graeme Hutcheson, University of Manchester Exploratory Regression The lecture notes, exercises
More informationMultivariate Analysis Multivariate Calibration part 2
Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data
More informationCHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY
23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series
More informationEXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY
EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 2015 MODULE 4 : Modelling experimental data Time allowed: Three hours Candidates should answer FIVE questions. All questions carry equal
More informationWorkshop 8: Model selection
Workshop 8: Model selection Selecting among candidate models requires a criterion for evaluating and comparing models, and a strategy for searching the possibilities. In this workshop we will explore some
More informationStat 401 B Lecture 26
Stat B Lecture 6 Forward Selection The Forward selection rocedure looks to add variables to the model. Once added, those variables stay in the model even if they become insignificant at a later ste. Backward
More informationStatistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010
Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester Exploratory regression and model
More informationLinear Methods for Regression and Shrinkage Methods
Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors
More informationLecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017
Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last
More informationFMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu
FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)
More information5.5 Regression Estimation
5.5 Regression Estimation Assume a SRS of n pairs (x, y ),..., (x n, y n ) is selected from a population of N pairs of (x, y) data. The goal of regression estimation is to take advantage of a linear relationship
More informationSplines and penalized regression
Splines and penalized regression November 23 Introduction We are discussing ways to estimate the regression function f, where E(y x) = f(x) One approach is of course to assume that f has a certain shape,
More informationD-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview
Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,
More informationSAS Workshop. Introduction to SAS Programming. Iowa State University DAY 2 SESSION IV
SAS Workshop Introduction to SAS Programming DAY 2 SESSION IV Iowa State University May 10, 2016 Controlling ODS graphical output from a procedure Many SAS procedures produce default plots in ODS graphics
More informationLecture 16: High-dimensional regression, non-linear regression
Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we
More informationCross-validation and the Bootstrap
Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,
More informationMATH 829: Introduction to Data Mining and Analysis Model selection
1/12 MATH 829: Introduction to Data Mining and Analysis Model selection Dominique Guillot Departments of Mathematical Sciences University of Delaware February 24, 2016 2/12 Comparison of regression methods
More informationTwo-Stage Least Squares
Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationChapter 15 Mixed Models. Chapter Table of Contents. Introduction Split Plot Experiment Clustered Data References...
Chapter 15 Mixed Models Chapter Table of Contents Introduction...309 Split Plot Experiment...311 Clustered Data...320 References...326 308 Chapter 15. Mixed Models Chapter 15 Mixed Models Introduction
More informationStatistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.
Statistical Consulting Topics Using cross-validation for model selection Cross-validation is a technique that can be used for model evaluation. We often fit a model to a full data set and then perform
More informationMachine Learning. Topic 4: Linear Regression Models
Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There
More informationRegression Analysis and Linear Regression Models
Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical
More informationNonparametric Methods Recap
Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010 Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority
More informationDifferentiation of Cognitive Abilities across the Lifespan. Online Supplement. Elliot M. Tucker-Drob
1 Differentiation of Cognitive Abilities across the Lifespan Online Supplement Elliot M. Tucker-Drob This online supplement reports the results of an alternative set of analyses performed on a single sample
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal
More informationSplines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)
Splines Patrick Breheny November 20 Patrick Breheny STA 621: Nonparametric Statistics 1/46 Introduction Introduction Problems with polynomial bases We are discussing ways to estimate the regression function
More informationMODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES
UNIVERSITY OF GLASGOW MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES by KHUNESWARI GOPAL PILLAY A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in
More informationAssignment No: 2. Assessment as per Schedule. Specifications Readability Assignments
Specifications Readability Assignments Assessment as per Schedule Oral Total 6 4 4 2 4 20 Date of Performance:... Expected Date of Completion:... Actual Date of Completion:... ----------------------------------------------------------------------------------------------------------------
More information2014 Stat-Ease, Inc. All Rights Reserved.
What s New in Design-Expert version 9 Factorial split plots (Two-Level, Multilevel, Optimal) Definitive Screening and Single Factor designs Journal Feature Design layout Graph Columns Design Evaluation
More informationCross-validation and the Bootstrap
Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. 1/44 Cross-validation and the Bootstrap In the section we discuss two resampling
More informationPackage msgps. February 20, 2015
Type Package Package msgps February 20, 2015 Title Degrees of freedom of elastic net, adaptive lasso and generalized elastic net Version 1.3 Date 2012-5-17 Author Kei Hirose Maintainer Kei Hirose
More informationPerformance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018
Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:
More informationEvaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support
Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Topics Validation of biomedical models Data-splitting Resampling Cross-validation
More informationResources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.
Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department
More information1 StatLearn Practical exercise 5
1 StatLearn Practical exercise 5 Exercise 1.1. Download the LA ozone data set from the book homepage. We will be regressing the cube root of the ozone concentration on the other variables. Divide the data
More informationBIOL 458 BIOMETRY Lab 10 - Multiple Regression
BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but
More informationLab 07: Multiple Linear Regression: Variable Selection
Lab 07: Multiple Linear Regression: Variable Selection OBJECTIVES 1.Use PROC REG to fit multiple regression models. 2.Learn how to find the best reduced model. 3.Variable diagnostics and influential statistics
More informationNonparametric Approaches to Regression
Nonparametric Approaches to Regression In traditional nonparametric regression, we assume very little about the functional form of the mean response function. In particular, we assume the model where m(xi)
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationExploring Econometric Model Selection Using Sensitivity Analysis
Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover
More informationSAS Workshop. Introduction to SAS Programming. Iowa State University DAY 3 SESSION I
SAS Workshop Introduction to SAS Programming DAY 3 SESSION I Iowa State University May 10, 2016 Sample Data: Prostate Data Set Example C8 further illustrates the use of all-subset selection options in
More informationInstruction on JMP IN of Chapter 19
Instruction on JMP IN of Chapter 19 Example 19.2 (1). Download the dataset xm19-02.jmp from the website for this course and open it. (2). Go to the Analyze menu and select Fit Model. Click on "REVENUE"
More informationMultiple Linear Regression
Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors
More informationSTAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.
STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Feature Selection Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. Admin Assignment 3: Due Friday Midterm: Feb 14 in class
More informationModel Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer
Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error
More informationThe problem we have now is called variable selection or perhaps model selection. There are several objectives.
STAT-UB.0103 NOTES for Wednesday 01.APR.04 One of the clues on the library data comes through the VIF values. These VIFs tell you to what extent a predictor is linearly dependent on other predictors. We
More informationThe Truth behind PGA Tour Player Scores
The Truth behind PGA Tour Player Scores Sukhyun Sean Park, Dong Kyun Kim, Ilsung Lee May 7, 2016 Abstract The main aim of this project is to analyze the variation in a dataset that is obtained from the
More informationDATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing
More informationLeveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg
Model Optimization Leveling Up as a Data Scientist http://shorelinechurch.org/wp-content/uploa ds/2014/10/level-up-ds.jpg Bias and Variance Error = (expected loss of accuracy) 2 + flexibility of model
More informationTHIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010
THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE
More informationFurther Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables
Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables
More informationOn Choosing the Right Coordinate Transformation Method
On Choosing the Right Coordinate Transformation Method Yaron A. Felus 1 and Moshe Felus 1 The Survey of Israel, Tel-Aviv Surveying Engineering Department, MI Halperin Felus Surveying and Photogrammetry
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More informationBootstrapping Methods
Bootstrapping Methods example of a Monte Carlo method these are one Monte Carlo statistical method some Bayesian statistical methods are Monte Carlo we can also simulate models using Monte Carlo methods
More informationUsing Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers
Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Why enhance GLM? Shortcomings of the linear modelling approach. GLM being
More informationNetwork Management System Dimensioning with Performance Data. Kaisa Tuisku
Network Management System Dimensioning with Performance Data Kaisa Tuisku University of Tampere School of Information Sciences Computer Science M. Sc. thesis Supervisor: Jorma Laurikkala June 2016 i University
More informationData-Splitting Models for O3 Data
Data-Splitting Models for O3 Data Q. Yu, S. N. MacEachern and M. Peruggia Abstract Daily measurements of ozone concentration and eight covariates were recorded in 1976 in the Los Angeles basin (Breiman
More informationIntroduction to hypothesis testing
Introduction to hypothesis testing Mark Johnson Macquarie University Sydney, Australia February 27, 2017 1 / 38 Outline Introduction Hypothesis tests and confidence intervals Classical hypothesis tests
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection CSE 416: Machine Learning Emily Fox University of Washington April 12, 2018 Symptom of overfitting 2 Often, overfitting associated with very large
More informationHyperparameters and Validation Sets. Sargur N. Srihari
Hyperparameters and Validation Sets Sargur N. srihari@cedar.buffalo.edu 1 Topics in Machine Learning Basics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation
More informationTopics in Machine Learning
Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationSYS 6021 Linear Statistical Models
SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are
More informationChapter 7: Dual Modeling in the Presence of Constant Variance
Chapter 7: Dual Modeling in the Presence of Constant Variance 7.A Introduction An underlying premise of regression analysis is that a given response variable changes systematically and smoothly due to
More informationNonparametric Importance Sampling for Big Data
Nonparametric Importance Sampling for Big Data Abigael C. Nachtsheim Research Training Group Spring 2018 Advisor: Dr. Stufken SCHOOL OF MATHEMATICAL AND STATISTICAL SCIENCES Motivation Goal: build a model
More informationST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.
ST512 Fall Quarter, 2005 Exam 1 Name: Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. 1. (42 points) A random sample of n = 30 NBA basketball
More informationWeek 4: Simple Linear Regression III
Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of
More informationRobust Linear Regression (Passing- Bablok Median-Slope)
Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their
More informationChapter 7: Numerical Prediction
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 7: Numerical Prediction Lecture: Prof. Dr.
More informationIQR = number. summary: largest. = 2. Upper half: Q3 =
Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number
More informationPackage GWRM. R topics documented: July 31, Type Package
Type Package Package GWRM July 31, 2017 Title Generalized Waring Regression Model for Count Data Version 2.1.0.3 Date 2017-07-18 Maintainer Antonio Jose Saez-Castillo Depends R (>= 3.0.0)
More informationExample 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1
Panel data set Consists of n entities or subjects (e.g., firms and states), each of which includes T observations measured at 1 through t time period. total number of observations : nt Panel data have
More information