Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection
|
|
- Oliver Reeves
- 5 years ago
- Views:
Transcription
1 Statistical Modelling for Social Scientists Manchester University January 20, 21 and 24, 2011 Graeme Hutcheson, University of Manchester Exploratory regression and model selection The lecture notes, exercises and data sets associated with this course are available for download from: A number of management research projects aim to derive predictive and/or explanatory models from a large number of variables, data that is usually collected from questionnaires. The aim of such analyses is typically to identify those variables that may influence certain response variables (eg., regression models) or to identify the fit of the data to a proposed theoretical structure (eg., structural equation models). Such research often requires a final model to be selected and inferences made about the population from this model. This session deals with the selection of a final model (i.e., the model that is used as the basis for interpreting the relationships in the population) from the viewpoint of a researcher using regression models. It should be noted, however, that many of these issues also apply to the selection of models using other technqiues, and particularly apply to structural equation models. Building Regression Models All models are wrong. Some are useful. George Box, quoted in Gill, The aim of any modelling procedure should be to obtain a model that represents the relationships in the population rather than a specific sample. To this end we want to try and capture the underlying trends and relationships in the data that can reveal proceses in the population. If a model contains too few variables, it will not have the information necessary for it to adequately describe what is going on in the population; the model will be under-fitted. On the other hand, if a model contains too many variables, it will contain more information than is necessary to describe what is going on
2 Y A Y B Y = α + βx Actual relationship x 1st order models x Y C Y D Y = α + β 1x + β 2x 2 Y = α + β 1x β 5x 5 2nd order models x 5th order models x Figure 1: This graphic is adapted from Burnam and Anderson (2002, pg.34). The graphic shows a relationship between Y and x which has been modelled using a number of Monte Carlo simulations. The simple 1 st order polynomal model clearly misidentifies the basic structure of the actual relationship, is under-fitted and unsatisfactory. The 5 th order polynomial model, on the other hand, has too many parameters, an unnecessarily large variance and will have poor predictive qualitities because it is unstable (over-fitted). For this relationahip a 2 nd order polynomial seem to be a quite good approximating model. in the population; the model will have too many paramters and will be over-fitted. The task for the analyst is to select the optimum number of variables needed to describe the population. Figure 1 shows a graphical illustration of under- and over-fitted models. We may consider panel A as representing the relationship in the population whilst panels B to D represent the models that have been estimated from a number of samples. Panel B shows models that have been based on too few explanatory variables - in this case the models are too simplistic to capture the form of the underlying relationship; this model is under-fitted. Panel D shows models that have been estimated using more explanatory variables than are necessary - in this case the models are too complex to adequately represent the underlying form of the relationship; this model is over-fitted. Panel C shows a more appropriate model, where the level of complexity in the population is better represented by 2
3 the estimated models. In general, the best approximating model is achieved by properly balancing the errors of under-fitting and over-fitting. In other words, the model should be parsimonious. An example of the problems associated with over-fitting can be seen in Table 1, where the addition of the variable gender merely adds variance. Although Model 1 has a larger R 2 value, which we would expect as it contains a greater number of parameters, the F statistics show that the smaller model actually provides a more significant linear prediction of Quality. The inclusion of Gender in the model does not improve the prediction of Quality and can therefore be omitted without any significant loss of power. Ideally, only those variables which contribute significantly to the prediction of the response variable should be retained. The removal of unimportant variables results in a simpler model which helps in interpretation and often provides a clearer insight into the way the response variable varies as a function of changes in the explanatory variables. In general, a good model should enable an accurate prediction to be made of the response variable, but only contain those explanatory variables which play a significant role. Table 1: Model Selection Coefficient s.e. t P Model 1 Delay Gender (constant) Model 2 Delay (constant) Model 1: Quality = (Delay) 1.108(Gender) F 2,67 = 3.358, P = 0.041, R 2 = Model 2: Quality = (Delay) F 1,68 = 6.611, P = 0.012, R 2 = In addition to selecting the correct variables for the model, it is also important to appropriately interpret the model parameters (i.e., exactly what is the relationship between each explanatory variable and the response), particularly for models containing multiple explanatory variables. In order to understand some of these dificulties, it is necessary to have a thorough knowledge of multicollinearity. Multicollinearity Multicollinearity describes a situation where an explanatory variable is related to one or more of the other explanatory variables in the model. If these relationships are perfect or very strong, the calculation of the regression model and the appropriate interpretation of the results can be affected. In the case where one explanatory variable can be precisely predicted from one or more of the other explanatory variables (perfect multicollinearity), the analysis fails as a regression equation cannot even be formulated. When a relationship is strong, but not perfect (high multicollinearity), the regression equation can be formulated, but the parameters may be unreliable. Parameters which are unreliable can change dramatically as a result of relatively minor changes in the data set with the addition or deletion of a small number of observations exerting a large influence on the regression equation and, subsequently, on the interpretation of the results. 3
4 The consequences of multicollinearity depend, to some degree, on the objectives of the analysis. If the goal is prediction, then multicollinearity need not present much of a problem, as it primarily affects the calculated importance of the explanatory variables. However, if the goal is explanation (that is, the aim is to identify the strength of relationships between individual explanatory variables and the response variable), the presence of a high degree of multicollinearity poses a serious problem for the correct interpretation of the results. When conducting a multiple regression, one has to identify when multicollinearity is likely to present a problem and a strategy to deal with it must be decided upon. Perfect Multicollinearity Perfect multicollinearity occurs when an explanatory variable can be precisely predicted from other explanatory variables in the model. When this happens, the variable contributes no unique information to the model and is therefore redundant. The inclusion of one or more redundant explanatory variables in a regression model is problematic as it is not possible to determine the parameters associated with these variables and, consequently, a regression equation cannot even be formulated. This problem can be demonstrated by looking at a three-variable relationship, which can be represented algebraically as a plane using the three variables y, x and z: y = α + β 1 x + β 2 z If one of these explanatory variables is redundant (say, x = 5z) then y can be described in terms of a single variable (x or z), which is represented algebraically as a line y = α + β 1 5z + β 2 z or equivalently, y = α + β6z The regression procedure attempts to calculate parameters for x and z, but since there is merely information about one of them, only one coefficient can be computed and the regression technique breaks down (for a detailed discussion of this see Berry and Feldman, 1993 and Maddala, 1992). In essence, when there is perfect multicollinearity, the regression parameters cannot be formulated as the procedure attempts to fit an equation which has more dimensions than are present in the data. In practice, perfect multicollinearity is not usually a problem as it is quite rare and can be readily detected. In fact, many statistical analysis packages automatically alert the user to its presence. A more serious problem for the analyst is the presence of high multicollinearity where a regression model can be formulated, but the parameters associated with some of the explanatory variables may be unreliable. High Multicollinearity When explanatory variables are highly (but not perfectly) related to one or more of the other explanatory variables in the model it becomes difficult to disentangle the separate effect of each variable. As a variable which shows a high degree of multicollinearity provides little unique information, the regression coefficient associated with it is also based on limited information and therefore 4
5 tends to have a large standard error (for detailed discussions on this refer to Afifi and Clark, 1996 and Edwards, 1985). In such cases the regression parameters are unlikely to accurately reflect the impact that x i has on y in the population. The problems associated with high multicollinearity can be demonstrated using hypothetical data which shows the relationship between the number of college places offered to students and marks obtained in two compulsory subjects, English and Mathematics (see Table 2) 1. Table 2: Exam Marks and Offers of College Places Number of Colleges English Mathematics Offering Places (%) (%) One would expect there to be a strong relationship between the number of college places a student is offered and the student s marks in English and Mathematics as the decision to offer a place at a college is based largely on the student s academic performance. One would also expect a student s mark in one subject to be strongly related to their mark in the other subject, as good students tend to score relatively highly in both. This three-variable relationship is shown in Figure 2 and the associated regression model in Equation 1 and Table 3. The model appears to provide a good prediction of the number of college places offered to a student as indicated by the F and R 2 statistics (F 2,8 = ; P < : R 2 = 0.890) which corresponds to area A + B + C, in Figure 2. From Figure 2 it can be seen that the unique contributions made by each of the explanatory variables to the number of college places offered is relatively small. When controlling for marks in Mathematics, marks in English only contribute a small amount to the model fit (area C). Similarly, when controlling for marks in English, marks in Mathematics only contribute a small amount to the model fit (area A). The results in Table 3 confirm this and show that the unique contribution of each of the explanatory variables when they are both entered into the model (Model 1) is not significant, as shown by the t statistics. It appears clear from the F and R 2 statistics that Model 1 provides a good fit, even though neither of the explanatory variables are significant. This, perhaps unexpected, result is due to the high degree of multicollinearity between the explanatory variables. Logically we might expect marks in both English and Mathematics to be strongly related to the number of college places offered as places are offered mainly on the basis of academic performance. This is what we find when simple regression models are calculated using single subjects to predict college places (Models 2 and 3). The resulting models fit almost as well as the model which uses both variables, but the regression parameters for the explanatory variables are now highly significant. We can see that the presence of multicollinearity has not really affected the predictive power of the model, but has serious implications for the interpretation of the importance of the explanatory variables. 1 For this example we will assume that all students applied to the same 10 colleges. 5
6 Figure 2: The relationship between exam marks and the number of college places offered College places offered = English Mathematics (1) Table 3: Modelling the Number of College Places Offered to Students Coefficient s.e. t P Model 1 English Mathematics (constant) Model 2 English (constant) Model 3 Mathematics (constant) Model 1: Places offered = α + β English + β Mathematics F 2,8 = 32.52, P < , R 2 = Model 2: Places offered = α + β English F 1,9 = , P < , R 2 = Model 3: Places offered = α + β Mathematics F 1,9 = , P < , R 2 =
7 Identifying Instances of Multicollinearity Some instances of multicollinearity can be identified by inspecting pair-wise correlation coefficients. Relationships between explanatory variables which are of the order of about 0.8 or larger indicate a level of multicollinearity that may prove to be problematic 2. In the above example, the correlation between English and Mathematics scores is indicating that multicollinearity may be a problem for these data. Whilst this approach is quite adept at identifying problem relationships between pairs of explanatory variables, it cannot always identify those instances where a combination of more than one variable predicts another. These relationships can, however, be determined using R 2 values to show the degree to which each explanatory variable can be explained using the other explanatory variables in the model. As with pair-wise correlations, we cannot say with any certainty how high the value of R 2 must be before multicollinearity is viewed as a cause for concern, but typically, values of about 0.8 or higher are taken as being indicative of a degree of multicollinearity which may be problematic. In the example above, if we predict a student s Mathematics score using their English score we obtain a regression model with an R 2 value of 0.804, which indicates that a problematic level of multicollinearity may be present. Calculating individual R 2 values for each explanatory variable in the model is a useful method of identifying instances of multicollinearity, but can be quite a lengthy process if there are a number of variables. It is, however, not necessary to manually compute these R 2 values as a number of analysis packages provide equivalent information through the tolerance and variance inflation factor (VIF) statistics shown in equations 2 and 3 (the reason that this statistic is called the variance inflation factor is clearly explained in Weisberg, 1985; page 198). Tolerance(β i ) = 1 R 2 i (2) where β i is the regression coefficient for variable i, and R 2 i is the squared multiple correlation coefficient between x i and the other explanatory variables. VIF(β i ) = 1 Tolerance (3) We can see from equation 3 that for a simple regression model, an R 2 value of 0.8 will result in a VIF value of 5 and a tolerance value of 0.2. Any explanatory variables which have a VIF value of 5 or more, or a Tolerance of 0.2 or less, are therefore of interest as they show a degree of multicollinearity which could be problematic. Table 4 shows the regression analysis of the example data set with VIF and tolerance values of a high enough level to be of concern. It should be noted that as there are only two explanatory variables, the statistics for both variables are the same. The tolerance and VIF statistics are based on the R 2 measure and therefore assume that the data are continuous. It is, however, possible to use these statistics on discontinuous data provided that the variables have been coded appropriately. Fox and Monette (1992) generalized the notion of variance inflation to related sets of regressors (e.g., different categories of one variable) and proposed the generalized variance inflation factor (GVIF). The interpretation of GVIF 1/2p is the decrease in precision of the estimation due to collinearity. For example, if GVIF (or VIF)=4, the square root of this is 2, which indicates that the confidence intervals for these predictors are twice what they would have been for uncorrelated predictors. The use of GVIF greatly increases the usefulness of the techniques as it enables problematic relationships between all types of variables to be identified is an arbitrary figure and is used here because it is commonly quoted in a number of texts. It should be noted, however, that correlations smaller than 0.8 can also cause problems for the regression procedure. 7
8 Table 4: VIF and Tolerance Statistics Coefficient s.e. t P Tolerance VIF English Mathematics Places offered = α + β 1 English + β 2 Mathematics F 2,8 = 32.52, P < , R 2 = It should be noted, however, that these statistics can only give a rough indication of which relationships may be problematic, they do not provide any proof that multicollinearity will be a problem, nor do they identify all instances of problematic relationships. The tolerance and VIF statistics merely provide a convenient method for identifying at least some of the relationship of concern. The R commands to obtain collinearity statistics for an OLS regression model is as follows: Computing VIF values Data set (available for download from RGSweb): colleges.txt Rcmdr: commands Statistics Fit models Linear models... Linear Model Variables (double click to formula): Model formula: OK colleges, english and maths colleges english + maths Models Numerical diagnostics Variance Inflation Factors... Rcmdr: output > vif(linearmodel.1) ENGLISH MATHS Dealing with Multicollinearity There are a number of ways in which multicollinearity can be reduced in a data set. These methods include: 1. Collect more data As multicollinearity is a problem which results from insufficient information in the sample, one solution is to increase the amount of information by collecting more data. As more data is 8
9 collected and the sample size increases, the standard error tends to decrease which reduces the effect of multicollinearity (see Berry and Feldman, 1993). Although increasing the amount of data is an attractive option and one of the best methods to reduce multicollinearity (at least when the data set is relatively small), it is in many instances, not practical or possible, so other less attractive methods need to be considered. 2. Collapse variables One option to reduce the level of multicollinearity is to combine two or more explanatory variables which are highly correlated into a single composite variable. This approach is, however, only reasonable when the explanatory variables are indicators of the same underlying concept. For example, using the data in Table 2, it makes theoretical sense to combine the two explanatory variables (marks in English and Mathematics) into a single index of academic performance. This single index could simply be the sum of the two scores, or the average score for the two subjects. The use of a composite variable in the regression model enables one to assess the contribution made by academic performance to the number of college places offered, without the problem of a high degree of multicollinearity which existed between English and Mathematics scores. The process of combining variables into latent variables (or factors as they are sometimes called) is not always as straightforward as in the example shown above, where two variables were related in quite an obvious way and could be combined easily into a single index. If there are a number of variables which are inter-related, it might be appropriate to first identify any latent variables in the sample using factor analysis and then enter these into the regression model. The technique of factor analysis is discussed in detail in Chapter Remove variables from the model When it is not possible to collapse highly related explanatory variables into a composite variable, one may delete one or more variables to remove the effect of multicollinearity. This option, whilst being one of the easiest to accomplish practically, can be problematic if the variable measures some distinct theoretical concept which cannot be easily dismissed from the model. It should be noted that the removal of a relevant explanatory variable from a model can cause more serious problems than the presence of high multicollinearity (the removal of important variables may result in a model which is mis-specified, see Berry and Feldman, 1993). It is, therefore, generally unwise to remove explanatory variables from a regression equation merely on the grounds that they show a high degree of multicollinearity. In general, the most reasonable method of dealing with multicollinearity is to collect more data and, where possible, collapse a number of variables into composite or latent variables, provided that they make theoretical sense. If no more data can be collected, the variables cannot be incorporated into a composite variable, and the highly related variables are deemed to be a necessary part of the model (and therefore cannot be removed), then one might just have to recognize its presence and live with its consequences (the consequence being that it is not possible to obtain reliable regression coefficients for all of the variables in the model). Statisitcs for describing model-fit There are a number of statistics that may be used to describe models and these can also be used in model selection. The most important statistics and the ones that we are to use here are based around measures of deviance (RSS and -2LL, which are assessed for significance using the F and partial-f tests and χ 2 ; for a full explanation of these, please see Hutcheson and Moutinho, 2008). A model cannot, however, be adequately described merely using the deviance value as model complexity is also an important consideration. Generally speaking, smaller models are considered 9
10 preferable to larger models (we want parsimonious models). Complexity can be taken into account when building a model by using an information criteria, which penalises models which have more parameters. The most common information criteria statistics used are Akaike s Information Criterion (AIC) and Schwarz s Bayesian Criterion (BIC). which are constructed from two terms, the deviance and the model complexity. The AIC is calculated according to the formula 2 log-likelihood + 2p, where 2 log-likelihood represents the deviance, and p represents the number of parameters in the fitted model. Although widely used, it is recognised that AIC tends to over-estimate the number of parameters. An ammendment to the AIC which penalise those models with more parameters more heavily is the BIC, which is calculated according to the formula 2 log-likelihood + log(n) p, where 2 log-likelihood represents the deviance, and p represents the number of parameters in the fitted model. A higher value of AIC or BIC indicates a preferable model. The AIC and BIC statistics can therefore be used to compare models and evaluate the effect of single and multiple variables on a particular model. The use of these statistics in model selection is demonstrated below. Model Selection Procedures There are a number of methods for selecting a final model. It is useful, however, to describe two common approaches to variable selection, stepwise and optimal subset methods. Stepwise methods seek good subsets of predictors by adding or subtracting terms one at a time, while optimal subsets locate the subset of predictors of a given size that maximise some measure of fit to the data. stepwise selection Stepwise selection aims to derive a regression model by sequentially adding or removing terms from a model. For example, a forward selection method builds up a model by sequentially adding variables. For a model of Y when there are 16 potential explanatory variables (X 1 to X 16 ), the procedure can be described as follows: First step: Individually add each of the explanatory varibles to the model Y = α: Y = α + β variable 1 Y = α + β variable 2... Y = α + β variable 15 Y = α + β variable 16 10
11 The explanatory variable that has the most effect (according to the change in R 2, the change in deviance (F, partial-f, -2LL) or statistics such as AIC and BIC) is selected to enter the model. Suppose that, in this case, variable 9, has the greatest effect on Y. The model would be Y = α + β variable 9. Stage 2: start with the model Y = α + β variable 9 and then individually add each of the remaining explanatory varibles to the model: Y = α + β 1 variable 9 + β 2 variable 1 Y = α + β 1 variable 9 + β 2 variable 2... Y = α + β 1 variable 9 + β 2 variable 15 Y = α + β 1 variable 9 + β 2 variable 16 For this example we might find that the inclusion of variable 2 has the greatest effect and, provided that it reaches the criterion level, we would include this variable in the model. The model would now be Y = α + β 1 variable 9 + β 2 variable 2. Further Stages This process continues, sequentially adding variables to the model until no variable can be added to the model that has a significant effect (as determined by the entry criterion), and the process then stops, giving the final model. For example, Y = α + β 1 variable 9 + β 2 variable 2 + β 3 variable 14 + β 4 variable 6. A model can also be selected by starting with a full model and then sequentially removing variables. This is commonly known as backward deletion and is shown in the example below. An example of step-wise model selection using AIC The example below shows the stepwise procedure used in R, which uses backward elimination based on the AIC statistic, as this is the default option. Information about how to run a forward selection procedure and how to use the BIC criterion can be found in the on-line R documentation or Fox, For this example, we will use data that is distributed as part of the S-Plus example files (car.txt) and also made available on RGSweb. The data set contains information about a number of different cars. Variable Price Mileage Weight Displacement hp type Description Retail price of car average distance travelled for a set amount of fuel weight of the car engine size horse power of car 1 = small 2 = sporty, 3 = compact, 4 = medium, 5 = large, 6 = van 11
12 We may model the continuous variable mileage using all other information that has been collected. This model of Mileage can be obtained using OLS regression. Mileage =α + β 1 Price + β 2 Weight + β 3 Displacement + β 4 Horse Power + β 5 Type The model above is likely to include a number of explanatory variables that are strongly related and may show problematic levels of multicollinearity. This is confirmed in the output as a number of GVIF values are relatively high (all are above 3). The presence of multicollinearity may affect the significance levels of the explanatory variables as a number of variables we might expect to have a strong relationship with mileage such as HP (at least when assessed on its own) is not significant in this model. This model can be made more parsimonious using the stepwise procedure in R. In order to do this, we run the full model and then ask for the stepwise selection procedure to be used. As Stepwise regression has not been implemented in Rcmdr, we shall call the function directly from the R console. In order to run this, copy the commands into the R console, making sure that you have first loaded the dataset under the name cars. We can see in the output that the full model has an AIC value of We can see that for this model the lowest AIC occurs when the variable HP is removed. As this value is less than that for the full model, the stepwise procedure removes HP and then recalculates the AIC values for the reduced model. This process continues until only PRICE, DISP and TYPE are left in the model. As none of these variables can be removed without increasing the AIC value, the stepwise procedure stops. The stepwise procedure is applicable to a broad range of models including the GLM models and also has the advantage of treating the individual parameters from categoprical variables as single units. Procedures for computing different types of step-wise model-selection procedures (eg., forward selection) using a variety of statistics (AIC and BIC) are described in detail in Fox (2002). It should be noted that none of these variable selection procedures is best in any absolute sense 3 ; they merely identify subsets of variables that, for the sample, are good predictors of the response variable. Stepwise selection and missing data (more advanced, but important) The stepwise procedure works by comparing a number of models at each stage, each model omits a different variable and the model-fit statistics for the models are compared. In order for these comparisons to be made, the models must be constructed using the same amount of data. This means that if one of the variables has missing data, when it is removed from the model, the other models will contain more data. In this case the model-fit statistics cannot be compared and the stepwise procedure fails. In order for the stepwise procedure to work, the variables it is used on must not contain missing data. If there are any missing data, these need to be removed list-wise before the procedure is started (this is what SPSS does, but does not make it obviou that this is the case). The problem with this procedure is that for data sets which have many variables and missing data, the proportion of data lost to the analysis can be substantial. It should be realised that the 3 It can be argued that automatic selection procedures should not be relied upon to produce the best model. 12
13 An OLS regression model of Mileage Data set (available for download from RGSweb): cars.txt Rcmdr: commands Statistics Fit models Linear models... Linear Model Variables (double click to formula): mileage, disp, HP, price, type and weight Model formula: mileage disp + HP + price + type + weight OK Models Numerical diagnostics Variance Inflation Factors... Rcmdr: output OLS Regression model Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-10 *** DISP ** HP PRICE TYPE[T.Large] TYPE[T.Medium] TYPE[T.Small] ** TYPE[T.Sporty] * TYPE[T.Van] * WEIGHT Signif. codes: 0 *** ** 0.01 * Residual standard error: on 50 degrees of freedom Multiple R-squared: 0.813,Adjusted R-squared: F-statistic: on 9 and 50 DF, p-value: 2.832e-15 Variance inflation factors GVIF Df GVIF^(1/2Df) DISP HP PRICE TYPE WEIGHT final model from a step-wise procedure may not be based on all the available data. Indeed, it is often the case that running the final regression model on the original data gives different results. For example, using a stepwise procedure on the data set TLRPsample.txt gives the following model: MHEdisp3 ~ ASgradeCont + Course + Language + Gender + EMA + unifam 13
14 A stepwise OLS regression model of Mileage Data set (available for download from RGSweb): cars.txt R console: commands LinearModel.1 <- lm(mileage ~ DISP + HP + PRICE + TYPE + WEIGHT, data=cars) step(linearmodel.1) R console: output > step(linearmodel.1) Start: AIC= MILEAGE ~ DISP + HP + PRICE + TYPE + WEIGHT Df Sum of Sq RSS AIC - HP WEIGHT PRICE <none> DISP TYPE Step: AIC= MILEAGE ~ DISP + PRICE + TYPE + WEIGHT Df Sum of Sq RSS AIC - WEIGHT PRICE <none> DISP TYPE Step: AIC= MILEAGE ~ DISP + PRICE + TYPE Df Sum of Sq RSS AIC <none> PRICE DISP TYPE Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ASgradeCont e-07 *** Course[T.UoM] Language[T.ENGLISH] *** Language[T.OTHER] Gender[T.male] EMA[T.yes] unifam[t.parents] unifam[t.siblings] * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 267 degrees of freedom 14
15 Multiple R-squared: , Adjusted R-squared: F-statistic: on 8 and 267 DF, p-value: 1.417e-12 whereas running the same model on the original data set gives the following model: lm(formula = MHEdisp3 ~ ASgradeCont + Course + Language + Gender + EMA + unifam, data = TLRPsample) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) *** ASgradeCont e-13 *** Course[T.UoM] Language[T.ENGLISH] *** Language[T.OTHER] Gender[T.male] * EMA[T.yes] unifam[t.parents] unifam[t.siblings] * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 480 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 8 and 480 DF, p-value: < 2.2e-16 The use of the stepwise procedure has meant the removal of a substantial proportion of the data from the analysis. It has removed 213 cases from the analysis ( ). There is a work-around this in R, but you need to re-start the stepwise procedure each time it stops due to missing data. For example, run the model: model01<-lm(mhedisp3 ~ ASgradeCont + AveragePed + Course + Ethnicity + Language + Gender + EMA + HEFCE_.social_group + LPN + unifam, data=tlrpsample) step(model01) The stepwise procedure stops with the message: Error in step(model01) : number of rows in use has changed: remove missing values? From this model we can see that the variables Ethnicity and AveragePed can be removed from the model as they have the smallest AIC scores. These should be removed from the regression command and the model re-run: model01<-lm(mhedisp3 ~ ASgradeCont + Course + Language + Gender + EMA + HEFCE_.social_group + LPN + unifam, data=tlrpsample) step(model01) 15
16 This model will also produce an error. Remove the variable with the lowest AIC and restart the procedure. Continue until there are no more errors. This procedure is time consuming (but preferable to the automated procedure in SPSS) and only of limited use as the step-wise procedure may not be the best way in which the model should be constructed. Maybe there isn t a simple automatic way to select the best subset of variables. optimal subset selection Selecting ONE best model might be a flawed tactic as there are often many subsets of explanatory variables that can explain the response variable almost as well if not better than that chosen by a stepwise procedure. A different method of model selection is to compare different selections of variables without builing them up sequentially. The optimal subsets procedure is designed to do this by locating the subset of predictors of a given size that maximise some measure of fit to the data. This can be achieved in R using the regsubsets command in the leaps library and the subsets command in the car library. To compute an optimal subsets regression for the cars dataset, the following commands can be used: Selecting a model using optimal subsets selction Data set (available for download from RGSweb): cars.txt R console: commands library(car) library(leaps) subset.1 <- regsubsets(mileage ~ DISP + HP + PRICE + TYPE + WEIGHT, nbest=10, data=cars) subsets(subset.1) R console: output output is shown in the form of a graphic (see below) By default the BIC statistic is plotted against the number of predictors in the model. The regsubsets command obtains the subset of the best 10 models for each number of parameters. The resulting graph is shown below in Figure 3. From Figure 3 we can see that there are a large number of models that are roughly equally effective. The graph can be made more interpretable by defining limits. For example, to just show the 3 best models and graph only those solutions with a certain number of parameters (say 5 or 6). The code and resulting analysis are shown below: Although the analysis of categorical variables is quite difficult at this moment in R (as the procedure views each explanatory variable as separate) it certainly looks as though the variables 16
17 Figure 3: competing models Selecting a model using optimal subsets selction: restricted sets Data set (available for download from RGSweb): cars.txt R console: commands library(car) library(leaps) subset.2 <- regsubsets(mileage ~ DISP + HP + PRICE + TYPE + WEIGHT, nbest=3, data=cars) subsets(subset.2, min.size=5, max.size=6) R console: output output is shown in the form of a graphic (see below) displacement type and price are all members of a subset that would appear to be the best predictors of the variable mileage (interpret this model in conjunction with the stepwise models computed earlier). Conclusion Automated selection procedures can be used to make decisions about whether terms are included or excluded from a regression model on statistical grounds according to how much the variables contribute to predicting the response variable. Ideally, such decisions should be based on theoretical 17
18 Figure 4: competing models, a clearer graph as well as statistical grounds, however, it is sometimes convenient to use automated procedures. Whilst such a technique of model-building is relatively quick and efficient at deriving a model which provides a good prediction of the response variable, it does not always provide a model which is adequate for explanatory purposes. Agresti makes the point that... Computerized variable selection procedures should be used with caution. When one considers a large number of terms for potential inclusion in a model, one or two of them that are not really important may look impressive simply due to chance. For instance, when all the true effects are weak, the largest sample effect may substantially overestimate its true effect. In addition, it often makes sense to include certain variables of special interest in a model and report their estimated effects even if they are not statistically significant at some level. Agresti, 1996; p Sanford Weisberg, in his excellent book Applied Linear Regression makes the following point in the chapter that discusses multicollinearity and variable selection: The single most important tool in selecting a subset of variables for use in a model is the analyst s knowledge of the substantive area under study and of each of the variables, including expected sign and magnitude of coefficient. Weisberg, S., 1985; p What should a modelling procedure look like? Define which factors are likely to be important (definition of a theoretical model ). 18
19 How are these factors to be represented in the model? (single variables, composite variables or factors). Evaluate the theoretical model proposed above (maybe utilising a structural equation model if the model is adequately defined). Test to see if the resulting model is one of the best-fitting using an all-subsets methodology (see Weisberg, 1985; Fox, 2002). Analyse diagnostics for evidence that model assumptions have been violated (amend the models if appropriate). Describe and interpret the model-fit statistics and parameters. Illustrate the model using predictions obtained for clusters that occur in the data. 19
Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010
Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester Exploratory regression and model
More informationQuantitative Methods in Management
Quantitative Methods in Management MBA Glasgow University March 20-23, 2009 Luiz Moutinho, University of Glasgow Graeme Hutcheson, University of Manchester Exploratory Regression The lecture notes, exercises
More information7. Collinearity and Model Selection
Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among
More informationVariable selection is intended to select the best subset of predictors. But why bother?
Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider
More information22s:152 Applied Linear Regression
22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider
More informationMulticollinearity and Validation CIVL 7012/8012
Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.
More informationStatistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010
Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester Principal Component and Factor Analysis
More information( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.
Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING
More informationModel selection Outline for today
Model selection Outline for today The problem of model selection Choose among models by a criterion rather than significance testing Criteria: Mallow s C p and AIC Search strategies: All subsets; stepaic
More informationResources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.
Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department
More informationBIOL 458 BIOMETRY Lab 10 - Multiple Regression
BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several
More informationSTA121: Applied Regression Analysis
STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model
More informationCategorical explanatory variables
Hutcheson, G. D. (2011). Tutorial: Categorical Explanatory Variables. Journal of Modelling in Management. 6, 2: 225 236. NOTE: this is a slightly updated version of this paper which is distributed to correct
More informationApplied Statistics and Econometrics Lecture 6
Applied Statistics and Econometrics Lecture 6 Giuseppe Ragusa Luiss University gragusa@luiss.it http://gragusa.org/ March 6, 2017 Luiss University Empirical application. Data Italian Labour Force Survey,
More informationFrequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM
Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM * Which directories are used for input files and output files? See menu-item "Options" and page 22 in the manual.
More informationDiscussion Notes 3 Stepwise Regression and Model Selection
Discussion Notes 3 Stepwise Regression and Model Selection Stepwise Regression There are many different commands for doing stepwise regression. Here we introduce the command step. There are many arguments
More informationUsing Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers
Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Why enhance GLM? Shortcomings of the linear modelling approach. GLM being
More informationBasics of Multivariate Modelling and Data Analysis
Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 9. Linear regression with latent variables 9.1 Principal component regression (PCR) 9.2 Partial least-squares regression (PLS) [ mostly
More informationHeteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors
Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms
More informationHistorical Data RSM Tutorial Part 1 The Basics
DX10-05-3-HistRSM Rev. 1/27/16 Historical Data RSM Tutorial Part 1 The Basics Introduction In this tutorial you will see how the regression tool in Design-Expert software, intended for response surface
More informationRegression. Dr. G. Bharadwaja Kumar VIT Chennai
Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called
More informationCluster Analysis Gets Complicated
Cluster Analysis Gets Complicated Collinearity is a natural problem in clustering. So how can researchers get around it? Cluster analysis is widely used in segmentation studies for several reasons. First
More informationFurther Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables
Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables
More informationRecall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:
Topic 11. Unbalanced Designs [ST&D section 9.6, page 219; chapter 18] 11.1 Definition of missing data Accidents often result in loss of data. Crops are destroyed in some plots, plants and animals die,
More informationRegression on SAT Scores of 374 High Schools and K-means on Clustering Schools
Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques
More informationCHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY
23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series
More informationThe problem we have now is called variable selection or perhaps model selection. There are several objectives.
STAT-UB.0103 NOTES for Wednesday 01.APR.04 One of the clues on the library data comes through the VIF values. These VIFs tell you to what extent a predictor is linearly dependent on other predictors. We
More informationLecture 13: Model selection and regularization
Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always
More informationLinear Model Selection and Regularization. especially usefull in high dimensions p>>100.
Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records
More informationBuilding Better Parametric Cost Models
Building Better Parametric Cost Models Based on the PMI PMBOK Guide Fourth Edition 37 IPDI has been reviewed and approved as a provider of project management training by the Project Management Institute
More informationCDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening
CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening Variables Entered/Removed b Variables Entered GPA in other high school, test, Math test, GPA, High school math GPA a Variables Removed
More informationUsing R for Analyzing Delay Discounting Choice Data. analysis of discounting choice data requires the use of tools that allow for repeated measures
Using R for Analyzing Delay Discounting Choice Data Logistic regression is available in a wide range of statistical software packages, but the analysis of discounting choice data requires the use of tools
More informationGraphical Analysis of Data using Microsoft Excel [2016 Version]
Graphical Analysis of Data using Microsoft Excel [2016 Version] Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters.
More informationCPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017
CPSC 340: Machine Learning and Data Mining Feature Selection Fall 2017 Assignment 2: Admin 1 late day to hand in tonight, 2 for Wednesday, answers posted Thursday. Extra office hours Thursday at 4pm (ICICS
More informationThe Truth behind PGA Tour Player Scores
The Truth behind PGA Tour Player Scores Sukhyun Sean Park, Dong Kyun Kim, Ilsung Lee May 7, 2016 Abstract The main aim of this project is to analyze the variation in a dataset that is obtained from the
More informationGoals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)
SOC6078 Advanced Statistics: 9. Generalized Additive Models Robert Andersen Department of Sociology University of Toronto Goals of the Lecture Introduce Additive Models Explain how they extend from simple
More informationSYS 6021 Linear Statistical Models
SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are
More informationWorkshop 8: Model selection
Workshop 8: Model selection Selecting among candidate models requires a criterion for evaluating and comparing models, and a strategy for searching the possibilities. In this workshop we will explore some
More informationEC121 Mathematical Techniques A Revision Notes
EC Mathematical Techniques A Revision Notes EC Mathematical Techniques A Revision Notes Mathematical Techniques A begins with two weeks of intensive revision of basic arithmetic and algebra, to the level
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Feature Selection Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. Admin Assignment 3: Due Friday Midterm: Feb 14 in class
More informationSpatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data
Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated
More informationMultiple Linear Regression
Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors
More informationDifferentiation of Cognitive Abilities across the Lifespan. Online Supplement. Elliot M. Tucker-Drob
1 Differentiation of Cognitive Abilities across the Lifespan Online Supplement Elliot M. Tucker-Drob This online supplement reports the results of an alternative set of analyses performed on a single sample
More informationStatistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings
Statistical Good Practice Guidelines SSC home Using Excel for Statistics - Tips and Warnings On-line version 2 - March 2001 This is one in a series of guides for research and support staff involved in
More informationSASEG 9B Regression Assumptions
SASEG 9B Regression Assumptions (Fall 2015) Sources (adapted with permission)- T. P. Cronan, Jeff Mullins, Ron Freeze, and David E. Douglas Course and Classroom Notes Enterprise Systems, Sam M. Walton
More informationExercise 2.23 Villanova MAT 8406 September 7, 2015
Exercise 2.23 Villanova MAT 8406 September 7, 2015 Step 1: Understand the Question Consider the simple linear regression model y = 50 + 10x + ε where ε is NID(0, 16). Suppose that n = 20 pairs of observations
More informationCREATING THE ANALYSIS
Chapter 14 Multiple Regression Chapter Table of Contents CREATING THE ANALYSIS...214 ModelInformation...217 SummaryofFit...217 AnalysisofVariance...217 TypeIIITests...218 ParameterEstimates...218 Residuals-by-PredictedPlot...219
More informationExploring Econometric Model Selection Using Sensitivity Analysis
Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover
More informationD-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview
Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,
More informationSPSS INSTRUCTION CHAPTER 9
SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can
More informationSection 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business
Section 3.2: Multiple Linear Regression II Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Multiple Linear Regression: Inference and Understanding We can answer new questions
More informationMultivariate Analysis Multivariate Calibration part 2
Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data
More informationLab 07: Multiple Linear Regression: Variable Selection
Lab 07: Multiple Linear Regression: Variable Selection OBJECTIVES 1.Use PROC REG to fit multiple regression models. 2.Learn how to find the best reduced model. 3.Variable diagnostics and influential statistics
More informationStat 5303 (Oehlert): Unbalanced Factorial Examples 1
Stat 5303 (Oehlert): Unbalanced Factorial Examples 1 > section
More informationPanel Data 4: Fixed Effects vs Random Effects Models
Panel Data 4: Fixed Effects vs Random Effects Models Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised April 4, 2017 These notes borrow very heavily, sometimes verbatim,
More informationInformation Criteria Methods in SAS for Multiple Linear Regression Models
Paper SA5 Information Criteria Methods in SAS for Multiple Linear Regression Models Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN ABSTRACT SAS 9.1 calculates Akaike s Information
More informationGLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015
GLM II Basic Modeling Strategy 2015 CAS Ratemaking and Product Management Seminar by Paul Bailey March 10, 2015 Building predictive models is a multi-step process Set project goals and review background
More information2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy
2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.
More informationEverything taken from (Hair, Hult et al. 2017) but some formulas taken elswere or created by Erik Mønness.
/Users/astacbf/Desktop/Assessing smartpls (engelsk).docx 1/8 Assessing smartpls Everything taken from (Hair, Hult et al. 017) but some formulas taken elswere or created by Erik Mønness. Run PLS algorithm,
More informationChapter 7: Dual Modeling in the Presence of Constant Variance
Chapter 7: Dual Modeling in the Presence of Constant Variance 7.A Introduction An underlying premise of regression analysis is that a given response variable changes systematically and smoothly due to
More information8. Collinearity and Model Selection
Lecture Notes 8. Collinearity and Model Selection Collinearity and Model Selection 1 1. Introduction I When there is aperfect linear relationship among the regressors in a linear model, the least-squares
More informationUsing Excel for Graphical Analysis of Data
Using Excel for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters. Graphs are
More informationLinear Methods for Regression and Shrinkage Methods
Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors
More informationChapter 6: Linear Model Selection and Regularization
Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming
More informationChapters 5-6: Statistical Inference Methods
Chapters 5-6: Statistical Inference Methods Chapter 5: Estimation (of population parameters) Ex. Based on GSS data, we re 95% confident that the population mean of the variable LONELY (no. of days in past
More informationIntroduction to mixed-effects regression for (psycho)linguists
Introduction to mixed-effects regression for (psycho)linguists Martijn Wieling Department of Humanities Computing, University of Groningen Groningen, April 21, 2015 1 Martijn Wieling Introduction to mixed-effects
More informationPerformance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018
Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:
More informationCS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp
CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as
More informationWeek 4: Simple Linear Regression II
Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties
More informationProblem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA
ECL 290 Statistical Models in Ecology using R Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA Datasets in this problem set adapted from those provided
More informationCross-validation and the Bootstrap
Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,
More informationStatistics Lab #7 ANOVA Part 2 & ANCOVA
Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")
More informationYear 10 General Mathematics Unit 2
Year 11 General Maths Year 10 General Mathematics Unit 2 Bivariate Data Chapter 4 Chapter Four 1 st Edition 2 nd Edition 2013 4A 1, 2, 3, 4, 6, 7, 8, 9, 10, 11 1, 2, 3, 4, 6, 7, 8, 9, 10, 11 2F (FM) 1,
More informationExploratory model analysis
Exploratory model analysis with R and GGobi Hadley Wickham 6--8 Introduction Why do we build models? There are two basic reasons: explanation or prediction [Ripley, 4]. Using large ensembles of models
More informationOptimizing Pharmaceutical Production Processes Using Quality by Design Methods
Optimizing Pharmaceutical Production Processes Using Quality by Design Methods Bernd Heinen, SAS WHITE PAPER SAS White Paper Table of Contents Abstract.... The situation... Case study and database... Step
More informationSandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing
Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications
More informationStudy Guide. Module 1. Key Terms
Study Guide Module 1 Key Terms general linear model dummy variable multiple regression model ANOVA model ANCOVA model confounding variable squared multiple correlation adjusted squared multiple correlation
More informationMixed Effects Models. Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC.
Mixed Effects Models Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC March 6, 2018 Resources for statistical assistance Department of Statistics
More informationBootstrapping Method for 14 June 2016 R. Russell Rhinehart. Bootstrapping
Bootstrapping Method for www.r3eda.com 14 June 2016 R. Russell Rhinehart Bootstrapping This is extracted from the book, Nonlinear Regression Modeling for Engineering Applications: Modeling, Model Validation,
More informationWeek 7 Picturing Network. Vahe and Bethany
Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups
More informationLISA: Explore JMP Capabilities in Design of Experiments. Liaosa Xu June 21, 2012
LISA: Explore JMP Capabilities in Design of Experiments Liaosa Xu June 21, 2012 Course Outline Why We Need Custom Design The General Approach JMP Examples Potential Collinearity Issues Prior Design Evaluations
More informationMultiple Regression White paper
+44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms
More informationData Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski
Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...
More informationStatistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015
Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015 Data Exploration Import Relevant Packages: library(grdevices) library(graphics) library(plyr) library(hexbin) library(base)
More informationMachine Learning: An Applied Econometric Approach Online Appendix
Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail
More informationNCSS Statistical Software
Chapter 327 Geometric Regression Introduction Geometric regression is a special case of negative binomial regression in which the dispersion parameter is set to one. It is similar to regular multiple regression
More informationChapter 3. Set Theory. 3.1 What is a Set?
Chapter 3 Set Theory 3.1 What is a Set? A set is a well-defined collection of objects called elements or members of the set. Here, well-defined means accurately and unambiguously stated or described. Any
More informationTHIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010
THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationAn introduction to SPSS
An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible
More informationCHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT
CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT This chapter provides step by step instructions on how to define and estimate each of the three types of LC models (Cluster, DFactor or Regression) and also
More informationUsing Excel for Graphical Analysis of Data
EXERCISE Using Excel for Graphical Analysis of Data Introduction In several upcoming experiments, a primary goal will be to determine the mathematical relationship between two variable physical parameters.
More informationLocal Minima in Regression with Optimal Scaling Transformations
Chapter 2 Local Minima in Regression with Optimal Scaling Transformations CATREG is a program for categorical multiple regression, applying optimal scaling methodology to quantify categorical variables,
More informationModel Diagnostic tests
Model Diagnostic tests 1. Multicollinearity a) Pairwise correlation test Quick/Group stats/ correlations b) VIF Step 1. Open the EViews workfile named Fish8.wk1. (FROM DATA FILES- TSIME) Step 2. Select
More information3 Graphical Displays of Data
3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked
More informationQuality Checking an fmri Group Result (art_groupcheck)
Quality Checking an fmri Group Result (art_groupcheck) Paul Mazaika, Feb. 24, 2009 A statistical parameter map of fmri group analyses relies on the assumptions of the General Linear Model (GLM). The assumptions
More informationChapter 4: Analyzing Bivariate Data with Fathom
Chapter 4: Analyzing Bivariate Data with Fathom Summary: Building from ideas introduced in Chapter 3, teachers continue to analyze automobile data using Fathom to look for relationships between two quantitative
More information