Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection

Size: px

Start display at page:

Download "Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection"

Oliver Reeves
5 years ago
Views:

Statistical Modelling for Social Scientists Manchester University January 20, 21 and 24, 2011 Graeme Hutcheson, University of Manchester Exploratory regression and model selection The lecture notes,

1 Statistical Modelling for Social Scientists Manchester University January 20, 21 and 24, 2011 Graeme Hutcheson, University of Manchester Exploratory regression and model selection The lecture notes, exercises and data sets associated with this course are available for download from: A number of management research projects aim to derive predictive and/or explanatory models from a large number of variables, data that is usually collected from questionnaires. The aim of such analyses is typically to identify those variables that may influence certain response variables (eg., regression models) or to identify the fit of the data to a proposed theoretical structure (eg., structural equation models). Such research often requires a final model to be selected and inferences made about the population from this model. This session deals with the selection of a final model (i.e., the model that is used as the basis for interpreting the relationships in the population) from the viewpoint of a researcher using regression models. It should be noted, however, that many of these issues also apply to the selection of models using other technqiues, and particularly apply to structural equation models. Building Regression Models All models are wrong. Some are useful. George Box, quoted in Gill, The aim of any modelling procedure should be to obtain a model that represents the relationships in the population rather than a specific sample. To this end we want to try and capture the underlying trends and relationships in the data that can reveal proceses in the population. If a model contains too few variables, it will not have the information necessary for it to adequately describe what is going on in the population; the model will be under-fitted. On the other hand, if a model contains too many variables, it will contain more information than is necessary to describe what is going on

2 Y A Y B Y = α + βx Actual relationship x 1st order models x Y C Y D Y = α + β 1x + β 2x 2 Y = α + β 1x β 5x 5 2nd order models x 5th order models x Figure 1: This graphic is adapted from Burnam and Anderson (2002, pg.34). The graphic shows a relationship between Y and x which has been modelled using a number of Monte Carlo simulations. The simple 1 st order polynomal model clearly misidentifies the basic structure of the actual relationship, is under-fitted and unsatisfactory. The 5 th order polynomial model, on the other hand, has too many parameters, an unnecessarily large variance and will have poor predictive qualitities because it is unstable (over-fitted). For this relationahip a 2 nd order polynomial seem to be a quite good approximating model. in the population; the model will have too many paramters and will be over-fitted. The task for the analyst is to select the optimum number of variables needed to describe the population. Figure 1 shows a graphical illustration of under- and over-fitted models. We may consider panel A as representing the relationship in the population whilst panels B to D represent the models that have been estimated from a number of samples. Panel B shows models that have been based on too few explanatory variables - in this case the models are too simplistic to capture the form of the underlying relationship; this model is under-fitted. Panel D shows models that have been estimated using more explanatory variables than are necessary - in this case the models are too complex to adequately represent the underlying form of the relationship; this model is over-fitted. Panel C shows a more appropriate model, where the level of complexity in the population is better represented by 2

3 the estimated models. In general, the best approximating model is achieved by properly balancing the errors of under-fitting and over-fitting. In other words, the model should be parsimonious. An example of the problems associated with over-fitting can be seen in Table 1, where the addition of the variable gender merely adds variance. Although Model 1 has a larger R 2 value, which we would expect as it contains a greater number of parameters, the F statistics show that the smaller model actually provides a more significant linear prediction of Quality. The inclusion of Gender in the model does not improve the prediction of Quality and can therefore be omitted without any significant loss of power. Ideally, only those variables which contribute significantly to the prediction of the response variable should be retained. The removal of unimportant variables results in a simpler model which helps in interpretation and often provides a clearer insight into the way the response variable varies as a function of changes in the explanatory variables. In general, a good model should enable an accurate prediction to be made of the response variable, but only contain those explanatory variables which play a significant role. Table 1: Model Selection Coefficient s.e. t P Model 1 Delay Gender (constant) Model 2 Delay (constant) Model 1: Quality = (Delay) 1.108(Gender) F 2,67 = 3.358, P = 0.041, R 2 = Model 2: Quality = (Delay) F 1,68 = 6.611, P = 0.012, R 2 = In addition to selecting the correct variables for the model, it is also important to appropriately interpret the model parameters (i.e., exactly what is the relationship between each explanatory variable and the response), particularly for models containing multiple explanatory variables. In order to understand some of these dificulties, it is necessary to have a thorough knowledge of multicollinearity. Multicollinearity Multicollinearity describes a situation where an explanatory variable is related to one or more of the other explanatory variables in the model. If these relationships are perfect or very strong, the calculation of the regression model and the appropriate interpretation of the results can be affected. In the case where one explanatory variable can be precisely predicted from one or more of the other explanatory variables (perfect multicollinearity), the analysis fails as a regression equation cannot even be formulated. When a relationship is strong, but not perfect (high multicollinearity), the regression equation can be formulated, but the parameters may be unreliable. Parameters which are unreliable can change dramatically as a result of relatively minor changes in the data set with the addition or deletion of a small number of observations exerting a large influence on the regression equation and, subsequently, on the interpretation of the results. 3

4 The consequences of multicollinearity depend, to some degree, on the objectives of the analysis. If the goal is prediction, then multicollinearity need not present much of a problem, as it primarily affects the calculated importance of the explanatory variables. However, if the goal is explanation (that is, the aim is to identify the strength of relationships between individual explanatory variables and the response variable), the presence of a high degree of multicollinearity poses a serious problem for the correct interpretation of the results. When conducting a multiple regression, one has to identify when multicollinearity is likely to present a problem and a strategy to deal with it must be decided upon. Perfect Multicollinearity Perfect multicollinearity occurs when an explanatory variable can be precisely predicted from other explanatory variables in the model. When this happens, the variable contributes no unique information to the model and is therefore redundant. The inclusion of one or more redundant explanatory variables in a regression model is problematic as it is not possible to determine the parameters associated with these variables and, consequently, a regression equation cannot even be formulated. This problem can be demonstrated by looking at a three-variable relationship, which can be represented algebraically as a plane using the three variables y, x and z: y = α + β 1 x + β 2 z If one of these explanatory variables is redundant (say, x = 5z) then y can be described in terms of a single variable (x or z), which is represented algebraically as a line y = α + β 1 5z + β 2 z or equivalently, y = α + β6z The regression procedure attempts to calculate parameters for x and z, but since there is merely information about one of them, only one coefficient can be computed and the regression technique breaks down (for a detailed discussion of this see Berry and Feldman, 1993 and Maddala, 1992). In essence, when there is perfect multicollinearity, the regression parameters cannot be formulated as the procedure attempts to fit an equation which has more dimensions than are present in the data. In practice, perfect multicollinearity is not usually a problem as it is quite rare and can be readily detected. In fact, many statistical analysis packages automatically alert the user to its presence. A more serious problem for the analyst is the presence of high multicollinearity where a regression model can be formulated, but the parameters associated with some of the explanatory variables may be unreliable. High Multicollinearity When explanatory variables are highly (but not perfectly) related to one or more of the other explanatory variables in the model it becomes difficult to disentangle the separate effect of each variable. As a variable which shows a high degree of multicollinearity provides little unique information, the regression coefficient associated with it is also based on limited information and therefore 4

5 tends to have a large standard error (for detailed discussions on this refer to Afifi and Clark, 1996 and Edwards, 1985). In such cases the regression parameters are unlikely to accurately reflect the impact that x i has on y in the population. The problems associated with high multicollinearity can be demonstrated using hypothetical data which shows the relationship between the number of college places offered to students and marks obtained in two compulsory subjects, English and Mathematics (see Table 2) 1. Table 2: Exam Marks and Offers of College Places Number of Colleges English Mathematics Offering Places (%) (%) One would expect there to be a strong relationship between the number of college places a student is offered and the student s marks in English and Mathematics as the decision to offer a place at a college is based largely on the student s academic performance. One would also expect a student s mark in one subject to be strongly related to their mark in the other subject, as good students tend to score relatively highly in both. This three-variable relationship is shown in Figure 2 and the associated regression model in Equation 1 and Table 3. The model appears to provide a good prediction of the number of college places offered to a student as indicated by the F and R 2 statistics (F 2,8 = ; P < : R 2 = 0.890) which corresponds to area A + B + C, in Figure 2. From Figure 2 it can be seen that the unique contributions made by each of the explanatory variables to the number of college places offered is relatively small. When controlling for marks in Mathematics, marks in English only contribute a small amount to the model fit (area C). Similarly, when controlling for marks in English, marks in Mathematics only contribute a small amount to the model fit (area A). The results in Table 3 confirm this and show that the unique contribution of each of the explanatory variables when they are both entered into the model (Model 1) is not significant, as shown by the t statistics. It appears clear from the F and R 2 statistics that Model 1 provides a good fit, even though neither of the explanatory variables are significant. This, perhaps unexpected, result is due to the high degree of multicollinearity between the explanatory variables. Logically we might expect marks in both English and Mathematics to be strongly related to the number of college places offered as places are offered mainly on the basis of academic performance. This is what we find when simple regression models are calculated using single subjects to predict college places (Models 2 and 3). The resulting models fit almost as well as the model which uses both variables, but the regression parameters for the explanatory variables are now highly significant. We can see that the presence of multicollinearity has not really affected the predictive power of the model, but has serious implications for the interpretation of the importance of the explanatory variables. 1 For this example we will assume that all students applied to the same 10 colleges. 5

6 Figure 2: The relationship between exam marks and the number of college places offered College places offered = English Mathematics (1) Table 3: Modelling the Number of College Places Offered to Students Coefficient s.e. t P Model 1 English Mathematics (constant) Model 2 English (constant) Model 3 Mathematics (constant) Model 1: Places offered = α + β English + β Mathematics F 2,8 = 32.52, P < , R 2 = Model 2: Places offered = α + β English F 1,9 = , P < , R 2 = Model 3: Places offered = α + β Mathematics F 1,9 = , P < , R 2 =

7 Identifying Instances of Multicollinearity Some instances of multicollinearity can be identified by inspecting pair-wise correlation coefficients. Relationships between explanatory variables which are of the order of about 0.8 or larger indicate a level of multicollinearity that may prove to be problematic 2. In the above example, the correlation between English and Mathematics scores is indicating that multicollinearity may be a problem for these data. Whilst this approach is quite adept at identifying problem relationships between pairs of explanatory variables, it cannot always identify those instances where a combination of more than one variable predicts another. These relationships can, however, be determined using R 2 values to show the degree to which each explanatory variable can be explained using the other explanatory variables in the model. As with pair-wise correlations, we cannot say with any certainty how high the value of R 2 must be before multicollinearity is viewed as a cause for concern, but typically, values of about 0.8 or higher are taken as being indicative of a degree of multicollinearity which may be problematic. In the example above, if we predict a student s Mathematics score using their English score we obtain a regression model with an R 2 value of 0.804, which indicates that a problematic level of multicollinearity may be present. Calculating individual R 2 values for each explanatory variable in the model is a useful method of identifying instances of multicollinearity, but can be quite a lengthy process if there are a number of variables. It is, however, not necessary to manually compute these R 2 values as a number of analysis packages provide equivalent information through the tolerance and variance inflation factor (VIF) statistics shown in equations 2 and 3 (the reason that this statistic is called the variance inflation factor is clearly explained in Weisberg, 1985; page 198). Tolerance(β i ) = 1 R 2 i (2) where β i is the regression coefficient for variable i, and R 2 i is the squared multiple correlation coefficient between x i and the other explanatory variables. VIF(β i ) = 1 Tolerance (3) We can see from equation 3 that for a simple regression model, an R 2 value of 0.8 will result in a VIF value of 5 and a tolerance value of 0.2. Any explanatory variables which have a VIF value of 5 or more, or a Tolerance of 0.2 or less, are therefore of interest as they show a degree of multicollinearity which could be problematic. Table 4 shows the regression analysis of the example data set with VIF and tolerance values of a high enough level to be of concern. It should be noted that as there are only two explanatory variables, the statistics for both variables are the same. The tolerance and VIF statistics are based on the R 2 measure and therefore assume that the data are continuous. It is, however, possible to use these statistics on discontinuous data provided that the variables have been coded appropriately. Fox and Monette (1992) generalized the notion of variance inflation to related sets of regressors (e.g., different categories of one variable) and proposed the generalized variance inflation factor (GVIF). The interpretation of GVIF 1/2p is the decrease in precision of the estimation due to collinearity. For example, if GVIF (or VIF)=4, the square root of this is 2, which indicates that the confidence intervals for these predictors are twice what they would have been for uncorrelated predictors. The use of GVIF greatly increases the usefulness of the techniques as it enables problematic relationships between all types of variables to be identified is an arbitrary figure and is used here because it is commonly quoted in a number of texts. It should be noted, however, that correlations smaller than 0.8 can also cause problems for the regression procedure. 7

8 Table 4: VIF and Tolerance Statistics Coefficient s.e. t P Tolerance VIF English Mathematics Places offered = α + β 1 English + β 2 Mathematics F 2,8 = 32.52, P < , R 2 = It should be noted, however, that these statistics can only give a rough indication of which relationships may be problematic, they do not provide any proof that multicollinearity will be a problem, nor do they identify all instances of problematic relationships. The tolerance and VIF statistics merely provide a convenient method for identifying at least some of the relationship of concern. The R commands to obtain collinearity statistics for an OLS regression model is as follows: Computing VIF values Data set (available for download from RGSweb): colleges.txt Rcmdr: commands Statistics Fit models Linear models... Linear Model Variables (double click to formula): Model formula: OK colleges, english and maths colleges english + maths Models Numerical diagnostics Variance Inflation Factors... Rcmdr: output > vif(linearmodel.1) ENGLISH MATHS Dealing with Multicollinearity There are a number of ways in which multicollinearity can be reduced in a data set. These methods include: 1. Collect more data As multicollinearity is a problem which results from insufficient information in the sample, one solution is to increase the amount of information by collecting more data. As more data is 8

9 collected and the sample size increases, the standard error tends to decrease which reduces the effect of multicollinearity (see Berry and Feldman, 1993). Although increasing the amount of data is an attractive option and one of the best methods to reduce multicollinearity (at least when the data set is relatively small), it is in many instances, not practical or possible, so other less attractive methods need to be considered. 2. Collapse variables One option to reduce the level of multicollinearity is to combine two or more explanatory variables which are highly correlated into a single composite variable. This approach is, however, only reasonable when the explanatory variables are indicators of the same underlying concept. For example, using the data in Table 2, it makes theoretical sense to combine the two explanatory variables (marks in English and Mathematics) into a single index of academic performance. This single index could simply be the sum of the two scores, or the average score for the two subjects. The use of a composite variable in the regression model enables one to assess the contribution made by academic performance to the number of college places offered, without the problem of a high degree of multicollinearity which existed between English and Mathematics scores. The process of combining variables into latent variables (or factors as they are sometimes called) is not always as straightforward as in the example shown above, where two variables were related in quite an obvious way and could be combined easily into a single index. If there are a number of variables which are inter-related, it might be appropriate to first identify any latent variables in the sample using factor analysis and then enter these into the regression model. The technique of factor analysis is discussed in detail in Chapter Remove variables from the model When it is not possible to collapse highly related explanatory variables into a composite variable, one may delete one or more variables to remove the effect of multicollinearity. This option, whilst being one of the easiest to accomplish practically, can be problematic if the variable measures some distinct theoretical concept which cannot be easily dismissed from the model. It should be noted that the removal of a relevant explanatory variable from a model can cause more serious problems than the presence of high multicollinearity (the removal of important variables may result in a model which is mis-specified, see Berry and Feldman, 1993). It is, therefore, generally unwise to remove explanatory variables from a regression equation merely on the grounds that they show a high degree of multicollinearity. In general, the most reasonable method of dealing with multicollinearity is to collect more data and, where possible, collapse a number of variables into composite or latent variables, provided that they make theoretical sense. If no more data can be collected, the variables cannot be incorporated into a composite variable, and the highly related variables are deemed to be a necessary part of the model (and therefore cannot be removed), then one might just have to recognize its presence and live with its consequences (the consequence being that it is not possible to obtain reliable regression coefficients for all of the variables in the model). Statisitcs for describing model-fit There are a number of statistics that may be used to describe models and these can also be used in model selection. The most important statistics and the ones that we are to use here are based around measures of deviance (RSS and -2LL, which are assessed for significance using the F and partial-f tests and χ 2 ; for a full explanation of these, please see Hutcheson and Moutinho, 2008). A model cannot, however, be adequately described merely using the deviance value as model complexity is also an important consideration. Generally speaking, smaller models are considered 9

10 preferable to larger models (we want parsimonious models). Complexity can be taken into account when building a model by using an information criteria, which penalises models which have more parameters. The most common information criteria statistics used are Akaike s Information Criterion (AIC) and Schwarz s Bayesian Criterion (BIC). which are constructed from two terms, the deviance and the model complexity. The AIC is calculated according to the formula 2 log-likelihood + 2p, where 2 log-likelihood represents the deviance, and p represents the number of parameters in the fitted model. Although widely used, it is recognised that AIC tends to over-estimate the number of parameters. An ammendment to the AIC which penalise those models with more parameters more heavily is the BIC, which is calculated according to the formula 2 log-likelihood + log(n) p, where 2 log-likelihood represents the deviance, and p represents the number of parameters in the fitted model. A higher value of AIC or BIC indicates a preferable model. The AIC and BIC statistics can therefore be used to compare models and evaluate the effect of single and multiple variables on a particular model. The use of these statistics in model selection is demonstrated below. Model Selection Procedures There are a number of methods for selecting a final model. It is useful, however, to describe two common approaches to variable selection, stepwise and optimal subset methods. Stepwise methods seek good subsets of predictors by adding or subtracting terms one at a time, while optimal subsets locate the subset of predictors of a given size that maximise some measure of fit to the data. stepwise selection Stepwise selection aims to derive a regression model by sequentially adding or removing terms from a model. For example, a forward selection method builds up a model by sequentially adding variables. For a model of Y when there are 16 potential explanatory variables (X 1 to X 16 ), the procedure can be described as follows: First step: Individually add each of the explanatory varibles to the model Y = α: Y = α + β variable 1 Y = α + β variable 2... Y = α + β variable 15 Y = α + β variable 16 10

11 The explanatory variable that has the most effect (according to the change in R 2, the change in deviance (F, partial-f, -2LL) or statistics such as AIC and BIC) is selected to enter the model. Suppose that, in this case, variable 9, has the greatest effect on Y. The model would be Y = α + β variable 9. Stage 2: start with the model Y = α + β variable 9 and then individually add each of the remaining explanatory varibles to the model: Y = α + β 1 variable 9 + β 2 variable 1 Y = α + β 1 variable 9 + β 2 variable 2... Y = α + β 1 variable 9 + β 2 variable 15 Y = α + β 1 variable 9 + β 2 variable 16 For this example we might find that the inclusion of variable 2 has the greatest effect and, provided that it reaches the criterion level, we would include this variable in the model. The model would now be Y = α + β 1 variable 9 + β 2 variable 2. Further Stages This process continues, sequentially adding variables to the model until no variable can be added to the model that has a significant effect (as determined by the entry criterion), and the process then stops, giving the final model. For example, Y = α + β 1 variable 9 + β 2 variable 2 + β 3 variable 14 + β 4 variable 6. A model can also be selected by starting with a full model and then sequentially removing variables. This is commonly known as backward deletion and is shown in the example below. An example of step-wise model selection using AIC The example below shows the stepwise procedure used in R, which uses backward elimination based on the AIC statistic, as this is the default option. Information about how to run a forward selection procedure and how to use the BIC criterion can be found in the on-line R documentation or Fox, For this example, we will use data that is distributed as part of the S-Plus example files (car.txt) and also made available on RGSweb. The data set contains information about a number of different cars. Variable Price Mileage Weight Displacement hp type Description Retail price of car average distance travelled for a set amount of fuel weight of the car engine size horse power of car 1 = small 2 = sporty, 3 = compact, 4 = medium, 5 = large, 6 = van 11

12 We may model the continuous variable mileage using all other information that has been collected. This model of Mileage can be obtained using OLS regression. Mileage =α + β 1 Price + β 2 Weight + β 3 Displacement + β 4 Horse Power + β 5 Type The model above is likely to include a number of explanatory variables that are strongly related and may show problematic levels of multicollinearity. This is confirmed in the output as a number of GVIF values are relatively high (all are above 3). The presence of multicollinearity may affect the significance levels of the explanatory variables as a number of variables we might expect to have a strong relationship with mileage such as HP (at least when assessed on its own) is not significant in this model. This model can be made more parsimonious using the stepwise procedure in R. In order to do this, we run the full model and then ask for the stepwise selection procedure to be used. As Stepwise regression has not been implemented in Rcmdr, we shall call the function directly from the R console. In order to run this, copy the commands into the R console, making sure that you have first loaded the dataset under the name cars. We can see in the output that the full model has an AIC value of We can see that for this model the lowest AIC occurs when the variable HP is removed. As this value is less than that for the full model, the stepwise procedure removes HP and then recalculates the AIC values for the reduced model. This process continues until only PRICE, DISP and TYPE are left in the model. As none of these variables can be removed without increasing the AIC value, the stepwise procedure stops. The stepwise procedure is applicable to a broad range of models including the GLM models and also has the advantage of treating the individual parameters from categoprical variables as single units. Procedures for computing different types of step-wise model-selection procedures (eg., forward selection) using a variety of statistics (AIC and BIC) are described in detail in Fox (2002). It should be noted that none of these variable selection procedures is best in any absolute sense 3 ; they merely identify subsets of variables that, for the sample, are good predictors of the response variable. Stepwise selection and missing data (more advanced, but important) The stepwise procedure works by comparing a number of models at each stage, each model omits a different variable and the model-fit statistics for the models are compared. In order for these comparisons to be made, the models must be constructed using the same amount of data. This means that if one of the variables has missing data, when it is removed from the model, the other models will contain more data. In this case the model-fit statistics cannot be compared and the stepwise procedure fails. In order for the stepwise procedure to work, the variables it is used on must not contain missing data. If there are any missing data, these need to be removed list-wise before the procedure is started (this is what SPSS does, but does not make it obviou that this is the case). The problem with this procedure is that for data sets which have many variables and missing data, the proportion of data lost to the analysis can be substantial. It should be realised that the 3 It can be argued that automatic selection procedures should not be relied upon to produce the best model. 12

13 An OLS regression model of Mileage Data set (available for download from RGSweb): cars.txt Rcmdr: commands Statistics Fit models Linear models... Linear Model Variables (double click to formula): mileage, disp, HP, price, type and weight Model formula: mileage disp + HP + price + type + weight OK Models Numerical diagnostics Variance Inflation Factors... Rcmdr: output OLS Regression model Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-10 *** DISP ** HP PRICE TYPE[T.Large] TYPE[T.Medium] TYPE[T.Small] ** TYPE[T.Sporty] * TYPE[T.Van] * WEIGHT Signif. codes: 0 *** ** 0.01 * Residual standard error: on 50 degrees of freedom Multiple R-squared: 0.813,Adjusted R-squared: F-statistic: on 9 and 50 DF, p-value: 2.832e-15 Variance inflation factors GVIF Df GVIF^(1/2Df) DISP HP PRICE TYPE WEIGHT final model from a step-wise procedure may not be based on all the available data. Indeed, it is often the case that running the final regression model on the original data gives different results. For example, using a stepwise procedure on the data set TLRPsample.txt gives the following model: MHEdisp3 ~ ASgradeCont + Course + Language + Gender + EMA + unifam 13

14 A stepwise OLS regression model of Mileage Data set (available for download from RGSweb): cars.txt R console: commands LinearModel.1 <- lm(mileage ~ DISP + HP + PRICE + TYPE + WEIGHT, data=cars) step(linearmodel.1) R console: output > step(linearmodel.1) Start: AIC= MILEAGE ~ DISP + HP + PRICE + TYPE + WEIGHT Df Sum of Sq RSS AIC - HP WEIGHT PRICE <none> DISP TYPE Step: AIC= MILEAGE ~ DISP + PRICE + TYPE + WEIGHT Df Sum of Sq RSS AIC - WEIGHT PRICE <none> DISP TYPE Step: AIC= MILEAGE ~ DISP + PRICE + TYPE Df Sum of Sq RSS AIC <none> PRICE DISP TYPE Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ASgradeCont e-07 *** Course[T.UoM] Language[T.ENGLISH] *** Language[T.OTHER] Gender[T.male] EMA[T.yes] unifam[t.parents] unifam[t.siblings] * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 267 degrees of freedom 14

15 Multiple R-squared: , Adjusted R-squared: F-statistic: on 8 and 267 DF, p-value: 1.417e-12 whereas running the same model on the original data set gives the following model: lm(formula = MHEdisp3 ~ ASgradeCont + Course + Language + Gender + EMA + unifam, data = TLRPsample) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) *** ASgradeCont e-13 *** Course[T.UoM] Language[T.ENGLISH] *** Language[T.OTHER] Gender[T.male] * EMA[T.yes] unifam[t.parents] unifam[t.siblings] * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 480 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 8 and 480 DF, p-value: < 2.2e-16 The use of the stepwise procedure has meant the removal of a substantial proportion of the data from the analysis. It has removed 213 cases from the analysis ( ). There is a work-around this in R, but you need to re-start the stepwise procedure each time it stops due to missing data. For example, run the model: model01<-lm(mhedisp3 ~ ASgradeCont + AveragePed + Course + Ethnicity + Language + Gender + EMA + HEFCE_.social_group + LPN + unifam, data=tlrpsample) step(model01) The stepwise procedure stops with the message: Error in step(model01) : number of rows in use has changed: remove missing values? From this model we can see that the variables Ethnicity and AveragePed can be removed from the model as they have the smallest AIC scores. These should be removed from the regression command and the model re-run: model01<-lm(mhedisp3 ~ ASgradeCont + Course + Language + Gender + EMA + HEFCE_.social_group + LPN + unifam, data=tlrpsample) step(model01) 15

16 This model will also produce an error. Remove the variable with the lowest AIC and restart the procedure. Continue until there are no more errors. This procedure is time consuming (but preferable to the automated procedure in SPSS) and only of limited use as the step-wise procedure may not be the best way in which the model should be constructed. Maybe there isn t a simple automatic way to select the best subset of variables. optimal subset selection Selecting ONE best model might be a flawed tactic as there are often many subsets of explanatory variables that can explain the response variable almost as well if not better than that chosen by a stepwise procedure. A different method of model selection is to compare different selections of variables without builing them up sequentially. The optimal subsets procedure is designed to do this by locating the subset of predictors of a given size that maximise some measure of fit to the data. This can be achieved in R using the regsubsets command in the leaps library and the subsets command in the car library. To compute an optimal subsets regression for the cars dataset, the following commands can be used: Selecting a model using optimal subsets selction Data set (available for download from RGSweb): cars.txt R console: commands library(car) library(leaps) subset.1 <- regsubsets(mileage ~ DISP + HP + PRICE + TYPE + WEIGHT, nbest=10, data=cars) subsets(subset.1) R console: output output is shown in the form of a graphic (see below) By default the BIC statistic is plotted against the number of predictors in the model. The regsubsets command obtains the subset of the best 10 models for each number of parameters. The resulting graph is shown below in Figure 3. From Figure 3 we can see that there are a large number of models that are roughly equally effective. The graph can be made more interpretable by defining limits. For example, to just show the 3 best models and graph only those solutions with a certain number of parameters (say 5 or 6). The code and resulting analysis are shown below: Although the analysis of categorical variables is quite difficult at this moment in R (as the procedure views each explanatory variable as separate) it certainly looks as though the variables 16

Figure 3: competing models Selecting a model using optimal subsets selction: restricted sets Data set (available for download from RGSweb): cars.

17 Figure 3: competing models Selecting a model using optimal subsets selction: restricted sets Data set (available for download from RGSweb): cars.txt R console: commands library(car) library(leaps) subset.2 <- regsubsets(mileage ~ DISP + HP + PRICE + TYPE + WEIGHT, nbest=3, data=cars) subsets(subset.2, min.size=5, max.size=6) R console: output output is shown in the form of a graphic (see below) displacement type and price are all members of a subset that would appear to be the best predictors of the variable mileage (interpret this model in conjunction with the stepwise models computed earlier). Conclusion Automated selection procedures can be used to make decisions about whether terms are included or excluded from a regression model on statistical grounds according to how much the variables contribute to predicting the response variable. Ideally, such decisions should be based on theoretical 17

18 Figure 4: competing models, a clearer graph as well as statistical grounds, however, it is sometimes convenient to use automated procedures. Whilst such a technique of model-building is relatively quick and efficient at deriving a model which provides a good prediction of the response variable, it does not always provide a model which is adequate for explanatory purposes. Agresti makes the point that... Computerized variable selection procedures should be used with caution. When one considers a large number of terms for potential inclusion in a model, one or two of them that are not really important may look impressive simply due to chance. For instance, when all the true effects are weak, the largest sample effect may substantially overestimate its true effect. In addition, it often makes sense to include certain variables of special interest in a model and report their estimated effects even if they are not statistically significant at some level. Agresti, 1996; p Sanford Weisberg, in his excellent book Applied Linear Regression makes the following point in the chapter that discusses multicollinearity and variable selection: The single most important tool in selecting a subset of variables for use in a model is the analyst s knowledge of the substantive area under study and of each of the variables, including expected sign and magnitude of coefficient. Weisberg, S., 1985; p What should a modelling procedure look like? Define which factors are likely to be important (definition of a theoretical model ). 18

19 How are these factors to be represented in the model? (single variables, composite variables or factors). Evaluate the theoretical model proposed above (maybe utilising a structural equation model if the model is adequately defined). Test to see if the resulting model is one of the best-fitting using an all-subsets methodology (see Weisberg, 1985; Fox, 2002). Analyse diagnostics for evidence that model assumptions have been violated (amend the models if appropriate). Describe and interpret the model-fit statistics and parameters. Illustrate the model using predictions obtained for clusters that occur in the data. 19

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester Exploratory regression and model