Quantitative Methods in Management

Size: px
Start display at page:

Download "Quantitative Methods in Management"

Transcription

1 Quantitative Methods in Management MBA Glasgow University March 20-23, 2009 Luiz Moutinho, University of Glasgow Graeme Hutcheson, University of Manchester Exploratory Regression The lecture notes, exercises and data sets associated with this workshop are available for download from: Multicollinearity GLMs allow more than one explanatory variable to be entered into a model. However, there are some considerations concerning which variables may be entered. Perhaps the most important of these is multicollinearity, a term used here to describe a situation where an explanatory variable is related to one or more of the other explanatory variables in the model. If these relationships are perfect or very strong, the calculation of the regression model and the appropriate interpretation of the results can be affected. In the case where one explanatory variable can be precisely predicted from one or more of the other explanatory variables (perfect multicollinearity), the analysis fails as a regression equation cannot even be formulated. When a relationship is strong, but not perfect (high multicollinearity), the regression equation can be formulated, but the parameters may be unreliable. Parameters which are unreliable can change dramatically as a result of relatively minor changes in the data set with the addition or deletion of a small number of observations exerting a large influence on the regression equation and, subsequently, on the interpretation of the results. The consequences of multicollinearity depend, to some degree, on the objectives of the analysis. If the goal is prediction, then multicollinearity need not present much of a problem, as it primarily affects the calculated importance of the explanatory variables. However, if the goal is explanation (that is, the aim is to identify the strength of relationships between individual explanatory variables

2 and the response variable), the presence of a high degree of multicollinearity poses a serious problem for the correct interpretation of the results. When conducting a multiple regression, one has to identify when multicollinearity is likely to present a problem and a strategy to deal with it must be decided upon. Perfect Multicollinearity Perfect multicollinearity occurs when an explanatory variable can be precisely predicted from other explanatory variables in the model. When this happens, the variable contributes no unique information to the model and is therefore redundant. The inclusion of one or more redundant explanatory variables in a regression model is problematic as it is not possible to determine the parameters associated with these variables and, consequently, a regression equation cannot even be formulated. This problem can be demonstrated by looking at a three-variable relationship, which can be represented algebraically as a plane using the three variables y, x and z: y = α + β 1 x + β 2 z If one of these explanatory variables is redundant (say, x = 5z) then y can be described in terms of a single variable (x or z), which is represented algebraically as a line y = α + β 1 5z + β 2 z or equivalently, y = α + β6z The regression procedure attempts to calculate parameters for x and z, but since there is merely information about one of them, only one coefficient can be computed and the regression technique breaks down (for a detailed discussion of this see Berry and Feldman, 1993 and Maddala, 1992). In essence, when there is perfect multicollinearity, the regression parameters cannot be formulated as the procedure attempts to fit an equation which has more dimensions than are present in the data. In practice, perfect multicollinearity is not usually a problem as it is quite rare and can be readily detected. In fact, many statistical analysis packages automatically alert the user to its presence. A more serious problem for the analyst is the presence of high multicollinearity where a regression model can be formulated, but the parameters associated with some of the explanatory variables may be unreliable. High Multicollinearity When explanatory variables are highly (but not perfectly) related to one or more of the other explanatory variables in the model it becomes difficult to disentangle the separate effect of each variable. As a variable which shows a high degree of multicollinearity provides little unique information, the regression coefficient associated with it is also based on limited information and therefore tends to have a large standard error (for detailed discussions on this refer to Afifi and Clark, 1996 and Edwards, 1985). In such cases the regression parameters are unlikely to accurately reflect the impact that x i has on y in the population. 2

3 The problems associated with high multicollinearity can be demonstrated using hypothetical data which shows the relationship between the number of college places offered to students and marks obtained in two compulsory subjects, English and Mathematics (see Table 1) 1. Table 1: Exam Marks and Offers of College Places Number of Colleges English Mathematics Offering Places (%) (%) One would expect there to be a strong relationship between the number of college places a student is offered and the student s marks in English and Mathematics as the decision to offer a place at a college is based largely on the student s academic performance. One would also expect a student s mark in one subject to be strongly related to their mark in the other subject, as good students tend to score relatively highly in both. This three-variable relationship is shown in Figure 1 and the associated regression model in Equation 1 and Table 2. The model appears to provide a good prediction of the number of college places offered to a student as indicated by the F and R 2 statistics (F 2,8 = ; P < : R 2 = 0.890) which corresponds to area A + B + C, in Figure 1. From Figure 1 it can be seen that the unique contributions made by each of the explanatory variables to the number of college places offered is relatively small. When controlling for marks in Mathematics, marks in English only contribute a small amount to the model fit (area C). Similarly, when controlling for marks in English, marks in Mathematics only contribute a small amount to the model fit (area A). The results in Table 2 confirm this and show that the unique contribution of each of the explanatory variables when they are both entered into the model (Model 1) is not significant, as shown by the t statistics. It appears clear from the F and R 2 statistics that Model 1 provides a good fit, even though neither of the explanatory variables are significant. This, perhaps unexpected, result is due to the high degree of multicollinearity between the explanatory variables. Logically we might expect marks in both English and Mathematics to be strongly related to the number of college places offered as places are offered mainly on the basis of academic performance. This is what we find when simple regression models are calculated using single subjects to predict college places (Models 2 and 3). The resulting models fit almost as well as the model which uses both variables, but the regression parameters for the explanatory variables are now highly significant. We can see that the presence of multicollinearity has not really affected the predictive power of the model, but has serious implications for the interpretation of the importance of the explanatory variables. Identifying Instances of Multicollinearity Some instances of multicollinearity can be identified by inspecting pair-wise correlation coefficients. Relationships between explanatory variables which are of the order of about 0.8 or larger indicate a 1 For this example we will assume that all students applied to the same 10 colleges. 3

4 Figure 1: The relationship between exam marks and the number of college places offered College places offered = English Mathematics (1) Table 2: Modelling the Number of College Places Offered to Students Coefficient s.e. t P Model 1 English Mathematics (constant) Model 2 English (constant) Model 3 Mathematics (constant) Model 1: Places offered = α + β English + β Mathematics F 2,8 = 32.52, P < , R 2 = Model 2: Places offered = α + β English F 1,9 = , P < , R 2 = Model 3: Places offered = α + β Mathematics F 1,9 = , P < , R 2 =

5 level of multicollinearity that may prove to be problematic 2. In the above example, the correlation between English and Mathematics scores is indicating that multicollinearity may be a problem for these data. Whilst this approach is quite adept at identifying problem relationships between pairs of explanatory variables, it cannot always identify those instances where a combination of more than one variable predicts another. These relationships can, however, be determined using R 2 values to show the degree to which each explanatory variable can be explained using the other explanatory variables in the model. As with pair-wise correlations, we cannot say with any certainty how high the value of R 2 must be before multicollinearity is viewed as a cause for concern, but typically, values of about 0.8 or higher are taken as being indicative of a degree of multicollinearity which may be problematic. In the example above, if we predict a student s Mathematics score using their English score we obtain a regression model with an R 2 value of 0.804, which indicates that a problematic level of multicollinearity may be present. Calculating individual R 2 values for each explanatory variable in the model is a useful method of identifying instances of multicollinearity, but can be quite a lengthy process if there are a number of variables. It is, however, not necessary to manually compute these R 2 values as a number of analysis packages provide equivalent information through the tolerance and variance inflation factor (VIF) statistics shown in equations 2 and 3 (the reason that this statistic is called the variance inflation factor is clearly explained in Weisberg, 1985; page 198). Tolerance(β i ) = 1 R 2 i (2) where β i is the regression coefficient for variable i, and R 2 i is the squared multiple correlation coefficient between x i and the other explanatory variables. VIF(β i ) = 1 Tolerance (3) We can see from equation 3 that for a simple regression model, an R 2 value of 0.8 will result in a VIF value of 5 and a tolerance value of 0.2. Any explanatory variables which have a VIF value of 5 or more, or a Tolerance of 0.2 or less, are therefore of interest as they show a degree of multicollinearity which could be problematic. Table 3 shows the regression analysis of the example data set with VIF and tolerance values of a high enough level to be of concern. It should be noted that as there are only two explanatory variables, the statistics for both variables are the same. Table 3: VIF and Tolerance Statistics Coefficient s.e. t P Tolerance VIF English Mathematics Places offered = α + β 1 English + β 2 Mathematics F 2,8 = 32.52, P < , R 2 = The tolerance and VIF statistics are based on the R 2 measure and therefore assume that the data are continuous. It is, however, possible to use these statistics on discontinuous data provided that the variables have been coded appropriately. Fox and Monette (1992) generalized the notion of is an arbitrary figure and is used here because it is commonly quoted in a number of texts. It should be noted, however, that correlations smaller than 0.8 can also cause problems for the regression procedure. 5

6 variance inflation to related sets of regressors (e.g., different categories of one variable) and proposed the generalized variance inflation factor (GVIF). The interpretation of GVIF 1/2p is the decrease in precision of the estimation due to collinearity. For example, if GVIF (or VIF)=4, the square root of this is 2, which indicates that the confidence intervals for these predictors are twice what they would have been for uncorrelated predictors. The use of GVIF greatly increases the usefulness of the techniques as it enables problematic relationships between all types of variables to be identified. It should be noted, however, that these statistics can only give a rough indication of which relationships may be problematic, they do not provide any proof that multicollinearity will be a problem, nor do they identify all instances of problematic relationships. The tolerance and VIF statistics merely provide a convenient method for identifying at least some of the relationship of concern. The R commands to obtain collinearity statistics for an OLS regression model is as follows: Computing VIF values Data set (available for download from RGSweb): colleges.txt Rcmdr: commands Statistics Fit models Linear models... Linear Model Variables (double click to formula): Model formula: OK colleges, english and maths colleges english + maths Models Numerical diagnostics Variance Inflation Factors... Rcmdr: output > vif(linearmodel.1) ENGLISH MATHS Dealing with Multicollinearity There are a number of ways in which multicollinearity can be reduced in a data set. These methods include: 1. Collect more data As multicollinearity is a problem which results from insufficient information in the sample, one solution is to increase the amount of information by collecting more data. As more data is collected and the sample size increases, the standard error tends to decrease which reduces the effect of multicollinearity (see Berry and Feldman, 1993). Although increasing the amount of data is an attractive option and one of the best methods to reduce multicollinearity (at least 6

7 when the data set is relatively small), it is in many instances, not practical or possible, so other less attractive methods need to be considered. 2. Collapse variables One option to reduce the level of multicollinearity is to combine two or more explanatory variables which are highly correlated into a single composite variable. This approach is, however, only reasonable when the explanatory variables are indicators of the same underlying concept. For example, using the data in Table 1, it makes theoretical sense to combine the two explanatory variables (marks in English and Mathematics) into a single index of academic performance. This single index could simply be the sum of the two scores, or the average score for the two subjects. The use of a composite variable in the regression model enables one to assess the contribution made by academic performance to the number of college places offered, without the problem of a high degree of multicollinearity which existed between English and Mathematics scores. The process of combining variables into latent variables (or factors as they are sometimes called) is not always as straightforward as in the example shown above, where two variables were related in quite an obvious way and could be combined easily into a single index. If there are a number of variables which are inter-related, it might be appropriate to first identify any latent variables in the sample using factor analysis and then enter these into the regression model. The technique of factor analysis is discussed in detail in Chapter Remove variables from the model When it is not possible to collapse highly related explanatory variables into a composite variable, one may delete one or more variables to remove the effect of multicollinearity. This option, whilst being one of the easiest to accomplish practically, can be problematic if the variable measures some distinct theoretical concept which cannot be easily dismissed from the model. It should be noted that the removal of a relevant explanatory variable from a model can cause more serious problems than the presence of high multicollinearity (the removal of important variables may result in a model which is mis-specified, see Berry and Feldman, 1993). It is, therefore, generally unwise to remove explanatory variables from a regression equation merely on the grounds that they show a high degree of multicollinearity. In general, the most reasonable method of dealing with multicollinearity is to collect more data and, where possible, collapse a number of variables into composite or latent variables, provided that they make theoretical sense. If no more data can be collected, the variables cannot be incorporated into a composite variable, and the highly related variables are deemed to be a necessary part of the model (and therefore cannot be removed), then one might just have to recognize its presence and live with its consequences (the consequence being that it is not possible to obtain reliable regression coefficients for all of the variables in the model). Building Regression Models All models are wrong. Some are useful. George Box, quoted in Gill, It is useful to distinguish two different types of models - predictive and explanatory. A predictive model attempts to provide the best possible prediction of the response variable regardless of the 7

8 actual variables the important thing is that they enable a prediction to be made of the response variable. An explanatory model also aims to provide a good prediction of the response variable as well, but aims to include those variables that are useful for explanation. For example, although colour and style of hair may be highly predictive of truancy (rebellious pupils may adopt certain extreme hair styles in a range of unusual colours), one may not wish to use these variables in a model explaining truancy. Hair colour may be a useful variable to predict truancy behaviour, but is not necessarily a variable that explains the behaviour. The decision to retain or remove a variable from a model may be related to the function of the model (predictive or descriptive) as well as the significance of the variable. When building multiple regression models it is useful to remove those variables which do not significantly contribute to the prediction of the response variable. The removal of such unimportant variables results in a simpler model which helps in interpretation and often provides a clearer insight into the way the response variable varies as a function of changes in the explanatory variables. Removing irrelevant variables from the model is also useful as these variables may not significantly increase the predictive power of the model but may increase the standard error (and consequently increase the confidence intervals). A good model enables an accurate predictions to be made but should contain only those variables which play an important role (with respect to prediction and/or explanation). In other words, the model should be parsimonious. Consider the two models presented in Table 4, which have been calculated using the child witness data. Model 1 uses the variables Delay and Gender to model Quality, whilst the nested Model 2 only uses Delay. Although Model 1 has a larger R 2 value, which we would expect as it contains a greater number of parameters, the F statistics show that the smaller model actually provides a more significant linear prediction of Quality. The inclusion of Gender in the model does not improve the prediction of Quality and can therefore be omitted without any significant loss of power. Ideally, only those variables which contribute significantly to the prediction of the response variable should be retained. The removal of unimportant variables results in a simpler model which helps in interpretation and often provides a clearer insight into the way the response variable varies as a function of changes in the explanatory variables. In general, a good model should enable an accurate prediction to be made of the response variable, but only contain those explanatory variables which play a significant role. The aim of regression is to derive a model which contains only those variables that are important for predicting the response variable. Whilst it is best to use a combination of theoretical, practical and statistical considerations to select the best model, there are some procedures that can be used to select models on purely statistical criteria. These procedures are popular, but have to be used Table 4: Model Selection Coefficient s.e. t P Model 1 Delay Gender (constant) Model 2 Delay (constant) Model 1: Quality = (Delay) 1.108(Gender) F 2,67 = 3.358, P = 0.041, R 2 = Model 2: Quality = (Delay) F 1,68 = 6.611, P = 0.012, R 2 =

9 with great caution, particularly with the messy data sets we often have to deal with in the social sciences. These techniques are worth studying as they also provide a methodology for selecting regression models manually. Model Selection Criteria There are a number of statistics that may be used to describe models and these can also be used in model selection. The most important statistics and the ones that we are to use here are based around measures of deviance. (RSS and -2LL, which are assessed for significance using the F and partial-f tests and χ 2 ; for a full explanation of these, please see Hutcheson and Moutinho, 2008). A model cannot, however, be adequately described merely using the deviance value as model complexity is also an important consideration. Generally speaking, smaller models are considered preferable to larger models (we want parsimonious models). Complexity can be taken into account when building a model by using an information criteria, which penalises models which have more parameters. The most common information criteria statistics used are Akaike s Information Criterion (AIC) and Schwarz s Bayesian Criterion (BIC). which are constructed from two terms, the deviance and the model complexity. The AIC is calculated according to the formula 2 log-likelihood + 2p, where 2 log-likelihood represents the deviance, and p represents the number of parameters in the fitted model. Although widely used, it is recognised that AIC tends to over-estimate the number of parameters. An ammendment to the AIC which penalise those models with more parameters more heavily is the BIC, which is calculated according to the formula 2 log-likelihood + log(n) p, where 2 log-likelihood represents the deviance, and p represents the number of parameters in the fitted model. A higher value of AIC or BIC indicates a preferable model. The AIC and BIC statistics can therefore be used to compare models and evaluate the effect of single and multiple variables on a particular model. The use of the AIC in automated model selection is demonstrated below. Model Selection Procedures There are two general approaches to variable selection, stepwise and optimal subset methods. Stepwise methods seek good subsets of predictors by adding or subtracting terms one at a time, while optimal subsets locate the subset of predictors of a given size that maximise some measure of fit to the data. stepwise selection Stepwise selection aims to derive a regression model by sequentially adding or removing terms from a model. For example, a forward selection method builds up a model by sequentially adding variables. 9

10 For a model of Y when there are 16 potential explanatory variables (X 1 to X 16 ), the procedure can be described as follows: First step: Individually add each of the explanatory varibles to the model Y = α: Y = α + β variable 1 Y = α + β variable 2... Y = α + β variable 15 Y = α + β variable 16 The explanatory variable that has the most effect (according to the change in R 2, the change in deviance (F, partial-f, -2LL) or statistics such as AIC and BIC) is selected to enter the model. Suppose that, in this case, variable 9, has the greatest effect on Y. The model would be Y = α + β variable 9. Stage 2: start with the model Y = α + β variable 9 and then individually add each of the remaining explanatory varibles to the model: Y = α + β 1 variable 9 + β 2 variable 1 Y = α + β 1 variable 9 + β 2 variable 2... Y = α + β 1 variable 9 + β 2 variable 15 Y = α + β 1 variable 9 + β 2 variable 16 For this example we might find that the inclusion of variable 2 has the greatest effect and, provided that it reaches the criterion level, we would include this variable in the model. The model would now be Y = α + β 1 variable 9 + β 2 variable 2. Further Stages This process continues, sequentially adding variables to the model until no variables have a significant effect (as determined by the entry criterion), and the process then stops, giving the final model. For example, Y = α + β 1 variable 9 + β 2 variable 2 + β 3 variable 14 + β 4 variable 6. A model can also be selected by starting with a full model and then sequentially removing variables. This is commonly known as backward deletion and is shown in the example below. An example of step-wise model selection using AIC The example below shows the stepwise procedure used in R, which uses backward elimination based on the AIC statistic, as this is the default option. Information about how to run a forward selection procedure and how to use the BIC criterion can be found in the on-line R documentation or Fox,

11 For this example, we will use data that is distributed as part of the S-Plus example files (car.txt) and also made available on RGSweb. The data set contains information about a number of different cars. Variable Price Mileage Weight Displacement hp type Description Retail price of car average distance travelled for a set amount of fuel weight of the car engine size horse power of car 1 = small 2 = sporty, 3 = compact, 4 = medium, 5 = large, 6 = van We may model the continuous variable mileage using all other information that has been collected. This model of Mileage can be obtained using OLS regression. Mileage =α + β 1 Price + β 2 Weight + β 3 Displacement + β 4 Horse Power + β 5 Type The model above is likely to include a number of explanatory variables that are strongly related and may show problematic levels of multicollinearity. This is confirmed in the output as a number of GVIF values are relatively high (all are above 3). The presence of multicollinearity may affect the significance levels of the explanatory variables as a number of variables we might expect to have a strong relationship with mileage such as HP (at least when assessed on its own) is not significant in this model. This model can be made more parsimonious using the stepwise procedure in R. In order to do this, we run the full model and then ask for the stepwise selection procedure to be used. As Stepwise regression has not been implemented in Rcmdr, we shall call the function directly from the R console. In order to run this, copy the commands into the R console, making sure that you have first loaded the dataset under the name cars. We can see in the output that the full model has an AIC value of We can see that for this model the lowest AIC occurs when the variable HP is removed. As this value is less than that for the full model, the stepwise procedure removes HP and then recalculates the AIC values for the reduced model. This process continues until only PRICE, DISP and TYPE are left in the model. As none of these variables can be removed without increasing the AIC value, the stepwise procedure stops. The stepwise procedure is applicable to a broad range of models including the GLM models and also has the advantage of treating the individual parameters from categoprical variables as single units. Procedures for computing different types of step-wise model-selection procedures (eg., forward selection) using a variety of statistics (AIC and BIC) are described in detail in Fox (2002). It should be noted that none of these variable selection procedures is best in any absolute sense 3 ; they merely identify subsets of variables that, for the sample, are good predictors of the response variable. 3 It can be argued that automatic selection procedures should not be relied upon to produce the best model. 11

12 An OLS regression model of Mileage Data set (available for download from RGSweb): cars.txt Rcmdr: commands Statistics Fit models Linear models... Linear Model Variables (double click to formula): mileage, disp, HP, price, type and weight Model formula: mileage disp + HP + price + type + weight OK Models Numerical diagnostics Variance Inflation Factors... Rcmdr: output OLS Regression model Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-10 *** DISP ** HP PRICE TYPE[T.Large] TYPE[T.Medium] TYPE[T.Small] ** TYPE[T.Sporty] * TYPE[T.Van] * WEIGHT Signif. codes: 0 *** ** 0.01 * Residual standard error: on 50 degrees of freedom Multiple R-squared: 0.813,Adjusted R-squared: F-statistic: on 9 and 50 DF, p-value: 2.832e-15 Variance inflation factors GVIF Df GVIF^(1/2Df) DISP HP PRICE TYPE WEIGHT Stepwise selection and missing data (more advanced, but important) The stepwise procedure works by comparing a number of models at each stage, each model omits a different variable and the model-fit statistics for the models are compared. In order for these comparisons to be made, the models must be constructed using the same amount of data. This means that if one of the variables has missing data, when it is removed from the model, the other 12

13 A stepwise OLS regression model of Mileage Data set (available for download from RGSweb): cars.txt R console: commands LinearModel.1 <- lm(mileage ~ DISP + HP + PRICE + TYPE + WEIGHT, data=cars) step(linearmodel.1) R console: output > step(linearmodel.1) Start: AIC= MILEAGE ~ DISP + HP + PRICE + TYPE + WEIGHT Df Sum of Sq RSS AIC - HP WEIGHT PRICE <none> DISP TYPE Step: AIC= MILEAGE ~ DISP + PRICE + TYPE + WEIGHT Df Sum of Sq RSS AIC - WEIGHT PRICE <none> DISP TYPE Step: AIC= MILEAGE ~ DISP + PRICE + TYPE Df Sum of Sq RSS AIC <none> PRICE DISP TYPE models will contain more data. In this case the model-fit statistics cannot be compared and the stepwise procedure fails. In order for the stepwise procedure to work, the variables it is used on must not contain missing data. If there are any missing data, these need to be removed list-wise before the procedure is started (this is what SPSS does, but does not make it obviou that this is the case). The problem with this procedure is that for data sets which have many variables and missing data, the proportion of data lost to the analysis can be substantial. It should be realised that the final model from a step-wise procedure may not be based on all the available data. Indeed, it is often the case that running the final regression model on the original data gives different results. For example, using a stepwise procedure on the data set TLRPsample.txt gives the following model: MHEdisp3 ~ ASgradeCont + Course + Language + Gender + EMA + unifam Coefficients: Estimate Std. Error t value Pr(> t ) 13

14 (Intercept) ASgradeCont e-07 *** Course[T.UoM] Language[T.ENGLISH] *** Language[T.OTHER] Gender[T.male] EMA[T.yes] unifam[t.parents] unifam[t.siblings] * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 267 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 8 and 267 DF, p-value: 1.417e-12 whereas running the same model on the original data set gives the following model: lm(formula = MHEdisp3 ~ ASgradeCont + Course + Language + Gender + EMA + unifam, data = TLRPsample) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) *** ASgradeCont e-13 *** Course[T.UoM] Language[T.ENGLISH] *** Language[T.OTHER] Gender[T.male] * EMA[T.yes] unifam[t.parents] unifam[t.siblings] * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 480 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 8 and 480 DF, p-value: < 2.2e-16 The use of the stepwise procedure has meant the removal of a substantial proportion of the data from the analysis. It has removed 213 cases from the analysis ( ). There is a work-around this in R, but you need to re-start the stepwise procedure each time it stops due to missing data. For example, run the model: model01<-lm(mhedisp3 ~ ASgradeCont + AveragePed + Course + Ethnicity + Language + Gender + EMA + HEFCE_.social_group + LPN + unifam, data=tlrpsample) step(model01) The stepwise procedure stops with the message: 14

15 Error in step(model01) : number of rows in use has changed: remove missing values? From this model we can see that the variables Ethnicity and AveragePed can be removed from the model as they have the smallest AIC scores. These should be removed from the regression command and the model re-run: model01<-lm(mhedisp3 ~ ASgradeCont + Course + Language + Gender + EMA + HEFCE_.social_group + LPN + unifam, data=tlrpsample) step(model01) This model will also produce an error. Remove the variable with the lowest AIC and restart the procedure. Continue until there are no more errors. This procedure is time consuming (but preferable to the automated procedure in SPSS) and only of limited use as the step-wise procedure may not be the best way in which the model should be constructed. Maybe there isn t a simple automatic way to select the best subset of variables. optimal subset selection Selecting ONE best model might be a flawed tactic as there are often many subsets of explanatory variables that can explain the reponse variable almost as well if not better than that chosen by a stepwise procedure. A different method of model selection is to compare different selections of variables without builing them up sequentially. The optimal subsets procedure is designed to do this by locating the subset of predictors of a given size that maximise some measure of fit to the data. This can be achieved in R using the regsubsets command in the leaps library and the subsets command in the car library. To compute an optimal subsets regression for the cars dataset, the following commands can be used: Selecting a model using optimal subsets selction Data set (available for download from RGSweb): cars.txt R console: commands library(car) library(leaps) subset.1 <- regsubsets(mileage ~ DISP + HP + PRICE + TYPE + WEIGHT, nbest=10, data=cars) subsets(subset.1) R console: output output is shown in the form of a graphic (see below) 15

16 Figure 2: competing models By default the BIC statistic is plotted against the number of predictors in the model. The regsubsets command obtains the subset of the best 10 models for each number of parameters. The resulting graph is shown below in Figure 2. From Figure 2 we can see that there are a large number of models that are roughly equally effective. The graph can be made more interpretable by defining limits. For example, to just show the 3 best models and graph only those solutions with a certain number of parameters (say 5 or 6). The code and resulting analysis are shown below: Selecting a model using optimal subsets selction: restricted sets Data set (available for download from RGSweb): cars.txt R console: commands library(car) library(leaps) subset.2 <- regsubsets(mileage ~ DISP + HP + PRICE + TYPE + WEIGHT, nbest=3, data=cars) subsets(subset.2, min.size=5, max.size=6) R console: output output is shown in the form of a graphic (see below) Although the analysis of categorical variables is quite difficult at this moment in R (as the procedure views each explanatory variable as separate) it certainly looks as though the variables 16

17 Figure 3: competing models, a clearer graph displacement type and price are all members of a subset that would appear to be the best predictors of the variable mileage (interpret this model in conjunction with the stepwise models computed earlier). Conclusion Automated selection procedures can be used to make decisions about whether terms are included or excluded from a regression model on statistical grounds according to how much the variables contribute to predicting the response variable. Ideally, such decisions should be based on theoretical as well as statistical grounds, however, it is sometimes convenient to use automated procedures. Whilst such a technique of model-building is relatively quick and efficient at deriving a model which provides a good prediction of the response variable, it does not always provide a model which is adequate for explanatory purposes. Agresti makes the point that... Computerized variable selection procedures should be used with caution. When one considers a large number of terms for potential inclusion in a model, one or two of them that are not really important may look impressive simply due to chance. For instance, when all the true effects are weak, the largest sample effect may substantially overestimate its true effect. In addition, it often makes sense to include certain variables of special interest in a model and report their estimated effects even if they are not statistically significant at some level. Agresti, 1996; p Sanford Weisberg, in his excellent book Applied Linear Regression makes the following point in the chapter that discusses multicollinearity and variable selection: 17

18 The single moist important tool in selecting a subset of variables for use in a model is the analyst s knowledge of the substantive area under study and of each of the variables, including expected sign and magnitude of coefficient. Weisberg, S., 1985; p What should a modelling procedure look like? Define which factors are likely to be important (definition of a theoretical model ). How are these factors to be represented in the model? (single variables, composite variables or factors). Evaluate the theoretical model proposed above (may utilising a structural equation model if the model is adequately defined). Try and improve this model by adding other variables (singly and in combination). Test to see if the resulting model is one of the best-fitting using an all-subsets methodology (see Weisberg, 1985; Fox, 2002). Analyse diagnostics for evidence that model assumptions have been violated (amend the models if appropriate). Describe and interpret the model-fit statistics and parameters. Illustrate the model using predictions obtained for clusters that occur in the data. 18

Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection

Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection Statistical Modelling for Social Scientists Manchester University January 20, 21 and 24, 2011 Graeme Hutcheson, University of Manchester Exploratory regression and model selection The lecture notes, exercises

More information

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010 Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester Exploratory regression and model

More information

7. Collinearity and Model Selection

7. Collinearity and Model Selection Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

Variable selection is intended to select the best subset of predictors. But why bother?

Variable selection is intended to select the best subset of predictors. But why bother? Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010 Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester Principal Component and Factor Analysis

More information

STA121: Applied Regression Analysis

STA121: Applied Regression Analysis STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model

More information

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes. Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department

More information

Model selection Outline for today

Model selection Outline for today Model selection Outline for today The problem of model selection Choose among models by a criterion rather than significance testing Criteria: Mallow s C p and AIC Search strategies: All subsets; stepaic

More information

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several

More information

Applied Statistics and Econometrics Lecture 6

Applied Statistics and Econometrics Lecture 6 Applied Statistics and Econometrics Lecture 6 Giuseppe Ragusa Luiss University gragusa@luiss.it http://gragusa.org/ March 6, 2017 Luiss University Empirical application. Data Italian Labour Force Survey,

More information

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value. Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING

More information

Categorical explanatory variables

Categorical explanatory variables Hutcheson, G. D. (2011). Tutorial: Categorical Explanatory Variables. Journal of Modelling in Management. 6, 2: 225 236. NOTE: this is a slightly updated version of this paper which is distributed to correct

More information

Discussion Notes 3 Stepwise Regression and Model Selection

Discussion Notes 3 Stepwise Regression and Model Selection Discussion Notes 3 Stepwise Regression and Model Selection Stepwise Regression There are many different commands for doing stepwise regression. Here we introduce the command step. There are many arguments

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms

More information

The problem we have now is called variable selection or perhaps model selection. There are several objectives.

The problem we have now is called variable selection or perhaps model selection. There are several objectives. STAT-UB.0103 NOTES for Wednesday 01.APR.04 One of the clues on the library data comes through the VIF values. These VIFs tell you to what extent a predictor is linearly dependent on other predictors. We

More information

Cluster Analysis Gets Complicated

Cluster Analysis Gets Complicated Cluster Analysis Gets Complicated Collinearity is a natural problem in clustering. So how can researchers get around it? Cluster analysis is widely used in segmentation studies for several reasons. First

More information

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Why enhance GLM? Shortcomings of the linear modelling approach. GLM being

More information

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM * Which directories are used for input files and output files? See menu-item "Options" and page 22 in the manual.

More information

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation: Topic 11. Unbalanced Designs [ST&D section 9.6, page 219; chapter 18] 11.1 Definition of missing data Accidents often result in loss of data. Crops are destroyed in some plots, plants and animals die,

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 9. Linear regression with latent variables 9.1 Principal component regression (PCR) 9.2 Partial least-squares regression (PLS) [ mostly

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information

Historical Data RSM Tutorial Part 1 The Basics

Historical Data RSM Tutorial Part 1 The Basics DX10-05-3-HistRSM Rev. 1/27/16 Historical Data RSM Tutorial Part 1 The Basics Introduction In this tutorial you will see how the regression tool in Design-Expert software, intended for response surface

More information

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Graphical Analysis of Data using Microsoft Excel [2016 Version] Graphical Analysis of Data using Microsoft Excel [2016 Version] Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters.

More information

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings Statistical Good Practice Guidelines SSC home Using Excel for Statistics - Tips and Warnings On-line version 2 - March 2001 This is one in a series of guides for research and support staff involved in

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business Section 3.2: Multiple Linear Regression II Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Multiple Linear Regression: Inference and Understanding We can answer new questions

More information

Information Criteria Methods in SAS for Multiple Linear Regression Models

Information Criteria Methods in SAS for Multiple Linear Regression Models Paper SA5 Information Criteria Methods in SAS for Multiple Linear Regression Models Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN ABSTRACT SAS 9.1 calculates Akaike s Information

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Workshop 8: Model selection

Workshop 8: Model selection Workshop 8: Model selection Selecting among candidate models requires a criterion for evaluating and comparing models, and a strategy for searching the possibilities. In this workshop we will explore some

More information

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening Variables Entered/Removed b Variables Entered GPA in other high school, test, Math test, GPA, High school math GPA a Variables Removed

More information

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated

More information

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Exploratory model analysis

Exploratory model analysis Exploratory model analysis with R and GGobi Hadley Wickham 6--8 Introduction Why do we build models? There are two basic reasons: explanation or prediction [Ripley, 4]. Using large ensembles of models

More information

SPSS INSTRUCTION CHAPTER 9

SPSS INSTRUCTION CHAPTER 9 SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can

More information

Building Better Parametric Cost Models

Building Better Parametric Cost Models Building Better Parametric Cost Models Based on the PMI PMBOK Guide Fourth Edition 37 IPDI has been reviewed and approved as a provider of project management training by the Project Management Institute

More information

Everything taken from (Hair, Hult et al. 2017) but some formulas taken elswere or created by Erik Mønness.

Everything taken from (Hair, Hult et al. 2017) but some formulas taken elswere or created by Erik Mønness. /Users/astacbf/Desktop/Assessing smartpls (engelsk).docx 1/8 Assessing smartpls Everything taken from (Hair, Hult et al. 017) but some formulas taken elswere or created by Erik Mønness. Run PLS algorithm,

More information

Lab 07: Multiple Linear Regression: Variable Selection

Lab 07: Multiple Linear Regression: Variable Selection Lab 07: Multiple Linear Regression: Variable Selection OBJECTIVES 1.Use PROC REG to fit multiple regression models. 2.Learn how to find the best reduced model. 3.Variable diagnostics and influential statistics

More information

Differentiation of Cognitive Abilities across the Lifespan. Online Supplement. Elliot M. Tucker-Drob

Differentiation of Cognitive Abilities across the Lifespan. Online Supplement. Elliot M. Tucker-Drob 1 Differentiation of Cognitive Abilities across the Lifespan Online Supplement Elliot M. Tucker-Drob This online supplement reports the results of an alternative set of analyses performed on a single sample

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy 2017 ITRON EFG Meeting Abdul Razack Specialist, Load Forecasting NV Energy Topics 1. Concepts 2. Model (Variable) Selection Methods 3. Cross- Validation 4. Cross-Validation: Time Series 5. Example 1 6.

More information

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017 CPSC 340: Machine Learning and Data Mining Feature Selection Fall 2017 Assignment 2: Admin 1 late day to hand in tonight, 2 for Wednesday, answers posted Thursday. Extra office hours Thursday at 4pm (ICICS

More information

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT This chapter provides step by step instructions on how to define and estimate each of the three types of LC models (Cluster, DFactor or Regression) and also

More information

EC121 Mathematical Techniques A Revision Notes

EC121 Mathematical Techniques A Revision Notes EC Mathematical Techniques A Revision Notes EC Mathematical Techniques A Revision Notes Mathematical Techniques A begins with two weeks of intensive revision of basic arithmetic and algebra, to the level

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Exercise 2.23 Villanova MAT 8406 September 7, 2015 Exercise 2.23 Villanova MAT 8406 September 7, 2015 Step 1: Understand the Question Consider the simple linear regression model y = 50 + 10x + ε where ε is NID(0, 16). Suppose that n = 20 pairs of observations

More information

SASEG 9B Regression Assumptions

SASEG 9B Regression Assumptions SASEG 9B Regression Assumptions (Fall 2015) Sources (adapted with permission)- T. P. Cronan, Jeff Mullins, Ron Freeze, and David E. Douglas Course and Classroom Notes Enterprise Systems, Sam M. Walton

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Feature Selection Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. Admin Assignment 3: Due Friday Midterm: Feb 14 in class

More information

Panel Data 4: Fixed Effects vs Random Effects Models

Panel Data 4: Fixed Effects vs Random Effects Models Panel Data 4: Fixed Effects vs Random Effects Models Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised April 4, 2017 These notes borrow very heavily, sometimes verbatim,

More information

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015 GLM II Basic Modeling Strategy 2015 CAS Ratemaking and Product Management Seminar by Paul Bailey March 10, 2015 Building predictive models is a multi-step process Set project goals and review background

More information

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1 Stat 5303 (Oehlert): Unbalanced Factorial Examples 1 > section

More information

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2) SOC6078 Advanced Statistics: 9. Generalized Additive Models Robert Andersen Department of Sociology University of Toronto Goals of the Lecture Introduce Additive Models Explain how they extend from simple

More information

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

More information

8. Collinearity and Model Selection

8. Collinearity and Model Selection Lecture Notes 8. Collinearity and Model Selection Collinearity and Model Selection 1 1. Introduction I When there is aperfect linear relationship among the regressors in a linear model, the least-squares

More information

Using R for Analyzing Delay Discounting Choice Data. analysis of discounting choice data requires the use of tools that allow for repeated measures

Using R for Analyzing Delay Discounting Choice Data. analysis of discounting choice data requires the use of tools that allow for repeated measures Using R for Analyzing Delay Discounting Choice Data Logistic regression is available in a wide range of statistical software packages, but the analysis of discounting choice data requires the use of tools

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Multivariate Analysis Multivariate Calibration part 2

Multivariate Analysis Multivariate Calibration part 2 Multivariate Analysis Multivariate Calibration part 2 Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Linear Latent Variables An essential concept in multivariate data

More information

Introduction to mixed-effects regression for (psycho)linguists

Introduction to mixed-effects regression for (psycho)linguists Introduction to mixed-effects regression for (psycho)linguists Martijn Wieling Department of Humanities Computing, University of Groningen Groningen, April 21, 2015 1 Martijn Wieling Introduction to mixed-effects

More information

Chapters 5-6: Statistical Inference Methods

Chapters 5-6: Statistical Inference Methods Chapters 5-6: Statistical Inference Methods Chapter 5: Estimation (of population parameters) Ex. Based on GSS data, we re 95% confident that the population mean of the variable LONELY (no. of days in past

More information

CREATING THE ANALYSIS

CREATING THE ANALYSIS Chapter 14 Multiple Regression Chapter Table of Contents CREATING THE ANALYSIS...214 ModelInformation...217 SummaryofFit...217 AnalysisofVariance...217 TypeIIITests...218 ParameterEstimates...218 Residuals-by-PredictedPlot...219

More information

SYS 6021 Linear Statistical Models

SYS 6021 Linear Statistical Models SYS 6021 Linear Statistical Models Project 2 Spam Filters Jinghe Zhang Summary The spambase data and time indexed counts of spams and hams are studied to develop accurate spam filters. Static models are

More information

Using Excel for Graphical Analysis of Data

Using Excel for Graphical Analysis of Data Using Excel for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters. Graphs are

More information

The Truth behind PGA Tour Player Scores

The Truth behind PGA Tour Player Scores The Truth behind PGA Tour Player Scores Sukhyun Sean Park, Dong Kyun Kim, Ilsung Lee May 7, 2016 Abstract The main aim of this project is to analyze the variation in a dataset that is obtained from the

More information

Assignment 6 - Model Building

Assignment 6 - Model Building Assignment 6 - Model Building your name goes here Due: Wednesday, March 7, 2018, noon, to Sakai Summary Primarily from the topics in Chapter 9 of your text, this homework assignment gives you practice

More information

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010 THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

More information

Annotated multitree output

Annotated multitree output Annotated multitree output A simplified version of the two high-threshold (2HT) model, applied to two experimental conditions, is used as an example to illustrate the output provided by multitree (version

More information

Chapter 6: Linear Model Selection and Regularization

Chapter 6: Linear Model Selection and Regularization Chapter 6: Linear Model Selection and Regularization As p (the number of predictors) comes close to or exceeds n (the sample size) standard linear regression is faced with problems. The variance of the

More information

Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA

Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA ECL 290 Statistical Models in Ecology using R Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA Datasets in this problem set adapted from those provided

More information

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Quality Checking an fmri Group Result (art_groupcheck)

Quality Checking an fmri Group Result (art_groupcheck) Quality Checking an fmri Group Result (art_groupcheck) Paul Mazaika, Feb. 24, 2009 A statistical parameter map of fmri group analyses relies on the assumptions of the General Linear Model (GLM). The assumptions

More information

MODEL DEVELOPMENT: VARIABLE SELECTION

MODEL DEVELOPMENT: VARIABLE SELECTION 7 MODEL DEVELOPMENT: VARIABLE SELECTION The discussion of least squares regression thus far has presumed that the model was known with respect to which variables were to be included and the form these

More information

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding In the previous lecture we learned how to incorporate a categorical research factor into a MLR model by using

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but

More information

Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models

Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models In this tutorial, we analyze data from a simple choice-based conjoint (CBC) experiment designed to estimate market shares (choice

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

LISA: Explore JMP Capabilities in Design of Experiments. Liaosa Xu June 21, 2012

LISA: Explore JMP Capabilities in Design of Experiments. Liaosa Xu June 21, 2012 LISA: Explore JMP Capabilities in Design of Experiments Liaosa Xu June 21, 2012 Course Outline Why We Need Custom Design The General Approach JMP Examples Potential Collinearity Issues Prior Design Evaluations

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,

More information

Salary 9 mo : 9 month salary for faculty member for 2004

Salary 9 mo : 9 month salary for faculty member for 2004 22s:52 Applied Linear Regression DeCook Fall 2008 Lab 3 Friday October 3. The data Set In 2004, a study was done to examine if gender, after controlling for other variables, was a significant predictor

More information

1 Homophily and assortative mixing

1 Homophily and assortative mixing 1 Homophily and assortative mixing Networks, and particularly social networks, often exhibit a property called homophily or assortative mixing, which simply means that the attributes of vertices correlate

More information

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Statistics Lab #7 ANOVA Part 2 & ANCOVA Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")

More information

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem. STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,

More information

Study Guide. Module 1. Key Terms

Study Guide. Module 1. Key Terms Study Guide Module 1 Key Terms general linear model dummy variable multiple regression model ANOVA model ANCOVA model confounding variable squared multiple correlation adjusted squared multiple correlation

More information

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

NCSS Statistical Software

NCSS Statistical Software Chapter 327 Geometric Regression Introduction Geometric regression is a special case of negative binomial regression in which the dispersion parameter is set to one. It is similar to regular multiple regression

More information

Elemental Set Methods. David Banks Duke University

Elemental Set Methods. David Banks Duke University Elemental Set Methods David Banks Duke University 1 1. Introduction Data mining deals with complex, high-dimensional data. This means that datasets often combine different kinds of structure. For example:

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

6. Relational Algebra (Part II)

6. Relational Algebra (Part II) 6. Relational Algebra (Part II) 6.1. Introduction In the previous chapter, we introduced relational algebra as a fundamental model of relational database manipulation. In particular, we defined and discussed

More information

Chapter 3. Set Theory. 3.1 What is a Set?

Chapter 3. Set Theory. 3.1 What is a Set? Chapter 3 Set Theory 3.1 What is a Set? A set is a well-defined collection of objects called elements or members of the set. Here, well-defined means accurately and unambiguously stated or described. Any

More information

Show how the LG-Syntax can be generated from a GUI model. Modify the LG-Equations to specify a different LC regression model

Show how the LG-Syntax can be generated from a GUI model. Modify the LG-Equations to specify a different LC regression model Tutorial #S1: Getting Started with LG-Syntax DemoData = 'conjoint.sav' This tutorial introduces the use of the LG-Syntax module, an add-on to the Advanced version of Latent GOLD. In this tutorial we utilize

More information