Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)"

Transcription

1 SOC6078 Advanced Statistics: 9. Generalized Additive Models Robert Andersen Department of Sociology University of Toronto Goals of the Lecture Introduce Additive Models Explain how they extend from simple nonparametric regression (i.e., smoothing splines and local polynomial regression) Discuss estimation using backfitting Explain how to interpret their results Conclude with some examples of Additive Models applied to real social science data 1 2 Limitations of the Multiple Nonparametric Models The general nonparametric model (both the lowess smooth and the smoothing spline) takes the following form: As we see here, the multiple nonparametric model allows all possible interactions between the independent variables in their effects on Y we specify a jointly conditional functional form This model is ideal under the following circumstances: 1. There are no more than two predictors 2. The pattern of nonlinearity is complicated and thus cannot be easily modelled with a simple transformation, polynomial regression or cubic spline 3. The sample size is sufficiently large 3 Limitations of the Multiple Nonparametric Models (2) The general nonparametric model becomes impossible to interpret and unstable as we add more explanatory variables, however 1. For example, in the lowess case, as the number of variables increases, the window span must become wider in order to ensure that each local regression has enough cases (The general idea is the same for smoothing splines) This process can create significant bias (the curve becomes too smooth) 2. It is impossible to interpret general nonparametric regression when there are more than two variables there are no coefficients, and we cannot graph effects more than three dimensions These limitations lead us to Additive Models 4

2 Additive Regression Models Additive regression models essentially apply local regression to low dimensional projections of the data That is, they estimate the regression surface by a combination of a collection of one-dimensional functions The nonparametric additive regression model is Additive Regression Models (2) The assumption that the contribution of each covariate is additive is analogous to the assumption in linear regression that each component is estimated separately Recall that the linear regression model is where the B j represent linear effects where the f i are arbitrary functions estimated from the data; the errors ε are assumed to have constant variance and a mean of 0 The estimated functions f i are the analogues of the coefficients in linear regression 5 For the additive model we model Y as an additive combination of arbitrary functions of the Xs The f j represent arbitrary trends that can be estimated by lowess or smoothing splines 6 Additive Regression Models (3) Now comes the question: How do we find these arbitrary trends? If the X s are completely independent which will not be the case we could simply estimate each functional form using a nonparametric regression of Y on each of the X s separately Similarly in linear regression when the X s are completely uncorrelated the partial regression slopes are identical to the marginal regression slopes Since the X s are related, however, we need to proceed in another way, in effect removing the effects of other predictors which are unknown before we begin We use a procedure called backfitting to find each curve, controlling for the effects of the others 7 Estimation and Backfitting Suppose that we have a two predictor additive model: If we unrealistically know the partial regression function f 2 but not f 1 we could rearrange the equation in order to solve for f 1 In other words, smoothing Y i -f 2 (x i2 ) against x i1 produces an estimate of α+f 1 (x i1 ). Simply put, knowing one function allows us to find the other in the real world, however we don t know either so we must proceed initially with preliminary estimates 8

3 Estimation and Backfitting (2) 1. Start by expressing the variables in mean deviation form so that the partial regressions sum to zero, thus eliminating the individual intercepts 2. Take preliminary estimates of each function from a leastsquares regression of Y on the X s 3. The preliminary estimates are used as step (0) in an iterative estimation process Estimation and Backfitting (3) The partial residuals for X 1 are then 5. The same procedure in step 4 is done for X 2 6. Next we smooth these partial residuals against their respective X s, providing a new estimate of f 4. Find the partial residuals for X 1. Recall that partial residuals remove Y from its linear relationship to X 2 while retaining the relationship between Y and X 1 9 where S is the (n n) smoother transformation matrix for X j that depends only on the configuration of X ij for the jth predictor 10 Estimation and Backfitting (4) Either loess or smoothing splines can be used to find the regression curves If local polynomial regression is used, a decision must be made about the span that is used If a smoothing spline is used, the degrees of freedom can be specified a head of time or using cross-validation with the goal of minimizing penalized residual sum of squares Recall that the first term measures the closeness to the data; the second term penalizes curvature in the function 11 Estimation and Backfitting (5) This process of finding new estimates of the functions by smoothing the partial residuals is reiterated until the partial functions converge That is, when the estimates of the smooth functions stabilize from one iteration to the next we stop When this process is done, we obtain estimates of s j (X ij ) for every value of X j More importantly, we will have reduced a multiple regression to a series of two-dimensional partial regression problems, making interpretation easy: Since each partial regression is only two-dimensional, the functional forms can be plotted on two-dimensional plots showing the partial effects of each X j on Y In other words, perspective plots are no longer necessary unless we include an interaction between two smoother terms 12

4 GAMs in R There are two excellent packages for fitting generalized additive models in R The gam (for generalized additive model) function in the mgcv (multilple smoothing parameter estimation with generalized cross validation) fits generalized additive models using smoothing splines The smoothing parameter can be chosen automatically using cross-validation or manually by specifying the degrees of freedom The gam function in the gam package allows either lowess (lo(x)) or smoothing splines (s(x)) to be specified The anova function can be used for both functions, allowing different models to be easily compared Additive Regression Models in R: Example: Canadian prestige data Here we use the Canadian Prestige data to fit an additive model for prestige regressed on income and education For this example I use the gam function in mgcv package The formula takes the same form as the glm function except now we have the option of having parametric terms and smoothed estimates The R-script specifying a smooth trend for both income and education is as follows: Additive Regression Models in R: Example: Canadian prestige data (2) The summary function returns tests for each smooth, the degrees of freedom for each smooth, and an adjusted R- square for the model. The deviance can be obtained from the deviance(model) command Additive Regression Models in R: Example: Canadian prestige data (3) Again, as with other nonparametric models, we have no slope parameters to investigate (we do have an intercept, however) A plot of the regression surface is necessary 15 16

5 Additive Regression Models in R: Example: Canadian prestige data (4) Additive Model: We can see the nonlinear relationship for both education and Income with Prestige but there is no interaction between them i.e., the slope for income is the same at every value of education We can compare this model to the general nonparametric regression model Prestige Income Education Additive Regression Models in R: Example: Canadian prestige data (5) General Nonparametric Model: This model is quite similar to the additive model, but there are some nuances particularly in the midrange of income that are not picked up by the additive model because the X s do not interact Prestige Income Education Additive Regression Models in R: Example: Canadian prestige data (6) Perspective plots can also be made automatically using the persp.gam function. These graphs include a 95% confidence region income education Additive Regression Models in R: Example: Canadian prestige data (7) Since the slices of the additive regression in the direction of one predictor (holding the other constant) are parallel, we can graph each partialregression function separately This is the benefit of the additive model We can graph as many plots as there are variables, and allowing us to easily visualize the relationships In other words, a multidimensional regression has been reduced to a series of two-dimensional partial-regression plots To get these in R: red/green are +/-2 se 19 20

6 Additive Regression Models in R: Example: Canadian prestige data (8) s(income,3.12) s(education,3.18) income education 21 Interpreting the Effects A plot of of X j versus s j (X j ) shows the relationship between X j and Y holding constant the other variables in the model Since Y is expressed in mean deviation form, the smooth term s j (X j ) is also centered and thus each plot represents how Y changes relative to its mean with changes in X Interpreting the scale of the graphs then becomes easy: The value of 0 on the Y-axis is the mean of Y As the line moves away from 0 in a negative direction we subtract the distance from the mean when determining the fitted value. For example, if the mean is 45, and for a particular X-value (say x=15) the curve is at s j (X j )=4, this means the fitted value of Y controlling for all other explanatory variables is 45+4=49. If there are several nonparametric relationships, we can add together the effects on the two graphs for any particular observation to find its fitted value of Y 22 Interpreting the Effects (2) R-script for previous slide Income=10000 Education=10 s(income,3.08) (10 000,6) s(education,3) (10,-5) income education The mean of prestige=47.3. Therefore the fitted value for an occupation with average income of $10000/year and 10 years of education on average is =

7 Residual Sum of Squares As was the case for smoothing splines and lowess smooths, statistical inference and hypothesis testing is based on the residual sum of squares (or deviance in the case of generalized additive models) and the degrees of freedom The RSS for an additive model is easily defined in the usual manner: The approximate degrees of freedom, however, need to be adjusted from the regular nonparametric case, however, because we are no longer specifying a jointlyconditional functional form Degrees of Freedom Recall that for nonparametric regression, the approximate degrees of freedom are equal to the trace of the smoother matrix (the matrix that projects Y onto Y-hat) We extend this to the additive model: 1 is subtracted from each df reflecting the constraint that each partial regression function sums to zero (the individual intercept have been removed) Parametric terms entered in the model each occupy a single degree of freedom as in the linear regression case The individual degrees of freedom are then combined for a single measure: 25 1 is added to the final degrees of freedom to account for the overall constant in the model 26 Specifying Degrees of Freedom As was the case the degrees of freedom or alternatively the smoothing parameter λ can be specified by the researcher Also like smoothing splines, however, generalized crossvalidation can be used to specify the degrees of freedom Recall that this finds the smoothing parameter that gives the lowest average mean squared error from the cross-validation samples Cross-validation is implemented using the mgcv package in R 27 Cautions about Statistical Tests when the λ are chosen using GCV If the smoothing parameters λ s (or equivalently, the degrees of freedom) are chosen using generalized crossvalidation (GCV), caution must be used when using an analysis of deviance If a variable is added or removed from the model, the smoothing parameter λ that yields the smallest mean squared error will also change By implication, the degrees of freedom also changes implying that the equivalent number of parameters used for the model is different In other words, the test will only be approximate because the otherwise nested models have different degrees of freedom associated with λ As a result, it is advisable to fix the degrees of freedom when comparing models 28

8 Testing for Linearity We can compare the linear model of prestige regressed on income and education with the additive model by carrying out an incremental F-test Diagnostic Plots The gam.check function returns four diagnostic plots: 1. A quantile-comparison plot of the residuals allows us to look for outliers and heavy tails 2. Residuals versus linear predictors (simply observed y for continuous variables) helps detect nonconstant error variance 3. Histogram of the residuals are good for detecting nonormality 4. Response versus fitted values The difference between the models is highly statistically significant the additive model describes the relationship between prestige and education and income much better Diagnostic Plots (2) Interactions between Smoothed Terms Sample Quantiles Normal Q-Q Plot residuals Resids vs. linear pred. The gam function in the mgcv package allows you to specify an intercation term between two or more terms In the case of an interaction between two terms, and when no other variables are included in the model, we essentially have a multiple nonparametric regression Theoretical Quantiles linear predictor Histogram of residuals Response vs. Fitted Values Once again we need to graph the relationship in a perspective plot Frequency Response While it is possible to fit a higher order interaction, once we get past the two-way interaction the graph no longer can be interpreted Residuals Fitted Values 31 32

9 Semi-Parametric Models The generality of the additive model makes it very attractive when there is complicated nonlinearity in the multivariate case Nonetheless, the flexibility of the smooth fit comes at the expense of precision and statistical power As a result, if a linear trend can be fit, it should be preferred This leads us to the semi-parametric model, which allow a mixture of linear and nonparametric components Semi-Parametric Models (2) Semi-parametric models also makes it possible to add categorical variables. They enter the model in exactly the same way as for linear regression as a set of dummy regressors As said earlier, the gam function in the mgcv package also allows you to specify interaction terms Any interactions that can be done in a linear model can also be included in an additive model The same backfitting procedure that is used for the general additive model is used in fitting this semiparametric model 33 The last of these specifies an interaction between a categorical variable X 2 (noted by two dummy regressors- D 1 and D 2 ) and a quantitative variable X 1 for which a smooth trend is specified This fits a separate curve for each category 34 Interaction for GAMs in R Interaction for GAMs in R (2) Blue Collar Professional White Collar s(income,1) s(income,1) s(income,1) income income income 35 36

10 Concurvity The generalized additive model analogue to collinearity in linear models Two possible problems can arise: 1. A point or group of points that are common outliers in two or more X s could cause wild tail behavior 2. If two X s are too highly correlated, backfitting may be unable to find a unique curve. In these cases, the initial linear coefficient will be all that is returned The graph on previous page is a good example Here type and income are too closely related i.e., professional jobs are high paying, blue collar jobs pay less, and thus we find only linear fits where the lines cross) As is the case with collinearity, there is no solution to concurvity other than reformulating the research question Example 2: Inequality data revisited Recall earlier in the course we saw that there was an apparent interaction between gini and democracy in their effects on attitudes towards pay inequality Thus far we haven t given too much effort towards trying to determine the functional form of these effects We now do so using a semi-parametric model that fits a smooth term for gini that interacts with a dummy regressor for democracy In other words we fit two separate curves: One for democracies and another for none democracies Example 2: Inequality data revisited (2) Example 2: Inequality data revisited (3) Democracies Non-democracies s(gini,5.05) s(gini,8.77) gini gini

11 Example 2: Inequality data revisited (4) We now proceed to test whether the additive model does significantly better than the linear model We conclude that the additive model is no more informative 41 Cautions about Interpretation When we fit a linear model we don t believe the model to be correct, but rather that it is a good approximation The same goes for additive models, but we hope that they are better approximations Having said that, all the same pitfalls possible for linear models are magnified with additive models Most importantly, we must be careful not to overinterpret fitted curves An examination of standard error bands, an analysis of deviance, and residual plots, can help determine whether fitted curves are important We can also select and delete variables in a stepwise manner (taking out and then re-adding) insignificant terms to ensure that only important terms remain in the final model We do not want unimportant terms influencing otherwise important effects 42 Generalized Additive Mixed Models The gamm function in the mgcv package calls on the lme function in the nlme package to fit generalized additive mixed models Recall earlier we used the British Election Study data to explore how income affected left-right attitudes Since observations were clustered within constituencies, we used a mixed model to take this clustering into account, specifying a random intercept Assume now that we have reason to believe that income had a nonlinear effect on attitudes. We could test this hypothesis by specifying a smooth for the income effect using the gamm function, and comparing this model with another that specifies a simpler linear trend Generalized Additive Mixed Models (2) 43 44

12 Generalized Additive Mixed Models (3) s(income2,1) Generalized Additive Mixed Models (4) We can also test for linearity using the anova function to compare the fit of a model specifying a smooth term to a model specifying a linear trend INCOME2 The plot of the income effect below indicates that a linear specification is probably best, so we proceed to a formal test using an analysis of deviance 45 We see hear that the difference between the models is not statistically significant, suggesting that the linear specification is best 46 Missing Data Missing data can be a problem for any type of model It is only seriously problematic, however, if the missing cases have a systematic relationship to the response or the X s If the data are missing at random (i.e., the pattern of missingness is not a function of the response), we are less worried about them. However, if they are not, the problem is even more serious for generalized additive models than for linear regression The backfitting algorithm omits all missing observations and thus their fitted values are set to 0 when the partial residuals are smoothed against the predictor Since the fitted curves have a mean of 0, this amounts to assigning the average fitted value to the missing observations In other words, it the same as using mean imputation in linear models, thus resulting in bias estimates 47 Problem of Mean Imputation (1) The following example randomly generates 100 observations from the normal distribution, N(20,2), so that x and y are perfectly correlated The example shows what happens if 20% of the data are randomly removed (and thus are missing completely at random) and mean imputation is used in a regression model 48

13 Problem of Mean Imputation (2) Missing on x Problem of Mean Imputation (3) Missing on x Density of x I now remove values of x for 20 observations, replacing them with the mean of x Density All cases Mean imputation N = 100 Bandwidth = Problem of Mean Imputation (4) Missing on x The mean imputation does not affect the slope, but it has pulled the intercept downwards More importantly, because there is less variation in x, the standard errors will be larger yall % Mean imputation (x) xmean Problem of Mean Imputation (4) Missing on y We now randomly replace 20 y-values with the mean of y but retain all values of x The mean imputation affects the slope and the intercept, resulting in biased estimates ymean % Mean imputation (y) xall 51 52

14 Summary and Conclusions Additive models give a compromise between ease of interpretation of the linear model and the flexibility of the general nonparametric model Complicated nonlinearity problems can be easily accommodated, even for models with many independent variables The effects in the model represent partial effects in the same way as coefficients in linear models These models should be seen as important models on their own, but they also can play an important role in diagnosing nonlinearity even if the final model chosen is a regular linear model 53 Summary and Conclusions (2) These models are extremely flexible in that both nonparametric and parametric trends can be specified Moreover, even interactions between explanatory variables are possible Caution:Since GAMs effectively use mean imputation for missing data (rather than list-wise deletion as in linear models), we must be especially careful to deal appropriately with missing data before fitting the model Mean imputation can result in biased estimates Finally, as we shall see tomorrow, these models can be extended to accommodate limited dependent variables in the same way that generalized linear models extend the general linear model 54

Nonlinearity and Generalized Additive Models Lecture 2

Nonlinearity and Generalized Additive Models Lecture 2 University of Texas at Dallas, March 2007 Nonlinearity and Generalized Additive Models Lecture 2 Robert Andersen McMaster University http://socserv.mcmaster.ca/andersen Definition of a Smoother A smoother

More information

Nonparametric Regression and Generalized Additive Models Part I

Nonparametric Regression and Generalized Additive Models Part I SPIDA, June 2004 Nonparametric Regression and Generalized Additive Models Part I Robert Andersen McMaster University Plan of the Lecture 1. Detecting nonlinearity Fitting a linear model to a nonlinear

More information

Generalized additive models I

Generalized additive models I I Patrick Breheny October 6 Patrick Breheny BST 764: Applied Statistical Modeling 1/18 Introduction Thus far, we have discussed nonparametric regression involving a single covariate In practice, we often

More information

7. Collinearity and Model Selection

7. Collinearity and Model Selection Sociology 740 John Fox Lecture Notes 7. Collinearity and Model Selection Copyright 2014 by John Fox Collinearity and Model Selection 1 1. Introduction I When there is a perfect linear relationship among

More information

Splines and penalized regression

Splines and penalized regression Splines and penalized regression November 23 Introduction We are discussing ways to estimate the regression function f, where E(y x) = f(x) One approach is of course to assume that f has a certain shape,

More information

Generalized Additive Model

Generalized Additive Model Generalized Additive Model by Huimin Liu Department of Mathematics and Statistics University of Minnesota Duluth, Duluth, MN 55812 December 2008 Table of Contents Abstract... 2 Chapter 1 Introduction 1.1

More information

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric) Splines Patrick Breheny November 20 Patrick Breheny STA 621: Nonparametric Statistics 1/46 Introduction Introduction Problems with polynomial bases We are discussing ways to estimate the regression function

More information

Generalized Additive Models

Generalized Additive Models Generalized Additive Models Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Additive Models GAMs are one approach to non-parametric regression in the multiple predictor setting.

More information

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel Non-Linear Regression Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel Today s Lecture Objectives 1 Understanding the need for non-parametric regressions 2 Familiarizing with two common

More information

Two-Stage Least Squares

Two-Stage Least Squares Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

More information

Nonparametric Approaches to Regression

Nonparametric Approaches to Regression Nonparametric Approaches to Regression In traditional nonparametric regression, we assume very little about the functional form of the mean response function. In particular, we assume the model where m(xi)

More information

PSY 9556B (Feb 5) Latent Growth Modeling

PSY 9556B (Feb 5) Latent Growth Modeling PSY 9556B (Feb 5) Latent Growth Modeling Fixed and random word confusion Simplest LGM knowing how to calculate dfs How many time points needed? Power, sample size Nonlinear growth quadratic Nonlinear growth

More information

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K. GAMs semi-parametric GLMs Simon Wood Mathematical Sciences, University of Bath, U.K. Generalized linear models, GLM 1. A GLM models a univariate response, y i as g{e(y i )} = X i β where y i Exponential

More information

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Minitab 17 commands Prepared by Jeffrey S. Simonoff Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save

More information

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business Section 3.4: Diagnostics and Transformations Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Regression Model Assumptions Y i = β 0 + β 1 X i + ɛ Recall the key assumptions

More information

Missing Data Missing Data Methods in ML Multiple Imputation

Missing Data Missing Data Methods in ML Multiple Imputation Missing Data Missing Data Methods in ML Multiple Imputation PRE 905: Multivariate Analysis Lecture 11: April 22, 2014 PRE 905: Lecture 11 Missing Data Methods Today s Lecture The basics of missing data:

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 0 - Multiple Regression Many problems in biology science involve the analysis of multivariate data sets. For data sets in which there is a single continuous dependent variable, but

More information

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data Introduction About this Document This manual was written by members of the Statistical Consulting Program as an introduction to SPSS 12.0. It is designed to assist new users in familiarizing themselves

More information

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation: Topic 11. Unbalanced Designs [ST&D section 9.6, page 219; chapter 18] 11.1 Definition of missing data Accidents often result in loss of data. Crops are destroyed in some plots, plants and animals die,

More information

Machine Learning. Topic 4: Linear Regression Models

Machine Learning. Topic 4: Linear Regression Models Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There

More information

3 Nonlinear Regression

3 Nonlinear Regression CSC 4 / CSC D / CSC C 3 Sometimes linear models are not sufficient to capture the real-world phenomena, and thus nonlinear models are necessary. In regression, all such models will have the same basic

More information

Smoothing non-stationary noise of the Nigerian Stock Exchange All-Share Index data using variable coefficient functions

Smoothing non-stationary noise of the Nigerian Stock Exchange All-Share Index data using variable coefficient functions Smoothing non-stationary noise of the Nigerian Stock Exchange All-Share Index data using variable coefficient functions 1 Alabi Nurudeen Olawale, 2 Are Stephen Olusegun 1 Department of Mathematics and

More information

Serial Correlation and Heteroscedasticity in Time series Regressions. Econometric (EC3090) - Week 11 Agustín Bénétrix

Serial Correlation and Heteroscedasticity in Time series Regressions. Econometric (EC3090) - Week 11 Agustín Bénétrix Serial Correlation and Heteroscedasticity in Time series Regressions Econometric (EC3090) - Week 11 Agustín Bénétrix 1 Properties of OLS with serially correlated errors OLS still unbiased and consistent

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening Variables Entered/Removed b Variables Entered GPA in other high school, test, Math test, GPA, High school math GPA a Variables Removed

More information

Using Excel for Graphical Analysis of Data

Using Excel for Graphical Analysis of Data Using Excel for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters. Graphs are

More information

NONPARAMETRIC REGRESSION TECHNIQUES

NONPARAMETRIC REGRESSION TECHNIQUES NONPARAMETRIC REGRESSION TECHNIQUES C&PE 940, 28 November 2005 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and other resources available at: http://people.ku.edu/~gbohling/cpe940

More information

Subset Selection in Multiple Regression

Subset Selection in Multiple Regression Chapter 307 Subset Selection in Multiple Regression Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that

More information

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea Chapter 3 Bootstrap 3.1 Introduction The estimation of parameters in probability distributions is a basic problem in statistics that one tends to encounter already during the very first course on the subject.

More information

Nonparametric regression using kernel and spline methods

Nonparametric regression using kernel and spline methods Nonparametric regression using kernel and spline methods Jean D. Opsomer F. Jay Breidt March 3, 016 1 The statistical model When applying nonparametric regression methods, the researcher is interested

More information

Lecture 3 - Object-oriented programming and statistical programming examples

Lecture 3 - Object-oriented programming and statistical programming examples Lecture 3 - Object-oriented programming and statistical programming examples Björn Andersson (w/ Ronnie Pingel) Department of Statistics, Uppsala University February 1, 2013 Table of Contents 1 Some notes

More information

Graphics before and after model fitting. Nicholas J. Cox University of Durham.

Graphics before and after model fitting. Nicholas J. Cox University of Durham. Graphics before and after model fitting Nicholas J. Cox University of Durham n.j.cox@durham.ac.uk 1 It is commonplace to compute various flavours of residual and predicted values after fitting many different

More information

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Building Better Parametric Cost Models

Building Better Parametric Cost Models Building Better Parametric Cost Models Based on the PMI PMBOK Guide Fourth Edition 37 IPDI has been reviewed and approved as a provider of project management training by the Project Management Institute

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use

More information

Opening Windows into the Black Box

Opening Windows into the Black Box Opening Windows into the Black Box Yu-Sung Su, Andrew Gelman, Jennifer Hill and Masanao Yajima Columbia University, Columbia University, New York University and University of California at Los Angels July

More information

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1 Panel data set Consists of n entities or subjects (e.g., firms and states), each of which includes T observations measured at 1 through t time period. total number of observations : nt Panel data have

More information

Minitab 18 Feature List

Minitab 18 Feature List Minitab 18 Feature List * New or Improved Assistant Measurement systems analysis * Capability analysis Graphical analysis Hypothesis tests Regression DOE Control charts * Graphics Scatterplots, matrix

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

Variable selection is intended to select the best subset of predictors. But why bother?

Variable selection is intended to select the best subset of predictors. But why bother? Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.

More information

SPSS INSTRUCTION CHAPTER 9

SPSS INSTRUCTION CHAPTER 9 SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

One way ANOVA when the data are not normally distributed (The Kruskal-Wallis test).

One way ANOVA when the data are not normally distributed (The Kruskal-Wallis test). One way ANOVA when the data are not normally distributed (The Kruskal-Wallis test). Suppose you have a one way design, and want to do an ANOVA, but discover that your data are seriously not normal? Just

More information

LISA: Explore JMP Capabilities in Design of Experiments. Liaosa Xu June 21, 2012

LISA: Explore JMP Capabilities in Design of Experiments. Liaosa Xu June 21, 2012 LISA: Explore JMP Capabilities in Design of Experiments Liaosa Xu June 21, 2012 Course Outline Why We Need Custom Design The General Approach JMP Examples Potential Collinearity Issues Prior Design Evaluations

More information

8. MINITAB COMMANDS WEEK-BY-WEEK

8. MINITAB COMMANDS WEEK-BY-WEEK 8. MINITAB COMMANDS WEEK-BY-WEEK In this section of the Study Guide, we give brief information about the Minitab commands that are needed to apply the statistical methods in each week s study. They are

More information

Generalized least squares (GLS) estimates of the level-2 coefficients,

Generalized least squares (GLS) estimates of the level-2 coefficients, Contents 1 Conceptual and Statistical Background for Two-Level Models...7 1.1 The general two-level model... 7 1.1.1 Level-1 model... 8 1.1.2 Level-2 model... 8 1.2 Parameter estimation... 9 1.3 Empirical

More information

The linear mixed model: modeling hierarchical and longitudinal data

The linear mixed model: modeling hierarchical and longitudinal data The linear mixed model: modeling hierarchical and longitudinal data Analysis of Experimental Data AED The linear mixed model: modeling hierarchical and longitudinal data 1 of 44 Contents 1 Modeling Hierarchical

More information

Fitting to a set of data. Lecture on fitting

Fitting to a set of data. Lecture on fitting Fitting to a set of data Lecture on fitting Linear regression Linear regression Residual is the amount difference between a real data point and a modeled data point Fitting a polynomial to data Could use

More information

Model selection and validation 1: Cross-validation

Model selection and validation 1: Cross-validation Model selection and validation 1: Cross-validation Ryan Tibshirani Data Mining: 36-462/36-662 March 26 2013 Optional reading: ISL 2.2, 5.1, ESL 7.4, 7.10 1 Reminder: modern regression techniques Over the

More information

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones What is machine learning? Data interpretation describing relationship between predictors and responses

More information

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM * Which directories are used for input files and output files? See menu-item "Options" and page 22 in the manual.

More information

Introduction to mixed-effects regression for (psycho)linguists

Introduction to mixed-effects regression for (psycho)linguists Introduction to mixed-effects regression for (psycho)linguists Martijn Wieling Department of Humanities Computing, University of Groningen Groningen, April 21, 2015 1 Martijn Wieling Introduction to mixed-effects

More information

ANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3

ANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3 ANNOUNCING THE RELEASE OF LISREL VERSION 9.1 2 BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3 THREE-LEVEL MULTILEVEL GENERALIZED LINEAR MODELS 3 FOUR

More information

Multiple Imputation with Mplus

Multiple Imputation with Mplus Multiple Imputation with Mplus Tihomir Asparouhov and Bengt Muthén Version 2 September 29, 2010 1 1 Introduction Conducting multiple imputation (MI) can sometimes be quite intricate. In this note we provide

More information

Grade 9 Math Terminology

Grade 9 Math Terminology Unit 1 Basic Skills Review BEDMAS a way of remembering order of operations: Brackets, Exponents, Division, Multiplication, Addition, Subtraction Collect like terms gather all like terms and simplify as

More information

Frequency Distributions

Frequency Distributions Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 13: The bootstrap (v3) Ramesh Johari ramesh.johari@stanford.edu 1 / 30 Resampling 2 / 30 Sampling distribution of a statistic For this lecture: There is a population model

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Project 1 Jeremy Morris February 20, 2006 1 Generating bivariate normal data Definition 2.2 from our text states that we can transform a sample from a standard normal random variable

More information

More advanced use of mgcv. Simon Wood Mathematical Sciences, University of Bath, U.K.

More advanced use of mgcv. Simon Wood Mathematical Sciences, University of Bath, U.K. More advanced use of mgcv Simon Wood Mathematical Sciences, University of Bath, U.K. Fine control of smoothness: gamma Suppose that we fit a model but a component is too wiggly. For GCV/AIC we can increase

More information

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition Online Learning Centre Technology Step-by-Step - Minitab Minitab is a statistical software application originally created

More information

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance Machine Learning May 13, 212 Introduction In this exercise, you will implement regularized linear regression and use it to study

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

4.5 The smoothed bootstrap

4.5 The smoothed bootstrap 4.5. THE SMOOTHED BOOTSTRAP 47 F X i X Figure 4.1: Smoothing the empirical distribution function. 4.5 The smoothed bootstrap In the simple nonparametric bootstrap we have assumed that the empirical distribution

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

IQR = number. summary: largest. = 2. Upper half: Q3 =

IQR = number. summary: largest. = 2. Upper half: Q3 = Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number

More information

Error Analysis, Statistics and Graphing

Error Analysis, Statistics and Graphing Error Analysis, Statistics and Graphing This semester, most of labs we require us to calculate a numerical answer based on the data we obtain. A hard question to answer in most cases is how good is your

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran Assignment 4 (Sol.) Introduction to Data Analytics Prof. andan Sudarsanam & Prof. B. Ravindran 1. Which among the following techniques can be used to aid decision making when those decisions depend upon

More information

Spatial Interpolation & Geostatistics

Spatial Interpolation & Geostatistics (Z i Z j ) 2 / 2 Spatial Interpolation & Geostatistics Lag Lag Mean Distance between pairs of points 1 Tobler s Law All places are related, but nearby places are related more than distant places Corollary:

More information

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Graphical Analysis of Data using Microsoft Excel [2016 Version] Graphical Analysis of Data using Microsoft Excel [2016 Version] Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters.

More information

Chapter 4: Analyzing Bivariate Data with Fathom

Chapter 4: Analyzing Bivariate Data with Fathom Chapter 4: Analyzing Bivariate Data with Fathom Summary: Building from ideas introduced in Chapter 3, teachers continue to analyze automobile data using Fathom to look for relationships between two quantitative

More information

Estimation of Item Response Models

Estimation of Item Response Models Estimation of Item Response Models Lecture #5 ICPSR Item Response Theory Workshop Lecture #5: 1of 39 The Big Picture of Estimation ESTIMATOR = Maximum Likelihood; Mplus Any questions? answers Lecture #5:

More information

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica Boston-Keio Workshop 2016. Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica... Mihoko Minami Keio University, Japan August 15, 2016 Joint work with

More information

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017 CPSC 340: Machine Learning and Data Mining More Regularization Fall 2017 Assignment 3: Admin Out soon, due Friday of next week. Midterm: You can view your exam during instructor office hours or after class

More information

Section 4: Analyzing Bivariate Data with Fathom

Section 4: Analyzing Bivariate Data with Fathom Section 4: Analyzing Bivariate Data with Fathom Summary: Building from ideas introduced in Section 3, teachers continue to analyze automobile data using Fathom to look for relationships between two quantitative

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 2015 MODULE 4 : Modelling experimental data Time allowed: Three hours Candidates should answer FIVE questions. All questions carry equal

More information

How to use FSBforecast Excel add in for regression analysis

How to use FSBforecast Excel add in for regression analysis How to use FSBforecast Excel add in for regression analysis FSBforecast is an Excel add in for data analysis and regression that was developed here at the Fuqua School of Business over the last 3 years

More information

predict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015

predict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015 predict and Friends: Common Methods for Predictive Models in R 36-402, Spring 2015 Handout No. 1, 25 January 2015 R has lots of functions for working with different sort of predictive models. This handout

More information

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several

More information

Nonparametric Testing

Nonparametric Testing Nonparametric Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

Robust Linear Regression (Passing- Bablok Median-Slope)

Robust Linear Regression (Passing- Bablok Median-Slope) Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their

More information

The Truth behind PGA Tour Player Scores

The Truth behind PGA Tour Player Scores The Truth behind PGA Tour Player Scores Sukhyun Sean Park, Dong Kyun Kim, Ilsung Lee May 7, 2016 Abstract The main aim of this project is to analyze the variation in a dataset that is obtained from the

More information

Data Analysis Multiple Regression

Data Analysis Multiple Regression Introduction Visual-XSel 14.0 is both, a powerful software to create a DoE (Design of Experiment) as well as to evaluate the results, or historical data. After starting the software, the main guide shows

More information

[1] CURVE FITTING WITH EXCEL

[1] CURVE FITTING WITH EXCEL 1 Lecture 04 February 9, 2010 Tuesday Today is our third Excel lecture. Our two central themes are: (1) curve-fitting, and (2) linear algebra (matrices). We will have a 4 th lecture on Excel to further

More information

Chapter 2 Modeling Distributions of Data

Chapter 2 Modeling Distributions of Data Chapter 2 Modeling Distributions of Data Section 2.1 Describing Location in a Distribution Describing Location in a Distribution Learning Objectives After this section, you should be able to: FIND and

More information

GAMs, GAMMs and other penalized GLMs using mgcv in R. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs, GAMMs and other penalized GLMs using mgcv in R. Simon Wood Mathematical Sciences, University of Bath, U.K. GAMs, GAMMs and other penalied GLMs using mgcv in R Simon Wood Mathematical Sciences, University of Bath, U.K. Simple eample Consider a very simple dataset relating the timber volume of cherry trees to

More information

Linear and Quadratic Least Squares

Linear and Quadratic Least Squares Linear and Quadratic Least Squares Prepared by Stephanie Quintal, graduate student Dept. of Mathematical Sciences, UMass Lowell in collaboration with Marvin Stick Dept. of Mathematical Sciences, UMass

More information

Linear Regression and Regression Trees. Avinash Kak Purdue University. May 12, :41am. An RVL Tutorial Presentation Presented on April 29, 2016

Linear Regression and Regression Trees. Avinash Kak Purdue University. May 12, :41am. An RVL Tutorial Presentation Presented on April 29, 2016 Linear Regression and Regression Trees Avinash Kak Purdue University May 12, 2016 10:41am An RVL Tutorial Presentation Presented on April 29, 2016 c 2016 Avinash Kak, Purdue University 1 CONTENTS Page

More information

NCSS Statistical Software. Robust Regression

NCSS Statistical Software. Robust Regression Chapter 308 Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that chapter for in depth coverage of multiple

More information

NCSS Statistical Software

NCSS Statistical Software Chapter 327 Geometric Regression Introduction Geometric regression is a special case of negative binomial regression in which the dispersion parameter is set to one. It is similar to regular multiple regression

More information

An introduction to plotting data

An introduction to plotting data An introduction to plotting data Eric D. Black California Institute of Technology February 25, 2014 1 Introduction Plotting data is one of the essential skills every scientist must have. We use it on a

More information

Using HLM for Presenting Meta Analysis Results. R, C, Gardner Department of Psychology

Using HLM for Presenting Meta Analysis Results. R, C, Gardner Department of Psychology Data_Analysis.calm: dacmeta Using HLM for Presenting Meta Analysis Results R, C, Gardner Department of Psychology The primary purpose of meta analysis is to summarize the effect size results from a number

More information

Concept of Curve Fitting Difference with Interpolation

Concept of Curve Fitting Difference with Interpolation Curve Fitting Content Concept of Curve Fitting Difference with Interpolation Estimation of Linear Parameters by Least Squares Curve Fitting by Polynomial Least Squares Estimation of Non-linear Parameters

More information

Econometric Tools 1: Non-Parametric Methods

Econometric Tools 1: Non-Parametric Methods University of California, Santa Cruz Department of Economics ECON 294A (Fall 2014) - Stata Lab Instructor: Manuel Barron 1 Econometric Tools 1: Non-Parametric Methods 1 Introduction This lecture introduces

More information

Time Series Analysis DM 2 / A.A

Time Series Analysis DM 2 / A.A DM 2 / A.A. 2010-2011 Time Series Analysis Several slides are borrowed from: Han and Kamber, Data Mining: Concepts and Techniques Mining time-series data Lei Chen, Similarity Search Over Time-Series Data

More information

Package bgeva. May 19, 2017

Package bgeva. May 19, 2017 Version 0.3-1 Package bgeva May 19, 2017 Author Giampiero Marra, Raffaella Calabrese and Silvia Angela Osmetti Maintainer Giampiero Marra Title Binary Generalized Extreme Value

More information