Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression Dr. G. Bharadwaja Kumar VIT Chennai

Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called independent variables.

Terminology

A variable may be thought to alter the dependent or independent variables, but may not actually be the focus of the experiment. So that variable will be kept constant or monitored to try to minimize its effect on the experiment. Such variables may be called a "control variable" or "extraneous variable".

Regression In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.

Regression The terms "dependent" and "independent" here have no direct relation to statistical dependence of variables or events. The term "(in)dependent" reflects only the functional relationship between variables within a model.

Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent variables. Regression is thus an explanation of causation. If the independent variable(s) sufficiently explain the variation in the dependent variable, the model can be used for prediction.

Simple Linear Regression In simple linear regression, there is one dependent variable, which is the one you are trying to explain with one independent variable. You can express the relationship as a linear equation, such as: y = a + bx

yi = a + bxi y is the dependent variable x is the independent variable a is a constant b is the slope of the line For every increase of 1 in x, y changes by an amount equal to b Some relationships are perfectly linear and fit this equation exactly.

Simple Linear Regression Table : Age and systolic blood pressure (SBP) among 33 adult women Age SBP Age SBP Age SBP 22 131 41 139 52 128 23 128 41 171 54 105 24 116 46 137 56 145 27 106 47 111 57 141 28 114 48 115 58 153 29 123 49 133 59 157 30 117 49 128 63 155 32 122 50 183 67 176 33 99 51 130 71 172 35 121 51 133 77 178 40 147 51 144 81 217

Simple Linear Regression

The most common method for fitting a regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line

Dependent variable (y) Simple Linear Regression / y = b0 + b1x ± є є b0 (y intercept) B1 = slope = y/ x Independent variable (x) The output of a regression is a function that predicts the dependent variable based upon values of the independent variables. Simple regression fits a straight line to the data.

Dependent variable Simple Linear Regression Observation: y Prediction: y ^ Independent variable (x) Zero The function will make a prediction for each observed data point. The observation is denoted by y and the prediction is denoted by y. ^

Dependent variable Regression Independent variable (x) A least squares regression selects the line with the lowest total sum of squared prediction errors. This value is called the Sum of Squares of Error, or SSE.

Regression Formulas

The Coefficient of Determination

Standard Error of Regression

Assumptions Weak exogeneity: This essentially means that the predictor variables x can be treated as fixed values, rather than random variables. This means, for example, that the predictor variables are assumed to be error-free Constant variance (aka homoscedasticity): This means that different response variables have the same variance in their errors, regardless of the values of the predictor variables.

Assumptions Independence of errors. This assumes that the errors of the response variables are uncorrelated with each other. Lack of multicollinearity in the predictors. For standard least squares estimation methods, the design matrix X must have full column rank p,; otherwise, we have a condition known as multicollinearity in the predictor variables

Collinearity

Problem with multicollinearity The least squares estimates will have big standard errors this is the main problem with multicollinearity we re trying to estimate the marginal effect of an independent variable holding the other independent variables constant. But the strong linear relationship among the independent variables makes this difficult we always see them move together That is, there is very little information in the data about the thing we re trying to estimate Consequently, we can t estimate it very precisely: the standard errors are large

Anscombe's quartet

Multiple Linear Regression Multiple linear regression simultaneously considers the influence of multiple explanatory variables on a response variable Y y α β1x 1 β 2 x 2... βi x i Partial regression coefficients i Amount by which y changes on average when xi changes by one unit and all the other xis remain constant Measures association between xi and y adjusted for all other xi

Multiple Linear Regression

Why use logistic regression? There are many important research topics for which the dependent variable is "limited." For example: whether or not a person smokes, or drinks, or skips class, or takes advanced mathematics. For these the outcome is not continuous or distributed normally. Example: Are mother s who have high school education less likely to have children with IEP s (individualized plans, indicating cognitive or emotional disabilities Binary logistic regression is a type of regression analysis where the dependent variable is a dummy variable: coded 0 (did not smoke) or 1(did smoke)

Logistic Regression Logistic regression analysis requires that the dependent variable be dichotomous (such as presence/absence or success/failure) Logistic regression analysis requires that the independent variables be metric or dichotomous.

Logistic Regression A variable is metric if we can measure the size of the difference between any two variable values. A variable is usually dichotomous, that is, it can take the value 1 with a probability of success q, or the value 0 with probability of failure 1-q. This type of variable is called a Bernoulli (or binary) variable

Logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables. Because it does not impose these requirements, it is preferred to discriminant analysis when the data does not satisfy these assumptions

ML is a way of finding the smallest possible deviance between the observed and predicted values (kind of like finding the best fitting line) using calculus (derivatives specifically). With ML, the computer uses different "iterations" in which it tries different solutions until it gets the smallest possible deviance or best fit. Once it has found the best solution, it provides a final value for the deviance, which is usually referred to as "negative two log likelihood"

The deviance statistic is called 2LL by Cohen et al. and it can be thought of as a chi-square value. we compare the deviance with just the intercept 2LLnull referring to 2LL of the constant-only model) to the deviance when the new predictor or predictors have been added ( 2LLk referring to 2LL of the model that has k number of predictors). The difference between these two deviance values is often referred to as G for goodness of fit

Count Data In statistics, count data is a statistical data type, a type of data in which the observations can take only the nonnegative integer values and these integers arise from counting rather than ranking.

The statistical treatment of count data is distinct from that of binary data, in which the observations can take only two values, and from ordinal data, which may also consist of integers but where the individual values fall on an arbitrary scale and only the relative ranking is important.

An individual piece of count data is often termed a count variable. When such a variable is treated as a random variable, the Poisson, binomial and negative binomial distributions are commonly used to represent its distribution.

Regression Models with Count Data Poisson Regression Negative Binomial Regression