Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Similar documents
14.2 The Regression Equation

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

Regression Analysis and Linear Regression Models

Generalized least squares (GLS) estimates of the level-2 coefficients,

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Multiple Regression White paper

Machine Learning - Regression. CS102 Fall 2017

An introduction to SPSS

Analysis of Panel Data. Third Edition. Cheng Hsiao University of Southern California CAMBRIDGE UNIVERSITY PRESS

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Linear and Quadratic Least Squares

An Introduction to Growth Curve Analysis using Structural Equation Modeling

Introduction to Mixed Models: Multivariate Regression

Hierarchical Generalized Linear Models

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Coding Categorical Variables in Regression: Indicator or Dummy Variables. Professor George S. Easton

Section 2.1: Intro to Simple Linear Regression & Least Squares

Modelling Proportions and Count Data

CSE 546 Machine Learning, Autumn 2013 Homework 2

Latent Class Modeling as a Probabilistic Extension of K-Means Clustering

Building Better Parametric Cost Models

Using Machine Learning to Optimize Storage Systems

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

Modelling Proportions and Count Data

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Two-Stage Least Squares

IBM SPSS Categories 23

CHAPTER 1 INTRODUCTION

Introduction to hypothesis testing

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

Serial Correlation and Heteroscedasticity in Time series Regressions. Econometric (EC3090) - Week 11 Agustín Bénétrix

1. Solve the following system of equations below. What does the solution represent? 5x + 2y = 10 3x + 5y = 2

Panel Data 4: Fixed Effects vs Random Effects Models

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

Also, for all analyses, two other files are produced upon program completion.

Algebra 2 Common Core Summer Skills Packet

CS 237: Probability in Computing

Study Guide. Module 1. Key Terms

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Statistical Analysis Using SPSS for Windows Getting Started (Ver. 2018/10/30) The numbers of figures in the SPSS_screenshot.pptx are shown in red.

Lecture on Modeling Tools for Clustering & Regression

Introduction to Mathematical Programming IE406. Lecture 4. Dr. Ted Ralphs

Statistical Tests for Variable Discrimination

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

More About Factoring Trinomials

Support Vector Machines

A straight line is the graph of a linear equation. These equations come in several forms, for example: change in x = y 1 y 0

Linear Regression. Problem: There are many observations with the same x-value but different y-values... Can t predict one y-value from x. d j.

Robust Linear Regression (Passing- Bablok Median-Slope)

Learning via Optimization

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Multicollinearity and Validation CIVL 7012/8012

1. Answer: x or x. Explanation Set up the two equations, then solve each equation. x. Check

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

NCSS Statistical Software

Chapter Two: Descriptive Methods 1/50

Using R for Analyzing Delay Discounting Choice Data. analysis of discounting choice data requires the use of tools that allow for repeated measures

Performance Evaluation of Various Classification Algorithms

Network Traffic Measurements and Analysis

Applied Multivariate Analysis

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Exercises R For Simulations Columbia University EPIC 2015 (no answers)

Handbook of Statistical Modeling for the Social and Behavioral Sciences

Stat 342 Exam 3 Fall 2014

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

Independent Variables

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Section 2.2: Covariance, Correlation, and Least Squares

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

ANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3

Preparing for Data Analysis

Important Things to Remember on the SOL

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

10-701/15-781, Fall 2006, Final

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Economics Nonparametric Econometrics

Machine Learning / Jan 27, 2010

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Relations and Functions 2.1

SLStats.notebook. January 12, Statistics:

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Measures of Dispersion

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Splines and penalized regression

Business Club. Decision Trees

Preparing for Data Analysis

Random Forest A. Fornaser

JMASM 46: Algorithm for Comparison of Robust Regression Methods In Multiple Linear Regression By Weighting Least Square Regression (SAS)

Using HLM for Presenting Meta Analysis Results. R, C, Gardner Department of Psychology

Tree-based methods for classification and regression

From logistic to binomial & Poisson models

Linear Methods for Regression and Shrinkage Methods

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Logistic Regression. (Dichotomous predicted variable) Tim Frasier

Fathom Dynamic Data TM Version 2 Specifications

Topic 3: GIS Models 10/2/2017. What is a Model? What is a GIS Model. Geography 38/42:477 Advanced Geomatics

Transcription:

Regression Dr. G. Bharadwaja Kumar VIT Chennai

Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called independent variables.

Terminology

A variable may be thought to alter the dependent or independent variables, but may not actually be the focus of the experiment. So that variable will be kept constant or monitored to try to minimize its effect on the experiment. Such variables may be called a "control variable" or "extraneous variable".

Regression In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.

Regression The terms "dependent" and "independent" here have no direct relation to statistical dependence of variables or events. The term "(in)dependent" reflects only the functional relationship between variables within a model.

Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent variables. Regression is thus an explanation of causation. If the independent variable(s) sufficiently explain the variation in the dependent variable, the model can be used for prediction.

Simple Linear Regression In simple linear regression, there is one dependent variable, which is the one you are trying to explain with one independent variable. You can express the relationship as a linear equation, such as: y = a + bx

yi = a + bxi y is the dependent variable x is the independent variable a is a constant b is the slope of the line For every increase of 1 in x, y changes by an amount equal to b Some relationships are perfectly linear and fit this equation exactly.

Simple Linear Regression Table : Age and systolic blood pressure (SBP) among 33 adult women Age SBP Age SBP Age SBP 22 131 41 139 52 128 23 128 41 171 54 105 24 116 46 137 56 145 27 106 47 111 57 141 28 114 48 115 58 153 29 123 49 133 59 157 30 117 49 128 63 155 32 122 50 183 67 176 33 99 51 130 71 172 35 121 51 133 77 178 40 147 51 144 81 217

Simple Linear Regression

The most common method for fitting a regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line

Dependent variable (y) Simple Linear Regression / y = b0 + b1x ± є є b0 (y intercept) B1 = slope = y/ x Independent variable (x) The output of a regression is a function that predicts the dependent variable based upon values of the independent variables. Simple regression fits a straight line to the data.

Dependent variable Simple Linear Regression Observation: y Prediction: y ^ Independent variable (x) Zero The function will make a prediction for each observed data point. The observation is denoted by y and the prediction is denoted by y. ^

/

Dependent variable Regression Independent variable (x) A least squares regression selects the line with the lowest total sum of squared prediction errors. This value is called the Sum of Squares of Error, or SSE.

Regression Formulas

The Coefficient of Determination

Standard Error of Regression

Assumptions Weak exogeneity: This essentially means that the predictor variables x can be treated as fixed values, rather than random variables. This means, for example, that the predictor variables are assumed to be error-free Constant variance (aka homoscedasticity): This means that different response variables have the same variance in their errors, regardless of the values of the predictor variables.

Assumptions Independence of errors. This assumes that the errors of the response variables are uncorrelated with each other. Lack of multicollinearity in the predictors. For standard least squares estimation methods, the design matrix X must have full column rank p,; otherwise, we have a condition known as multicollinearity in the predictor variables

Collinearity

Problem with multicollinearity The least squares estimates will have big standard errors this is the main problem with multicollinearity we re trying to estimate the marginal effect of an independent variable holding the other independent variables constant. But the strong linear relationship among the independent variables makes this difficult we always see them move together That is, there is very little information in the data about the thing we re trying to estimate Consequently, we can t estimate it very precisely: the standard errors are large

Anscombe's quartet

Multiple Linear Regression Multiple linear regression simultaneously considers the influence of multiple explanatory variables on a response variable Y y α β1x 1 β 2 x 2... βi x i Partial regression coefficients i Amount by which y changes on average when xi changes by one unit and all the other xis remain constant Measures association between xi and y adjusted for all other xi

Multiple Linear Regression

Multiple Linear Regression

Multiple Linear Regression

Why use logistic regression? There are many important research topics for which the dependent variable is "limited." For example: whether or not a person smokes, or drinks, or skips class, or takes advanced mathematics. For these the outcome is not continuous or distributed normally. Example: Are mother s who have high school education less likely to have children with IEP s (individualized plans, indicating cognitive or emotional disabilities Binary logistic regression is a type of regression analysis where the dependent variable is a dummy variable: coded 0 (did not smoke) or 1(did smoke)

Logistic Regression Logistic regression analysis requires that the dependent variable be dichotomous (such as presence/absence or success/failure) Logistic regression analysis requires that the independent variables be metric or dichotomous.

Logistic Regression A variable is metric if we can measure the size of the difference between any two variable values. A variable is usually dichotomous, that is, it can take the value 1 with a probability of success q, or the value 0 with probability of failure 1-q. This type of variable is called a Bernoulli (or binary) variable

Logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables. Because it does not impose these requirements, it is preferred to discriminant analysis when the data does not satisfy these assumptions

/

ML is a way of finding the smallest possible deviance between the observed and predicted values (kind of like finding the best fitting line) using calculus (derivatives specifically). With ML, the computer uses different "iterations" in which it tries different solutions until it gets the smallest possible deviance or best fit. Once it has found the best solution, it provides a final value for the deviance, which is usually referred to as "negative two log likelihood"

The deviance statistic is called 2LL by Cohen et al. and it can be thought of as a chi-square value. we compare the deviance with just the intercept 2LLnull referring to 2LL of the constant-only model) to the deviance when the new predictor or predictors have been added ( 2LLk referring to 2LL of the model that has k number of predictors). The difference between these two deviance values is often referred to as G for goodness of fit

/

Count Data In statistics, count data is a statistical data type, a type of data in which the observations can take only the nonnegative integer values and these integers arise from counting rather than ranking.

The statistical treatment of count data is distinct from that of binary data, in which the observations can take only two values, and from ordinal data, which may also consist of integers but where the individual values fall on an arbitrary scale and only the relative ranking is important.

An individual piece of count data is often termed a count variable. When such a variable is treated as a random variable, the Poisson, binomial and negative binomial distributions are commonly used to represent its distribution.

Regression Models with Count Data Poisson Regression Negative Binomial Regression