# Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Size: px
Start display at page:

Download "Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes."

Transcription

1 Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department of Statistics at UBC: SOS Program - An hour of free consulting to UBC graduate students. Funded by the Provost and VP Research Office. STAT Stat grad students taking this course offer free statistical advice. Fall semester every academic year. Short Term Consulting Service - Advice from Stat grad students. Fee-for-service on small projects (less than 15 hours). Hourly Projects - ASDa professional staff. Fee-for-service consulting. Outline Methods for predicting continuous outcomes The language of statistics is not as standardized as you might like! Correlation analysis Simple linear regression Different terms can be used for essentially the same model Statisticians consider regression as the general approach Multiple linear regression ANOVA and ANCOVA

2 Correlation Measures the direction and strength of relationship between numeric variables ρ = 0 ρ = 0.4 ρ = 0.8 Data format Rectangular data, fits in a rectangle where each row represents a sampled unit and each column is a characteristic observed on that unit Example: Each row represents a different car model and the columns are various measured features ρ = 0.9 ρ = 0.95 ρ = 1 ## mpg cyl wt am gear ## Mazda RX ## Mazda RX4 Wag ## Datsun ## Hornet 4 Drive ## Hornet Sportabout ## Valiant ## Duster ## Merc 240D ## Merc ## Merc Association between car weight and mileage 1974 Motor Trends car data (32 different models of cars) Mileage (mpg) Summary statistics from the data ## Mean Variance SD Correlation ## mpg ## wt We can test the correlation between mpg and weight ## Estimate t value p value ## Correlation e-10 ## 95% confidence interval: ## Squared Correlation = Weight (in 1000lbs)

3 Data (2 numeric variables) Estimating the parameters from the sample From a random sample of n independent units from a population we measure 2 different variables (data) Each variable has a mean and a variance and the two variables have a correlation Let s call our variables X and Y {X, Y } = {x i, y i } for i = 1,..., n Correlation between x i and x j is 0 Correlation between y i and y j is 0 Correlation between x i and y i is ρ The population parameters: The means are denoted by µ x and µ y The variances are denoted by σ 2 x and σ 2 y The covariance of the two variables is denoted by σ xy The correlation is ρ = σxy σ xσ y is a number between -1 and 1 Formulas for the sample estimates of the population parameters: ˆµ x = x = n 1 x i /n ˆσ 2 x = s 2 x = n 1 (x i x) 2 /(n 1) ˆσ xy = s xy = n 1 (x i x)(y i ȳ)/(n 1) ˆρ xy = r xy = s xy /(s x s y ) Correlation analysis By correlation we usually mean the Pearson product-moment correlation coefficient. It measures the linear relationship between X and Y. ρ is estimated by r = sxy s xs y Hypothesis of interest is usually H 0 : ρ = 0 versus H 1 : ρ 0 and we use the test statistic r n 2 1 r 2 t n 2 We can also use Fisher s Transformation (F(r) = r )) to test if ρ equals any value including 0 and to construct confidence intervals for ρ ln( 1+r Neither method is overly sensitive to the normality assumption. While 1 r 1, < F(r) < Spearman s rank correlation coefficient Is a rank version of Pearson s correlation computed by converting the data for each variable into a rank before computing the correlation. It uses same test statistics as Pearson s correlation. Is closely related to the Pearson s correlation coefficient except the relationship need not be linear in nature Is robust to outliers If the relationship is linear without any extreme points, Spearman s and Pearson s will be similar

4 Two prediction scenarios The prediction line Mileage is random, Weight is not Weight is random, Mileage is not Mileage is random, Weight is not Weight is random, Mileage is not Mileage (mpg) Weight (in 1000lbs) Mileage (mpg) Weight (in 1000lbs) y Weight (in 1000lbs) Mileage (mpg) Weight (in 1000lbs) Mileage (mpg) Determining the prediction line Minimize the error (residuals), the vertical distance between the observed values and the predicted line Fitted values, residuals and mean squared error We use the estimated slope and intercept with each x i to compute a fitted value for y i ŷ i = b 0 + b 1 x i We compute the residual by taking the difference between the observed data and the fitted value ˆε i = y i ŷ i We estimate the variance of the residuals Minimize the sum of the squared errors, method called Least Squares Two parameters determine the line: slope and intercept x ˆσ 2 ε = s 2 ε = ˆε 2 /(n 2) Note: The average of the residuals is 0. We use n 2 because we estimated 2 parameters (b 0 and b 1 ).

5 Least Squares estimates for the slope and intercept The slope is primarily determined by the correlation between Y and X. It s magnitude is limited by the ratio of the standard deviation of Y and X. s slope: b 1 = r y xy s x = sxy sx 2 We estimate the intercept by picking a point on the line then using our estimate of the slope, solve for the intercept. { x, ȳ} is always on the line. intercept: b 0 = ȳ b 1 x With some math it can be shown that both b 0 and b 1 can be computed as a weighted average of Y. This means both will follow a normal distribution if n is large enough by the central limit theorem of statistics (CLT). Example revisited Summary statistics from the data ## Mean SD Correlation ## mpg ## wt Regression coefficients predicting mpg by weight ## (Intercept) wt ## ## * / = ## * = Estimated line to predict mpg from weight t value Pr(> t ) ## (Intercept) e-19 ## wt e-10 ## R squared = Estimated line to predict weight from mgp t value Pr(> t ) ## (Intercept) e-18 ## mpg e-10 Compare the 3 results ## Estimate t value p value ## Correlation e-10 t value Pr(> t ) ## wt e-10 t value Pr(> t ) ## mpg e-10 All three have the same t value and p value ## R squared = 0.753

6 Fitted values and residuals ŷ i = b 0 + b 1 x i Fit i = ( 5.34wt i ) ˆε i = y i ŷ i Res i = wt i Fit i ## wt mpg Fit Res ## Mazda RX ## Mazda RX4 Wag ## Datsun ## Hornet 4 Drive ## Hornet Sportabout ## MSE = The coefficient of determination Better prediction with regression line compared to red mean line? Mileage (mpg) Weight (in 1000lbs) Variation line doesn t explain (sum square of errors to line) = 278 Total variation in y (sum square of errors to ȳ) = 1126 % variation NOT explained by the line = 278 / 1126 = % variation explained by the line = = The coefficient of determination Tells how much better at predicting Y the regression model is compared to just using ȳ proportion of the variance of Y explained by the model depends on the size of the errors (residuals) For simple linear regression R 2 = ρ 2 xy The R 2 alone doesn t tell the whole story: The more variables in the model the higher the R 2 BUT also the higher the variability in the predictions made from the model (due to having to estimate more coefficients for the model) With large sample sizes, the R 2 value could still be low even with highly significant model coefficients With small sample sizes, the R 2 value could still be high even with insignificant model coefficients The error of the fitted value Decreases as x i moves closer to x fitted values at the endpoints depend greatly on the slope of the line fitted values at the middle are relatively insensitive to the slope Decreases as the variance in x increases since the slope is determined by the endpoints The distribution of the fitted value can be approximated by a normal distribution because of the CLT. This means we can compute accurate confidence bounds for the fitted line.

7 95% Confidence Interval (green) for the fitted line y y Prediction Interval High Intercept High Slope Low Slope The prediction estimate is the same as the fitted value The uncertainty (variability) in the prediction includes the uncertainty in the fitted line (model) plus the variability in the y data There is a 95% chance that the interval at a given x i covers the true mean for y i x 95% Prediction Interval (blue dashed) Confidence Interval for mean yi Prediction Interval for yi There is a 95% chance that the interval at a given x i covers the true y i x Simple Linear Regression A line that summarizes the relationship between two quantitative variables and is used to make predictions Regression assumes that Y is random but X is not Model the mean of Y as a function of X µ yi = β 0 + β 1 x i Each observation is described as being some distance (error ε i ) from the estimated mean where ε i N(0, σ 2 ) Assumptions: normality is not critical homoscedasticity (constant variance) is important independent errors is critical

8 Diagnostics for Regression Example Revisited Before fitting the model plot y i versus x i Does the spread of Y depend on X? What does the relationship look like? Are there any extreme X or Y values? Mileage (mpg) Raw Data Residuals Chrysler Imperial Residuals vs Fitted Fiat 128 Toyota Corolla Plot the residuals versus the quantiles of a normal distribution Weight (in 1000lbs) Fitted values Plot residuals (ε i ) versus fitted values (ŷ i ). You should see an uncorrelated oval of data. If the data can be ordered over time or space, check the residuals for indications of serial correlation Examine the influence of each observation using Cook s distance Standardized residuals Normal Q Q Theoretical Quantiles Fiat 128 Chrysler Toyota Imperial Corolla Cook's distance Cook's distance Chrysler Imperial Toyota Corolla Fiat Obs. number Heteroscedasticity Serial Correlation

9 Effect of Outliers or high leverage points y y y y Regession with 2 predictors Motor Trends car data (32 different models of cars) mpg x x x x hp wt Regression predicting mpg by hp and weight (wt) Summary statistics from the data ## Mean Variance SD ## mpg ## hp ## wt ## correlations between the variables ## mpg hp wt ## mpg ## hp ## wt t value Pr(> t ) ## (Intercept) e-20 ## wt e-06 ## hp e-03 ## R squared = ## wt hp mpg Fit Res ## Mazda RX ## Mazda RX4 Wag ## Datsun ## MSE = 6.726

10 Model Diagnostics Multiple Linear Regression (2 predictors) Residuals Chrysler Imperial Residuals vs Fitted Fiat Toyota 128Corolla Standardized residuals Normal Q Q Toyota Fiat 128 Chrysler Corolla Imperial The assumptions are the same but the equation (model) contains more parameters: µ Y = β 0 + β 1 X 1 + β 2 X 2 computing the estimates of the coefficients is more complicated but can be easily expressed in matrix form (not presented here) Cook's distance Fitted values Cook's distance Chrysler Imperial Maserati Bora Toyota Corolla 0 5 Obs. number Theoretical Quantiles The key concern is how much correlation exists between X 1 and X 2 since severe multicollinearity can increase the variance of the coefficient estimates can complicate or prevent the identification of an optimal set of explanatory variables for a statistical model assess using the variance inflation factor (VIF), 1/(1 ρ 2 x 1,x 2 ) if VIF >= 5, coefficients are poorly estimated and one should be wary of their p-values doesn t affect how well the model fits, a model with severe multicollinearity can produce great predictions Example 1 Random data Y, X 1, X 2 Example 2 Random data Y, X 1, X 2 True model: Y = X 1 with ρ x1,x 2 = 0 ## X ## X ## X ## X SE multiplier = 1.0 True model: Y = X 1 with ρ x1,x 2 = 0.5 ## X ## X ## X ## X SE multiplier = / = 1.15

11 Example 3 Random data Y, X 1, X 2 Example 4 Random Data Y, X 1, X 2 True model: Y = X 1 with ρ x1,x 2 = 0.9 ## X ## X ## X ## X SE multiplier = / = 2.28 True model: Y = X 1 /2 + X 2 /2 with ρ x1,x 2 = 0 ## X ## X ## X ## X Example 5 Random data Y, X 1, X 2 Example 6 Random Data Y, X 1, X 2 True model: Y = X 1 /2 + X 2 /2 with ρ x1,x 2 = 0.5 ## X ## X ## X ## X True model: Y = X 1 /2 + X 2 /2 with ρ x1,x 2 = 0.9 ## X ## X ## X ## X

12 Dealing with multicollinearity What makes a regession linear? Possible solutions: standardize the predictors (subtract the mean) remove highly correlated predictors linearly combine predictors (add them together) do nothing if the p-values aren t important If you can live with less precise coefficient estimates, or a model that has a high R-squared but few significant predictors, doing nothing can be the correct decision because it won t impact the fit Linear regression refers to the coefficients in the model. It has nothing to do with the way X is included in the model or the relationship between X and Y. It is the form of β: Linear regression models: µ y = β 0 + β 1 X + β 2 X 2 µ y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 + β 4 X β 5X 2 2 Not a linear regression model because of β 2 1 : µ y = β 0 + β 1 X 1 + β 2 1 X 2 This is a (linear) Regression The observed versus the model fit µ y = 2 + x/2 + x 2 /5 Linear Regression Fit Linear Regression Fit y Observed Fitted x

13 Regression with k predictors Like the two predictor case, we can compute all the parameter estimates easily using matrix algebra. The model includes more terms but otherwise the issues are similar to the two predictor case. We need to look at the correlation matrix of all the predictors to determine if colinearity is a problem. While a predictor may not be highly correlated to any one other predictor, it may be highly correlated with a set of other predictors. If some variables are excluded, all the other predictors in the model will cooperate to act as surrogates for the excluded variables depending on the strength of the correlation between the covariates If more than two predictors are included in the model then the VIF is unique for each predictor and is computed by regressing each predictor on the other predictors in the model and looking at the coefficient of determination ANOVA as a regression model (categorical predictors) In order to fit an ANOVA model as a regression model, the p level factor variable is converted into p 1 indicator variables Example: The 3 level cylinder predictor is converted into 2 indicator variables, one for 6 cylinders and one for 8 cylinders ## mpg cyl ## Mazda RX ## Mazda RX4 Wag ## Datsun ## Hornet 4 Drive ## Hornet Sportabout ## mpg cyl6 cyl8 ## Mazda RX ## Mazda RX4 Wag ## Datsun ## Hornet 4 Drive ## Hornet Sportabout Example mpg ## cyl N Mean SD ## ## ## t value Pr(> t ) ## (Intercept) e-22 ## cyl e-04 ## cyl e # of cylinders

14 What about the constant variance assumption? Residuals Y Y ANCOVA continuous and categorical predictor Residuals vs Fitted Toyota Corolla Fiat 128 Volvo 142E Standardized residuals Volvo 142E Normal Q Q Toyota Corolla Fiat 128 This is the common names for a model that contains both categorical and continuous predictors. We include categorical predictors by converting them into indicator variables then proceed with a multiple regression. Only difference is this time the model also includes a continuous predictor X Fitted values Theoretical Quantiles Example: Treatment versus placebo with a covariate Cook's distance Cook's distance Toyota Corolla Fiat 128 Volvo 142E 0 5 Obs. number µ Y = β 0 + β 1 I(trt) + β 2 X b 0 = intercept for the placebo group b 1 = change in intercept from placebo to treatment b 2 = common slope for the covariate This model fits parallel lines µ y = 5 + 5I(trt) + X Regression estimates ## Placebo ## TRT-PLB ## Intercept ## Slope ## Placebo ## TRT-PLB ## Slope Placebo Treatment X

15 Model with different slopes µ y = 5 + 5I(treat) + X XI(treat) The slopes of our continuous predictor need not be the same for each level of our categorical predictor. We can model this with an interaction term. µ Y = β 0 + β 1 I(trt) + β 2 X + β 3 XI(trt) b 0 = intercept for the placebo group b 1 = change in intercept from placebo to treatment b 2 = slope for the placebo group b 3 = change in slope from placebo to treatment Lines are no longer parallel meaning the difference between the groups changes with the value of X Y X Regression estimates ## Intercept ## Slope ## Placebo ## TRT-PLB ## Slope ## Int(Placebo) ## Int(TRT-PLB) ## Slp(Placebo) ## Slp(TRT-PLB) Sample Size and Power calculations α (or Significance) is the chance of accepting what is false. It does not depend on n, the sample size, but is chosen. Want it to be small and is commonly set at β (or Power) is the chance of accepting what is true. It depends on n and increases as n increases. To calculate Power, you need to specify each parameter in the model (based on previous studies, educated guesses or some other source). The more complicated the model the more parameters you need to specify. Java Applets to calculate sample size and power G*Power for Mac or Windows

16 Questions? Department of Statistics at UBC: SOS Program - An hour of free consulting to UBC graduate students. Funded by the Provost and VP Research Office. STAT Stat grad students taking this course offer free statistical advice. Fall semester every academic year. Short Term Consulting Service - Advice from Stat grad students. Fee-for-service on small projects (less than 15 hours). Hourly Projects - ASDa professional staff. Fee-for-service consulting.

### Regression Models Course Project Vincent MARIN 28 juillet 2016

Regression Models Course Project Vincent MARIN 28 juillet 2016 Executive Summary "Is an automatic or manual transmission better for MPG" "Quantify the MPG difference between automatic and manual transmissions"

### Mixed Effects Models. Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC.

Mixed Effects Models Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC March 6, 2018 Resources for statistical assistance Department of Statistics

### Modelling Proportions and Count Data

Modelling Proportions and Count Data Rick White May 4, 2016 Outline Analysis of Count Data Binary Data Analysis Categorical Data Analysis Generalized Linear Models Questions Types of Data Continuous data:

### Modelling Proportions and Count Data

Modelling Proportions and Count Data Rick White May 5, 2015 Outline Analysis of Count Data Binary Data Analysis Categorical Data Analysis Generalized Linear Models Questions Types of Data Continuous data:

### Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

### Introduction for heatmap3 package

Introduction for heatmap3 package Shilin Zhao April 6, 2015 Contents 1 Example 1 2 Highlights 4 3 Usage 5 1 Example Simulate a gene expression data set with 40 probes and 25 samples. These samples are

### Chapter 4: Analyzing Bivariate Data with Fathom

Chapter 4: Analyzing Bivariate Data with Fathom Summary: Building from ideas introduced in Chapter 3, teachers continue to analyze automobile data using Fathom to look for relationships between two quantitative

### Will Landau. January 24, 2013

Iowa State University January 24, 2013 Iowa State University January 24, 2013 1 / 30 Outline Iowa State University January 24, 2013 2 / 30 statistics: the use of plots and numerical summaries to describe

### THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

### Lecture 7: Linear Regression (continued)

Lecture 7: Linear Regression (continued) Reading: Chapter 3 STATS 2: Data mining and analysis Jonathan Taylor, 10/8 Slide credits: Sergio Bacallado 1 / 14 Potential issues in linear regression 1. Interactions

### Quick Guide for pairheatmap Package

Quick Guide for pairheatmap Package Xiaoyong Sun February 7, 01 Contents McDermott Center for Human Growth & Development The University of Texas Southwestern Medical Center Dallas, TX 75390, USA 1 Introduction

### Basic R QMMA. Emanuele Taufer. 2/19/2018 Basic R (1)

Basic R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20classes/1-3_basic_r.html#(1) 1/21 Preliminary R is case sensitive: a is not the same as A.

### Introduction to R: Day 2 September 20, 2017

Introduction to R: Day 2 September 20, 2017 Outline RStudio projects Base R graphics plotting one or two continuous variables customizable elements of plots saving plots to a file Create a new project

### Applied Regression Modeling: A Business Approach

i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

### CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening Variables Entered/Removed b Variables Entered GPA in other high school, test, Math test, GPA, High school math GPA a Variables Removed

### STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,

### Multiple Regression White paper

+44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms

### Minitab 17 commands Prepared by Jeffrey S. Simonoff

Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save

### Study Guide. Module 1. Key Terms

Study Guide Module 1 Key Terms general linear model dummy variable multiple regression model ANOVA model ANCOVA model confounding variable squared multiple correlation adjusted squared multiple correlation

### ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

ST512 Fall Quarter, 2005 Exam 1 Name: Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. 1. (42 points) A random sample of n = 30 NBA basketball

### Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

### Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

❷Chapter 2 Basic Statistics Business School, University of Shanghai for Science & Technology 2016-2017 2nd Semester, Spring2017 Contents of chapter 1 1 recording data using computers 2 3 4 5 6 some famous

### THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

Forrest W. Young & Carla M. Bann THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA CB 3270 DAVIE HALL, CHAPEL HILL N.C., USA 27599-3270 VISUAL STATISTICS PROJECT WWW.VISUALSTATS.ORG

### Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

### Descriptive Statistics, Standard Deviation and Standard Error

AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

### Product Catalog. AcaStat. Software

Product Catalog AcaStat Software AcaStat AcaStat is an inexpensive and easy-to-use data analysis tool. Easily create data files or import data from spreadsheets or delimited text files. Run crosstabulations,

### One Factor Experiments

One Factor Experiments 20-1 Overview Computation of Effects Estimating Experimental Errors Allocation of Variation ANOVA Table and F-Test Visual Diagnostic Tests Confidence Intervals For Effects Unequal

### RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems Shari Kraber Pat Whitcomb Martin Bezener Stat-Ease, Inc. Stat-Ease, Inc. Stat-Ease, Inc. 221 E. Hennepin Ave. 221 E. Hennepin Ave. 221 E.

### Applied Regression Modeling: A Business Approach

i Applied Regression Modeling: A Business Approach Computer software help: SPSS SPSS (originally Statistical Package for the Social Sciences ) is a commercial statistical software package with an easy-to-use

### SLStats.notebook. January 12, Statistics:

Statistics: 1 2 3 Ways to display data: 4 generic arithmetic mean sample 14A: Opener, #3,4 (Vocabulary, histograms, frequency tables, stem and leaf) 14B.1: #3,5,8,9,11,12,14,15,16 (Mean, median, mode,

### Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

### Week 4: Simple Linear Regression III

Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of

### Cross-validation and the Bootstrap

Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,

### THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533 MIDTERM EXAMINATION: October 14, 2005 Instructor: Val LeMay Time: 50 minutes 40 Marks FRST 430 50 Marks FRST 533 (extra questions) This examination

### Two-Stage Least Squares

Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

### Introductory Applied Statistics: A Variable Approach TI Manual

Introductory Applied Statistics: A Variable Approach TI Manual John Gabrosek and Paul Stephenson Department of Statistics Grand Valley State University Allendale, MI USA Version 1.1 August 2014 2 Copyright

### Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA

ECL 290 Statistical Models in Ecology using R Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA Datasets in this problem set adapted from those provided

### Correlation and Regression

How are X and Y Related Correlation and Regression Causation (regression) p(y X=) Correlation p(y=, X=) http://kcd.com/552/ Correlation Can be Induced b Man Mechanisms # of Pups 1 2 3 4 5 6 7 8 Inbreeding

### Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

### Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition Online Learning Centre Technology Step-by-Step - Minitab Minitab is a statistical software application originally created

### STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

### Robust Linear Regression (Passing- Bablok Median-Slope)

Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their

### CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

23 CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY 3.1 DESIGN OF EXPERIMENTS Design of experiments is a systematic approach for investigation of a system or process. A series

### Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

SOC6078 Advanced Statistics: 9. Generalized Additive Models Robert Andersen Department of Sociology University of Toronto Goals of the Lecture Introduce Additive Models Explain how they extend from simple

### Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Topic 16 - Other Remedies Ridge Regression Robust Regression Regression Trees Outline - Fall 2013 Piecewise Linear Model Bootstrapping Topic 16 2 Ridge Regression Modification of least squares that addresses

### Secondary 1 Vocabulary Cards and Word Walls Revised: June 27, 2012

Secondary 1 Vocabulary Cards and Word Walls Revised: June 27, 2012 Important Notes for Teachers: The vocabulary cards in this file match the Common Core, the math curriculum adopted by the Utah State Board

### CCSSM Curriculum Analysis Project Tool 1 Interpreting Functions in Grades 9-12

Tool 1: Standards for Mathematical ent: Interpreting Functions CCSSM Curriculum Analysis Project Tool 1 Interpreting Functions in Grades 9-12 Name of Reviewer School/District Date Name of Curriculum Materials:

### Reference

Leaning diary: research methodology 30.11.2017 Name: Juriaan Zandvliet Student number: 291380 (1) a short description of each topic of the course, (2) desciption of possible examples or exercises done

### For our example, we will look at the following factors and factor levels.

In order to review the calculations that are used to generate the Analysis of Variance, we will use the statapult example. By adjusting various settings on the statapult, you are able to throw the ball

### IQR = number. summary: largest. = 2. Upper half: Q3 =

Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number

### Chapter 5snow year.notebook March 15, 2018

Chapter 5: Statistical Reasoning Section 5.1 Exploring Data Measures of central tendency (Mean, Median and Mode) attempt to describe a set of data by identifying the central position within a set of data

### Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding In the previous lecture we learned how to incorporate a categorical research factor into a MLR model by using

### 8.NS.1 8.NS.2. 8.EE.7.a 8.EE.4 8.EE.5 8.EE.6

Standard 8.NS.1 8.NS.2 8.EE.1 8.EE.2 8.EE.3 8.EE.4 8.EE.5 8.EE.6 8.EE.7 8.EE.7.a Jackson County Core Curriculum Collaborative (JC4) 8th Grade Math Learning Targets in Student Friendly Language I can identify

### BIOL 458 BIOMETRY Lab 10 - Multiple Regression

BIOL 458 BIOMETRY Lab 10 - Multiple Regression Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several

### Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

ABSTRACT Paper 1938-2018 Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS Robert M. Lucas, Robert M. Lucas Consulting, Fort Collins, CO, USA There is confusion

### Data Management - 50%

Exam 1: SAS Big Data Preparation, Statistics, and Visual Exploration Data Management - 50% Navigate within the Data Management Studio Interface Register a new QKB Create and connect to a repository Define

### Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

### Fathom Dynamic Data TM Version 2 Specifications

Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other

### Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 4, 217 PDF file location: http://www.murraylax.org/rtutorials/regression_intro.pdf HTML file location:

### Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple

### This is a simple example of how the lasso regression model works.

1 of 29 5/25/2016 11:26 AM This is a simple example of how the lasso regression model works. save.image("backup.rdata") rm(list=ls()) library("glmnet") ## Loading required package: Matrix ## ## Attaching

### Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients

### STATISTICS (STAT) Statistics (STAT) 1

Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

### Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Why enhance GLM? Shortcomings of the linear modelling approach. GLM being

### Analysis of variance - ANOVA

Analysis of variance - ANOVA Based on a book by Julian J. Faraway University of Iceland (UI) Estimation 1 / 50 Anova In ANOVAs all predictors are categorical/qualitative. The original thinking was to try

### Bivariate (Simple) Regression Analysis

Revised July 2018 Bivariate (Simple) Regression Analysis This set of notes shows how to use Stata to estimate a simple (two-variable) regression equation. It assumes that you have set Stata up on your

### IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

IAT 355 Visual Analytics Data and Statistical Models Lyn Bartram Exploring data Example: US Census People # of people in group Year # 1850 2000 (every decade) Age # 0 90+ Sex (Gender) # Male, female Marital

### An introduction to SPSS

An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

### Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

Lecture Simple Regression, An Overview, and Simple Linear Regression Learning Objectives In this set of lectures we will develop a framework for simple linear, logistic, and Cox Proportional Hazards Regression

### CREATING THE ANALYSIS

Chapter 14 Multiple Regression Chapter Table of Contents CREATING THE ANALYSIS...214 ModelInformation...217 SummaryofFit...217 AnalysisofVariance...217 TypeIIITests...218 ParameterEstimates...218 Residuals-by-PredictedPlot...219

### Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

### SPSS INSTRUCTION CHAPTER 9

SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can

### Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Section 3.2: Multiple Linear Regression II Jared S. Murray The University of Texas at Austin McCombs School of Business 1 Multiple Linear Regression: Inference and Understanding We can answer new questions

### Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms

### Week 5: Multiple Linear Regression II

Week 5: Multiple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Adjusted R

### Lecture 20: Outliers and Influential Points

Lecture 20: Outliers and Influential Points An outlier is a point with a large residual. An influential point is a point that has a large impact on the regression. Surprisingly, these are not the same

### Section 2.1: Intro to Simple Linear Regression & Least Squares

Section 2.1: Intro to Simple Linear Regression & Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 Regression:

### Chapters 5-6: Statistical Inference Methods

Chapters 5-6: Statistical Inference Methods Chapter 5: Estimation (of population parameters) Ex. Based on GSS data, we re 95% confident that the population mean of the variable LONELY (no. of days in past

### Correlation. January 12, 2019

Correlation January 12, 2019 Contents Correlations The Scattterplot The Pearson correlation The computational raw-score formula Survey data Fun facts about r Sensitivity to outliers Spearman rank-order

### Lecture 25: Review I

Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

### 14.2 The Regression Equation

14.2 The Regression Equation Tom Lewis Fall Term 2009 Tom Lewis () 14.2 The Regression Equation Fall Term 2009 1 / 12 Outline 1 Exact and inexact linear relationships 2 Fitting lines to data 3 Formulas

### The Truth behind PGA Tour Player Scores

The Truth behind PGA Tour Player Scores Sukhyun Sean Park, Dong Kyun Kim, Ilsung Lee May 7, 2016 Abstract The main aim of this project is to analyze the variation in a dataset that is obtained from the

### SASEG 9B Regression Assumptions

SASEG 9B Regression Assumptions (Fall 2015) Sources (adapted with permission)- T. P. Cronan, Jeff Mullins, Ron Freeze, and David E. Douglas Course and Classroom Notes Enterprise Systems, Sam M. Walton

### Middle School Math Course 2

Middle School Math Course 2 Correlation of the ALEKS course Middle School Math Course 2 to the Indiana Academic Standards for Mathematics Grade 7 (2014) 1: NUMBER SENSE = ALEKS course topic that addresses

### Stat 241 Review Problems

1 Even when things are running smoothly, 5% of the parts produced by a certain manufacturing process are defective. a) If you select parts at random, what is the probability that none of them are defective?

### CHAPTER 2: SAMPLING AND DATA

CHAPTER 2: SAMPLING AND DATA This presentation is based on material and graphs from Open Stax and is copyrighted by Open Stax and Georgia Highlands College. OUTLINE 2.1 Stem-and-Leaf Graphs (Stemplots),

### Exploring and Understanding Data Using R.

Exploring and Understanding Data Using R. Loading the data into an R data frame: variable

### Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

### Lab 5 - Risk Analysis, Robustness, and Power

Type equation here.biology 458 Biometry Lab 5 - Risk Analysis, Robustness, and Power I. Risk Analysis The process of statistical hypothesis testing involves estimating the probability of making errors

### Measures of Dispersion

Lesson 7.6 Objectives Find the variance of a set of data. Calculate standard deviation for a set of data. Read data from a normal curve. Estimate the area under a curve. Variance Measures of Dispersion

### AP Statistics Summer Assignment:

AP Statistics Summer Assignment: Read the following and use the information to help answer your summer assignment questions. You will be responsible for knowing all of the information contained in this

### Nuts and Bolts Research Methods Symposium

Organizing Your Data Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Topics to Discuss: Types of Variables Constructing a Variable Code Book Developing Excel Spreadsheets

### Error Analysis, Statistics and Graphing

Error Analysis, Statistics and Graphing This semester, most of labs we require us to calculate a numerical answer based on the data we obtain. A hard question to answer in most cases is how good is your

### Objects, Class and Attributes

Objects, Class and Attributes Introduction to objects classes and attributes Practically speaking everything you encounter in R is an object. R has a few different classes of objects. I will talk mainly

### Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

Topic 11. Unbalanced Designs [ST&D section 9.6, page 219; chapter 18] 11.1 Definition of missing data Accidents often result in loss of data. Crops are destroyed in some plots, plants and animals die,

### Lab #7 - More on Regression in R Econ 224 September 18th, 2018

Lab #7 - More on Regression in R Econ 224 September 18th, 2018 Robust Standard Errors Your reading assignment from Chapter 3 of ISL briefly discussed two ways that the standard regression inference formulas