Simulating Multivariate Normal Data

Similar documents
Simulating Correlated Multivariate Pseudorandom Numbers

CH5: CORR & SIMPLE LINEAR REFRESSION =======================================

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Stat 5100 Handout #6 SAS: Linear Regression Remedial Measures

Conditional and Unconditional Regression with No Measurement Error

Factorial ANOVA. Skipping... Page 1 of 18

Within-Cases: Multivariate approach part one

Stat 5100 Handout #19 SAS: Influential Observations and Outliers

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Centering and Interactions: The Training Data

Factorial ANOVA with SAS

Bivariate (Simple) Regression Analysis

1 Downloading files and accessing SAS. 2 Sorting, scatterplots, correlation and regression

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

Cell means coding and effect coding

STAT:5201 Applied Statistic II

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

6:1 LAB RESULTS -WITHIN-S ANOVA

Intermediate SAS: Statistics

Week 5: Multiple Linear Regression II

SAS/STAT 13.1 User s Guide. The NESTED Procedure

Introduction to SAS proc calis

SAS/STAT 13.1 User s Guide. The SCORE Procedure

CSC 328/428 Summer Session I 2002 Data Analysis for the Experimenter FINAL EXAM

Repeated Measures Part 4: Blood Flow data

Cut Out The Cut And Paste: SAS Macros For Presenting Statistical Output ABSTRACT INTRODUCTION

Week 4: Simple Linear Regression III

Stat 500 lab notes c Philip M. Dixon, Week 10: Autocorrelated errors

5.5 Regression Estimation

The NESTED Procedure (Chapter)

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

1 The SAS System 23:01 Friday, November 9, 2012

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Section 2.2: Covariance, Correlation, and Least Squares

Conducting a Path Analysis With SPSS/AMOS

Rapid Rolling Window Regressions via Home Made Sum of Squares and Cross Products. Mark Keintz Wharton Research Data Services PhilaSUG March 19, 2019

Baruch College STA Senem Acet Coskun

Getting Correct Results from PROC REG

Confirmatory Factor Analysis on the Twin Data: Try One

Withdrawn Equity Offerings: Event Study and Cross-Sectional Regression Analysis Using Eventus Software

EXST3201 Mousefeed01 Page 1

Chapter 6: Linear Model Selection and Regularization

STAT 5200 Handout #25. R-Square & Design Matrix in Mixed Models

STAT 503 Fall Introduction to SAS

APPLICATION OF FUZZY REGRESSION METHODOLOGY IN AGRICULTURE USING SAS

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Introduction to Statistical Analyses in SAS

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

DESIGN OF EXPERIMENTS and ROBUST DESIGN

Contrasts and Multiple Comparisons

Stat 5100 Handout #15 SAS: Alternative Predictor Variable Types

T-test og variansanalyse i SAS. T-test og variansanalyse i SAS p.1/18

SAS/STAT 14.1 User s Guide. Special SAS Data Sets

Annexes : Sorties SAS pour l'exercice 3. Code SAS. libname process 'G:\Enseignements\M2ISN-Series temp\sas\';

CREATING THE ANALYSIS

range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90%

Instructions for Using ABCalc James Alan Fox Northeastern University Updated: August 2009

Regression Analysis and Linear Regression Models

Generalized Least Squares (GLS) and Estimated Generalized Least Squares (EGLS)

The Kenton Study. (Applied Linear Statistical Models, 5th ed., pp , Kutner et al., 2005) Page 1 of 5

piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 September 8, 2017

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

STATISTICS FOR PSYCHOLOGISTS

SAS/STAT 14.2 User s Guide. The SIMNORMAL Procedure

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

Multiple Regression White paper

Module 25.1: nag lin reg Regression Analysis. Contents

Introduction to SAS proc calis

Panel Data 4: Fixed Effects vs Random Effects Models

Solution to Bonus Questions

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models

Product Catalog. AcaStat. Software

Lab 2: OLS regression

STA431 Handout 9 Double Measurement Regression on the BMI Data

Handling missing values in Analysis

An introduction to SPSS

Introduction to Hierarchical Linear Model. Hsueh-Sheng Wu CFDR Workshop Series January 30, 2017

Section 2.1: Intro to Simple Linear Regression & Least Squares

Stata Session 2. Tarjei Havnes. University of Oslo. Statistics Norway. ECON 4136, UiO, 2012

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison

Subset Selection in Multiple Regression

Stat 5100 Handout #11.a SAS: Variations on Ordinary Least Squares

GRAPHING CALCULATOR REFERENCE BOOK

For example, the system. 22 may be represented by the augmented matrix

What is KNIME? workflows nodes standard data mining, data analysis data manipulation

Regression on the trees data with R

Stat 302 Statistical Software and Its Applications SAS: Data I/O

Introduction to Stata: An In-class Tutorial

Stat 401 B Lecture 26

Two-Stage Least Squares

Stat 302 Statistical Software and Its Applications SAS: Data I/O & Descriptive Statistics

Online Supplementary Appendix for. Dziak, Nahum-Shani and Collins (2012), Multilevel Factorial Experiments for Developing Behavioral Interventions:

Study Guide. Module 1. Key Terms

Data Analysis and Hypothesis Testing Using the Python ecosystem

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

I. MODEL. Q3i: Check my . Q29s: I like to see films and TV programs from other countries. Q28e: I like to watch TV shows on a laptop/tablet/phone

A Step by Step Guide to Learning SAS

Transcription:

Simulating Multivariate Normal Data You have a population correlation matrix and wish to simulate a set of data randomly sampled from a population with that structure. I shall present here code and examples for doing this with SAS and with R. SAS The code below will simulate data for a matrix of correlations between variables Y1, Y2, Y3, X1, X2, and X3. The user enters the number of subjects (NS), population correlation matrix, number of X variables, number of Y variables, and number of XY correlations. The code was obtained from a document authored by Ali A. Al-Subaihi. I made two minor modifications one to correct a malformed do loop and one to print out the raw data. OPTIONS ls=100 ps=60 nodate nonumber; proc iml; /********* The Parameters **************/ NS=20; /* No. of subjects */ PopCor={ 1.5.5.7.7.1,.5 1.5.7.7.1,.5.5 1.7.7.1,.7.7.7 1.2.2,.7.7.7.2 1.2,.1.1.1.2.2 1}; %Let NY=3; /* No. of the y's */ %Let NX=3; /* No. of the x's */ %Let NPC=9; /* No. of yx correlations = NY*NX */ /***************************************************/ NV=&NY+&NX; /* No. of Variables */ CorY= PopCor[1:&NY,1:&NY]; /* Corr. among the y's */ CorX= PopCor[&NY+1:NV,&NY+1:NV]; /* Corr. among the x's */ CorYX= PopCor[&NY+1:NV,1:&NY]; /* Corr. betw. the y's & the x's */ do i=1 to ncol(coryx); /* Corr. betw. the y's & the x's as a column*/ CorYXs=CorYXs//CorYX[,i]; end; %macro loop(npc); %Do i=1 %to &NPC; /* Bi's Correlation matrices */ Cryx&i=I(2); Cryx&i[1,2]=CorYXs[&i,1]; Cryx&i[2,1]=CorYXs[&i,1]; %end; %mend loop; %loop (&npc); X=Rannor(Repeat(0,NS,&NX))*root(CorX); /* The X data matrix */ y=rannor(repeat(0,ns,&ny))*root(cory); /* The Y data matrix */ DaXs=0*j(ns,&NX); %macro loop2 (NY); %Let k=0; %do j= 1 %to &NY; %do i=1 %to &NX; %Let c=%eval(&i+&k); %put c=&c; dat=(y[,&j] X[,&i])*(root(CrYX&c)); dat2=dat2 dat[,2]; %end; %Let k=&c;

daxs=daxs+dat2; free dat2; %end; %mend loop2; %loop2 (&NY ); daxs=daxs*(1/&ny); data=y daxs; /* The final data matrix */ eg=eigval(corr(daxs)); CXs=(eg[<>,1]-1)/(&NX-1); /* The average Correlations among all x's */ eg=eigval(corr(y)); CYs=(eg[<>,1]-1)/(&NY-1); /* The average Correlations among all y's */ Call=corr(data); /* Correlations among all data */ ca=call[1:&ny,(&ny+1):(&ny+&nx)]; print 'The Correlations between Xs and Ys',ca, 'The average Correlations among all Xs = ' CXs, 'The average Correlations among all Xs = 'CYs, 'The total correlation matrix of the data', call; Print Y DaXs; quit; Here is the output The SAS System The Correlations between Xs and Ys ca X1 X2 X3 Y1 0.0922853 0.4218243-0.09542 Y2 0.3152653 0.4900725 0.1158662 Y3 0.2920519 0.3488916-0.06382 CXs The average Correlations among all Xs = 0.0283237 CYs The average Correlations among all Xs = 0.5879789 The total correlation matrix of the data Call Y1 Y2 Y3 X1 X2 X3 Y1 1 0.6281394 0.6452408 0.0922853 0.4218243-0.09542 Y2 0.6281394 1 0.4864695 0.3152653 0.4900725 0.1158662 Y3 0.6452408 0.4864695 1 0.2920519 0.3488916-0.06382 X1 0.0922853 0.3152653 0.2920519 1 0.0401596 0.0426824 X2 0.4218243 0.4900725 0.3488916 0.0401596 1-0.003992 X3-0.09542 0.1158662-0.06382 0.0426824-0.003992 1

One can bring this correlation matrix into SAS and then conduct whatever analysis is desired. You should add to the input correlation matrix the Ns, the means, and the standard deviations. See Type=Corr Data Sets in SAS. Here are the simulated raw scores for the 20 cases. The sample correlation matrix would be closer to the population correlation matrix were we to have set NS to a larger value. These scores can be simply copied and pasted into a plain text file to input into SAS or another stat pack later. Y1 y Y2 Y3 X1 DaXs X2 X3 0.294887-0.582958 0.297439-0.441789 0.1445486-0.346909 0.0372935-0.157417 2.1881686 0.4037079 0.7267558 0.1168349-0.837347-0.121842-0.750433-1.153303 0.3795585 0.7463251-0.209449 0.0010962-0.107466 0.0959561-0.0917 0.7647455 0.7049894 0.5853205 0.4627145-0.494718 0.4118964-2.320013-0.101087-0.805897-0.603725-0.856924 0.4942442 0.735404-1.009013-0.673193-0.968747 0.1681752-1.262204 1.5133175 0.0292499-0.665348-0.040404-0.151344-0.430755-0.802164-2.493784-2.283503-0.980428 0.1513918 0.5909697 0.5800627 1.5472928 0.6538862 1.4124367 0.4481895 1.0405913-0.63712 0.2970573-2.210995 0.4902151-1.370868-0.131075 0.6516813-1.040603-0.78466 0.9676666-0.227843-1.014993 0.3039052-0.490062-1.231672-0.002651 0.4146054-0.7227-2.003181-0.743901-0.243996-0.312494-0.451639 0.7575557-1.822701-1.410863-2.32878-1.204068-0.697067-1.432772-0.458144 0.4164414 1.9757161 1.2147687-0.29125 0.7458188 0.8515013 0.0843899 0.8979836 0.4082412 0.4260293-0.216101 1.6424124-0.421096 0.0238457-0.876751-0.012722 0.5551283 0.3100667-0.461053-0.502053-0.000088-0.729896 0.3490318 0.1207564-1.247173-1.315123-0.203691-0.667222-0.747485-1.776219 Here I illustrate doing an analysis with these simulated data. data duh; input y1 y2 y3 x1 x2 x3; cards; 0.294887-0.582958 0.297439-0.441789 0.1445486-0.346909 0.0372935-0.157417 2.1881686 0.4037079 0.7267558 0.1168349-0.837347-0.121842-0.750433-1.153303 0.3795585 0.7463251-0.209449 0.0010962-0.107466 0.0959561-0.0917 0.7647455 0.7049894 0.5853205 0.4627145-0.494718 0.4118964-2.320013-0.101087-0.805897-0.603725-0.856924 0.4942442 0.735404-1.009013-0.673193-0.968747 0.1681752-1.262204 1.5133175 0.0292499-0.665348-0.040404-0.151344-0.430755-0.802164-2.493784-2.283503-0.980428 0.1513918 0.5909697 0.5800627 1.5472928 0.6538862 1.4124367 0.4481895 1.0405913-0.63712 0.2970573-2.210995 0.4902151-1.370868-0.131075 0.6516813-1.040603-0.78466 0.9676666-0.227843-1.014993 0.3039052

-0.490062-1.231672-0.002651 0.4146054-0.7227-2.003181-0.743901-0.243996-0.312494-0.451639 0.7575557-1.822701-1.410863-2.32878-1.204068-0.697067-1.432772-0.458144 0.4164414 1.9757161 1.2147687-0.29125 0.7458188 0.8515013 0.0843899 0.8979836 0.4082412 0.4260293-0.216101 1.6424124-0.421096 0.0238457-0.876751-0.012722 0.5551283 0.3100667-0.461053-0.502053-0.000088-0.729896 0.3490318 0.1207564-1.247173-1.315123-0.203691-0.667222-0.747485-1.776219 proc corr; run; Here is the output: The SAS System The CORR Procedure 6 Variables: y1 y2 y3 x1 x2 x3 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum y1 20-0.35269 0.87557-7.05383-2.49378 1.54729 y2 20-0.48848 1.08417-9.76959-2.32878 1.97572 y3 20 0.06954 0.88870 1.39070-1.20407 2.18817 x1 20-0.27193 0.53852-5.43853-1.37087 0.44819 x2 20 0.00732 0.72957 0.14631-1.43277 1.04059 x3 20-0.09147 1.15754-1.82944-2.32001 1.64241 Pearson Correlation Coefficients, N = 20 Prob > r under H0: Rho=0 y1 y2 y3 x1 x2 x3 y1 1.00000 0.62814 0.64524 0.09229 0.42182-0.09542 0.0030 0.0021 0.6988 0.0639 0.6890 y2 0.62814 1.00000 0.48647 0.31527 0.49007 0.11587 0.0030 0.0296 0.1758 0.0283 0.6266 y3 0.64524 0.48647 1.00000 0.29205 0.34889-0.06382 0.0021 0.0296 0.2115 0.1316 0.7892 x1 0.09229 0.31527 0.29205 1.00000 0.04016 0.04268 0.6988 0.1758 0.2115 0.8665 0.8582 x2 0.42182 0.49007 0.34889 0.04016 1.00000-0.00399 0.0639 0.0283 0.1316 0.8665 0.9867 x3-0.09542 0.11587-0.06382 0.04268-0.00399 1.00000 0.6890 0.6266 0.7892 0.8582 0.9867

R Generating Multivariate Random Associated Data shows how to generate random data from a specified correlation matrix. I made minor modifications in the code, including code to write the raw data to a csv file. The user provides the population correlation matrix, number of rows in that matrix, and number of observations to be generated. R <- matrix(cbind( 1,.80,.2,.80,1,.7,.2,.7,1), nrow=3); U <- t(chol(r)); nvars <- dim(u)[1]; numobs <- 100; set.seed(1); random.normal <- matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs); X <- U %*% random.normal; newx <- t(x); raw <- as.data.frame(newx); orig.raw <- as.data.frame(t(random.normal)); names(raw) <- c("response","predictor1","predictor2"); cor(raw); write.csv(raw, file = "priapus.csv") The sample correlation matrix will be output, like this: Response predictor1 predictor2 response 1.0000000 0.8463254 0.1882828 predictor1 0.8463254 1.0000000 0.6392855 predictor2 0.1882828 0.6392855 1.0000000 This sample correlation matrix can be input into SAS, or you can use the raw data that was written to the csv file. When you open the csv file, it will look like that shown below. Before importing it into a stat pack, you should either delete the leftmost column or name it ID, Case, Subject, or other appropriate term. You may also wish to save it as an xlsx file rather than a csv file. Karl L. Wuensch, 9-November-2015 Return to Wuensch s Stats Lessons

Here I imported the raw data into SAS and played with it a bit: Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum Label response 100 0.08437 1.09608 8.43699-2.88892 2.64917 response predictor1 100 0.13888 0.99858 13.88845-2.69543 2.33466 predictor1 predictor2 100 0.08422 0.87835 8.42208-1.94101 2.04920 predictor2 Pearson Correlation Coefficients, N = 100 Prob > r under H0: Rho=0 response predictor1 predictor2 response response 1.00000 0.84633 0.18828 0.0607 predictor1 predictor1 0.84633 1.00000 0.63929 predictor2 0.18828 0.63929 1.00000 predictor2 0.0607

The SAS System The REG Procedure Model: MODEL1 Dependent Variable: response response Number of Observations Read 100 Number of Observations Used 100 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 2 110.22089 55.11044 613.29 Error 97 8.71649 0.08986 Corrected Total 99 118.93738 Root MSE 0.29977 R-Square 0.9267 Dependent Mean 0.08437 Adj R-Sq 0.9252 Coeff Var 355.30178 Variable Label DF Parameter Estimate Parameter Estimates Standard Error t Value Pr > t Standardized Estimate Intercept Intercept 1-0.04009 0.03027-1.32 0.1885 0 predictor1 predictor1 1 1.34758 0.03924 34.35 1.22770 predictor2 predictor2 1-0.74445 0.04461-16.69-0.59657 Do notice that predictor2 is suppressing irrelevant variance in predictor1. As written, the R code provided above will produce the same sample correlation matrix every time you run it. To get a different matrix randomly obtained from the same population matrix, all you need do is change the seed number. Here is the output when I changed the see from 1 to 27858: response predictor1 predictor2 response 1.0000000 0.8030892 0.2065089 predictor1 0.8030892 1.0000000 0.6918264 predictor2 0.2065089 0.6918264 1.0000000