Simulating Multivariate Normal Data You have a population correlation matrix and wish to simulate a set of data randomly sampled from a population with that structure. I shall present here code and examples for doing this with SAS and with R. SAS The code below will simulate data for a matrix of correlations between variables Y1, Y2, Y3, X1, X2, and X3. The user enters the number of subjects (NS), population correlation matrix, number of X variables, number of Y variables, and number of XY correlations. The code was obtained from a document authored by Ali A. Al-Subaihi. I made two minor modifications one to correct a malformed do loop and one to print out the raw data. OPTIONS ls=100 ps=60 nodate nonumber; proc iml; /********* The Parameters **************/ NS=20; /* No. of subjects */ PopCor={ 1.5.5.7.7.1,.5 1.5.7.7.1,.5.5 1.7.7.1,.7.7.7 1.2.2,.7.7.7.2 1.2,.1.1.1.2.2 1}; %Let NY=3; /* No. of the y's */ %Let NX=3; /* No. of the x's */ %Let NPC=9; /* No. of yx correlations = NY*NX */ /***************************************************/ NV=&NY+&NX; /* No. of Variables */ CorY= PopCor[1:&NY,1:&NY]; /* Corr. among the y's */ CorX= PopCor[&NY+1:NV,&NY+1:NV]; /* Corr. among the x's */ CorYX= PopCor[&NY+1:NV,1:&NY]; /* Corr. betw. the y's & the x's */ do i=1 to ncol(coryx); /* Corr. betw. the y's & the x's as a column*/ CorYXs=CorYXs//CorYX[,i]; end; %macro loop(npc); %Do i=1 %to &NPC; /* Bi's Correlation matrices */ Cryx&i=I(2); Cryx&i[1,2]=CorYXs[&i,1]; Cryx&i[2,1]=CorYXs[&i,1]; %end; %mend loop; %loop (&npc); X=Rannor(Repeat(0,NS,&NX))*root(CorX); /* The X data matrix */ y=rannor(repeat(0,ns,&ny))*root(cory); /* The Y data matrix */ DaXs=0*j(ns,&NX); %macro loop2 (NY); %Let k=0; %do j= 1 %to &NY; %do i=1 %to &NX; %Let c=%eval(&i+&k); %put c=&c; dat=(y[,&j] X[,&i])*(root(CrYX&c)); dat2=dat2 dat[,2]; %end; %Let k=&c;
daxs=daxs+dat2; free dat2; %end; %mend loop2; %loop2 (&NY ); daxs=daxs*(1/&ny); data=y daxs; /* The final data matrix */ eg=eigval(corr(daxs)); CXs=(eg[<>,1]-1)/(&NX-1); /* The average Correlations among all x's */ eg=eigval(corr(y)); CYs=(eg[<>,1]-1)/(&NY-1); /* The average Correlations among all y's */ Call=corr(data); /* Correlations among all data */ ca=call[1:&ny,(&ny+1):(&ny+&nx)]; print 'The Correlations between Xs and Ys',ca, 'The average Correlations among all Xs = ' CXs, 'The average Correlations among all Xs = 'CYs, 'The total correlation matrix of the data', call; Print Y DaXs; quit; Here is the output The SAS System The Correlations between Xs and Ys ca X1 X2 X3 Y1 0.0922853 0.4218243-0.09542 Y2 0.3152653 0.4900725 0.1158662 Y3 0.2920519 0.3488916-0.06382 CXs The average Correlations among all Xs = 0.0283237 CYs The average Correlations among all Xs = 0.5879789 The total correlation matrix of the data Call Y1 Y2 Y3 X1 X2 X3 Y1 1 0.6281394 0.6452408 0.0922853 0.4218243-0.09542 Y2 0.6281394 1 0.4864695 0.3152653 0.4900725 0.1158662 Y3 0.6452408 0.4864695 1 0.2920519 0.3488916-0.06382 X1 0.0922853 0.3152653 0.2920519 1 0.0401596 0.0426824 X2 0.4218243 0.4900725 0.3488916 0.0401596 1-0.003992 X3-0.09542 0.1158662-0.06382 0.0426824-0.003992 1
One can bring this correlation matrix into SAS and then conduct whatever analysis is desired. You should add to the input correlation matrix the Ns, the means, and the standard deviations. See Type=Corr Data Sets in SAS. Here are the simulated raw scores for the 20 cases. The sample correlation matrix would be closer to the population correlation matrix were we to have set NS to a larger value. These scores can be simply copied and pasted into a plain text file to input into SAS or another stat pack later. Y1 y Y2 Y3 X1 DaXs X2 X3 0.294887-0.582958 0.297439-0.441789 0.1445486-0.346909 0.0372935-0.157417 2.1881686 0.4037079 0.7267558 0.1168349-0.837347-0.121842-0.750433-1.153303 0.3795585 0.7463251-0.209449 0.0010962-0.107466 0.0959561-0.0917 0.7647455 0.7049894 0.5853205 0.4627145-0.494718 0.4118964-2.320013-0.101087-0.805897-0.603725-0.856924 0.4942442 0.735404-1.009013-0.673193-0.968747 0.1681752-1.262204 1.5133175 0.0292499-0.665348-0.040404-0.151344-0.430755-0.802164-2.493784-2.283503-0.980428 0.1513918 0.5909697 0.5800627 1.5472928 0.6538862 1.4124367 0.4481895 1.0405913-0.63712 0.2970573-2.210995 0.4902151-1.370868-0.131075 0.6516813-1.040603-0.78466 0.9676666-0.227843-1.014993 0.3039052-0.490062-1.231672-0.002651 0.4146054-0.7227-2.003181-0.743901-0.243996-0.312494-0.451639 0.7575557-1.822701-1.410863-2.32878-1.204068-0.697067-1.432772-0.458144 0.4164414 1.9757161 1.2147687-0.29125 0.7458188 0.8515013 0.0843899 0.8979836 0.4082412 0.4260293-0.216101 1.6424124-0.421096 0.0238457-0.876751-0.012722 0.5551283 0.3100667-0.461053-0.502053-0.000088-0.729896 0.3490318 0.1207564-1.247173-1.315123-0.203691-0.667222-0.747485-1.776219 Here I illustrate doing an analysis with these simulated data. data duh; input y1 y2 y3 x1 x2 x3; cards; 0.294887-0.582958 0.297439-0.441789 0.1445486-0.346909 0.0372935-0.157417 2.1881686 0.4037079 0.7267558 0.1168349-0.837347-0.121842-0.750433-1.153303 0.3795585 0.7463251-0.209449 0.0010962-0.107466 0.0959561-0.0917 0.7647455 0.7049894 0.5853205 0.4627145-0.494718 0.4118964-2.320013-0.101087-0.805897-0.603725-0.856924 0.4942442 0.735404-1.009013-0.673193-0.968747 0.1681752-1.262204 1.5133175 0.0292499-0.665348-0.040404-0.151344-0.430755-0.802164-2.493784-2.283503-0.980428 0.1513918 0.5909697 0.5800627 1.5472928 0.6538862 1.4124367 0.4481895 1.0405913-0.63712 0.2970573-2.210995 0.4902151-1.370868-0.131075 0.6516813-1.040603-0.78466 0.9676666-0.227843-1.014993 0.3039052
-0.490062-1.231672-0.002651 0.4146054-0.7227-2.003181-0.743901-0.243996-0.312494-0.451639 0.7575557-1.822701-1.410863-2.32878-1.204068-0.697067-1.432772-0.458144 0.4164414 1.9757161 1.2147687-0.29125 0.7458188 0.8515013 0.0843899 0.8979836 0.4082412 0.4260293-0.216101 1.6424124-0.421096 0.0238457-0.876751-0.012722 0.5551283 0.3100667-0.461053-0.502053-0.000088-0.729896 0.3490318 0.1207564-1.247173-1.315123-0.203691-0.667222-0.747485-1.776219 proc corr; run; Here is the output: The SAS System The CORR Procedure 6 Variables: y1 y2 y3 x1 x2 x3 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum y1 20-0.35269 0.87557-7.05383-2.49378 1.54729 y2 20-0.48848 1.08417-9.76959-2.32878 1.97572 y3 20 0.06954 0.88870 1.39070-1.20407 2.18817 x1 20-0.27193 0.53852-5.43853-1.37087 0.44819 x2 20 0.00732 0.72957 0.14631-1.43277 1.04059 x3 20-0.09147 1.15754-1.82944-2.32001 1.64241 Pearson Correlation Coefficients, N = 20 Prob > r under H0: Rho=0 y1 y2 y3 x1 x2 x3 y1 1.00000 0.62814 0.64524 0.09229 0.42182-0.09542 0.0030 0.0021 0.6988 0.0639 0.6890 y2 0.62814 1.00000 0.48647 0.31527 0.49007 0.11587 0.0030 0.0296 0.1758 0.0283 0.6266 y3 0.64524 0.48647 1.00000 0.29205 0.34889-0.06382 0.0021 0.0296 0.2115 0.1316 0.7892 x1 0.09229 0.31527 0.29205 1.00000 0.04016 0.04268 0.6988 0.1758 0.2115 0.8665 0.8582 x2 0.42182 0.49007 0.34889 0.04016 1.00000-0.00399 0.0639 0.0283 0.1316 0.8665 0.9867 x3-0.09542 0.11587-0.06382 0.04268-0.00399 1.00000 0.6890 0.6266 0.7892 0.8582 0.9867
R Generating Multivariate Random Associated Data shows how to generate random data from a specified correlation matrix. I made minor modifications in the code, including code to write the raw data to a csv file. The user provides the population correlation matrix, number of rows in that matrix, and number of observations to be generated. R <- matrix(cbind( 1,.80,.2,.80,1,.7,.2,.7,1), nrow=3); U <- t(chol(r)); nvars <- dim(u)[1]; numobs <- 100; set.seed(1); random.normal <- matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs); X <- U %*% random.normal; newx <- t(x); raw <- as.data.frame(newx); orig.raw <- as.data.frame(t(random.normal)); names(raw) <- c("response","predictor1","predictor2"); cor(raw); write.csv(raw, file = "priapus.csv") The sample correlation matrix will be output, like this: Response predictor1 predictor2 response 1.0000000 0.8463254 0.1882828 predictor1 0.8463254 1.0000000 0.6392855 predictor2 0.1882828 0.6392855 1.0000000 This sample correlation matrix can be input into SAS, or you can use the raw data that was written to the csv file. When you open the csv file, it will look like that shown below. Before importing it into a stat pack, you should either delete the leftmost column or name it ID, Case, Subject, or other appropriate term. You may also wish to save it as an xlsx file rather than a csv file. Karl L. Wuensch, 9-November-2015 Return to Wuensch s Stats Lessons
Here I imported the raw data into SAS and played with it a bit: Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum Label response 100 0.08437 1.09608 8.43699-2.88892 2.64917 response predictor1 100 0.13888 0.99858 13.88845-2.69543 2.33466 predictor1 predictor2 100 0.08422 0.87835 8.42208-1.94101 2.04920 predictor2 Pearson Correlation Coefficients, N = 100 Prob > r under H0: Rho=0 response predictor1 predictor2 response response 1.00000 0.84633 0.18828 0.0607 predictor1 predictor1 0.84633 1.00000 0.63929 predictor2 0.18828 0.63929 1.00000 predictor2 0.0607
The SAS System The REG Procedure Model: MODEL1 Dependent Variable: response response Number of Observations Read 100 Number of Observations Used 100 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 2 110.22089 55.11044 613.29 Error 97 8.71649 0.08986 Corrected Total 99 118.93738 Root MSE 0.29977 R-Square 0.9267 Dependent Mean 0.08437 Adj R-Sq 0.9252 Coeff Var 355.30178 Variable Label DF Parameter Estimate Parameter Estimates Standard Error t Value Pr > t Standardized Estimate Intercept Intercept 1-0.04009 0.03027-1.32 0.1885 0 predictor1 predictor1 1 1.34758 0.03924 34.35 1.22770 predictor2 predictor2 1-0.74445 0.04461-16.69-0.59657 Do notice that predictor2 is suppressing irrelevant variance in predictor1. As written, the R code provided above will produce the same sample correlation matrix every time you run it. To get a different matrix randomly obtained from the same population matrix, all you need do is change the seed number. Here is the output when I changed the see from 1 to 27858: response predictor1 predictor2 response 1.0000000 0.8030892 0.2065089 predictor1 0.8030892 1.0000000 0.6918264 predictor2 0.2065089 0.6918264 1.0000000