... DATA REDUCTION: AN INTRODUCTION TO PRINCIPAL COMPONENTS ANALYSIS. Statistics

Size: px

Start display at page:

Download "... DATA REDUCTION: AN INTRODUCTION TO PRINCIPAL COMPONENTS ANALYSIS. Statistics"

Marlene Ross
6 years ago
Views:

1 DATA REDUCTION: AN INTRODUCTION TO PRINCIPAL COMPONENTS ANALYSIS Jennifer L. Caveny, M.S. James F. Murray, Ph.D. U.S. Quality Algorithms, Inc. This paper is an introduction to the method of Principal Components (PC) Analysis and the SAS Procedure PRINCOMP. First, we will give a quick ovelview of the method. The second section of the paper will introduce the SAS procedure and outline the minimum required coding. In the third section, we'll present an example. Finally, we'll demonstrate through the example some code which can be used to graph the principal components. Section I. Introduction to Principal Components Analysis PC Analysis has been around for nearly a hundred years. The method was introduced by Pearson around the turn of the century and further developed by Hotelling in It is one of many empirical approaches to data reduction. The main goal of PC Analysis is to reduce the variables in a data set to a minimum number of measurable dimensions while retaining the maximum information. These minimum dimensions are the principal components. The idea of PC Analysis is to manipulate the original variables (by applying weights) into a new set of "composites" which contain as much of the information from the original data as possible. In order to retain the original relationships, these composites are linear combinations of the original variables. Each of the composites will explain a unique proportion of the variability in the original variables so that the set of all composites e""plains all of the original variability. PC Analysis is commonly used in the field of Psychology to create scoring algorithms for psychometric instruments. For example, patients respond to 1 questions on a standardized personality inventory. PC Analysis will collapse the original 1 questions to a smaller number of composite scores (say 5 or 1) representing major personality characteristics. These scores would be produced by applying a set of weights to the original responses.. For any one characteristic those questions which relate similar information are giv~n large weights and those questions which do not contribute any information are weighted less for that characteristic. PC Analysis is also beneficial with survey data where it is used to produce scoring algorithms. The original variables can be replaced with the principal components in any analysis or summarization which was appropriate for the original data but not feasible because of the large number ofvariables. We would like to choose each composite so that it accounts for as much of the original variability as possible. In other words, we want each composite to be highly correlated with the original variables. The degree of correlation between a composite and the original variables is directly related to the amount of variability contained within that single composite. Therefore, to maximize the variability of a composite is to maximize the correlation between that composite and the original variables. Thus, PC Analysis searches for those sets of weights which produce linear composites explaining the maximum possible variance. Consider the set of example data plotted as x and y in Figure 1 below.., FlgUl1l 1... I,1 + y 1 I....! x " 781

We see that the data seem to fall in somewhat of an elliptical pattern. Imagine a rotating diameter line through the center point of the ellipse and consider the variability in the data at each angle.

2 We see that the data seem to fall in somewhat of an elliptical pattern. Imagine a rotating diameter line through the center point of the ellipse and consider the variability in the data at each angle. Figure 2 shows that the dimension with the most variability falls along the major axis of the ellipse (line a). Assume this line explains 85% of the variance. We can account for the remaining 15% of the variability by constructing line b. These two lines could be considered the two principal components of the original data. y " I 't j, I Figale 2... "'" "..*.,," " ~... ';(. '1... ".....'......'.. ". ".*,.~.:...,'fi'.,.~~.... "....- x Now we can see that with this set of data there are two possible linear combinations and therefore two sets of weights to be applied to the original variables. But, how do we come up with these weights? Through the work of many famous (and not so famous) statisticians and researcbers, it has been shown that the eigenvectors of the covariance matrix make a good choice for these weights. Because this paper is intended to be an introduction to the method through SAS, we will not go into the details of the theoretical derivation. We will only go so far as to state that the covariance matrix can be decomposed into latent vectors which can be used as the sets of weights to produce the principal components. In addition. associated with each latent vector is a latent root (or eigenvalue) which can be interpreted as the proportion of variability in the set of original variables which is e:\.-plained by this linear combination. Principal components are extracted in the order of the amount of variability explained. Therefore, the first component identified is that which explains the largest amount of variability. As the components are identified, each will explain the largest proportion of the remaining " variance but less variance than the previous components. Also, because each component explains a unique piece of the original variability, all components will be uncorrelated with the previous components. Principal components can be produced uilti11% of the variability in the original data is explained. The number of principal components extracted may be as large as the number of original variables. However, the first few composites usually explain a sufficient proportion of the variability from the original data. Determining how many and which components to retrieve is the most difficult part of PC Analysis. The SAS system will compute as many principal components as required to explain 1% of the variability in the original variables. It is up to the user to determine how many of those components are practically useful and what "construct" each represents. The idea is to choose the smallest number of principal components which account for the largest amount of variability. A rule of thumb is to choose enough components to explain at least 8% of the variability. Scatterplots of the principal components are sometimes helpful in identifying the dimension each component represents. Another aspect which should be considered in PC Analysis is the scaling of the original variables. When principal components are extracted from the covariance matrix, those variables with the largest variance will be given the highest weights. Remember variance has no unit and the variance of a variable is directly affected by the unit of measure. When the magnitudes of the variances are related to the units of measure, then the principal component weights will also be a function of the units of measure. The common remedy is to extract principal components from the correlation matrix rather than the covariance matrix. Because correlations have no units and take on values in the range of -1 to I, the effect of units has been removed. Another option is to transform all original variables to the same scale. For example, all variables could be transformed to standard normal variates. The major rationale for performing a PC analysis is to reduce the complexity of dealing with all of the original data. A key result of the PC analysis is to understand the important constructs represented within the original data. Section 11. SAS PROC PRlNCOMP The SAS System for Statistical Analysis provides several approaches to Principal Components Analysis. The following table identifies the available procedures along with the type of data needed and the analyses performed. 782

~ Analysis Type of Data PRlNCOMP PC continuous FACTOR Corrunon Factor continuous PC continuous PRlNQUAL PC categorical CORRESP Weighted PC categorical PROC PRlNCOMP provides a straight forward

3 ~ Analysis Type of Data PRlNCOMP PC continuous FACTOR Corrunon Factor continuous PC continuous PRlNQUAL PC categorical CORRESP Weighted PC categorical PROC PRlNCOMP provides a straight forward approach to PC Analysis and is the topic of this paper. There are six main statements for PROC PRINCOMP: the PROC call, the VAR, BY, FREQ, PARTIAL, and WEIGHT statements. Each of these statements will be discussed in greater detail. The PROC call has three options that are used to specify data sets. First, the common DATA= option tells the system which data set holds the set of original variables. As with other procedures, if the DATA= option is not specified, the most recently created data set will be used. The input data set can be a regular SAS data set or output from many other SAS procedures such as a correlation matrix, a covariance matrix, or a sums of squares and crossproducts matrix. There are two available options for requesting output data sets. The OUT= option produces a data set holding all of the original observations and variables plus the new principal components. The OUTST AT= option creates a data set containing surrunary statistics. Both of these data sets will be described in more detail later. The user can tailor his or her analysis with several available options in the PROC call. First, the N= option specifies the exact number ofprincipal components to be computed. Remember, SAS will produce as many principal components as needed to explain all of the variability in the original variables. It is the researchers responsibility to decide how many components are actually useful. It should be noted that the weightings and values of the component scores will not be affected by this option. Only the number of components output and the amount of variability explained will be limited. If an output data set containing the principal component scores is to be saved and used in future analyses, the user may want to control the names of the variables holding the component scores. This can be accomplished through the PREFIX= option. The components will be assigned names beginning with the value of PREFIX and ending with the number of the component. As with all SAS variables, the name (PREFIX + number) cannot exceed eight characters. For example, if the user suspects that there will be fewer than nine principal components then the value of PREFIX can hold up to seven characters and the last character should be reserved for the component number. If more than nine components are expected, then at least two characters should be reserved for the number. If more than nine components are expected then at least two characters should be reserved for the component number. Another usefuj option is NOINT which forces SAS to use the correlation (or covariance) matrix without correction for the mean. This means that no intercept term is included in the model. The NOINT option is only used when prior work has convinced the analyst that the intercept should equal zero. A similar option is V ARDEF= which allows the user to specify the denominator to be used in variance and standard deviation calculations. The possible values are OF, N, WEIGHT or WDF. The default value is DF which indicates that the error degrees offreedom are to be used. The value of N requests that the number of observations be used as the denominator. When the relative weights are given to individual observations (through the WEIGHT statement) the divisor can be either the sum of the weights (V ARDEF=WGn or the sum of the weights minus one (V ARDEF=WDF). The COV (or COVARIANCE) option requests that the principal components be extracted from the covariance matrix (rather than the default correlation matrix). Recall, variables with large variances will be weighted stronger when the covariance matrix is used. Thus. this option should not be used unless the original variables are all measured in like units or have equal variance. The STD (or STANDARD) option requests that the resulting principal components be standardized to have unit variance. The decomposition method insures that the eigenvectors of the correlation matrix will have unit length. This option causes the eigenvectors to be divided by the square roots of the eigenvalues to produce principal component scores which have unit variance. Finally, the NOPRINT option can be used to suppress printed output. This option is useful when many PC analyses are being produced. Also, it is usually only used when either the OUT= or OUTST AT= options are specified as alternative means of receiving the results. The next most corrunon statement in PROC PRINCOMP is the V AR statement which identifies the set of original variables. Because we are expecting SAS to perfonn numeric calculations, the variables listed in this 783

4 statement must be numeric. Character or other nonnumeric types of variables will cause an error message. SAS does not limit the number of variables which can be specified; however, the number of observations in the data set will limit the number of variables that can be considered. The number of observations should always exceed the number of variables. While no hard rule exists, it is prudent to follow the guide of at least ten observations per variable. If the V AR statement is omitted, SAS will perform the PC Analysis on all numeric variables in the input data set As with all SAS procedures, a BY statement can be included to request a separate analysis for each unique value of the BY variables. The input data should be soned in order of the BY variables. The output data sets will also include the BY variables. The WEIGHT statement can be used to introduce relative weights for individual observations in the original set of variables. In some cases, the researcher may have prior knowledge that the reliability of cenain variables differs greatly across observations and may wish to weight the observations differentially. For example, if the observations in the data set are estimated means then weighting the observations by the inverse of the standard deviation would provide optimal results. Another situation where weights may be usefu1 is in the case of missing values. If missing values are replaced with mean or median values, those observations with a substitute value may be weighted less than observations with actual values. Sometimes, the input data may be previously summarized. In other words, each observation in the input data set may represent multiple occurrences of that unique combination of values of the original variables. In this case, the FREQ statement can be used to identify that variable which holds the frequencies. Theoretically, this frequency variable should be an integer. If noninteger values are included, SAS will only use the integer portion of the value. Likewise., SAS will exclude observations with missing or zero values of the FREQ variable. The user can request that the effect of cenain variables be removed from the correlation matrix prior to the extraction of principal components. Those variables can be identified with the PARTIAL statement. As with the V AR statement, all variables named should be numeric. PRINCOMP uses the PARTIAL variables to predict the V AR variables and then computes residuals. In output data sets, these residuals are named with the characters "R_" prefixed to the first six characters of the VAR variable names. The principal components are ej.1racted from the correlation matrix of these residuals. Now that we have covered all of the statements and options, let's go back to the output data sets. First, the OUT= data set contains all of the origina1 variables plus the new variables holding the principal component scores. These new variables will be named PRlNl, PRlN2, etc. if the PREFIX= option is not used to customize the names. If the N= option is specified, only that number of component score variables will be included in this data set. The number of observations in this data set will equal that of the input data set. It should be noted that an OUT= data set cannot be created if the input data set is a summary or statistic type of data set (i.e., TYPE = CORR, COY, EST, SSCP, etc.). If the PARTIAL statement is used, then this data set also contains the residuals in variables named by the convention discussed previously. The OUTSTAT= data set contains SUIDIDaIY statistics for each of the variables listed in the V AR statement. This data set contains variables with the same names as those used in the PC analysis (i.e., those listed in the V AR statement). However. in this data set those variables hold the values of the summary statistics, not the raw data. Since the statistics are contained as observations. they can be identified with the _TYPE_variable. The following table lists the available statistics. Statistic means standard deviations number of observations sum of weights correlations covariances eigenvalues weights (eigenvectors) uncorrected std deviations uncorrected correlations uncorrected covariances uncorrected weights _TYPE_ MEAN SID N SUMWGT CORR COY EIGENVAL SCORE USID UCORR UCOV USCORE There will be one _ TYPE_ = MEAN observation for each value of the BY variables or one observation when there is no BY statement. These observations will be omitted if the PARTIAL statement is used. The _TYPE_ = SID observations contain the standard deviations of the original variables. The OUTSTAT= data set will contain one of these observations for each value of the BY variables. When the PARTIAL 784

statement is used, these observations hold the square root of the mean square error from the prediction of the original variables by the PARTIAL variables.

5 statement is used, these observations hold the square root of the mean square error from the prediction of the original variables by the PARTIAL variables. These observations are excluded when the COV option is used. The observations with _ TYPE_ = N will contain the same value across all variables. Again, there will be one observation per BY value. When the partial statement and the default value of V ARDEF (OF) are used, the degrees of freedom for the PARTIAL variables are removed from the number of observations. The _ TYPE_ = SUMWGT records will only be output when the sum of the weights differs from the number of observations. The values contained in these observations will be equal across all variables. If the PARTIAL statement is used along with the V ARDEF = WDF option, then this value is decremented by the degrees of freedom of the PARTIAL variables. There will be as many _TYPE_ = CORR observations in the OUTSTAT= data set as there are variables being analyzed. These observations contain the correlations between pairs of the original variables. The _NAIvIE_ variable identifies the second variable of the pair. When the PARTIAL statement is used, the partial correlations are output instead of the raw correlations. These observations are excluded when the COV option is used. Similarly, the _TYPE_ = COV observations are only included when the COV option is specified. The number of these observations is equal to the number of original variables. The partial covariances will be output when the PARTIAL statement is used. The observations containing the eigenvalues are identified by _TYPE _ = EIGENV AL. There will be one for each value of the BY variables. The eigenvalues are found in the variables named after the original variables. However, it should be noted that this does not mean that an individual eigenvalue is directly related to any individual input variable. The eigenvalues will be assigned to these variables based on the ordering of the variables in the V AR statement The first eigenvalue can be found in the first variable listed and so on. When the N= option is used to limit the number ofprincipal components, only that number of eigenvalues will be output and the remaining variables will hold missing values. There are as many observations with _TYPE_ = SCORE as there are principal components. Thus, if the a certain number of components is requested, only that Dumber of SCORE observations will be included in the OUTSTAT= data set. The corresponding principal component can be identified with the _NAIvIE_ variable which will hold the values assigned with the PREFIX= option or the default names. The OUTST AT= data set will have different TYPE values depending upon various options. The default value is CORR. When the components are extracted from the covariance matrix (i.e., the COVoption is used), the resulting data set will have TYPE=COV. When the NOINT option is used the output data set will be unadjusted, so the value of TYPE will be UCORR or UCOV depending upon the use of the COVoption. Armed with this knowledge of the options and statements available in PROC PRINCOMP, we will now proceed to an example. Section Ill. Example For illustration purposes let's consider a contrived example. Suppose a leading health care plan develops a compensation system which will encourage providers to improve quality of care. Each provider will be measured on twelve variables. 1. hospital cost per member per month 2. specialist physician cost per member per month 3. emergency room cost per member per month 4. primary care physician encounters per member per year 5. laboratory tests per member per year 6. radiology encounters per member per year 7. proportion of members transferring to other primary care providers during the year average scores from survey questions where members rate their physicians on the following qualities: 8. "Ability to make appointments for checkups" 9. "Ability to make appointments for illness" 1. "Ability to contact doctor when office is closed" 11. "Ability to obtain referrals to specialists after evaluation by primary doctor" 12. "Response to an emergency call within 3 minutes" It is suspected that these twelve measures will actually represent two or three constructs which will capture the quality of care these providers are furnishing. We want to identify these constructs and develop composite scores for each. Thus, we'll perform a PC analysis. For each measure, the providers will be ranked and assigned scores in the range of 1 to 1, where high numbers indicate better performance. The input data set 785

6 will contain one observation per provider and twelve variables measuring the provider's relative performance in the designated areas. For simplicity's sake, we will refer to these scores as MEASI - MEASI2, where the number corresponds to the items listed above. The table below displays ten example observations. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ID MEASl MEASl MEASJ MEAS4 MEAS5 MEAS6 MEAS7 MEASS MEAS9 MEASIO MEASH MEAS S _~' w._'.~-.'_..'" " '.....,'"".'.'or oo _....'..', """"'"...'.'...''''... "..._...'.~, '.'.' loos S S , ~ ~ ~, , ~ ~.~ ~~ First, we will consider missing values. All twelve of the measures should contribute to a provider's scores on the final constructs; therefore, missing values will be replaced with the average value for that measure. We want to perform a very simple PC analysis with no weights or partialling. The components will be extracted from the correlation matrix with an intercept term and no standardization of scores will be requested. The following code produces this simple analysis. PROC PRlNCOMP DATA = DOCS OUT = PCOUT; VAR MEASI -MEASI2; TITLE 'PC Analysis on Twelve Quality Measures'; RUN; An example of SAS output which would be produced from this code is shown on the following pages. The printed output includes summary statistics (namely the mean, standard deviation and number of observations) for each of the twelve original variables. The next section of the printed output is a display of the correlation matrix. All elements along the diagonal are equal to one. Each off-diagonal element tells the degree of the linear relationship between those two original variables. In our contrived example, MEASI and MEAS2 have a strong positive linear relationship with a correlation of.813. On the other hand, MEASI and MEAS12 do not appear to have a linear relationship since their correlation is only.6. The third section lists the eigenvalues associated with each principal component in order of the proportion of variability explained. We see that SAS produced twelve principal components in our example. The first component explains about 41 % of the variability in the set of original variables. The second and third components add an additional 21% and 2% respectively. Thus the first three principal components account for approximately 82% of the variance of the original variables. The final section of the printed output displays the eigenvectors of the correlation matrix. These vectors of constants are used as weights to create the component scores. For example, let's consider the first component. From the first eigenvector (listed under the column labeled PRlNI on the output), we can compute a value for the first principal component for each observation in the original data set through following equation. PRINI = -.66 * MEASI * MEAS * MEAS * MEAS * MEAS * MEAS * MEAS * MEAS * MEAS * MEASIO * MEASII * MEASl2 However, because we did request an OUT data set, these scores were computed for each principal component and saved in the PCOUT data set as PRlNl - PRlN

7 From these weights, we see that the original variables MEAS7 through MEAS12 contribute the most to this component while all of the other original variables have lower weights. Consider what these six original variables are measuring... rate of member transfers and the five survey responses. This first principal component appears to be a construct of member satisfaction. Let's consider the second principal component. From the second eigenvector (column labeled PRIN2) we see that MEAS1, MEAS2 and MEAS3 all have weights around.5 which are the heaviest weights for this component. These three variables measure the per member per month costs which a provider incurred during the year for hospital, specialist and emergency room use. This principal component might be representing resource utilization. Now consider the third component and its weightings. As we've seen before, three of the original variables (MEAS3, MEAS4 and MEAS5) have the highest weightings for this component. These three variables measure a physician's use of network services. We consider this component to be the dimension of access to available services. None of the eight remaining principal components uniquely explain any significant amount of the variability in the original data. Recall, the goal of PC Analysis is to reduce the number of original variables while retaining as much of the original information as possible. We have accounted for more than 8% of the variance in the twelve measures with three principal components. MODIFIED SAS OUTPUT Principal Component Analysis 5 Observations 12 Variables Simple Statistics MEASI MEAS2 MEAS3 MEAS4 MEAS5 MEAS6 Mean StD MEAS7 MEAS8 MEAS9 MEASIO MEASll MEAS12 Mean StD Correlation Matrix MEASI MEAS2 MEAS3 MEAS4 MEAS5 MEAS6 MEAS7 MEAS8 MEAS9 MEASIO MEASll MEAS12 MEASI MEAS MEAS MEAS

8 Principal Component Analysis Correlation Matrix MEASI MEAS2 MEAS3 MEAS4 MEAS5 MEAS6 MEAS7 MEAS8 MEAS9 MEASIO MEASII MEASI2 MEAS5 MEAS8 MEAS9 MEASIO MEASll MEASl Q Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative PRINI PRIN2 PRIN3 PRIN4 PRIN5 PRIN6 PRIN7 PRIN8 PRIN9 PRINIO PRINll PRIN Eigenvectors PRINI PRlN2 PRlN3 PRlN4 PRINS PRlN6 PRIN7 PRINS PRIN9 PRINIO PRINII PRINJ2 MEASI MEAS2 MEAS3 MEAS4 MEAS5 MEAS6 MEAS7 MEAS8 MEAS9 MEASIO MEASII MEASI

9 .Section IV. Graphing Principal Components Now suppose we would like to evaluate the fictitious providers from the contrived example based on their scores on the three principal components. Scatterplots of the components help to visualize how a provider is doing in all three areas. The following code computes confidence intervals around the components and creates a scatter plot of the first two components. PROC UNJV ARIA TE DATA = PCOUT NOPRINT; VAR PRINI PRIN2 PRIN3; OUTPUT OUT = STATS MEAN = MEANI MEAN2 MEAN3 SID = SIDI SID2 SID3 ; RUN; DATAPCOUT; IF _N_EQ I THEN SET STATS; SETPCOUT; set up variables for plot annotation *; LENGTII FUNCTION STYLE $8 TEXT $4 POSmON XSYS YSYS HSYS $1 ; RETAIN MEANI MEAN2 MEAN3 SIDI SID2 SID3 SIZE 2 FUNCTION 'label' POSmON'5' XSYS '2' YSYS '2" HSYS '2' STYLE 'zapf; * choose two providers to highlight *; IF ID IN (,1367', '1394') THEN TEXT =ID; ELSE TEXT = '*'; * compute confidence intervals *; ARRAY L {*} LI 12 13; ARRAY U {*} UI U2 U3; ARRAY M {*} MEANI MEAN2 MEAN3; ARRAY S {*} SIDI SID2 SID3; DOl= I TO 3; L{I} = M{I} * S{l}; U{1} = M{I} * S{I}; END; RUN; GOPTIONS RESET = ALL DEVICE = HPLJ3SI; PROC GPLOT DATA = PCOUT; AXISI WIDTII = 3 LABEL = (FONT=ZAPF 'COMPONENT 1 '); AXIS2 WIDTII = 3 LABEL = (FONT=ZAPF 'COMPONENT 2'); SYMBOL! VALUE = NONE; SYMB12 VALUE = POINT I = JOIN; SYMBOL3 VALUE = POINT 1 = JOIN; SYMBOL4 VALUE = POINT 1 = JOIN; SYMBOLS VALUE = POINT I = JOIN; PLOT PRIN2*PRINl PRIN2*Ll PRlN2 Ul L2*PRlNI U2*PRlNll ANNOTATE=PCOUT OVERLAY FRAME HAXIS = AXIS I V AXIS = AXIS2; TITLE FONT = ZAPF RUN; QUIT; JUSTIFY=CENTER 'FIRST TWO PRINCIPAL COMPONENTS'; The graphs are shown on the following page. The four lines on each graph represent the 95% confidence bounds for the two components being graphed. This is a simplistic approach, but it does give us an idea of relative performance. Recall, the original variables were all measured on a to 1 scale where higher numbers represent better performance. Therefore, the components will be interpreted in the same manner. The numbers on the graphs identify fictitious providers who have been highlighted solely for illustration purposes. In a real-life situation, points would be identified based on some predefined criteria. For example. an analyst may want to identify all subjects falling outside a certain confidence band. First, consider provider From the first two graphs we see that this provider's score falls outside of the lower confidence bound on the first principal component. In fact, he has the lowest score for this component. However, from the third graph we see that his scores on the second and third principal components are not significantly different from the overall mean scores. From this analysis. we learn that this provider could stand to improve upon member satisfaction. Now consider provider From the graphs we see that this provider performs well in all three areas. In addition, he seems to be outstanding in the area of utilization. These three components will be useful in evaluating the quality of care each physician provides. In this example. PC Analysis proved effective in reducing the number of measures from twelve to three. 789

10 I II _ I Fbst 'l\yo Principal Componeo.ts,.,.., M, * ~*:I :.:, - - : -",:~...:.....:;;, 'V; :, - ~ 'II.. ~.. *Jtf't....:.,.J:~... 4.*.,.-. o. ",... ~ --::... ~ 'II 'II.,ji3e7 1* 'II.,.... 'II",, I.. _.- s ,.. J...s..J.:.. I Second and Thizd Principal Components J II 1, -, -, -, -I' t Fb'st and Thizd Principal Components -,.,.,.\.-. -'- t -).:., - J.. 1 -, , ' If..,. ~if...,... IuM..... 'II.:.. r1! t.~.~ - ~.:~. ""-~ r.&. \*,,~,.A:"-~~' :; -1, IWA~.. IJ *... r;,~.,... #,'~,.~.. '1,- *... -, -I -J -, -, 2 _I -"'- References SAS/STAT User's Guide, Version 6, Fourth Edition, Volume 2. Cary, NC: SAS Institute, Inc., 199 Dunteman, George H. (1984), Introduction to Multivariate Analysis, Sage Publications, Beverly Hills_ CA Marascuilo, Leonard A. and Levin, Joel R. (1982), Multivariate Statistics in the Social Sciences: A Researcher's Guide. Brooks/Cole Publishing, Monterey, CA. SAS is a registered trademark ofsas Institute, Inc., Cary, NC. 79

SAS/STAT 14.1 User s Guide. The PRINCOMP Procedure

SAS/STAT 14.1 User s Guide The PRINCOMP Procedure This document is an individual chapter from SAS/STAT 14.1 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute