Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Size: px

Start display at page:

Download "Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010"

Hugh Quinn
6 years ago
Views:

1 Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester Principal Component and Factor Analysis The lecture notes, exercises and data sets associated with this course are available for download from: Factor Analysis (FA) and Principal Components Analysis (PCA) are very similar techniques in that they both attempt to analyse the structure in a data set and define a small number of components or factors that capture most of the variation in the dataframe. With a large number of variables it may be easier to consider a small number of combinations of the original data rather than the entire data frame. The two techniques differ in that PCA identifies components that capture the variation in the dataframe without attempting to interpret the meaning of these components. Factor analysis identifies the structure in the dataframe (often using the PCA technique), but also tries to explain what the structure in the data frame means. In this session we will use the word component to identify the structure determined by PCA and factor to indicate the structure that has been determined by FA. As we are primarilly concerned with providing explanations of the meaning of the structure in our data, most of the discussion will refer to factors, which can be interpreted as meaningful components Introduction to Factor Analysis Factor analysis assumes that relationships between variables are due to the effects of underlying factors and that observed correlations are the result of variables sharing common factors. Consider the hypothetical correlation matrix in table 1 which shows student performance in a number of different academic disciplines: A visual inspection of table 1 suggests that the six disciplines might usefully be divided into two groups. Maths, Physics and Computing appear to be closely related and constitute one group, whilst Art, Drama and English which also appear to be closely related constitute the other group. For these

2 Table 1: Correlation Matrix Maths Physics Computing Art Drama English Maths 1.00 Physics Computing Art Drama English data, a factor analysis should clearly indicate the presence of two underlying factors which could be interpreted as representing the different types of skills required to succeed in the disciplines. Maths, Physics and Computing could be related as they all require an ability to think logically, whereas English, Art and Drama might require a more abstract style of thought. The way in which the disciplines have grouped together could therefore have been determined by the underlying factors of artistic and logical aptitude. Describing a data set in terms of factors (or latent variables as they are sometimes called) can be useful in the identification of underlying processes which determine correlations among the variables. In the example above, the marks obtained by the children might be better understood as a function of whether the discipline requires logical or creative ability rather than skills which are specific to each individual subject. A description of the children s performance given in terms of two separate factors as opposed to six related variables has resulted in a simpler interpretation. General Principles of Factor Analysis Each variable in factor analysis is expressed as a linear combination of factors which are not actually observed. For example, a person s result in an examination might be influenced by a number of factors, such as the person s aptitude in that particular subject, his or her experience with taking examinations, IQ and writing ability. The score a person gets on a test will be a reflection of a number of different abilities (factors) which affect the test score. A person s test score can be predicted by taking account of these abilities, as shown in Equation 1. Test Score = a(factor 1) + b(factor 2) + c(factor 3) + U test score (1) where a, b and c indicate the extent to which the different factors influence the test score and U represents an unknown component of the test score. Applying this equation to the example above we get: Test Score = aiq + bexperience + cwriting ability + U test score This equation is similar to a multiple regression equation except that IQ, Experience and Writing ability are not single independent variables but are labels for the underlying factors. IQ, Experience and Writing ability are called Common Factors, since all variables are expressed as functions of them. The U in the equation is called a Unique Factor, since it represents the part of the test score that cannot be explained by the common factors. U test score is unique to the test score variable. It should be noted that we do not know what these factors are in advance as their meaning can only be determined by interpreting the results of the analysis. 2

3 Equation 1 showed that a particular variable can be expressed in terms of unobserved factors. It is also possible to define an unobserved factor in terms of the observed variables. Each factor is identified as the correlation between the variables in the analysis and Equation 2 defines a factor in these terms. Factor X = β 1 Var 1 + β 2 Var 2 + β 3 Var β k Var k (2) where Var 1, Var 2,... Var k are variables and β 1, β 2,... β k are standardised regression coefficients Applying this equation to the example above we get: Factor X = β 1 Mathematics + β 2 Physics + β 3 Computing + β 4 Art + β 5 Drama + β 6 English where Mathematics, Physics, Computing, Art, Drama and English are variables and β 1, β 2,... β 6 are standardised regression coefficients Equation 2 calculates one of the factors which might underlie the data set. Additional factors can be calculated to explain the remaining variance in the data. For example, if the first factor to be calculated accounted for 40% of the variability in the data there would remain 60% of the variance which is unaccounted for. The next factor to be computed would account for as much of the remaining variance as possible. Factor 1 accounts for the biggest portion of the variance in the data, factor 2 accounts for the next biggest portion of variance, factor 3 accounts for the third next biggest portion of the variance etc... Successively smaller amounts of variance are accounted for by equating further factors until all of the variance is accounted for. Example Data set The technique of Factor analysis will be demonstrated using a real data set which shows children s performance on a number of tests. These tests were designed to assess a range of different abilities and skills. Table 2 shows the 17 variables included in the analysis. Label Table 2: Variables in data file Description active How active art Articulation atten Attention comp Comprehension coord Coordination draw Drawing lexp Expressive language mat Mathematical ability motsk Motor skills newsit Capability in new situations saint Social interaction 1 sencom Sentence completion sint Social interaction 2 temp Temperament under Understanding of language vocab Vocabulary writ Writing 3

4 Level of measurement FA is based on correlations, so continuous data is required. However, this requirement is often relaxed so that ordered data can be used (see Hutcheson and Sofroniou, 1999, for a full discussion of this issus). Measures of Sampling Adequacy A useful method for determining the appropriateness of running a factor analysis is to compute a measure of sampling adequacy. Such measures have been proposed by Kaiser (1970) and are based on an index which compares correlation and partial correlation coefficients (these measures of sampling adequacy are also known as Kaiser Meyer Olkin, or KMO statistics). KMO statistics can be calculated for individual and multiple variables using Equations. As these measures are not required for the understanding of the factor analysis technique, they will not be covered in detail here. Full explanations are, however, provided in Hutcheson and Sofroniou, An algorithm for computing the KMO statistics in R can be obtained from Graeme Hutcheson on request. Principal Components Analysis (PCA) Once the variables that are to be used in the factor analysis have been selected (based on theoretical considerations, the KMO statistics and the level of measurement of the data) the individual components that define the structure in the data can be determined using PCA. PCA identifies linear combinations of the observed variables with the first principal-component, P C(1), being the linear combination of variables that accounts for the largest amount of variance in the sample. The second principal-component, P C(2), is the linear combination of the variables which is uncorrelated with P C(1) and accounts for the maximum amount of the remaining variation in the data. Successive components explain progressively smaller portions of the total sample variance, and are all uncorrelated with each other. Essentially, principal-component analysis transforms a set of correlated variables into a set of uncorrelated components. The principal-components analysis in Rcmdr has transformed the 17 correlated variables (ACTIV to WRIT) into 17 uncorrelated components (Comp.1 to Comp.17). The component loadings show the correlations between each of the variables and the new components. In this analysis as we have 17 variables represented by 17 components, all of the variation in each variable is accounted for (we have not lost any information by transforming the variables into components). The squared loadings for a variable will sum to 1.0. In this form, the data have just been rearranged. The task now is to see if the data can be represented appropriately using fewer components. Selecting the number of components The principal components analysis provides information about the amount of variance explained by each of the components. In the Rcmdr output the component variances show the Eigenvalues for each of the components. The Eigenvalues just indicate the amount of variance in all the data that is accounted for by the component. As we have 17 variables, the component variances also add up to 17 (add them up). We can see that the first component accounts for the largest amount of variance (8.28 out of 17), followed by the second (2 out of 17) and followed by successively smaller components. If we were to just consider the first two components, these would account for about 60% of the variation in the data (10.28 out of 17). An Eigenvalue of 1.0 indicates the same amount of variance as is explained by a single variable. Although many packages have a default to extract only those components that have Eigenvalues of 1 or more, in practice, a useful solution might be 4

5 Principal Components Analysis Data set: factor.txt (available for download from RGSweb) Rcmdr: commands Statistics Dimensional analysis Principal-components analysis... Principal Components Analysis Variables (pick two or more) Analyze correlation matrix OK select all variables select Rcmdr: output >.PC <- princomp(~active+art+atten+comp+coord+draw+lexp+mat+motsk+newsit+saint+sencom+sint+ TEMP+UNDER+VOCAB+WRIT, cor=true, data=dataset) > unclass(loadings(.pc)) # component loadings Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 ACTIVE ART ATTEN COMP COORD DRAW LEXP MAT MOTSK NEWSIT SAINT SENCOM SINT TEMP UNDER VOCAB WRIT Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 ACTIVE ART ATTEN COMP COORD DRAW LEXP MAT MOTSK NEWSIT SAINT SENCOM SINT TEMP UNDER VOCAB WRIT continued overleaf... 5

6 ...continued from overleaf Comp.14 Comp.15 Comp.16 Comp.17 ACTIVE ART ATTEN COMP COORD DRAW LEXP MAT MOTSK NEWSIT SAINT SENCOM SINT TEMP UNDER VOCAB WRIT >.PC$sd^2 # component variances Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp obtained using fewer or more components than this. The current analysis suggests that a solution of around 4 may be appropriate. An easy way to view the Eigenvalues is to use a scree plot (look this up on the web for a description of why it is called a scree plot and also how it might be best interpreted). The commands to obtain a screeplot in Rcmdr are shown below. Principal Components Analysis: drawing a screeplot Data set: factor.txt (available for download from RGSweb) Rcmdr: commands Statistics Dimensional analysis Principal-components analysis... Principal Components Analysis Variables (pick two or more) Screeplot OK select all variables select The analysis above suggests that nearly 70% (( )/17 *100) of the variation in the data (the 17 correlated variables) can be represented by just 3 principal components. 6

7 Figure 1: A screeplot Interpreting the components This section is included here to illustrate the point that the components, whilst accounting for a large proportion of the variance, may not have easily interpretable meanings assigned to them. In practice, the following statistics would not be computed as a matter of course as a factor analysis would probably be used directly. This section therefore provides a demonstration and is not part of the normal analytic procedure. The main point we are to look at here is what do these three components look like and how do they relate to the original variables. In order to answer this question, we can save the 3 principalcomponents to the data set. This can be achieved very simply in Rcmdr using the commands given below. These commands run the principal-components analysis using the correlation matrix and then save the first three components to the data set as variables PC1, PC2 and PC3. Although of limited use for general analysis, it is useful for the purposes of this lecture to see the relationship between the original variables and the 3 components. This will tell us how much of each variable is explained by the 3 components. We can do this in Rcmdr by correlating the components with the variables using a matrix correlation. From the correlation analysis output (only the important results have been reported here) we can see that PC1 (principal-component 1) accounts for of variable ACTIVE and of variable ART. The amount of the variable ACTIVE that is accounted for by all three principalcomponents is the sum of the squared loadings ( ), which equals We can therefore say that 69.08% of the variance in the variable ACTIVE is accounted for by the three principal-components. These statistics are commonly provided in software and are also known as the communalities. 7

8 Principal Components Analysis: saving the components Data set: factor.txt (available for download from RGSweb) Rcmdr: commands Statistics Dimensional analysis Principal-components analysis... Principal Components Analysis Variables (pick two or more) Analyze correlation matrix Add principal components to data set OK select all variables select select Number of Components Number of components to retain select 3 OK Rcmdr: output > Dataset$PC1 <-.PC$scores[,1] > Dataset$PC2 <-.PC$scores[,2] > Dataset$PC3 <-.PC$scores[,3] What we can note from the output is that all of the variables load most highly on principal-component 1. The variables are therefore related most highly to this component. This creates a problem if we wish to assign a meaning to the component, as at the moment, PC1 seems to represent every variable. Although we have identified a number of components to the data, we cannot assign any meaning to these components. This is where the technique of Factor Analysis is of help. Factor Analysis Factor Analysis is probably most easily understood as a technique that redistributes the loadings of the components (see above) so that they can be interpreted. We saw that the principal-components above all loaded highly on component 1. Factor analysis attempts to re-distribute these loadings so that they load on a number of different factors. The hope is that those variables that share similar underlying causes will load together on a single component. The technique used to re-distribute the loadings is called rotation. After a rotation has been applied to the data, the components are called factors. The Rotation Phase We can see from the analysis above that the principal-components are not always easy to interpret as they are often correlated with many variables. In the example above PC1 shows the highest 8

9 Principal Components Analysis: correlating variables and main components Data set: factor.txt (available for download from RGSweb) Rcmdr: commands Statistics Summaries Correlation matrix... Correlation Matrix Variables (pick two or more) Types of correlation Pearson product-moment OK select all variables select Rcmdr: output PC1 PC2 PC3 ACTIVE e e e-01 ART e e e-01 ATTEN e e e-01 COMP e e e-02 COORD e e e-01 DRAW e e e-01 LEXP e e e-01 MAT e e e-01 MOTSK e e e-01 NEWSIT e e e-01 SAINT e e e-01 SENCOM e e e-02 SINT e e e-01 TEMP e e e-01 UNDER e e e-02 VOCAB e e e-01 WRIT e e e-01 loading for all variables apart from temp. Using this matrix it is not easy to assign any description to the factors. In such cases the technique of rotation can be used which transforms the factors to make them more easily interpretable. Orthogonal and Oblique Rotation There are two general types of rotation which can be carried out, Orthogonal and Oblique. Orthogonal rotation refers to the procedure where the computed factors are uncorrelated to one another and in a two factor model this can be graphically represented by the axes remaining at right angles. The Table shown in Figure 2 shows the factor loadings for four variables for a two factor solution before and after an orthogonal rotation. This information is also shown graphically. It can be seen that after rotation the two axes are still at right angles (and hence uncorrelated), this graph therefore represents an orthogonal rotation of components resulting in orthogonal factors. In this case the rotation has resulted in an easy interpretation for the factors. Rcmdr uses the varimax method which attempts to minimise the number of variables which have a high loading on a factor. This enhances the interpretability of the factors. Although other orthogonal rotation methods are available, we shall just deal with varimax. If we allow for some correlation between the factors, sometimes the factor matrix can be simplified (also assuming that it is theoretically justified to have correlated factors). For example in figure 3, if 9

10 the axes went through the dotted lines a simpler pattern matrix would result than would have with orthogonal rotation (keeping the axes at right angles). A rotation carried out which allows for some correlation between the factors is termed OBLIQUE. Oblique rotation has come into favour recently for several reasons. It is unlikely that influences in nature are uncorrelated. Even if they are uncorrelated in the population, they need not be so in the sample. Thus, oblique rotations have often been found to yield substantively meaningful factors. The method Rcmdr uses for oblique rotation is called promax. The table in Figure 3 shows rotated and unrotated two factor solutions for six variables. This information is also shown graphically. It can be seen that the rotated factors are easier to interpret as the variables load highly on only one factor. A factor analysis in R using an orthogonal rotation technique is shown in the example below. 10

11 Figure 2: Orthogonal rotation of Components Factor Two Factor One Initial Components Rotated Factors Component 1 Component 2 Factor 1 Factor 2 v v v v

12 Figure 3: Oblique Rotation of Components Factor Two Factor One Initial Components Rotated Factors Component 1 Component 2 Factor 1 Factor 2 v v v v v v

13 Factor Analysis Data set: factor.txt (available for download from RGSweb) Rcmdr: commands Statistics Dimensional analysis Factor analysis... Factor Analysis Variables (pick three or more) Factor Rotation: Varimax select Factor Scores: None select OK Number of Factors Number of factors to extract select 3 OK select all variables Rcmdr: output Call: factanal(x = ~ACTIVE + ART + ATTEN + COMP + COORD + DRAW + LEXP + MAT + MOTSK + NEWSIT + SAINT + SENCOM + SINT + TEMP + UNDER + VOCAB + WRIT, factors = 3, data = Dataset, scores = "none", rotation = "varimax") Uniquenesses: ACTIVE ART ATTEN COMP COORD DRAW LEXP MAT MOTSK NEWSIT SAINT SENCOM SINT TEMP UNDER VOCAB WRIT Loadings: Factor1 Factor2 Factor3 ACTIVE ART ATTEN COMP COORD DRAW LEXP MAT MOTSK NEWSIT SAINT SENCOM SINT TEMP UNDER VOCAB WRIT Factor1 Factor2 Factor3 SS loadings Proportion Var Cumulative Var Test of the hypothesis that 3 factors are sufficient. The chi square statistic is on 88 degrees of freedom. The p-value is 5.11e-25 13

14 There are a number of things to note from the factor analysis output shown above: The uniqueness gives an indication of the uniqueness of the variable (the variation in the variable that cannot be attributed to any factor), the U test score element in the equation discussed earlier: Test Score = aiq + bexperience + cwriting ability + U test score From this score, we can see that some variables have very small values (eg., SENCOM, VOCAB and COORD) indicating that the factors represent the variation in the variables well, whereas other variables have much higher values (eg., TEMP and MAT), indicating that a smaller amount of the variation in these variables is represented by the factors. From the output, we can also see that the 3 factors account for 63.8% of the variance in the data (this is shown in the Cumulative Var row). Also, the proportion of variance accounted for by each factor is shown in the Proportion Var row. In this case factor 1 accounts for 26.3%. You may note that these statistics are slightly different to those obtained for the PCA model above as the default method for the FA in R is factanal which performs a maximum likelihood factor analysis rather than the principal components. The interpretation and basic theory, however, remain unchanged. The test of the hypothesis that 3 factors are sufficient gives a highly significant chi-square value. This indicates that 3 factors are not sufficient to represent the data. On the basis of this evidence we would ceratinly wish to look at solutions with more than 3 factors. From the Loadings, we can see which variables load on which factor. This information enables us to define the factors and give them labels. We can see that the variables now load highly on particular factors. Using the components (i.e., before any rotation was applied), the variables all loaded onto PC1. Once a rotation has been applied, the variables load on different factors. The table below shows these loadings arranged into order. Table 3: Factor scores Factor 1 VOCAB.875 vocabulary SENCOM.865 Sentence Completion ART.863 articulation LEXP.809 Expressive language UNDER.708 Understanding of language COMP.667 Comprehension Factor 2 SAINT.847 Social Interaction 1 ACTIVE.811 How active SINT.795 Social Interaction 2 Factor 3 COORD.827 Coordination WRIT.786 Writing DRAW.775 Drawing MOTSK.539 Motor skills 14

15 Interpreting Factors The rotated factor matrix provides a much clearer interpretation of the results as can be seen in table 3. The unrotated component matrix did not enable an easy interpretation of the components, however, once a rotation has been completed the interpretation of the factors becomes clearer. Factor 1 relates to linguistic skills, Factor 2 to Social skills and Factor 3 to practical skills. The data set can now be described in terms of three underlying factors instead of 17 variables. The loadings for the factors can be saved as new variables and entered into other analyses such as regression. A graphical illustration of factor analysis Using similar data to that used above, three factors were extracted but couldn t be easily interpreted as Table 4 shows 1 : Table 4: Factor Loadings for the 3-Factor Model Variable Factor 1 Factor 2 Factor 3 Articulation Attention Comprehension Coordination Drawing Memory Motor Skill Sentence Completion Temperament Writing The factors can be interpreted by identifying the variables they are highly related to. For example, if factor 1 is strongly related to the variables, motor skill, drawing and coordination, this could be interpreted as representing physical dexterity. Such a clear-cut identification of the three factors identified in Table 4 is not possible as they are related to many variables. The difficulty with providing interpretations for the factors in such circumstances is demonstrated graphically in Figure 4, where the first two factors from the 3-factor model identified above are shown in a simple twodimensional scatter plot 2. This graph suggests that there are two distinct factors in the data which are represented clearly as two clusters of points. The factors are represented as the axes of the graph and those points falling on or close to an axis indicate a strong relationship with that factor. We can see from the graph that the variables fall midway between the axes and are therefore related to both factors. The presence of the two factors shown in Figure 4 is not obvious from the factor loadings of the initial factors (see Table 4) as the variables are not exclusively related to one factor or the other. Attempting to interpret the factor loading scores obtained in a principal components analysis directly is therefore not an ideal method to identify distinct factors. It can be seen from Figure 5 that the rotated axes in the graph pass much closer to the clusters of points than do the principal component axes. The precise degree of rotation is determined using one of a number of algorithms available in most common statistical software packages. Popular methods include minimizing the number of variables which have high loadings to enhance the interpretability of the factors, and minimizing the number of factors which provides simpler interpretations of the 1 See Hutcheson and Sofroniou, 1999, for all references. 2 In order to show this information in two dimensions, only the first two factors and those variables which form part of a two-factor solution are shown. The variables attention and temperament which form a third factor are omitted from this demonstration. 15

16 Figure 4: A graphical representation of the principal components Figure 5: A graphical representation of factor rotation 16

17 variables (refer to Kim and Mueller, 1994, for a discussion of rotation techniques). Although there are many rotation techniques available, in practice, the different techniques tend to produce similar results when there is a large sample and the factors are relatively well defined (Fava and Velicer, 1992). The factor loadings for two rotation techniques, orthogonal and oblique, are shown in Table 5. In this example, the factors can be interpreted directly from the rotated factor loadings as variables tend to load highly on only one factor. For example, the variable attention, which correlated with all three of the principal components more or less evenly, only correlates highly with a single rotated factor (factor 3). A similar pattern can be seen for the other variables in Table 5. Table 5: A Component Matrix showing Orthogonal and Oblique Rotation of Factors Initial Components Rotated Factors Orthogonal Oblique Variable Articulation Attention Comprehension Coordination Drawing Memory Motor Skills Sent. Comp Temperament Writing

Scaling Techniques in Political Science

Scaling Techniques in Political Science Eric Guntermann March 14th, 2014 Eric Guntermann Scaling Techniques in Political Science March 14th, 2014 1 / 19 What you need R RStudio R code file Datasets You