Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Similar documents
Scaling Techniques in Political Science

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Chapter 1. Using the Cluster Analysis. Background Information

Oblique Factor Rotation Explained

GRAPHS AND STATISTICS Residuals Common Core Standard

SPSS INSTRUCTION CHAPTER 9

Generalized Procrustes Analysis Example with Annotation

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

Principal Components Analysis with Spatial Data

Clustering and Visualisation of Data

Statistical Modelling for Social Scientists. Manchester University. January 20, 21 and 24, Exploratory regression and model selection

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Introduction to Factor Analysis for Marketing

Week 7 Picturing Network. Vahe and Bethany

Step-by-Step Guide to Advanced Genetic Analysis

Quantitative Methods in Management

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Using Excel for Graphical Analysis of Data

Using Excel for Graphical Analysis of Data

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

Chapter 13 Multivariate Techniques. Chapter Table of Contents

Multiple Regression White paper

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Linear Methods for Regression and Shrinkage Methods

Lecture 25: Review I

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

MODERN FACTOR ANALYSIS

A presentation from the Telemetrics Lab

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Lecture 07 Dimensionality Reduction with PCA

CREATING THE ANALYSIS

Study Guide. Module 1. Key Terms

Analysis and Latent Semantic Indexing

Excel Core Certification

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Basics of Multivariate Modelling and Data Analysis

General Instructions. Questions

8 th Grade Mathematics Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the

DESIGN OF EXPERIMENTS and ROBUST DESIGN

Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models

Step-by-Step Guide to Relatedness and Association Mapping Contents

Tutorial 3. Jun Xu, Teaching Asistant csjunxu/ February 16, COMP4134 Biometrics Authentication

An introduction to plotting data

Workshop 8: Model selection

Chapter 2 Basic Structure of High-Dimensional Spaces

Enduring Understandings: Some basic math skills are required to be reviewed in preparation for the course.

8 th Grade Pre Algebra Pacing Guide 1 st Nine Weeks

DI TRANSFORM. The regressive analyses. identify relationships

Integers & Absolute Value Properties of Addition Add Integers Subtract Integers. Add & Subtract Like Fractions Add & Subtract Unlike Fractions

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Statistical Pattern Recognition

Graphical Presentation for Statistical Data (Relevant to AAT Examination Paper 4: Business Economics and Financial Mathematics) Introduction

SYS 6021 Linear Statistical Models

SAT Released Test 8 Problem #28

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

1 Introduction to Using Excel Spreadsheets

7 Fractions. Number Sense and Numeration Measurement Geometry and Spatial Sense Patterning and Algebra Data Management and Probability

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

For our example, we will look at the following factors and factor levels.

Cluster Analysis Gets Complicated

Facial Expression Detection Using Implemented (PCA) Algorithm

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

MAT 003 Brian Killough s Instructor Notes Saint Leo University

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Version 2.4 of Idiogrid

5. Compare the volume of a three dimensional figure to surface area.

Sections 3-6 have been substantially modified to make the paper more comprehensible. Several figures have been re-plotted and figure captions changed.

MULTIVARIATE ANALYSIS USING R

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Detecting Polytomous Items That Have Drifted: Using Global Versus Step Difficulty 1,2. Xi Wang and Ronald K. Hambleton

Using Large Data Sets Workbook Version A (MEI)

Chapter Two: Descriptive Methods 1/50

Statistical Pattern Recognition

Grade 8 Unit 1 Congruence and Similarity (4 Weeks)

This chapter will show how to organize data and then construct appropriate graphs to represent the data in a concise, easy-to-understand form.

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Tutorial: Using Tina Vision s Quantitative Pattern Recognition Tool.

Statistical Pattern Recognition

Digital image processing

Whitepaper Spain SEO Ranking Factors 2012

Categorical explanatory variables

Unit 3: Congruence & Similarity

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

The problem. Needed for contract. 10% Reduction. Resources available

Mathematics 308 Geometry. Chapter 9. Drawing three dimensional objects

Interesting Application. Linear Algebra

Year 10 General Mathematics Unit 2

NARROW CORRIDOR. Teacher s Guide Getting Started. Lay Chin Tan Singapore

8. MINITAB COMMANDS WEEK-BY-WEEK

Infographics and Visualisation (or: Beyond the Pie Chart) LSS: ITNPBD4, 1 November 2016

NCSS Statistical Software

SPSS Basics for Probability Distributions

Chapter 4: Analyzing Bivariate Data with Fathom

Whitepaper Italy SEO Ranking Factors 2012

JUST THE MATHS UNIT NUMBER STATISTICS 1 (The presentation of data) A.J.Hobson

SAS Graphics Macros for Latent Class Analysis Users Guide

Tutorial 1: Welded Frame - Problem Description

Transcription:

Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester Principal Component and Factor Analysis The lecture notes, exercises and data sets associated with this course are available for download from: www.research-training.net/iscte Factor Analysis (FA) and Principal Components Analysis (PCA) are very similar techniques in that they both attempt to analyse the structure in a data set and define a small number of components or factors that capture most of the variation in the dataframe. With a large number of variables it may be easier to consider a small number of combinations of the original data rather than the entire data frame. The two techniques differ in that PCA identifies components that capture the variation in the dataframe without attempting to interpret the meaning of these components. Factor analysis identifies the structure in the dataframe (often using the PCA technique), but also tries to explain what the structure in the data frame means. In this session we will use the word component to identify the structure determined by PCA and factor to indicate the structure that has been determined by FA. As we are primarilly concerned with providing explanations of the meaning of the structure in our data, most of the discussion will refer to factors, which can be interpreted as meaningful components Introduction to Factor Analysis Factor analysis assumes that relationships between variables are due to the effects of underlying factors and that observed correlations are the result of variables sharing common factors. Consider the hypothetical correlation matrix in table 1 which shows student performance in a number of different academic disciplines: A visual inspection of table 1 suggests that the six disciplines might usefully be divided into two groups. Maths, Physics and Computing appear to be closely related and constitute one group, whilst Art, Drama and English which also appear to be closely related constitute the other group. For these

Table 1: Correlation Matrix Maths Physics Computing Art Drama English Maths 1.00 Physics.80 1.00 Computing.78.73 1.00 Art.12.14.15 1.00 Drama.04.21.13.68 1.00 English.24.15.07.91.79 1.00 data, a factor analysis should clearly indicate the presence of two underlying factors which could be interpreted as representing the different types of skills required to succeed in the disciplines. Maths, Physics and Computing could be related as they all require an ability to think logically, whereas English, Art and Drama might require a more abstract style of thought. The way in which the disciplines have grouped together could therefore have been determined by the underlying factors of artistic and logical aptitude. Describing a data set in terms of factors (or latent variables as they are sometimes called) can be useful in the identification of underlying processes which determine correlations among the variables. In the example above, the marks obtained by the children might be better understood as a function of whether the discipline requires logical or creative ability rather than skills which are specific to each individual subject. A description of the children s performance given in terms of two separate factors as opposed to six related variables has resulted in a simpler interpretation. General Principles of Factor Analysis Each variable in factor analysis is expressed as a linear combination of factors which are not actually observed. For example, a person s result in an examination might be influenced by a number of factors, such as the person s aptitude in that particular subject, his or her experience with taking examinations, IQ and writing ability. The score a person gets on a test will be a reflection of a number of different abilities (factors) which affect the test score. A person s test score can be predicted by taking account of these abilities, as shown in Equation 1. Test Score = a(factor 1) + b(factor 2) + c(factor 3) + U test score (1) where a, b and c indicate the extent to which the different factors influence the test score and U represents an unknown component of the test score. Applying this equation to the example above we get: Test Score = aiq + bexperience + cwriting ability + U test score This equation is similar to a multiple regression equation except that IQ, Experience and Writing ability are not single independent variables but are labels for the underlying factors. IQ, Experience and Writing ability are called Common Factors, since all variables are expressed as functions of them. The U in the equation is called a Unique Factor, since it represents the part of the test score that cannot be explained by the common factors. U test score is unique to the test score variable. It should be noted that we do not know what these factors are in advance as their meaning can only be determined by interpreting the results of the analysis. 2

Equation 1 showed that a particular variable can be expressed in terms of unobserved factors. It is also possible to define an unobserved factor in terms of the observed variables. Each factor is identified as the correlation between the variables in the analysis and Equation 2 defines a factor in these terms. Factor X = β 1 Var 1 + β 2 Var 2 + β 3 Var 3 +... β k Var k (2) where Var 1, Var 2,... Var k are variables and β 1, β 2,... β k are standardised regression coefficients Applying this equation to the example above we get: Factor X = β 1 Mathematics + β 2 Physics + β 3 Computing + β 4 Art + β 5 Drama + β 6 English where Mathematics, Physics, Computing, Art, Drama and English are variables and β 1, β 2,... β 6 are standardised regression coefficients Equation 2 calculates one of the factors which might underlie the data set. Additional factors can be calculated to explain the remaining variance in the data. For example, if the first factor to be calculated accounted for 40% of the variability in the data there would remain 60% of the variance which is unaccounted for. The next factor to be computed would account for as much of the remaining variance as possible. Factor 1 accounts for the biggest portion of the variance in the data, factor 2 accounts for the next biggest portion of variance, factor 3 accounts for the third next biggest portion of the variance etc... Successively smaller amounts of variance are accounted for by equating further factors until all of the variance is accounted for. Example Data set The technique of Factor analysis will be demonstrated using a real data set which shows children s performance on a number of tests. These tests were designed to assess a range of different abilities and skills. Table 2 shows the 17 variables included in the analysis. Label Table 2: Variables in data file Description active How active art Articulation atten Attention comp Comprehension coord Coordination draw Drawing lexp Expressive language mat Mathematical ability motsk Motor skills newsit Capability in new situations saint Social interaction 1 sencom Sentence completion sint Social interaction 2 temp Temperament under Understanding of language vocab Vocabulary writ Writing 3

Level of measurement FA is based on correlations, so continuous data is required. However, this requirement is often relaxed so that ordered data can be used (see Hutcheson and Sofroniou, 1999, for a full discussion of this issus). Measures of Sampling Adequacy A useful method for determining the appropriateness of running a factor analysis is to compute a measure of sampling adequacy. Such measures have been proposed by Kaiser (1970) and are based on an index which compares correlation and partial correlation coefficients (these measures of sampling adequacy are also known as Kaiser Meyer Olkin, or KMO statistics). KMO statistics can be calculated for individual and multiple variables using Equations. As these measures are not required for the understanding of the factor analysis technique, they will not be covered in detail here. Full explanations are, however, provided in Hutcheson and Sofroniou, 1999. An algorithm for computing the KMO statistics in R can be obtained from Graeme Hutcheson on request. Principal Components Analysis (PCA) Once the variables that are to be used in the factor analysis have been selected (based on theoretical considerations, the KMO statistics and the level of measurement of the data) the individual components that define the structure in the data can be determined using PCA. PCA identifies linear combinations of the observed variables with the first principal-component, P C(1), being the linear combination of variables that accounts for the largest amount of variance in the sample. The second principal-component, P C(2), is the linear combination of the variables which is uncorrelated with P C(1) and accounts for the maximum amount of the remaining variation in the data. Successive components explain progressively smaller portions of the total sample variance, and are all uncorrelated with each other. Essentially, principal-component analysis transforms a set of correlated variables into a set of uncorrelated components. The principal-components analysis in Rcmdr has transformed the 17 correlated variables (ACTIV to WRIT) into 17 uncorrelated components (Comp.1 to Comp.17). The component loadings show the correlations between each of the variables and the new components. In this analysis as we have 17 variables represented by 17 components, all of the variation in each variable is accounted for (we have not lost any information by transforming the variables into components). The squared loadings for a variable will sum to 1.0. In this form, the data have just been rearranged. The task now is to see if the data can be represented appropriately using fewer components. Selecting the number of components The principal components analysis provides information about the amount of variance explained by each of the components. In the Rcmdr output the component variances show the Eigenvalues for each of the components. The Eigenvalues just indicate the amount of variance in all the data that is accounted for by the component. As we have 17 variables, the component variances also add up to 17 (add them up). We can see that the first component accounts for the largest amount of variance (8.28 out of 17), followed by the second (2 out of 17) and followed by successively smaller components. If we were to just consider the first two components, these would account for about 60% of the variation in the data (10.28 out of 17). An Eigenvalue of 1.0 indicates the same amount of variance as is explained by a single variable. Although many packages have a default to extract only those components that have Eigenvalues of 1 or more, in practice, a useful solution might be 4

Principal Components Analysis Data set: factor.txt (available for download from RGSweb) Rcmdr: commands Statistics Dimensional analysis Principal-components analysis... Principal Components Analysis Variables (pick two or more) Analyze correlation matrix OK select all variables select Rcmdr: output >.PC <- princomp(~active+art+atten+comp+coord+draw+lexp+mat+motsk+newsit+saint+sencom+sint+ TEMP+UNDER+VOCAB+WRIT, cor=true, data=dataset) > unclass(loadings(.pc)) # component loadings Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 ACTIVE -0.2314002-0.27320244 0.26059408 0.35367956 0.067183513-0.02503823-0.13272014 ART -0.2546264 0.35476397 0.14750807-0.09904303 0.073718012 0.07143314-0.04407942 ATTEN -0.2066590-0.24471373 0.22124731-0.48279116-0.207170278-0.21436250 0.08207964 COMP -0.2879066 0.17329181-0.04394613 0.04221465-0.135856223-0.02457099 0.03354750 COORD -0.2544259-0.15365049-0.36661435-0.02395982 0.175580832-0.21399516 0.02630530 DRAW -0.2433396-0.14921169-0.37139636-0.02391467 0.158074730-0.24819962-0.16437896 LEXP -0.2743231 0.27592777 0.11233644-0.08686783-0.088884245-0.04247925-0.17316543 MAT -0.1779620-0.04595127-0.31892384 0.22245766-0.830248211 0.16231000 0.17649001 MOTSK -0.1930976-0.08909091-0.33260479-0.19441160 0.228957095 0.81271186-0.07539956 NEWSIT -0.2543743-0.05873016 0.08328651 0.12821363 0.280840416 0.03719423 0.86423625 SAINT -0.2427095-0.24730893 0.26540774 0.37306628 0.026573513 0.10940896-0.18696783 SENCOM -0.2780857 0.31419137 0.05627190-0.10311497-0.004583613 0.05724753-0.12571330 SINT -0.2312356-0.31173141 0.26065515 0.20331763-0.008228937 0.13190280-0.18028500 TEMP -0.1498802-0.37468358 0.20907527-0.57008211-0.130915637 0.06277715 0.02112767 UNDER -0.2814661 0.21327706 0.04790684 0.03125934-0.055728716-0.13057932 0.13572667 VOCAB -0.2760968 0.32809763 0.10739953-0.01748627 0.082005469-0.00796318-0.09818551 WRIT -0.2356517-0.15331041-0.39512473 0.01980193 0.128801343-0.31077983-0.13918008 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 ACTIVE -0.21636162-0.120548173 0.49148869-0.234121243 0.0002900892-0.25809182 ART -0.26216877-0.176809085-0.05571234-0.057897218-0.4369001020-0.16582571 ATTEN -0.08348760-0.622375559-0.02111103 0.076848598 0.0524038155 0.03986371 COMP 0.54891614 0.126477206 0.27117104-0.034013978 0.1072129961 0.23512098 COORD 0.05848256-0.104683183-0.41783420 0.127172793 0.4205495908-0.23463097 DRAW -0.06719568 0.172885494 0.29523290 0.583004154-0.3458936776-0.16768021 LEXP -0.12276513 0.322205258-0.15201288-0.230097886 0.2735495450-0.55265923 MAT -0.21239672 0.011845470-0.03626324 0.036197718-0.0722483921-0.01524971 MOTSK 0.06578261-0.202806047 0.10015895-0.085910953 0.0395852076-0.06695616 NEWSIT -0.16473797 0.153709074-0.03871963 0.002407261-0.0300759290 0.02956457 SAINT -0.04809022-0.049338486 0.01786821 0.220948308 0.4062981276 0.20618371 SENCOM -0.25232067 0.094501898-0.06770028 0.177727746 0.1356460963 0.34708447 SINT 0.26344831 0.100659837-0.58238207 0.016578636-0.4582144831 0.03154343 TEMP 0.03694200 0.523947744 0.14676985-0.065313035 0.0149620307 0.05162330 UNDER 0.55600815-0.212526204 0.13101224-0.071515295-0.0837080387-0.19734863 VOCAB -0.11535432 0.001836192-0.01289503 0.127878054 0.0153139949 0.34322106 WRIT -0.13187578 0.010406348-0.02665323-0.640820374-0.1222634079 0.36255020 continued overleaf... 5

...continued from overleaf Comp.14 Comp.15 Comp.16 Comp.17 ACTIVE 0.070192548-0.19025605 0.42627907 0.09430234 ART 0.405521971-0.32192774-0.35277283-0.21885107 ATTEN 0.052743291 0.32485944 0.03445444 0.07246324 COMP 0.600480387 0.18798325 0.09082752-0.01561056 COORD 0.201370650-0.40570894 0.19958321-0.13164417 DRAW -0.063714920 0.20930320-0.08458628 0.04648792 LEXP -0.071559935 0.45010796-0.06453266 0.04085990 MAT -0.055911722-0.07213068 0.02197661-0.07791369 MOTSK -0.074430808 0.11170966 0.02176186 0.01828805 NEWSIT -0.005350489 0.15540130-0.02319934 0.05044739 SAINT -0.048770316 0.02315307-0.57975454-0.14890603 SENCOM -0.042467228-0.21918502 0.10642553 0.69443829 SINT 0.008932030 0.11192918 0.18449358 0.11628957 TEMP -0.063852264-0.33496884-0.05801538-0.15062147 UNDER -0.538082562-0.28000660-0.17614040 0.10923628 VOCAB -0.318182065 0.08228285 0.42102985-0.59557760 WRIT -0.098638573 0.08929187-0.19651020 0.01824879 >.PC$sd^2 # component variances Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 8.2806840 2.0060064 1.5772454 0.9957742 0.6834144 0.6044778 0.4700830 0.3965184 0.3405472 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 0.3053812 0.2972447 0.2102269 0.1971972 0.1849951 0.1689181 0.1575378 0.1237483 obtained using fewer or more components than this. The current analysis suggests that a solution of around 4 may be appropriate. An easy way to view the Eigenvalues is to use a scree plot (look this up on the web for a description of why it is called a scree plot and also how it might be best interpreted). The commands to obtain a screeplot in Rcmdr are shown below. Principal Components Analysis: drawing a screeplot Data set: factor.txt (available for download from RGSweb) Rcmdr: commands Statistics Dimensional analysis Principal-components analysis... Principal Components Analysis Variables (pick two or more) Screeplot OK select all variables select The analysis above suggests that nearly 70% ((8.28 + 2.01 + 1.577)/17 *100) of the variation in the data (the 17 correlated variables) can be represented by just 3 principal components. 6

Figure 1: A screeplot Interpreting the components This section is included here to illustrate the point that the components, whilst accounting for a large proportion of the variance, may not have easily interpretable meanings assigned to them. In practice, the following statistics would not be computed as a matter of course as a factor analysis would probably be used directly. This section therefore provides a demonstration and is not part of the normal analytic procedure. The main point we are to look at here is what do these three components look like and how do they relate to the original variables. In order to answer this question, we can save the 3 principalcomponents to the data set. This can be achieved very simply in Rcmdr using the commands given below. These commands run the principal-components analysis using the correlation matrix and then save the first three components to the data set as variables PC1, PC2 and PC3. Although of limited use for general analysis, it is useful for the purposes of this lecture to see the relationship between the original variables and the 3 components. This will tell us how much of each variable is explained by the 3 components. We can do this in Rcmdr by correlating the components with the variables using a matrix correlation. From the correlation analysis output (only the important results have been reported here) we can see that PC1 (principal-component 1) accounts for 0.6588 of variable ACTIVE and 0.7327 of variable ART. The amount of the variable ACTIVE that is accounted for by all three principalcomponents is the sum of the squared loadings (0.6588 2 + 0.3869 2 + 0.3273 2 ), which equals 0.6908. We can therefore say that 69.08% of the variance in the variable ACTIVE is accounted for by the three principal-components. These statistics are commonly provided in software and are also known as the communalities. 7

Principal Components Analysis: saving the components Data set: factor.txt (available for download from RGSweb) Rcmdr: commands Statistics Dimensional analysis Principal-components analysis... Principal Components Analysis Variables (pick two or more) Analyze correlation matrix Add principal components to data set OK select all variables select select Number of Components Number of components to retain select 3 OK Rcmdr: output > Dataset$PC1 <-.PC$scores[,1] > Dataset$PC2 <-.PC$scores[,2] > Dataset$PC3 <-.PC$scores[,3] What we can note from the output is that all of the variables load most highly on principal-component 1. The variables are therefore related most highly to this component. This creates a problem if we wish to assign a meaning to the component, as at the moment, PC1 seems to represent every variable. Although we have identified a number of components to the data, we cannot assign any meaning to these components. This is where the technique of Factor Analysis is of help. Factor Analysis Factor Analysis is probably most easily understood as a technique that redistributes the loadings of the components (see above) so that they can be interpreted. We saw that the principal-components above all loaded highly on component 1. Factor analysis attempts to re-distribute these loadings so that they load on a number of different factors. The hope is that those variables that share similar underlying causes will load together on a single component. The technique used to re-distribute the loadings is called rotation. After a rotation has been applied to the data, the components are called factors. The Rotation Phase We can see from the analysis above that the principal-components are not always easy to interpret as they are often correlated with many variables. In the example above PC1 shows the highest 8

Principal Components Analysis: correlating variables and main components Data set: factor.txt (available for download from RGSweb) Rcmdr: commands Statistics Summaries Correlation matrix... Correlation Matrix Variables (pick two or more) Types of correlation Pearson product-moment OK select all variables select Rcmdr: output PC1 PC2 PC3 ACTIVE -6.658813e-01-3.869463e-01 3.272760e-01 ART -7.327175e-01 5.024648e-01 1.852531e-01 ATTEN -5.946855e-01-3.465968e-01 2.778610e-01 COMP -8.284851e-01 2.454394e-01-5.519125e-02 COORD -7.321406e-01-2.176207e-01-4.604252e-01 DRAW -7.002383e-01-2.113338e-01-4.664309e-01 LEXP -7.893970e-01 3.908063e-01 1.410816e-01 MAT -5.121066e-01-6.508242e-02-4.005314e-01 MOTSK -5.556611e-01-1.261826e-01-4.177131e-01 NEWSIT -7.319921e-01-8.318162e-02 1.045982e-01 SAINT -6.984253e-01-3.502724e-01 3.333214e-01 SENCOM -8.002244e-01 4.450004e-01 7.067100e-02 SINT -6.654077e-01-4.415163e-01 3.273527e-01 TEMP -4.312979e-01-5.306777e-01 2.625744e-01 UNDER -8.099518e-01 3.020719e-01 6.016545e-02 VOCAB -7.945010e-01 4.646963e-01 1.348814e-01 WRIT -6.781156e-01-2.171390e-01-4.962309e-01 loading for all variables apart from temp. Using this matrix it is not easy to assign any description to the factors. In such cases the technique of rotation can be used which transforms the factors to make them more easily interpretable. Orthogonal and Oblique Rotation There are two general types of rotation which can be carried out, Orthogonal and Oblique. Orthogonal rotation refers to the procedure where the computed factors are uncorrelated to one another and in a two factor model this can be graphically represented by the axes remaining at right angles. The Table shown in Figure 2 shows the factor loadings for four variables for a two factor solution before and after an orthogonal rotation. This information is also shown graphically. It can be seen that after rotation the two axes are still at right angles (and hence uncorrelated), this graph therefore represents an orthogonal rotation of components resulting in orthogonal factors. In this case the rotation has resulted in an easy interpretation for the factors. Rcmdr uses the varimax method which attempts to minimise the number of variables which have a high loading on a factor. This enhances the interpretability of the factors. Although other orthogonal rotation methods are available, we shall just deal with varimax. If we allow for some correlation between the factors, sometimes the factor matrix can be simplified (also assuming that it is theoretically justified to have correlated factors). For example in figure 3, if 9

the axes went through the dotted lines a simpler pattern matrix would result than would have with orthogonal rotation (keeping the axes at right angles). A rotation carried out which allows for some correlation between the factors is termed OBLIQUE. Oblique rotation has come into favour recently for several reasons. It is unlikely that influences in nature are uncorrelated. Even if they are uncorrelated in the population, they need not be so in the sample. Thus, oblique rotations have often been found to yield substantively meaningful factors. The method Rcmdr uses for oblique rotation is called promax. The table in Figure 3 shows rotated and unrotated two factor solutions for six variables. This information is also shown graphically. It can be seen that the rotated factors are easier to interpret as the variables load highly on only one factor. A factor analysis in R using an orthogonal rotation technique is shown in the example below. 10

Figure 2: Orthogonal rotation of Components 1 0.5 Factor Two 0 0.5 1 1 0.5 0 0.5 1 Factor One Initial Components Rotated Factors Component 1 Component 2 Factor 1 Factor 2 v1.50000.50000.70684 -.01938 v2.50000 -.40000.05324 -.63809 v3.70000.70000.98958 -.02713 v4 -.60000.60000.02325.84821 11

Figure 3: Oblique Rotation of Components 1 0.5 Factor Two 0 0.5 1 1 0.5 0 0.5 1 Factor One Initial Components Rotated Factors Component 1 Component 2 Factor 1 Factor 2 v1.76558 -.23212.80000.00000 v2.66989 -.20311.70000.00000 v3.57419 -.17409.60000.00000 v4.45410.53272.00000.70000 v5.38932.45662.00000.60000 v6.32436.38051.00000.50000 12

Factor Analysis Data set: factor.txt (available for download from RGSweb) Rcmdr: commands Statistics Dimensional analysis Factor analysis... Factor Analysis Variables (pick three or more) Factor Rotation: Varimax select Factor Scores: None select OK Number of Factors Number of factors to extract select 3 OK select all variables Rcmdr: output Call: factanal(x = ~ACTIVE + ART + ATTEN + COMP + COORD + DRAW + LEXP + MAT + MOTSK + NEWSIT + SAINT + SENCOM + SINT + TEMP + UNDER + VOCAB + WRIT, factors = 3, data = Dataset, scores = "none", rotation = "varimax") Uniquenesses: ACTIVE ART ATTEN COMP COORD DRAW LEXP MAT MOTSK 0.264 0.213 0.659 0.309 0.196 0.288 0.239 0.715 0.634 NEWSIT SAINT SENCOM SINT TEMP UNDER VOCAB WRIT 0.490 0.188 0.155 0.289 0.754 0.314 0.147 0.291 Loadings: Factor1 Factor2 Factor3 ACTIVE 0.211 0.184 0.811 ART 0.863 0.136 0.155 ATTEN 0.254 0.258 0.458 COMP 0.667 0.411 0.278 COORD 0.224 0.827 0.262 DRAW 0.223 0.775 0.248 LEXP 0.809 0.232 0.228 MAT 0.208 0.456 0.184 MOTSK 0.229 0.539 0.153 NEWSIT 0.416 0.339 0.471 SAINT 0.256 0.171 0.847 SENCOM 0.865 0.263 0.165 SINT 0.188 0.210 0.795 TEMP 0.211 0.446 UNDER 0.708 0.327 0.280 VOCAB 0.875 0.211 0.205 WRIT 0.191 0.786 0.233 Factor1 Factor2 Factor3 SS loadings 4.479 3.196 3.179 Proportion Var 0.263 0.188 0.187 Cumulative Var 0.263 0.451 0.638 Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 300.53 on 88 degrees of freedom. The p-value is 5.11e-25 13

There are a number of things to note from the factor analysis output shown above: The uniqueness gives an indication of the uniqueness of the variable (the variation in the variable that cannot be attributed to any factor), the U test score element in the equation discussed earlier: Test Score = aiq + bexperience + cwriting ability + U test score From this score, we can see that some variables have very small values (eg., SENCOM, VOCAB and COORD) indicating that the factors represent the variation in the variables well, whereas other variables have much higher values (eg., TEMP and MAT), indicating that a smaller amount of the variation in these variables is represented by the factors. From the output, we can also see that the 3 factors account for 63.8% of the variance in the data (this is shown in the Cumulative Var row). Also, the proportion of variance accounted for by each factor is shown in the Proportion Var row. In this case factor 1 accounts for 26.3%. You may note that these statistics are slightly different to those obtained for the PCA model above as the default method for the FA in R is factanal which performs a maximum likelihood factor analysis rather than the principal components. The interpretation and basic theory, however, remain unchanged. The test of the hypothesis that 3 factors are sufficient gives a highly significant chi-square value. This indicates that 3 factors are not sufficient to represent the data. On the basis of this evidence we would ceratinly wish to look at solutions with more than 3 factors. From the Loadings, we can see which variables load on which factor. This information enables us to define the factors and give them labels. We can see that the variables now load highly on particular factors. Using the components (i.e., before any rotation was applied), the variables all loaded onto PC1. Once a rotation has been applied, the variables load on different factors. The table below shows these loadings arranged into order. Table 3: Factor scores Factor 1 VOCAB.875 vocabulary SENCOM.865 Sentence Completion ART.863 articulation LEXP.809 Expressive language UNDER.708 Understanding of language COMP.667 Comprehension Factor 2 SAINT.847 Social Interaction 1 ACTIVE.811 How active SINT.795 Social Interaction 2 Factor 3 COORD.827 Coordination WRIT.786 Writing DRAW.775 Drawing MOTSK.539 Motor skills 14

Interpreting Factors The rotated factor matrix provides a much clearer interpretation of the results as can be seen in table 3. The unrotated component matrix did not enable an easy interpretation of the components, however, once a rotation has been completed the interpretation of the factors becomes clearer. Factor 1 relates to linguistic skills, Factor 2 to Social skills and Factor 3 to practical skills. The data set can now be described in terms of three underlying factors instead of 17 variables. The loadings for the factors can be saved as new variables and entered into other analyses such as regression. A graphical illustration of factor analysis Using similar data to that used above, three factors were extracted but couldn t be easily interpreted as Table 4 shows 1 : Table 4: Factor Loadings for the 3-Factor Model Variable Factor 1 Factor 2 Factor 3 Articulation 0.641 0.379 0.432 Attention 0.615 0.451 0.520 Comprehension 0.766 0.253 0.371 Coordination 0.824 0.391 0.123 Drawing 0.815 0.384 0.115 Memory 0.769 0.188 0.340 Motor Skill 0.673 0.431 0.076 Sentence Completion 0.741 0.194 0.181 Temperament 0.540 0.444 0.619 Writing 0.796 0.399 0.071 The factors can be interpreted by identifying the variables they are highly related to. For example, if factor 1 is strongly related to the variables, motor skill, drawing and coordination, this could be interpreted as representing physical dexterity. Such a clear-cut identification of the three factors identified in Table 4 is not possible as they are related to many variables. The difficulty with providing interpretations for the factors in such circumstances is demonstrated graphically in Figure 4, where the first two factors from the 3-factor model identified above are shown in a simple twodimensional scatter plot 2. This graph suggests that there are two distinct factors in the data which are represented clearly as two clusters of points. The factors are represented as the axes of the graph and those points falling on or close to an axis indicate a strong relationship with that factor. We can see from the graph that the variables fall midway between the axes and are therefore related to both factors. The presence of the two factors shown in Figure 4 is not obvious from the factor loadings of the initial factors (see Table 4) as the variables are not exclusively related to one factor or the other. Attempting to interpret the factor loading scores obtained in a principal components analysis directly is therefore not an ideal method to identify distinct factors. It can be seen from Figure 5 that the rotated axes in the graph pass much closer to the clusters of points than do the principal component axes. The precise degree of rotation is determined using one of a number of algorithms available in most common statistical software packages. Popular methods include minimizing the number of variables which have high loadings to enhance the interpretability of the factors, and minimizing the number of factors which provides simpler interpretations of the 1 See Hutcheson and Sofroniou, 1999, for all references. 2 In order to show this information in two dimensions, only the first two factors and those variables which form part of a two-factor solution are shown. The variables attention and temperament which form a third factor are omitted from this demonstration. 15

Figure 4: A graphical representation of the principal components Figure 5: A graphical representation of factor rotation 16

variables (refer to Kim and Mueller, 1994, for a discussion of rotation techniques). Although there are many rotation techniques available, in practice, the different techniques tend to produce similar results when there is a large sample and the factors are relatively well defined (Fava and Velicer, 1992). The factor loadings for two rotation techniques, orthogonal and oblique, are shown in Table 5. In this example, the factors can be interpreted directly from the rotated factor loadings as variables tend to load highly on only one factor. For example, the variable attention, which correlated with all three of the principal components more or less evenly, only correlates highly with a single rotated factor (factor 3). A similar pattern can be seen for the other variables in Table 5. Table 5: A Component Matrix showing Orthogonal and Oblique Rotation of Factors Initial Components Rotated Factors Orthogonal Oblique Variable 1 2 3 1 2 3 1 2 3 Articulation.641.379.432.843.089.149.936.154.011 Attention.615.451.520.231.164.879.073.002.890 Comprehension.766.253.371.826.274.178.857.062.007 Coordination.824.391.123.259.855.221.018.890.052 Drawing.815.384.115.261.843.215.025.876.047 Memory.769.188.340.779.327.167.789.138.019 Motor Skills.673.431.076.179.776.102.032.836.050 Sent. Comp..741.194.181.660.328.276.631.158.123 Temperament.540.444.619.117.134.917.058.007.959 Writing.796.399.071.273.834.167.049.869.006 17