- PDF Free Download

Size: px

Start display at page:

Download ""

Deirdre Wilkinson
6 years ago
Views:

1 PCOMP IDL API Reference Guides > IDL Reference Guide > Part I: IDL Command Reference > Routines: P PCOMP Syntax Return Value Arguments Keywords Examples Version History See Also The PCOMP function computes the principal components of an m-column, n-row array, where m is the number of variables and n is the number of observations or samples. The principal components of a multivariate data set may be used to restate the data in terms of derived variables or may be used to reduce the dimensionality of the data by reducing the number of variables (columns). This routine is written in the IDL language. Its source code can be found in the file pcomp.pro in the lib subdirectory of the IDL distribution. Syntax Result = PCOMP( A [, COEFFICIENTS=variable] [, /COVARIANCE] [, /DOUBLE] [, EIGENVALUES=variable] [, NVARIABLES=value] [, /STANDARDIZE] [, VARIANCES=variable] ) Return Value The result is an nvariables-column (nvariables! m), n-row array of derived variables. Arguments A An m-column, n-row, single- or double-precision floating-point array. Keywords COEFFICIENTS Use this keyword to specify a named variable that will contain the principal components used to compute the derived variables. The principal components are the coefficients of the derived variables and are returned in an m-column, m-row array. The rows of this array correspond to the coefficients of the derived variables. The coefficients are scaled so that the sums of their squares are equal to the eigenvalue from which they are computed. COVARIANCE Set this keyword to compute the principal components using the covariances of the original data. The default is to use the correlations of the original data to compute the principal components. DOUBLE Set this keyword to use double-precision for computations and to return a double-precision result. Set DOUBLE=0 to use single-precision for computations and to return a single-precision result. The default is /DOUBLE if Array is double precision, otherwise the default is DOUBLE=0. 1 of 4 3/14/11 2:33 PM

2 PCOMP EIGENVALUES Use this keyword to specify a named variable that will contain a one-column, m-row array of eigenvalues that correspond to the principal components. The eigenvalues are listed in descending order. NVARIABLES Use this keyword to specify the number of derived variables. A value of zero, negative values, and values in excess of the input array's column dimension result in a complete set (m-columns and n-rows) of derived variables. STANDARDIZE Set this keyword to convert the variables (the columns) of the input array to standardized variables (variables with a mean of zero and variance of one). VARIANCES Use this keyword to specify a named variable that will contain a one-column, m-row array of variances. The variances correspond to the percentage of the total variance for each derived variable. Examples PRO ex_pcomp ;Define an array with 4 variables and 20 observations. array = [[19.5, 43.1, 29.1, 11.9], $ [24.7, 49.8, 28.2, 22.8], $ [30.7, 51.9, 37.0, 18.7], $ [29.8, 54.3, 31.1, 20.1], $ [19.1, 42.2, 30.9, 12.9], $ [25.6, 53.9, 23.7, 21.7], $ [31.4, 58.5, 27.6, 27.1], $ [27.9, 52.1, 30.6, 25.4], $ [22.1, 49.9, 23.2, 21.3], $ [25.5, 53.5, 24.8, 19.3], $ [31.1, 56.6, 30.0, 25.4], $ [30.4, 56.7, 28.3, 27.2], $ [18.7, 46.5, 23.0, 11.7], $ [19.7, 44.2, 28.6, 17.8], $ [14.6, 42.7, 21.3, 12.8], $ [29.5, 54.4, 30.1, 23.9], $ [27.7, 55.3, 25.7, 22.6], $ [30.2, 58.6, 24.6, 25.4], $ [22.7, 48.2, 27.1, 14.8], $ [25.2, 51.0, 27.5, 21.1]] ;Remove the mean from each variable. m = 4 ; number of variables n = 20 ; number of observations means = TOTAL(array, 2)/n array = array - REBIN(means, m, n) ;Compute derived variables based upon the principal components. result = PCOMP(array, COEFFICIENTS = coefficients, $ EIGENVALUES=eigenvalues, VARIANCES=variances, /COVARIANCE) 2 of 4 3/14/11 2:33 PM

3 PCOMP END, 'Result: ', result, FORMAT = '(4(F8.2))', 'Coefficients: ' FOR mode=0,3 DO, $ mode+1, coefficients[*,mode], $ FORMAT='("Mode#",I1,4(F10.4))' eigenvectors = coefficients/rebin(eigenvalues, m, m), 'Eigenvectors: ' FOR mode=0,3 DO, $ mode+1, eigenvectors[*,mode],$ FORMAT='("Mode#",I1,4(F10.4))' array_reconstruct = result ## eigenvectors, 'Reconstruction error: ', $ TOTAL((array_reconstruct - array)^2), 'Energy conservation: ', TOTAL(array^2), $ TOTAL(eigenvalues)*(n-1), ' Mode Eigenvalue PercentVariance' FOR mode=0,3 DO, $ mode+1, eigenvalues[mode], variances[mode]*100 When the above program is compiled and executed, the following output is produced: Result: Coefficients: Mode# Mode# Mode# Mode# Eigenvectors: Mode# Mode# of 4 3/14/11 2:33 PM

4 PCOMP Mode# Mode# Reconstruction error: e-010 Energy conservation: Mode Eigenvalue PercentVariance The first two derived variables account for 96% of the total variance of the original data. Version History 5.0 Introduced See Also CORRELATE, EIGENQL 4 of 4 3/14/11 2:33 PM

5 Multivariate Analysis Samples 0 and 7 contain identical data and are assigned to cluster #1. Samples 1, 2, 5, and 8 contain identical data and are assigned to cluster #3. Samples 3 and 6 contain identical data and are assigned to cluster #0. Sample 4 is unique and is assigned to cluster #2. If this example is run several times, each time computing new cluster weights, it is possible that the cluster number assigned to each grouping of samples may change. Principal Components Analysis Principal components analysis is a mathematical technique which describes a multivariate set of data using derived variables. The derived variables are formulated using specific linear combinations of the original variables. The derived variables are uncorrelated and are computed in decreasing order of importance; the first variable accounts for as much as possible of the variation in the original data, the second variable accounts for the second largest portion of the variation in the original data, and so on. Principal components analysis attempts to construct a small set of derived variables which summarize the original data, thereby reducing the dimensionality of the original data. The principal components of a multivariate set of data are computed from the eigenvalues and eigenvectors of either the sample correlation or sample covariance matrix. If the variables of the multivariate data are measured in widely differing units (large variations in magnitude), it is usually best to use the sample correlation matrix in computing the principal components; this is the default method used in IDL's PCOMP function. Another alternative is to standardize the variables of the multivariate data prior to computing principal components. Standardizing the variables essentially makes them all equally important by creating new variables that each have a mean of zero and a variance of one. Proceeding in this way allows the principal components to be computed from the sample covariance matrix. IDL's PCOMP function includes COVARIANCE and STANDARDIZE keywords to provide this functionality. For example, suppose that we wish to restate the following data using its principal components. There are three variables, each consisting of five samples. Table 7-1: Data for Principal Component Analysis Var 1 Var 2 Var 3 Sample Sample Sample of 5 3/14/11 2:32 PM

6 Multivariate Analysis Sample Sample We compute the principal components (the coefficients of the derived variables) to 2 decimal accuracy and store them by row in the following array. The derived variables {z 1, z 2, z 3 } are then computed as follows: 3 of 5 3/14/11 2:32 PM

7 Multivariate Analysis In this example, analysis shows that the derived variable z 1 accounts for 57.3% of the total variance of the original data, the derived variable z 2 accounts for 28.2% of the total variance of the original data, and the derived variable z 3 accounts for 14.5% of the total variance of the original data. Example of Derived Variables from Principal Components The following example constructs an appropriate set of derived variables, based upon the principal components of the original data, which may be used to reduce the dimensionality of the data. The data consist of four variables, each containing of twenty samples. ; Define an array with 4 variables and 20 samples: data = [[19.5, 43.1, 29.1, 11.9], $ [24.7, 49.8, 28.2, 22.8], $ [30.7, 51.9, 37.0, 18.7], $ [29.8, 54.3, 31.1, 20.1], $ [19.1, 42.2, 30.9, 12.9], $ [25.6, 53.9, 23.7, 21.7], $ [31.4, 58.5, 27.6, 27.1], $ [27.9, 52.1, 30.6, 25.4], $ [22.1, 49.9, 23.2, 21.3], $ [25.5, 53.5, 24.8, 19.3], $ [31.1, 56.6, 30.0, 25.4], $ [30.4, 56.7, 28.3, 27.2], $ [18.7, 46.5, 23.0, 11.7], $ [19.7, 44.2, 28.6, 17.8], $ [14.6, 42.7, 21.3, 12.8], $ [29.5, 54.4, 30.1, 23.9], $ [27.7, 55.3, 25.7, 22.6], $ [30.2, 58.6, 24.6, 25.4], $ [22.7, 48.2, 27.1, 14.8], $ [25.2, 51.0, 27.5, 21.1]] The variables that will contain the values returned by the COEFFICIENTS, EIGENVALUES, and VARIANCES keywords to the PCOMP routine must be initialized as nonzero values prior to calling PCOMP. coef = 1 & eval = 1 & var = 1 ; Compute the derived variables based upon ; the principal components. result = PCOMP(data, COEFFICIENTS = coef, $ EIGENVALUES = eval, VARIANCES = var) ; Display the array of derived variables:, result, FORMAT = '(4(f5.1, 2x))' 4 of 5 3/14/11 2:32 PM

8 Multivariate Analysis IDL prints: Display the percentage of total variance for each derived variable:, var IDL prints: Display the percentage of variance for the first two derived variables; the first two columns of the resulting array above., TOTAL(var[0:1]) IDL prints: This indicates that the first two derived variables (the first two columns of the resulting array) account for 96.3% of the total variance of the original data, and thus could be used to summarize the original data. Routines for Multivariate Analysis See Multivariate Analysis (in the functional category "Mathematics" (IDL Reference Guide)) for a brief description of IDL routines for multivariate analysis. Detailed information is available in the IDL Reference Guide. 5 of 5 3/14/11 2:32 PM

Applied Neuroscience. Columbia Science Honors Program Fall Machine Learning and Neural Networks

Applied Neuroscience. Columbia Science Honors Program Fall Machine Learning and Neural Networks Applied Neuroscience Columbia Science Honors Program Fall 2016 Machine Learning and Neural Networks Machine Learning and Neural Networks Objective: Introduction to Machine Learning Agenda: 1. JavaScript