Discriminant Analysis-Based Dimension Reduction for Hyperspectral Image Classification

Size: px

Start display at page:

Download "Discriminant Analysis-Based Dimension Reduction for Hyperspectral Image Classification"

Lorin Holmes
5 years ago
Views:

1 Satellite View istockphoto.com/frankramspott puzzle outline footage firm, inc. Discriminant Analysis-Based Dimension Reduction for Hyperspectral Image Classification A survey of the most recent advances and an experimental comparison of different techniques H WEI LI, FUBIAO FENG, HENGCHAO LI, AND QIAN DU Digital Object Identifier /MGRS Date of publication: 30 March 2018 march 2018 yperspectral imagery contains hundreds of contiguous bands with a wealth of spectral signatures, making it possible to distinguish materials through subtle spectral discrepancies. Because these spectral bands are highly correlated, dimensionality reduction, as the name suggests, seeks to reduce data dimensionality without losing desirable information. This article reviews discriminant analysisbased dimensionality-reduction approaches for hyperspectral imagery, including typical linear discriminant analysis (LDA), state-of-the-art sparse graph-based discriminant analysis (SGDA), and their extensions. We categorize the related techniques into the following areas: 1) linear subspace learning, which finds an explicit linear projective mapping 2) locality-preserving dimensionality reduction, which exploits local relationships among neighbors in a feature space ieee Geoscience and remote sensing magazine / IEEE / IEEE 15

3) graph-embedding discriminant analysis, which preserves similarities of pixel pairs and characterizes data geometry properties 4) semisupervised discriminant analysis (SDA) 5) discriminant analysis

2 3) graph-embedding discriminant analysis, which preserves similarities of pixel pairs and characterizes data geometry properties 4) semisupervised discriminant analysis (SDA) 5) discriminant analysis in a kernel space. Experimental results using real hyperspectral data show the comparative performance of each approach. We also discuss some open issues and ongoing investigation in this field. We hope our survey will offer insights for further research. PROGRESS TO DATE Hyperspectral imagery capturing reflectance values over a wide range of electromagnetic spectra provides rich spectral information that can help discriminate among spectrally similar object pixels. However, the curse of dimensionality [1] [5], caused by a great number of dimensions with only a few samples, makes hyperspectral image classification challenging. Furthermore, these adjacent bands may be heavily redundant. Figure 1 illustrates the absolute values of the cross-band correlation coefficients of the popular Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) Indian Pines data. Table 1 further demonstrates that more than 34% of the correlation coefficients are greater than 0.9, which indicates FIGURE 1. The absolute spectral correlation coefficients of the Indian Pines data. TABLE 1. THE PROPORTION OF SPECTRAL-BAND CORRELATION FOR THE INDIAN PINES DATA. LEVEL Correlation coefficient 0 ~ ~ ~ ~ ~ 0.5 Proportion (%) LEVEL Correlation coefficient 0.5 ~ ~ ~ ~ ~ 1.0 Proportion (%) that much redundant information exists among the spectral bands. Thus, dimensionality reduction is a critical preprocessing step for hyperspectral image analysis. In past decades, dimensionality-reduction methods have leapt forward, coming to play a vital role in the analysis of highdimensional data sets [6] [11]. There are two main categories of dimensionality reduction in hyperspectral imagery, i.e., band selection and feature extraction. In band selection, we obtain a subset of the original bands by minimizing spectral redundancy [12] [17]. The process selects an appropriate band subset from the original bands, while feature extraction preserves important features through mathematical transformations. Typical feature extraction methods, such as principal component analysis (PCA) [18] [22] and Fisher s LDA (FLDA) [23] [27], project the original data into a low-dimensional subspace. We can view PCA and LDA, which are subspace learning methods, as the most popular dimensionality-reduction techniques for hyperspectral image classification. PCA, an unsupervised approach, aims to find projections by maximizing the data variance in the projected subspace [28]. This method is suboptimal for classification tasks because it may abandon some distinctive information. Some PCA extensions have been proposed, such as robust PCA (RPCA) [29], probabilistic PCA [30], sparse PCA [31], structured sparse PCA [32], and mixture of Gaussians-RPCA [33]. Comparatively, supervised LDA [34] seeks to find a transform that maximizes Fisher s ratio as a Rayleigh quotient in the projected subspace to enhance class separability. However, under small-sample-size (SSS) situations, LDA may potentially fail because of ill-conditioned statistical estimates. Thus, researchers have presented numerous LDA extensions, such as subspace LDA (SLDA) [35], modified FLDA (MFLDA) [24], regularized LDA (RLDA) [25], normalized discriminant analysis (NDA) [36], and noiseadjusted SLDA (NA-SLDA) [37]. This review focuses on supervised LDA-related research. Traditional subspace learning methods assume that the class-conditional distributions are Gaussian [38], [39]. However, hyperspectral data can be affected by illumination conditions [40], atmospheric effects, and geometric distortion [41]. Thus, real class-conditional distributions usually possess a complicated multimodal structure. As a consequence, traditional LDA is most likely unable to capture the data manifold structure reflecting the geometric information (e.g., multiple clusters and subspace structure). Manifold learning techniques [42] [47], which depend significantly on the data characteristics, are supposed to remedy this defect. Locality-preserving projections (LPPs) and supervised LPP [48], originally developed for facial recognition, can be counted as linear manifold learning methods to maintain the local geometric structure of neighboring samples in the original data space. The work in [49] [53] also developed 16 ieee Geoscience and remote sensing magazine march 2018

3 the extensions of these methods. The authors of [54] and [55] employed local Fisher s discriminant analysis (LFDA) to reduce the dimensionality of hyperspectral data. Unlike LDA, LFDA was designed to handle multimodal non- Gaussian class distributions, so the local structure can be preserved in a low-dimensional subspace. Subsequently, researchers proposed additional manifold learning methods in hyperspectral image analysis, such as semisupervised LDA [56], local patch discriminative metric learning [57], and modified LPP [58]. One study [59] applied a modified stochastic neighbor embedding for multifeature dimension reduction. If there are insufficient training samples, the covariance matrix of each class may not be accurately estimated. In this case, the generalization capability of the testing samples cannot be guaranteed. A possible solution to deal with too few training (labeled) samples could be learning on both labeled and unlabeled data. To use unlabeled samples, Cai et al. developed SDA [60] in addition to the aforementioned methods. A recent work developed a graph-embedding framework [61] for dimensionality reduction. Here, the authors constructed an undirected weighted graph to exploit the geometrical characteristics of the input data, and they estimated a projection matrix by solving a generalized eigenvalue problem. The most important step of graph-embeddingbased dimensionality reduction is to choose an intrinsic graph. The study in [62] proposed SGDA; the research team built the graph with sparse representation solved by, 1 -minimization. The same investigators delved into collaborative graph-based discriminant analysis (CGDA) [63] by replacing the, 1-minimization with, 2-minimization. For the graph construction, all of the samples collaborated, as it were, on the representation of a single one, and each sample had an equal opportunity to participate. Other researchers developed several extensions of graphembedding-based dimensionality reduction, such as simultaneous sparse graph embedding [64], semisupervised sparse manifold discriminative analysis [65], weighted SGDA (WSGDA) [66], sparse and low-rank graph for discriminant analysis (SLGDA) [67], tensor sparse and low-rank graph-based discriminant analysis [68], Laplacian regularized collaborative graph for discriminant analysis (LapCGDA) [69], and graph-based discriminant analysis with spectral similarity [70]. Most of these techniques are linear. When data distributions are so complex that the resulting decision boundaries are highly nonlinear, we require kernel methods to solve the issue [71] [74]. The essence here is to map the input data into a much higher-dimensional feature space, where the complex nonlinear decision boundary becomes linear in the kernel-induced space. Kernel-projection techniques have been studied over the past decades. The works in [75] [78] and [79] [81], for instance, presented kernel discriminant analysis (KDA) and kernel PCA (KPCA), respectively. The authors of [82] demonstrated that kernel nonparametric weighted feature extraction outperforms KDA and KPCA. Li et al. [83], [84] employed kernel LFDA (KLFDA) to preserve the local structure in a kernel-induced space. In addition, investigators have proposed some other kernel-based methods for dimensionality reduction, such as composite kernels discriminant analysis [85], kernel CGDA (KCGDA) [69], locality-preserving composite kernel (LPCK) [86], and discriminative multiple kernel learning (DMKL) [87]. DISCRIMINANT ANALYSIS-BASED DIMENSION REDUCTION n Consider a data set with training samples X = " xi, i = 1 in R d (d-dimensional space) and class labels Labeli! " 12,, f, C,, where C is the number of classes and n is the total number of training samples. A possible solution Let n j be the number of available training samples for the to deal with too few training (labeled) C jth class and R j = 1nj = n. We samples could be obtain a supervised transform-based dimensionality- learning on both reduction matrix P in R d# dl labeled and (dl is the dimensionality of unlabeled data. a reduced-dimensional subspace) by minimizing an objective function. In general, a low-dimensional subspace Y can be obtained as Y = P T X, (1) T where each sample xi can be denoted by yi = P x i, with the reduced dimensionality dl. LINEAR DISCRIMINANT ANALYSIS AND REGULARIZED LINEAR DISCRIMINANT ANALYSIS The traditional FLDA [34] seeks a linear transform W such that the within-class scatter is minimized and the betweenclass scatter is maximized. The objective is to find a projection that maximizes the Fisher s ratio, i.e., the determinant of S ( b) to the determinant of S (w), by solving the following optimization problem: < ( w) -1 < ( b) PLDA = argmax tr6 ^P S P LDA h P LDA S PLDA@, (2) PLDA LDA where S ( b) is the between-class scatter matrix and S ( w) is the within-class scatter matrix. Note that the total scatter matrix S () t is the sum of S ( b) and S ( w). This equation can be expressed equivalently to < () t -1 < ( b) PLDA = argmax tr6 ^P S PLDAh P S PLDA@. (3) PLDA LDA However, the SSS problem always occurs because of high data dimensionality and limited available labeled samples. Thus, S ( w) and S () t may be singular, which results in instability and overfitting issues [88]. RLDA [25] modifies LDA LDA march 2018 ieee Geoscience and remote sensing magazine 17

4 by adding a regularization term, i.e., Su () t S () t = u +mi, where m is a regularization parameter and I is an identity matrix. Because S () t is positive semidefinite, S u () t is also positive and even nonsingular. Thus, the transformation of RLDA is calculated via solving the following optimization problem: < () t -1 < ( b) PRLDA = argmax tr6 ^P Su PRLDAh P S PRLDA@. (4) PRLDA RLDA RLDA IMPROVEMENTS OF LINEAR DISCRIMINANT ANALYSIS Combining LDA and PCA is a popular method in which PCA-based dimensionality reduction is employed to preprocess for LDA; i.e., LDA is implemented in a PCA subspace (denoted as subplace LDA [35]). In doing so, the PCA discards the null subspace of rank-deficient scatter matrices so that LDA can perform When data under better conditions. The distributions are so researchers in [37] used a signal-to-noise ratio metric complex that the to replace the original maximum variance criterion in the resulting decision boundaries are highly PCA. As a result, [89] proposed nonlinear, we require NA-SLDA for noise-robust dimensionality reduction of hy- kernel methods to solve the issue. perspectral data. In [90], a linear-constrained, distance-based discriminant analysis imposed the constraint that all class centers must be aligned in predetermined directions. Yuan et al. [91] investigated a spectral spatial LDA by employing a local scatter matrix from a small neighborhood as a regularizer in the original objective function of LDA. The work in [92] [94] developed sparse LDA and sparse tensor discriminant analysis to better reserve useful structural information in the input data. The authors of [95] presented a modified sparse discriminant analysis based on, 21, -norm regularization for jointly sparse feature extraction. Furthermore, the study in [88] proposed nonparametric discriminant analysis (NDA) to address the fundamental limitation originating from the parametric nature of scatter matrices. After that, other studies developed several NDA extensions [96], [97] from a multiview perspective. To avoid the requirement of sufficient training samples and complete class knowledge, Du proposed MFLDA [24], with the desired class signatures only. Two other investigators [9] transformed the feature vector of each hyperspectral pixel into a feature matrix and presented two-dimensional (2-D) LDA for dimensionality reduction. Then, Han and Clemmensen [98] developed regularized generalized eigen-decomposition to solve generalized eigenvalue problems and obtain sparse solutions. The research in [99] proposed an interesting angular discriminant analysis, which attempted to find a subspace that separated classes in an angular sense. LOCALITY-PRESERVING DIMENSIONALITY REDUCTION Traditional linear subspace learning methods provide satisfactory performance only if the class-conditional distributions are homoscedastic Gaussian. To accommodate practical hyperspectral data, dimensionality-reduction methods need to handle multimodal distribution and accurately capture the statistics in a reduced-dimensional subspace. Consequently, investigators have developed locality-preserving dimensionality-reduction methods, such as LFDA [54]. LOCAL FISHER S DISCRIMINANT ANALYSIS LFDA can be viewed as an extension of LDA and LPP [100]. By invoking the concept exploited in LPP, LFDA obtains between-class separation in the projection while simultaneously preserving the within-class local structure. Define A ij!, [ 01, ] as the affinity between xi and xj, A ij, 2 xi xj = expd - - n cc, (5) i j where x x ( knn c i = i- ) i represents the local scaling of data samples in the neighborhood of xi, and x ( knn ) i is the knn nearest neighbors of xi. A is then an affinity matrix of size n# n that measures the distance relationship among samples. In LFDA, the local between-class S ( lb) and within-class S ( lw) scatter matrices are further defined as n ( lb) 1 ( lb) < S = 2 / Wij, ( xi-xj)( xi- xj), (6) ij, = 1 n ( lw) 1 ( lw) < S = 2 / Wij, ( xi-xj)( xi- xj), (7) ij, = 1 where W ( lb) and W ( lw) are n# n matrices expressed as W W Aij( 1/ n- 1/ nl), if yi = yj = l, = ( (8) 1/ n, if yi! yj, ( lb), ij, Aij/ nl, if yi = yj = l, = ( (9) 0, if yi! yj, ( lw), ij, where yi is the label of xi. Maximizing the Fisher s ratio as defined using the local scatter matrices, the objective function can be written as < ( lw) -1 < ( lb) PLFDA = argmax tr6 ^P S P LFDA h P LFDA S PLFDA@, (10) PLFDA LFDA which is equivalent to solving S ( lb) P S ( lw LFDA = K ) P LFDA, where K is the diagonal eigenvalue matrix and PLFDA! R d is the transformation matrix. Because of the local constraint, the global between- and within-class scatter matrices in the original expression for the Fisher s ratio are replaced by their local versions, defined in (6) and (7), in LFDA. Thus, the method can be viewed as a localized variant of LDA, which preserves neighborhood relationships and forces neighboring points in the input space to remain close in the projected subspace. 18 ieee Geoscience and remote sensing magazine march 2018

5 Examples of dimensionality reduction for a 2-D two-class multimodal synthetic data set using LDA and LFDA are illustrated in Figure 2. Figure 2(a) depicts an original two-class multimodal classification problem, along with the projecting directions learned using LDA and LFDA. We can clearly observe that LDA distorts the information contained in multimodal distributions because there is a large overlap between two classes in its projected subspace, while LFDA is able to preserve the multimodal structure of the data in the projected subspace. OTHER ALGORITHMS USING A LOCALITY-PRESERVING STRATEGY Locality-preserving dimensionality reduction, proven to be effective for hyperspectral data processing, has recently attracted increasing attention involving two different types. The first type uses the locality-preserving strategy with only spectral information. For example, the study in [101] discussed a supervised Laplacian eigenmaps method, which made full use of class label information to guide the procedure of nonlinear dimensionality reduction by adopting the large-margin concept. The authors of [102] studied a local-scaling-cut criterion that could handle heteroscedastic and multimodal data. Similarly, some researchers designed local discriminant analysis [56], [103] to preserve local neighborhood information while simultaneously maximizing the class discrimination; yet others discussed local tensor discriminant analysis [104]. Later, Yang et al. investigated a semisupervised dual-geometric subspace projection approach [105], where they defined a local consistency-constrained geometric matrix to reveal the geometric structure among the data. The research in [106] developed sparse discriminant embedding (SDE); compared to sparsity preserving projections (SPPs), SDE maintains the merits of both the intermanifold structure and the sparsity property. The other type of locality-preserving method uses spectral spatial information. A genetic algorithm-based LFDA (GA-LFDA) [107] used spectral spatial information for dimensionality reduction. The authors of [108] further employed GA-LFDA to exploit the high correlation between Traditional linear successive spectral bands. Additionally, researchers developed methods provide subspace learning an improved locally linear embedding [109] method based satisfactory performance only if on robust spatial information; they presented spatial the class-conditional neighbor sorting and filtering distributions are to ensure the robustness of homoscedastic the spectral spatial distance. Gaussian. Zhou et al. [110] proposed a spatial- and spectral-regularized local discriminant embedding method for dimensionality reduction, where they calculated the discriminative projection by minimizing a local spatial spectral scatter and maximizing a modified total data scatter. GRAPH-EMBEDDING DISCRIMINANT ANALYSIS SGDA and CGDA are recently developed dimensionalityreduction techniques [61] [63] that belong to the graphembedding framework. An undirected weighted graph, representing the desired statistical or geometrical properties of the data, reflects the similarities of vertex pairs. The By LDA By LFDA (a) (i) (ii) (b) FIGURE 2. (a) A synthetic 2-D multimodal data plot and the directions of LDA and LFDA. (b) The data distribution after projection into a one-dimensional subspace: (i) by LDA and (ii) by LFDA. march 2018 ieee Geoscience and remote sensing magazine 19

TABLE 2. THE AFFINITY FOR SEVERAL DIMENSIONALITY- REDUCTION METHODS UNDER A GRAPH-EMBEDDING FRAMEWORK. ALGORITHM AFFINITY CONSTRAINT PCA W ij, = 1/ nj,!

eigenvalues of the graph Laplacian matrix, with certain constraints.

6 TABLE 2. THE AFFINITY FOR SEVERAL DIMENSIONALITY- REDUCTION METHODS UNDER A GRAPH-EMBEDDING FRAMEWORK. ALGORITHM AFFINITY CONSTRAINT PCA W ij, = 1/ nj,! i LDA w ij, ( LDA) LFDA W, W SGDA CGDA lb lw ij, ij, W = argminwi w Xwi = xi i i 1 W = argminwi w Xwi = xi projection matrix can be derived as the eigenvectors corresponding to the smallest nonzero eigenvalues of the graph Laplacian matrix, with certain constraints. GRAPH-EMBEDDING FRAMEWORK The general graph-embedding framework [111] [114] intends to hunt for the projection matrix P by preserving the relationship of data points in the original space. The objective function can be mathematically formed as T T 2 Pu = arg min / P x i -P xj W T T P XLpX P i! j T T = arg min tr( P XLX P), T T P XLpX P ij, (11) where L is the Laplacian matrix of graph GL, = D- W, W is the graph weight matrix, D is a diagonal matrix with the ith diagonal element being Dii = R n j = 1W ij,, and Lp may be the Laplacian matrix of the penalty graph Gp or a simple scale normalization constraint [61]. The optimal projection matrix P can be obtained as T T P XLX P Pu = argmin, P T XLXP T (12) P p which can be solved as a generalized eigenvalue decomposition problem, (a) i i 2 FIGURE 3. A visualization of two graph weights using three-class synthetic data: (a) an SDGA graph and (b) a CDGA graph. (b) T T XLX P = KXLpX P. (13) The d# dl projection matrix P is constructed by the dl eigenvectors corresponding to the dl smallest nonzero eigenvalues. Note that the performance of graph-embedding-based dimensionality-reduction algorithms mainly depends on the choice of affinity matrix. We summarize several traditional and state-of-the-art dimensionality-reduction methods in Table 2. Meanwhile, under the graph-embedding framework, the affinity matrix in the typical LDA can be represented as W 1/ n j, if x, x both belong tothe jthclass, = ( (14) 0, otherwise, ( LDA) i j ij, where nj denotes the number of samples in the jth class. Obviously, LDA can be viewed as a linear extension of the graph-embedding framework with D (LDA) = I. Here, I is the identity matrix of size d# d. SPARSE GRAPH-BASED DISCRIMINANT ANALYSIS AND COLLABORATIVE GRAPH-BASED DISCRIMINANT ANALYSIS In SGDA [62], for each pixel xi with the dictionary X, we calculate the sparse representation vector by solving the, 1-norm optimization problem [115], argmin wi 1 s.t. Xwi = xi, (15) wi where wi = [ wi1, wi2, g, win] is a vector of size n # 1 and 1 denotes the, 1-norm that sums up the absolute values of all of the entries. In contrast to SGDA, CGDA [63] obtains the weights between the pixels by collaborative representation. Thus, it benefits from within-class sample collaboration and computational efficiency. Collaborative representation is decided by the optimization of the, 2-norm problem defined as arg min w s.t. Xwi = xi, (16) wi i 2 where wi = [ wi1, wi2, g, win] is a vector of size n # 1 and 2 denotes the, 2-norm. The graph weight matrix W = [ w1, w2, g, wn] of size n # n is further constructed, such that the ith column wi is the collaborative representation vector corresponding to xi. Note that the diagonal elements in W are set to be zero. Ly et al. [62], [63] compute the matrix W using the withinclass samples. Thus, W is expressed in the form of a blockdiagonal structure, R ( 1) V SW 0 W W = S j W, (17) S ( C) 0 W W T X where " W (), = 1 is the weight matrix of size nl# nl using labeled samples in only the lth class. Figure 3 illustrates two l lc 20 ieee Geoscience and remote sensing magazine march 2018

7 graph weights using a three-class synthetic data set, from which the SGDA graph is obviously sparser than CGDA one. EXTENSION OF GRAPH-BASED DISCRIMINANT ANALYSIS The drawback of SGDA is that the method may be ineffective in capturing the global structure of the data. As a consequence, the authors of [67] developed an SLGDA for which they constructed a more informative graph by combining both sparsity and low rankness to preserve the global and local structure simultaneously. Similarly, Li and Du discussed the shortcoming of CGDA [69], i.e., that the method does not consider the data manifold structure. Thus, they developed an LapCGDA that not only can offer collaborative representation but can also exploit the intrinsic geometric information. The work in [66] further presented WSGDA to enhance both locality and sparsity during the learning of data features. Then, to overcome the difficulty of not having enough training samples, other researchers proposed a semisupervised block-sparse graph [116]. Yet others [117] constructed an optimal neighborhood graph by employing the marginal likelihood as an objective function. Finally, the researchers in [118] proposed probabilistic class structure regularized sparse representation, which exploited class structure information during the process of learning a discriminative graph. SEMISUPERVISED DISCRIMINANT ANALYSIS Because it is too expensive and difficult to obtain efficient labeled data, researchers have developed semisupervised learning-based discriminant analysis to solve this problem [60], [65], [119]. This method can efficiently utilize few labeled samples and many unlabeled samples to improve class separability. SEMISUPERVISED LEARNING-BASED DISCRIMINANT ANALYSIS In [60], SDA aims to seek a projection that respects the discriminant structure inferred from the labeled data points, as well as the intrinsic geometrical structure inferred from both labeled and unlabeled data points. Given a labeled set { Xi} m i = 1 belonging to C classes and an unlabeled set n { Xi} i= m+ 1 (m labeled data and n m unlabeled), the SDA optimization problem can be represented as max T ( b) P S P T () t P S P J( P), P + a (18) where J( P ) controls the learning complexity of the hypothesis family, and the coefficient a balances the model complexity and the empirical loss. This objective function can be viewed as an extension of traditional LDA. In SDA, the regularizer is defined as T T 2 J( P) = / ( P x i -P xj) Wij, (19) ij where Wij denotes the relationship between nodes i and j, if xi and x j are close, i.e., if xi and x j are among the k-nearest neighbors of each other in all of the samples, including labeled and unlabeled ones. After some simple algebraic operations, the previous formulation can be described as / ij ij / / T T 2 J( P) = ( P x i -P xj) W T = 2 P xidiixi T P+ 2 P T xjwijx T j P i T T = 2P X( D-W) X P T T = 2P XLX P, ij (20) where D is a diagonal matrix with the ith diagonal element n being Dii = R j = 1W ij,, and L = D- W is the Laplacian matrix. The corresponding weight matrix W is defined by 1, if xi! Nk( xj) or xj! Nk( xi) Wij = ( (21) 0, otherwise, where Nk( xi ) denotes the set of k-nearest neighbors of xi. The study in [120] developed semisupervised LFDA (SELF) from the previously introduced LFDA [100], aiming to solve the following generalized eigenvalue problem: S rlw { = m S {, (22) ( rlb) ( ) where { is the matrix of eigenvectors, and S ( rlb) and S ( rlw) are the regularized local between-class scatter matrix and the regularized local within-class scatter matrix defined by S rlb ) : ( 1 ) S lb ) S t = - b + b, (23) S rlw ) : ( 1 ) S lw = - b ) + bid, (24) respectively, with b! [ 01, ] being a parameter to balance the two terms. It is obvious that SELF is reduced to LFDA when b = 0, and SELF is reduced to PCA when b = 1. In other words SELF with 0 1 b 1 1 inherits the characteristics of both supervised LFDA and unsupervised PCA. The matrix S ( rlb) can be represented in a pairwise form as n ( rlb) 1 ( rlb) T S : = 2 / Wij, ( xi-xj)( xi- xj), (25) ij, = 1 ( where W rlb ) ij, is the n# n weight matrix as W ( rlb) ij, ( 1-b) Aij, ( 1/ nl- 1/ nlyi) + b/ n if yi = yj, : = * ( 1 - b)/ nl + b/ n if yi! yj, (26) b/ n otherwise, where yi is the label of xi. IMPROVEMENTS OF SEMISUPERVISED DISCRIMINANT ANALYSIS In [121], semisupervised orthogonal discriminant analysis (SODA) via label propagation is an improved semisupervised march 2018 ieee Geoscience and remote sensing magazine 21

8 LDA method. SODA propagates the label information from labeled samples to unlabeled data through a specially designed label propagation and calculates the projection matrix by maximizing the objective function of orthogonal discriminant analysis. Coping with the problem of effectively combining unlabeled Because it is too data with labeled samples to expensive and difficult find the embedding transformation, subspace semisuper- to obtain efficient labeled data, vised Fisher s discriminant researchers have analysis [122] aims to find an developed embedding transformation semisupervised that respects the discriminant structure inferred from learning-based labeled samples and the intrinsic geometrical structure discriminant analysis. inferred from both labeled and unlabeled data. To address the trace-ratio (TR) problem for semisupervised dimensionality reduction, Huang et al. proposed TR-based flexible SDA [123] to better handle data sampled from a certain type of nonlinear manifold that is somewhat close to a linear subspace. To improve SELF, Shao and Zhang proposed a sparse dimensionality-reduction method based on SELF [124] by utilizing the advantageous complementarities between SELF and SPP. Motivated by the fact that statistically uncorrelated and parameter-free characteristics are both desirable and promising for dimension reduction, the authors developed enhanced SELF [125] to preserve the manifold structure of labeled and unlabeled samples, in addition to separating labeled samples into different classes from one another. An alternative semisupervised framework is flexible manifold embedding, which is an improved method of local and global consistency [126]. The method seeks to maximize label fitness and manifold smoothness, which can effectively utilize label information from labeled data as well as the manifold structure of the data (including labeled and unlabeled data). In [127], adaptive semisupervised dimensionality reduction with sparse representation seeks to obtain the optimized low-dimensional representation of the original data by adaptively adjusting the weights of the pairwise constraints and simultaneously optimizing the graph construction using the, 1 graph of sparse representation. The work in [128] developed semisupervised double sparse graphs, fully considering both the positive and negative structure relationship of data points by using double sparse graphs. DISCRIMINANT ANALYSIS IN KERNEL SPACE In practice, hyperspectral pixels of different classes in the original space are not always distinctive; this is known as an inseparability issue. Thus, because of the nonlinearity of hyperspectral data, kernel methods may have the potential to solve such a nonlinear problem. The idea is that, through kernel mapping, input data are projected into a much higher-dimensional feature space, where complex nonlinear decision boundaries become linear in the kernel-induced space. There are several relevant techniques (e.g., KLFDA [83] and KCGDA [69]) that have been successfully applied in hyperspectral image analysis. KERNEL LOCAL FISHER S DISCRIMINANT ANALYSIS AND KERNEL COLLABORATIVE GRAPH-BASED DISCRIMINANT ANALYSIS Before introducing KLFDA [83], [84], we first briefly review KDA [75], [76], which seeks to find a projection vector w in a kernel-induced (higher-dimensional) space such that the Fisher s ratio can be more easily maximized. For a given nonlinear mapping function U, the KDA projection can be solved by ( w) ( b) PKDA = argmax tr7 < < _ P S P KDA i P KDA S PKDAA, (27) PKDA KDA U where S ( b) U is the between-class scatter matrix and S ( w) U is the within-class scatter matrix in the space induced by the mapping function U. KLFDA is a kernel extension of LFDA via the kernel trick [129]. The kernel function employed is the radial basis function [72], which can be expressed as -1 2 xi- xj k( xi, xj) = expd- 2 n, (28) 2v where v is a user-defined parameter of the kernel. For KLFDA, the local within- and between-class scatter matrices are defined in the kernel-induced space. The projection matrix PKLFDA in the kernel space that maximizes the modified Fisher s ratio is given by the solution of the generalized eigenvalue problem, (lb) (lw) KL KPKLFDA = K u ( KL K+ fin) PKLFDA, (29) where K u is the diagonal eigenvalue matrix; f is a small constant; PKLFDA is the eigenvector matrix; K represents the Gram matrix, with Kij, = k( xi, x j) defined by (28); L (lw) = D (lw) - W (lw), where D ( lw) is a diagonal matrix with the ith diagonal element being Dii = R j = 1Wij, ; and L (lb) = L (m) - L (lw), where ( lw) n ( lw) L ( m) (m) (m) (m) is the local mixture matrix defined as L = D - W, and D (m) is a diagonal matrix with the ith diagonal element ( m) n ( m) being Dii = R j = 1Wij,, where W (lb) and W (lw) are calculated using (8) and (9), respectively, in the kernel feature space. In KLFDA, we employ the affinity matrix to weight the withinclass scatter matrix in the kernel-induced space, such that the local neighborhood relationship is preserved. In KCGDA [69], projection P ( k) in the kernel space is given by the solution of the generalized eigenvalue problem, ( k) T ( k) ( k) ( k) T ( k) KL K P = K KL K P, (30) T n# n where K = U U! R represents the Gram matrix, with Kij, = k( xi, x j), and L ( k) is the Laplacian matrix calculated U 22 ieee Geoscience and remote sensing magazine march 2018

9 according to the weight matrix W ( k) Here, the objective function becomes, 2 2 i i i 2 i 2 in the kernel space. * * arg min U( x) - U w + m w, (31) w * i n - 1 where U i = [ U( x1), U( x2), g, U( xn)]! R, excluding U ( xi). The weight vector w * i, with a size of ( n - 1) # 1, can be recovered in a closed-form solution, T * Ui U( xi) -1 wi = T = ( Ki+ mi) k(, xi), (32) U Ui + mi i T ( n- 1) # 1 where k(, xi) = [ k( x1, xi), k( x2, xi), g, k( xn, xi)]! R, and Ki i T R ( n- 1) # i ( n- = U U! 1). Then, the weight matrix W ( k) in the kernel space can be constructed similarly to (17). The authors of [69] proved KCGDA to be superior to its linear version, i.e., CGDA. OTHER TECHNIQUES WITH KERNEL STRATEGY Because of the satisfactory results obtained when handling complex-distributed nonlinear data, increasing attention has been paid to studying kernel-based dimensionality-reduction methods as extensions of traditional KDA. One study investigated CKDA [85], constructing spectral and spatial information extracted by a Gaussian weighted local mean operator for a composite kernel suitable to the SSS problem. Similarly, Zhang and Prasad proposed LPCK [86] for multisource remote-sensing data classification. The work in [87] used DMKL to attempt to find an optimal kernel combination via a time-consuming search, achieving the maximum separability. The authors of [80] discussed a multiclassifier and decision-fusion framework based on KDA. Furthermore, they de veloped pairwise KDA and KLFDA, along with a oneagainst-one strategy, to overcome single global transformation for the multiclass task. For graph-embedding techniques, in addition to KCGDA, researchers developed kernel SGDA (KSGDA) and kernel SLGDA in [130] and [131]. In [130], one team investigated a graph representation under sparse and low-rank constraints in kernel Hilbert spaces for clustering and semisupervised classification. In [131], another group developed a lowcomplexity Nyström-based approximate kernel method, with the benefits of improving class separability and avoiding the manipulation of complicated kernel operations in the traditional kernel methods. EXPERIMENTAL RESULTS AND ANALYSIS In this section, we demonstrate the performance of the aforementioned dimensionality-reduction methods (LDA, LFDA, SGDA, and CGDA) and their kernel extensions, i.e., KDA, KLFDA, KSGDA, and KCGDA. We acquired our first batch of experimental data [136] through NASA s AVIRIS sensor in northwestern Indiana. There are 220 spectral channels in the μm region of the visible and infrared spectrum, with a spatial resolution of 20 m. There are 16 different land-cover classes in the original ground truth, but we used only eight classes in this study to avoid those classes with very few training samples [4]. In this data set, we randomly selected 10% of the labeled samples in the ground-truth map for training and the rest for testing. We collected the second experimental data set from Hyperspectral pixels of the Reflective Optics System different classes in the Imaging Spectrometer (ROSIS) original space are not sensor over the city of Pavia always distinctive, in northern Italy. The one we which is known as an used was the Pavia University scene, which has a spatial inseparability issue. coverage of 610 # 340 pixels. The data set had 103 spectral bands prior to water-band removal, with a spectral coverage from 0.43 to 0.86 μm and a spatial resolution of 1.3 m. Approximately 42,776 labeled pixels with nine classes are in the ground-truth map. In this data set, we randomly selected 8% of the labeled samples for training and the rest for testing. We acquired the third set of data also from the AVIRIS sensor, capturing an area over Salinas Valley, California. The image contains 512 # 217 pixels, with a spatial resolution of 3.7 m and 204 bands after 20 water absorption bands were removed. The scene mainly contains bare soil, vegetable fields, and vineyards. There are also 16 classes, and we randomly selected 5% of the labeled samples for training and the rest for testing. For discriminant analysis methods, the most important parameter is the reduced dimensionality that significantly affects the final classification performance. Figures 4 6 illustrate the classification accuracy versus reduced dimensionality K in a range of 1 41, with an interval of two. For most of the methods, the accuracy tends to be stable when the reduced dimensionality arrives at a particular value; e.g., 15 is the best for LFDA, SGDA, and CGDA for two experimental data sets, and 40 may be optimal for KLFDA, KS- GDA, and KCGDA. In Tables 3 5, we used the standardized McNemar s test to testify to the statistical improvements among these algorithms. The McNemar s test Z values larger than 1.96 and 2.58 mean that two classification results are statistically different at a 95% and 99% confidence level, respectively. It is obvious that the kernel version is generally better than its linear version, because the Z value is usually greater than 2.58 (except SGDA and CGDA in the University of Pavia data set and CGDA in the Salinas data set). SGDA and CGDA can perform better than LFDA and LDA in general. In Tables 6 8, we summarize the accuracy of each class, overall accuracy (OA), average accuracy (AA), and Kappa coefficient. On the one hand, LFDA utilizes local information, which causes the accuracy to be higher than that of LDA. SGDA and CGDA employ the coefficients obtained with, 1 -norm and, 2 -norm, respectively, march 2018 ieee Geoscience and remote sensing magazine 23

90 90 Overall Classification Accuracy (%) 85 80 75 70 65 60 55 LDA LFDA SGDA CGDA 10 20 30 40 Reduced Dimensionality (a) Overall Classification Accuracy (%) 85 80 75 70 65 60 55 KDA KLFDA KSGDA KCGDA

10 90 90 Overall Classification Accuracy (%) LDA LFDA SGDA CGDA Reduced Dimensionality (a) Overall Classification Accuracy (%) KDA KLFDA KSGDA KCGDA Reduced Dimensionality (b) FIGURE 4. The classification accuracy versus the reduced-dimensionality K for (a) linear methods and (b) kernel methods using the Indian Pines data Overall Classification Accuracy (%) LDA LFDA SGDA CGDA Reduced Dimensionality (a) Overall Classification Accuracy (%) KDA KLFDA KSGDA KCGDA Reduced Dimensionality (b) FIGURE 5. The classification accuracy versus the reduced-dimensionality K for (a) linear methods and (b) kernel methods using the University of Pavia data. Overall Classification Accuracy (%) LDA LFDA SGDA CGDA Reduced Dimensionality (a) Overall Classification Accuracy (%) KDA KLFDA KSGDA KCGDA Reduced Dimensionality (b) FIGURE 6. The classification accuracy versus the reduced-dimensionality K for (a) linear methods and (b) kernel methods using the Salinas data. 24 ieee Geoscience and remote sensing magazine march 2018

11 TABLE 3. THE STATISTICAL SIGNIFICANCE (99% CONFIDENCE LEVEL) FROM THE STANDARDIZED McNEMAR S TEST FOR THE INDIAN PINES DATA SET. Z/SIGNIFICANT? LDA LFDA SGDA CGDA KDA KLFDA KSGDA KCGDA LDA 3.95/yes 4.70/yes 10.33/yes 8.56/yes 17.87/yes 17.59/yes 18.23/yes LFDA 0.75/no 6.41/yes 4.63/yes 14.05/yes 13.76/yes 14.41/yes SGDA 5.66/yes 3.88/yes 13.32/yes 13.03/yes 13.68/yes CGDA 1.79/no 7.74/yes 7.45/yes 8.11/yes KDA 9.51/yes 9.22/yes 9.87/yes KLFDA 0.30/no 0.37/no KSGDA 0.67/no KCGDA TABLE 4. THE STATISTICAL SIGNIFICANCE (99% CONFIDENCE LEVEL) FROM THE STANDARDIZED McNEMAR S TEST FOR THE UNIVERSITY OF PAVIA DATA SET. Z/SIGNIFICANT? LDA LFDA SGDA CGDA KDA KLFDA KSGDA KCGDA LDA 17.51/yes 24.71/yes 22.95/yes 7.38/yes 20.09/yes 24.14/yes 24.92/yes LFDA 7.39/yes 5.57/yes 10.19/yes 2.63/yes 6.80/yes 7.61/yes SGDA 1.83/no 17.5/yes 4.76/yes 0.59/no 0.22/no CGDA 15.71/yes 2.94/yes 1.24/no 2.04/yes KDA 12.8/yes 16.92/yes 17.71/yes KLFDA 4.17/yes 4.98/yes KSGDA 0.81/no KCGDA TABLE 5. THE STATISTICAL SIGNIFICANCE (99% CONFIDENCE LEVEL) FROM THE STANDARDIZED McNEMAR S TEST FOR THE SALINAS DATA SET. Z/SIGNIFICANT? LDA LFDA SGDA CGDA KDA KLFDA KSGDA KCGDA LDA 13/yes 8.29/yes 13.13/yes 13.53/yes 17.83/yes 12.48/yes 14.78/yes LFDA 0.12/no 4.73/yes 0.54/no 4.88/yes 0.52/no 1.79/no SGDA 4.86/yes 5.27/yes 9.61/yes 4.21/yes 6.52/yes CGDA 0.41/no 4.76/yes 0.65/no 1.66/no KDA 4.35/yes 1.06/no 1.25/no KLFDA 5.41/yes 3.10/yes KSGDA 2.32/yes KCGDA TABLE 6. THE SUPPORT VECTOR MACHINE (SVM) CLASS-SPECIFIC ACCURACY (%) AND OA OF DIFFERENT TECHNIQUES FOR THE INDIAN PINES DATA. LDA LFDA SGDA CGDA KDA KLFDA KSGDA KCGDA OA AA Kappa march 2018 ieee Geoscience and remote sensing magazine 25

12 TABLE 7. THE SVM CLASS-SPECIFIC ACCURACY (%) AND OA OF DIFFERENT TECHNIQUES FOR THE UNIVERSITY OF PAVIA DATA. LDA LFDA SGDA CGDA KDA KLFDA KSGDA KCGDA , OA AA Kappa TABLE 8. THE SVM CLASS-SPECIFIC ACCURACY (%) AND OA OF DIFFERENT TECHNIQUES FOR THE SALINAS DATA. LDA LFDA SGDA CGDA KDA KLFDA KSGDA KCGDA OA AA Kappa to construct the affinity matrix, and their performance is always better than LFDA and LDA. On the other hand, because the distribution of hyperspectral data is usually complex, kernel versions of these algorithms are superior to their linear counterparts. Take the Indian Pines data, for example. The accuracy of KLFDA is 84.75%, which is around 5% higher than that of LFDA; for both KSGDA and KCGDA, there is an improvement of approximately 3.5%. In Figures 7 9, we list classification maps of different dimensionality-reduction techniques, which are consistent with the results in Tables 6 8. For example, we observe that kernel-based methods produce smoother maps than linear methods in classifying the sixth class when using the Indian Pines data. We also conducted an experiment to show the sensitivity to changes of training-data set sizes. Figures illustrate the OA as a function of the ratio of training samples to the total labeled samples. For the Salinas data, the training size is changed from 0.01 to 0.05 (note that 0.01 is the ratio of the number of training samples to the total labeled data). To avoid any bias, we randomly chose the training samples for each sample-size ratio, and we repeated the experiment ten times and reported the mean accuracy. For the Indian Pines data, the ratio range is ; for the University of Pavia data, it is ieee Geoscience and remote sensing magazine march 2018

74%. (f) KLFDA: 86.65%. (g) KSGDA: 85.84%. (h) KCGDA: 86.50%. (a) (b) (c) (d) (e) (f) (g) (h) Unlabeled Asphalt Meadows Gravel Trees Metal Sheets Bare Soil Bitumen Bricks Shadows FIGURE 8.

13 (a) (b) (c) (d) (e) (f) (g) (h) Corn (No Till) Corn (Minimum Till) Grass/ Pasture Hay-Windowed Soybean (No Till) Soybean (Minimum Till) Soybean (Clean Till) Woods FIGURE 7. The thematic maps resulting from dimensionality-reduction methods for the Indian Pines data set, with eight classes: (a) LDA: 74.74%. (b) LFDA: 80.97%. (c) SGDA: 81.18%. (d) CGDA: 84.35%. (e) KDA: 81.74%. (f) KLFDA: 86.65%. (g) KSGDA: 85.84%. (h) KCGDA: 86.50%. (a) (b) (c) (d) (e) (f) (g) (h) Unlabeled Asphalt Meadows Gravel Trees Metal Sheets Bare Soil Bitumen Bricks Shadows FIGURE 8. The thematic maps resulting from dimensionality-reduction methods for the University of Pavia data set, with nine classes: (a) LDA: 88.74%. (b) LFDA: 92.43%. (c) SGDA: 93.92%. (d) CGDA: 93.06%. (e) KDA: 90.36%. (f) KLFDA: 92.59%. (g) KSGDA: 93.83%. (h) KCGDA: 93.62%. march 2018 ieee Geoscience and remote sensing magazine 27

(a) (b) (c) (d) (e) (f) (g) (h) Weeds 1 Weeds 2 Fallow Fallow Rough Plow Fallow Smooth Stubble Celery Grapes Soil Corn Lettuce (Four Weeks) Lettuce (Five Weeks) Lettuce (Six Weeks) Lettuce (Seven

14 (a) (b) (c) (d) (e) (f) (g) (h) Weeds 1 Weeds 2 Fallow Fallow Rough Plow Fallow Smooth Stubble Celery Grapes Soil Corn Lettuce (Four Weeks) Lettuce (Five Weeks) Lettuce (Six Weeks) Lettuce (Seven Weeks) Vineyard Untrained Vineyard Trellis FIGURE 9. The thematic maps resulting from dimensionality-reduction methods for the Salinas data set, with 16 classes: (a) LDA: 90.85%. (b) LFDA: 92.65%. (c) SGDA: 91.82%. (d) CGDA: 92.78%. (e) KDA: 92.81%. (f) KLFDA: 93.26%. (g) KSGDA: 92.22%. (h) KCGDA: 93.15%. Classification Accuracy (%) LDA LFDA SGDA CGDA Ratio of Training Samples to the Total Labeled Samples (a) Classification Accuracy (%) LDA LFDA SGDA CGDA Ratio of Training Samples to the Total Labeled Samples (a) Classification Accuracy (%) KDA KLFDA KSGDA KCGDA Ratio of Training Samples to the Total Labeled Samples (b) Classification Accuracy (%) KDA KLFDA KSGDA KCGDA Ratio of Training Samples to the Total Labeled Samples (b) FIGURE 10. The classification accuracy versus different numbers of training sample sizes using the Indian Pines data with (a) linear methods and (b) kernel methods. FIGURE 11. The classification accuracy versus different numbers of training sample sizes using the University of Pavia data set with (a) linear methods and (b) kernel methods. 28 ieee Geoscience and remote sensing magazine march 2018

15 It is obvious that when the training-size ratio is larger, the accuracy increases. There is no doubt that the typical LDA performs the worst and that kernel methods are still always better than linear methods. When the ratio is 0.1 for the Indian Pines data, the improvement gap between KLFDA and LFDA is 8%, the one between KSGDA and SGDA is 6%, and that between KCGDA and CGDA is 4%. As for the standard deviation, when the ratio of the training samples is larger, the deviation tends to be smaller and stable. Based on the aforementioned experiments, we can summarize some important conclusions. In general, algorithms using a locality-preserving strategy, such as LFDA, are better than those without local information, such as LDA, because the locality-preserving technique can provide much information about geometric structure and manifold subspaces to inform the discriminant analysis. In addition, graph-based discriminant analysis algorithms (e.g., CGDA) perform better than other traditional dimensionality-reduction approaches. Table 9 summarizes the computational complexity of the previously described algorithms. We carried out all experiments using MATLAB on an Intel Core i central processing unit machine with 16 GB of random access memory. Obviously, SGDA and CGDA need more computational time to obtain higher classification accuracy than LDA and LFDA. Furthermore, the classification accuracies of the kernel versions (e.g., KLFDA) are generally higher than those of the linear versions (e.g., LFDA). However, the dimensionality-reduction methods of the kernel versions are always more complicated than the corresponding linear versions because of the additional nonlinear mapping process. CONCLUSIONS The rich spectral information in hyperspectral imagery is a double-edged sword: on the one hand, it provides potentially discriminative features for accurate object recognition or classification; on the other hand, highly correlated spectral bands may increase computational complexity and deteriorate classification performance when training samples are limited. During the past few years, researchers have proposed many dimensionality-reduction algorithms for hyperspectral image analysis. In this article, we have reviewed discriminant analysis-based dimensionality-reduction approaches, including linear subspace learning (e.g., LDA), locality-preserving dimensionality reduction (e.g., LFDA), graph-embedding discriminant analysis (e.g., SGDA and CGDA), SDA (e.g., SELF), and discriminant analysis in kernel space. One interesting open problem for future work is how to construct a more representative graph when we consider dimensionality-reduction tasks within the graph-embedding framework. From the experimental results, we observe that state-of-the-art CGDA or SGDA is generally superior to LDA or even LFDA. We construct the graph of CGDA or SGDA Classification Accuracy (%) Classification Accuracy (%) LDA LFDA SGDA CGDA Ratio of Training Samples to the Total Labeled Samples (a) KDA KLFDA KSGDA KCGDA Ratio of Training Samples to the Total Labeled Samples (b) FIGURE 12. The classification accuracy versus different numbers of training sample sizes using the Salinas data with (a) linear methods and (b) kernel methods. TABLE 9. THE EXECUTION TIME (s) OF DIFFERENT METHODS IN THE THREE DATA SETS. TECHNIQUES INDIAN PINES UNIVERSITY OF PAVIA SALINAS LDA LFDA SGDA CGDA KDA KLFDA KSGDA KCGDA march 2018 ieee Geoscience and remote sensing magazine 29

HYPERSPECTRAL image (HSI) acquired by spaceborne

HYPERSPECTRAL image (HSI) acquired by spaceborne 1 SuperPCA: A Superpixelwise PCA Approach for Unsupervised Feature Extraction of Hyperspectral Imagery Junjun Jiang, Member, IEEE, Jiayi Ma, Member, IEEE, Chen Chen, Member, IEEE, Zhongyuan Wang, Member,