IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST"

Cynthia Clark
5 years ago
Views:

1 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST Efficient Multiple Feature Fusion With Hashing for Hyperspectral Imagery Classification: A Comparative Study Zisha Zhong, Bin Fan, Member, IEEE, Kun Ding, Haichang Li, Shiming Xiang, and Chunhong Pan, Member, IEEE Abstract Due to the complementary properties of different features, multiple feature fusion has a large potential for hyperspectral imagery classification. At the meantime, hashing is promising in representing a high-dimensional float-type feature with extremely low bit binary codes while maintaining the performance. In this paper, we study the possibility of using hashing to fuse multiple features for hyperspectral imagery classification. For this purpose, we propose a multiple feature fusion framework to evaluate the performance of using different hashing methods. For comparison and completeness, we also have an extensive comparison to five subspace-based dimension reduction methods and six fusion-based methods which are popular solutions to deal with multiple features in hyperspectral image classification. Experimental results on four benchmark hyperspectral data sets demonstrate that using hashing to fuse multiple features can achieve comparable or better performance with the traditional subspace-based dimension reduction methods and fusion-based methods. Moreover, the binary features obtained by using hashing need much less storage and are faster to compute distances with the help of machine instructions. Index Terms Binary codes, classification, feature fusion, hashing, hyperspectral images. I. INTRODUCTION HYPERSPECTRAL imaging technology provides us spectral signatures with a large number of bands [1], [2], significantly characterizing the inherent physical and chemical properties of imaged objects [1], [3]. However, it turns out to be very challenging for hyperspectral image processing because of the high dimensionality of pixels resulting from the increased spectral resolution [4]. Consequently, it attracts more and more interests in remote sensing and other research communities such as machine learning and computer vision [5]. As a vital application of hyperspectral images, land-cover classification which aims at classifying image pixels into multiple categories remains an active research area [6], [7]. Manuscript received April 28, 2015; revised October 4, 2015 and February 14, 2016; accepted February 29, Date of publication April 5, 2016; date of current version June 1, This work was supported in part by the National Natural Science Foundation of China under Grants , , , and and in part by Beijing Natural Science Foundation under Grant The authors are with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing , China ( zszhong@nlpr.ia.ac.cn; bfan@nlpr.ia.ac.cn; kding@nlpr.ia.ac.cn; hcli@nlpr.ia.ac.cn; smxiang@nlpr.ia.ac.cn; chpan@nlpr.ia.ac.cn). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TGRS Recent studies [2], [4], [8] have demonstrated that the joint exploitation of both spectral information and spatial information can significantly improve the classification accuracy. To sum up, there are mainly three kinds of spectral spatial algorithms for hyperspectral imagery classification [4]: 1) spectral spatial feature extraction; 2) spatial spectral segmentation [9] [12]; and 3) other methods (e.g., kernel-based [13] [15] and Markov random field methods [16]). In terms of spectral spatial feature extraction, lots of endeavors have been devoted to develop effective and efficient feature extraction algorithms to improve hyperspectral imagery classification. Representative techniques include filtering-based methods (e.g., Gabor filtering [17] and edge-preserving filtering [18]), morphology-based methods (e.g., extended morphological profiles (EMPs) [19] and extended attribute profiles (EAPs) [20], [21]), and spatial statistical feature extraction (e.g., graylevel concurrence matrices [22], [23]). These methods can achieve significant improvement on classification performance. Since a single kind of feature only carries some certain characteristics of the object, some researchers designed feature fusion approaches to integrate multiple types of features for further improving the classification performance. Recent advances on multiple feature fusion can be divided into four categories: multiple kernel learning methods, subspace-based feature extraction methods, feature-selectionbased methods, and ensemble methods. Most multiple kernel learning methods have been focused on the effective or efficient composition of different kernels [5], [24] [27]. Li et al. [15] have developed a framework to integrate multiple types of features extracted from both linear and nonlinear transformations without the need to learn the weights of the considered features. Gu et al. [28] have proposed a representative multiple kernel learning approach for efficiently determining the optimal kernels in hyperspectral image classification. Considering the subspace-based methods, Zhang et al. [29] have proposed a multiple feature combining approach to encode different features into a low-dimensional representation based on manifold learning and patch alignment framework [30]. Additionally, Zhang et al. [31] have introduced a modified stochastic neighbor embedding algorithm for multiple feature dimension reduction under a probability-preserving projection framework. Unlike the feature extraction methods, feature selection does not create new feature representations and can keep the physical meanings of features, thus attracting great interest of researchers. A very recent study on multiple feature selection IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 4462 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 is [32], in which a discriminative sparse multimodal learning is developed for multiple feature selection. By introducing a structured regularization technique, they have extended the original discriminative least square regression framework [33] to exploit both the intrinsic structures in data and the correlations among different features. Different from the aforementioned methods, a support vector machine (SVM) ensemble fusion method has been proposed in [34], which constructs an SVM ensemble to combine multiple spectral and spatial features at both pixel and object levels. The existing methods for multiple feature fusion are mainly focused on improving their classification accuracy without considering the computational and storage cost. However, with the increasing demand of earth observation and with the development of hyperspectral imaging technology, we can obtain more and more high-quality hyperspectral images with higher and higher spectral resolution. Thus, we are facing the challenge of processing this huge amount of data. As a result, it requires efficient solutions for hyperspectral imagery classification, both in processing time and memory footprint. Unfortunately, there are few literatures dealing with this problem. As a powerful technique to obtain compact features and fast nearest neighbor search, hashing has not been introduced in remote sensing processing until very recently [35], where it is adopted for large-scale remote sensing image retrieval. To the best of our knowledge, it has not been used in hyperspectral image classification. In this paper, we introduce using hashing technique to extract compact binary features for hyperspectral image classification in the proposed multiple feature fusion with hashing (MFH) framework. To show the effectiveness of using the hashing technique in this task, we give a comparative study on applying different hashing techniques to generate compact binary codes. Based on the extensive experimental evaluations on four standard hyperspectral data sets, we discuss the advantages and disadvantages of using hashing for fusing multiple features. The main contributions of our work are summarized as follows. 1) We propose an MFH framework to use hash technique in fusing multiple features for hyperspectral image classification and show encouraging results. 2) We conduct an extensive performance evaluation of different hashing methods on fusing multiple features for classification on four popular hyperspectral data sets. Based on the evaluation results, we supply with an indepth discussion on the advantages, disadvantages, and availability of different hashing methods in this task. 3) We conduct comparative experiments with five classical subspace-based dimension reduction methods and six different multiple feature fusion methods. Experiments show that, when equipped with a proper hashing learning strategy, the proposed MFH method can achieve comparable or even competitive performance. Meanwhile, the obtained binary features require much less storage and classification time. The rest of this paper is organized as follows. Section II introduces the proposed MFH framework. Section III gives a brief introduction of six representative hashing algorithms adopted in MFH, including three unsupervised ones and three supervised ones. Section IV describes the four used data sets and elaborates our experimental setups as well as evaluation protocol. Section V presents the experimental results and analysis. Then, we give an overview and guidelines for potential users based on our experimental observations in Section VI. Finally, Section VII concludes this paper with some possible future works. II. MFH FRAMEWORK The proposed MFH framework can be divided into three steps: 1) perform feature extraction in the hyperspectral image via efficient approaches, and concatenate multiple features into a long feature vector for each pixel; 2) perform hashing learning on these feature vectors with or without class label information, and map the original float-type feature vectors into compact binary codes; and 3) perform classification with the obtained binary codes, and output the final classification results. The flowchart of MFH is shown in Fig. 1. In the following, we will describe these three steps in detail. A. Multiple Feature Extraction Feature extraction plays a very important role in pattern recognition. In remote sensing society, a lot of research endeavors have been devoted, and much efficient feature extraction methods have been developed. Owing to these works, different types of features can be extracted from the hyperspectral image. Without loss of generality, suppose that N kinds of features are extracted for each pixel i. Then, we stack these features into a long vector x i = {x ik } N k=1 R1 D,wherex ik R 1 d k is the kth single feature descriptor for pixel i and d k is the dimension of the feature. D = N i=1 d k is the length of the fused feature descriptor. B. Hashing Learning Hashing aims at mapping the original data into compact binary codes, or equivalently a sequence of bits, while preserving similarity in the original data space [36] [38]. Due to the binary characteristic, their distances can be computed extremely fast with Hamming metric in modern computers, which consequently facilitates faster nearest neighbor search. Meanwhile, storing binary codes requires much less memory footprint compared to storing float vectors. Specifically, given n training samples {x i,y i } n i=1,where x i R 1 D is the ith training point with D float-type features, y i {1,...,C} is the class label of ith training point, and C is the number of labeled classes. The hashing learning is to learn a set of hashing functions {h b (x)} B b=1 to map the original highdimensional float-type feature x i R 1 D (i =1,...,n) into a low-dimensional binary code z i { 1, 1} 1 B,whereB D is the number of bits used and the bth bit of z i is the output of h b (x i ). Once the hashing functions are learned, the fused floattype multiple feature vectors extracted in a hyperspectral image can be encoded into binary codes and be given as inputs for the following classification procedure.

3 ZHONG et al.: EFFICIENT MFH FOR HYPERSPECTRAL IMAGERY CLASSIFICATION 4463 Fig. 1. Flowchart of the proposed MFH framework. C. Classification With Binary Codes Finally, based on class label, these compact binary codes are given as inputs for nearest neighbor (1-NN) classifier with Hamming metric to predict the class labels and output the classification results. III. FEATURE HASHING In the proposed MFH framework, the classification performance would largely depend on the quality of binary codes obtained by hashing methods. As a consequence, a key problem is to design good schemes for finding good binary codes. Fortunately, there are a lot of theoretic or practical efforts in literature on solving this problem from various viewpoints [36] [38]. According to whether the label information is used or not, we roughly divide the existing hashing methods into two categories: unsupervised and supervised ones. Unsupervised hashing methods try to preserve the distance-based similarity in original feature space, while the supervised ones are developed to preserve the label-based similarity. In this section, we will briefly introduce three representative hashing methods for both categories, respectively. Moreover, in the following sections, we will adopt them in the proposed MFH framework and give a detailed evaluation on the classification performance on four popular hyperspectral data sets. A. LSH Locality-sensitive hashing (LSH) is one of the basic but most popular hashing methods. Given a data point x R D, for the bth bit of the binary code, a random vector w b R D is selected from a zero-mean D-dimensional normal distribution N (0, I D ), I D is an identity matrix, and the hashing function is defined as { 1, if wb T h b (x) = x 0 1, if wb T x < 0. Both [39] and [40] have proved that the random projections can preserve similarity as the number of hash bits increases, and meanwhile, they have also observed that the number of required hash bits may be large for high-dimensional data. LSH is totally probabilistic and does not take the data distribution into account, so its performance is limited. However, for its simplicity, LSH is popular and usually serves as the baseline for performance comparison in hashing. B. KLSH The kernelized locality-sensitive hashing (KLSH) proposed by Kulis et al. [41] generalizes LSH by using the kernel technique, which makes it possible to embed high-dimensional features or complex distance functions into a low-dimensional Hamming space. The main idea of KLSH is to approximate the Gaussian-based random projections in the kernel space by using a weighted combination of kernel-mapped anchors selected from the input space, based on the central limit theorem [42]. Specifically, given an arbitrary kernel function κ(x i, x j ),the main steps are as follows. 1) Select p data points from the input space, and form a kernel matrix K upon them. 2) Center the kernel matrix K. 3) For each hash function h b (φ(x)), form an indicator vector e S by randomly selecting t indices from [1,...,p], then form w b = K 1/2 e S, and generate a bit according to h b (φ(x)) = sign( p i=1 w b(i)κ(x, x i )). Here, sign( ) is a sign function. As stated in [41], KLSH is simple with general applicability and usually preferable in cases where the computation of the hash functions is dependent on the kernel embeddings. However, the computation of K 1/2 has O(p 3 ) time complexity, the computation of w b has O(p 2 ) time complexity, and with the obtained w b,thefinalh b (φ(x)) needs O(p) time complexity.

4 4464 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 As a consequence, it is suggested that p should be much smaller than n in order to maintain efficiency [41]. C. SH Weiss et al. [43] have formalized the problem of finding hashing functions as graph partition and developed a solution based on spectral relaxation where the hash bits are calculated by thresholding a subset of eigenvectors of the Laplacian of the similarity graph. Specifically, the optimization criterion is to minimize the average Hamming distance between similar items, which is formulated as follows: min trace(z T LZ) s.t. Z T 1 n = 0, Z T Z = I n where trace( ) is the trace operator, Z { 1, +1} n B is the binary codes for training samples, and L is the graph Laplacian in original space. I n is an identity matrix with n rows. The constraint of Z T 1 n = 0 is to assure that each bit of these binary codes should have equal number of 1 and+1. 1 n R n is an all-one vector. The constraint of Z T Z = I n means that the hashing bits are uncorrelated to each other. The aforementioned optimization is solved by relaxing Z to continuous values, and the final binary codes are obtained by thresholding Z. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace Beltrami eigenfunctions of manifolds, they have also showed how to efficiently calculate the binary code of a testing sample. As reported in [43], spectral hashing (SH) has two limitations. The first is the assumption of a multidimensional uniform distribution of the data, which usually does not hold in real cases. The second is that the eigenvalues from the outer-product eigenfunctions are discarded to maintain the uncorrelatedness of bits, which would break the uncorrelated property when there is more than one eigenfunction along a single PCA direction and consequently result in a deteriorated performance [37]. D. KSH Liu et al. [44] have proposed a novel kernel-based supervised hashing (KSH) model in which the supervised information comes from similar and dissimilar data pairs. The idea is to map the original data to compact binary codes whose Hamming distances are minimized on similar pairs and simultaneously maximized on dissimilar pairs. By utilizing the algebraic equivalence between Hamming distance and inner product, they designed an efficient greedy algorithm to learn the hashing functions one by one. To deal with linear inseparable data, the kernel-based hash functions are adopted, which are defined as [41] h b (x) =sign(f(x)) = sign( m j=1 κ(x (j), x)a j c), where c = n m i=1 j=1 κ(x (j), x i )(a j /n), m is the number of uniformly selected anchor samples for kernel computation, a j R is the coefficient, and c is the bias. After simplifying the notion, we obtain f(x)=a T k(x),wherea=[a 1,...,a m ] T and k(x) = [ κ ( x (1), x ) μ 1,...,κ ( x (m), x ) μ m ] T where μ j =(1/n) n i=1 κ(x (j), x i ). Different from KLSH where the coefficient vector is the random projection constructed based on the subset of data samples, here, the vector a is optimized by leveraging supervised information which results in more discriminative hashing functions. Specifically, considering B hash bits, the problem is to find B coefficient vectors a 1,...,a B to construct B hashing functions {h b (x) = sign(a T k(x))} b B b=1. Since directly optimizing on Hamming distance is nontrivial, the authors deduced an equivalence between inner product and Hamming distance: code(x i ) code(x j )= B 2D h (x i, x j ),whered h (x i, x j ) is the Hamming distance between the binary codes of x i and x j. Based on this relationship, the optimization problem is modeled as min Q = 1 H n { 1,+1} n B B H nh T n S 2 (1) where F is the Frobenius norm and H n = code(x 1) code(x n ) denotes the binary code matrix for the n training samples, code(x) is the binary code of x, and the label matrix S records the pairwise relationship, where 1, (x i, x j ) M S ij = 1, (x i, x j ) C 0, otherwise and M is the set of similar pairs and C is the set of dissimilar pairs. Then, with some simple operations, the code matrix H n can be rewritten as H n = h 1 (x 1 ),, h B (x 1 )..... h 1 (x n ),, h B (x n ) F = sign( K n A) where Kn =[ k(x 1 ),..., k(x l )] T R n m and A = [a 1,...,a B ] R m B. Then, substitute H n into (1); the final objective is to minimize Q(A) = 1 B sign( K n A) ( sign( K n A) ) T 2 S. Based on the separable property of code inner products, a greedy optimization is proposed to obtain the hashing function for each bit one by one. Finally, with the obtained coefficient vectors [a 1,...,a B ], the binary code for a testing sample x i is computed by [sign( k T (x i )a 1 ),...,sign( k T (x i )a B )]. Owing to the class-label-based similarity-preserving objective, KSH can yield short yet discriminative codes. E. CCA-Based ITQ The iterative quantization (ITQ) [45] is proposed to find a rotation transformation of data by minimizing the quantization error between the PCA-reduced data and binary codes. Originally, ITQ is developed to learn binary codes unsupervisedly. It consists of two steps: 1) apply PCA to the original data X R n D : V = XP R n B,whereP R D B (B D) is the PCA projection matrix, and 2) find the binary codes Z { 1, +1} n B along with an optimal rotation matrix R based F

5 ZHONG et al.: EFFICIENT MFH FOR HYPERSPECTRAL IMAGERY CLASSIFICATION 4465 on the PCA-reduced data V by minimizing the quantization loss Q(Z, R) = Z VR 2 F. The problem is solved via two-step alternative optimization. The first step is to update Z by Z = sign(vr) with the fixed R. The second step is to update R with the fixed Z by solving a standard orthogonal Procrustes problem. For testing samples, the obtained R is also used to transform their PCAreduced features into binary codes. When label information is available, the PCA can be replaced with canonical correlation analysis (CCA) [46], resulting in the CCA-based ITQ algorithm (CCA-ITQ). As shown in [45], the CCA-ITQ is very effective and can significantly improve the performance of image retrieval. F. FastHash Lin et al. [47], [48] have proposed a flexible yet simple supervised hashing framework that is able to accommodate different types of loss functions and hash functions. Their work consists of two steps: binary code inference and hashing function learning. The first step is formulated as binary quadratic programming that is proved to be block-submodular, and consequently, the graph cut technique is used for efficient solution. In the second step, the boosted decision trees are adopted as supervised hashing functions that can be very fast to train and evaluate on large-scale high-dimensional data. More specifically, the authors first formulated the hashing learning problem as n n δ(y ij 0)L (Φ(x i ), Φ(x j ); y ij ) (2) min Φ( ) i=1 j=1 where Φ(x) =[h 1 (x),...,h B (x)] { 1, +1} B is the binary code for data point x. δ( ) is the indicator function, and δ(y ij 0) {0, 1} indicates whether the relation between two data points is defined, where y ij =1 indicates x i and x j are similar and y ij = 1 designates dissimilar. L( ) is a loss function that measures how well the binary codes match the ground truth similarity y ij. By introducing auxiliary variables z i,b = h b (x i ) { 1, +1} as the output of the bth hash function on x i, the problem in (2) is decomposed into two subproblems min Z min Φ( ) n i=1 j=1 n δ(y ij 0)L(z i, z j ; y ij ) Z { 1, +1} n B B b=1 i=1 n δ (z i,b = h b (x i )) where z i =[z i,1,...,z i,b ] is the binary code for x i.theb hash functions are solved one by one. For each hash function, the two subproblems are solved alternatively. After solving for 1 b, the binary codes are updated by applying the learned hash functions. As a result, the learned hash function can influence the solution of the following bits. For the binary code inference of the bth hash bit, let the binary codes of the previous (b 1) bits be fixed, and the problem is simplified as n n min δ(y ij 0)l b (z i,b,z j,b ; y ij ) (3) z (b) i=1 j=1 where z (b) =[z 1,b,z 2,b,...,z n,b ] { 1, +1} n is the bth bit values of the binary codes of the n training samples. l b represents the loss function output for the bth bit, conditioning on the previous (b 1) bits ( l b (z i,b,z j,b ; y ij )=L z i,b,z j,b ; z (r 1) i, z (r 1) j,y ij ) where z (r 1) i represents the binary code of x i in all previous (b 1) bits. It has been proved that the problem in (3) can be rewritten into a standard binary quadratic optimization problem with any Hamming affinity or distance-based loss function L( ), and several different loss functions L( ) were discussed [48]. Then, a graph-cut-based block search method is developed for efficient solutions in large-scale problems. For the second step, hashing function learning is to solve a binary classification problem for learning one hash function. The binary codes obtained in the first step are used as the classification labels. Any binary classifiers can be directly applied. To effectively deal with high-dimensional nonlinear data, the boosted decision trees are adopted. Finally, the hash function is defined as a linear combination of decision trees [ Q ] h b (x) =sign w q T q (x) (4) q=1 where T q ( ) { 1, +1} denotes the qth decision tree with binary output, Q is the number of decision trees, and w q is the weighting coefficient, which can be solved by AdaBoost. As shown in [48], since FastHash adopts the boosted decision trees as the hash functions which can effectively and efficiently deal with high-dimensional nonlinear data, it can generate highly descriptive binary codes. IV. EXPERIMENTAL SETUPS In this section, we present an experimental evaluation to test the performance of the proposed MFH framework for hyperspectral imagery classification. Section IV-A shows the used four standard benchmark hyperspectral data sets. Section IV-B simply introduces four well-known feature extraction methods that are used to generate the multiple feature vectors for fusion. Section IV-C describes the evaluated methods and their parameter settings. Finally, Section IV-D defines the evaluation criterions for experimental analysis. A. Data Sets Indian Pines: This data set was captured by the Airborne Visible/Infrared Imaging Spectrometry (AVIRIS) sensor over a mixed agricultural/forested region in Northwest Indiana, on June 12, This data set has a spatial size of pixels and 220 spectral bands with a spatial resolution of 20 m/pixel.

4466 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 Fig. 2. (a) False color image of the Indian Pines data set, (b) its ground truth image, and (c) its labeled classes.

For these four data sets, the number of training and testing samples for each class will be described in their respective experimental section. Fig. 3.

It has 16 land-cover classes, whose sizes of labeled samples disproportionately range from 20 to 2468 pixels.

6 4466 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 Fig. 2. (a) False color image of the Indian Pines data set, (b) its ground truth image, and (c) its labeled classes. Fig. 4. (a) False color image of the Salinas data set, (b) its ground truth image, and (c) its labeled classes. For these four data sets, the number of training and testing samples for each class will be described in their respective experimental section. Fig. 3. (a) False color image of the University of Pavia data set, (b) its ground truth image, (c) the given training samples, and (d) its labeled classes. It has 16 land-cover classes, whose sizes of labeled samples disproportionately range from 20 to 2468 pixels. In the experiments, we remove 20 noisy bands ( , , and 220) due to water absorption and use the remaining 200 bands. The false color and its ground truth images are shown in Fig. 2. University of Pavia: This data set was acquired by the Reflective Optics System Imaging Spectrometry (ROSIS), covering an urban area of the University of Pavia, Italy. Originally, the ROSIS sensor provided 115 bands from 0.43 to 0.86 μm. After removing the 12 most noisy bands, the remaining 103 bands are used for experiments. The spatial size of this data set is pixels, and its spatial resolution is 1.3 m/pixel. There are 9 classes, with sizes of labeled samples ranging from 1026 to The false color and its ground truth images are shown in Fig. 3. Salinas: this data set was captured by the AVIRIS sensor over Salinas Valley, CA, USA, with a spatial resolution of 3.7 m/pixel. This data set has a spatial size of with 224 spectral bands. In our experiments, 20 water absorption bands ( , , and 224) are discarded. This data set has 16 classes, whose sample sizes range from 916 to The false color and its ground truth images are shown in Fig. 4. Houston: This data set was initially distributed in the 2013 IEEE Geoscience and Remote Sensing Data Fusion Contest [49], [50], which includes an urban hyperspectral data set and a light detection and ranging (LiDAR) derived digital surface model. Both are geographically referenced and at the same spatial resolution (2.5 m). The hyperspectral data set has 144 bands in the nm spectral region. There are 15 classes of interest selected by the organizers. Fig. 5 shows the false color image of the hyperspectral data, the colored LiDAR image, and the given training and testing samples. B. Multiple Feature Extraction Four kinds of commonly used features in hyperspectral image processing are extracted for each pixel, including the following: 1) the original spectral feature (denoted as Spectral); 2) the EMP feature (denoted as EMP) [19]; 3) the EAP feature (denoted as EAP) [20], [21]; and 4) the Gabor filtering feature (denoted as Gabor). We briefly introduce each of them as follows. 1) Spectral: The spectral feature of one pixel is its spectral signatures, which is represented as x spe R d 1,whered 1 is the number of bands of the hyperspectral image. 2) EMP: For each top principal component image of a hyperspectral image, the morphological profiles [19] are generated by applying the morphological operations with several hand-designed morphological elements. Then, the EMP feature for one pixel is constructed by concatenating the morphological profiles, denoted as x emp R d 2, where d 2 is the number of morphological elements times the number of considered principal components of the hyperspectral image. 3) EAP: The EAP feature [20], [21] is also based on the top principal component images of a hyperspectral image. First, a maxtree is constructed for each principal component image, then applying several attribute filters to the constructed maxtree. Finally, the EAP feature for each pixel is obtained by concatenating all attribute profiles extracted from these principal component images. We denote it as x eap R d 3, where d 3 is the number of attribute filters times the number of principal components of the hyperspectral image. 4) Gabor: The top principal component images of a hyperspectral image are convolved with several Gabor filters with different orientations and scales, and then, the filtering coefficients are extracted as features for each pixel,

ZHONG et al.: EFFICIENT MFH FOR HYPERSPECTRAL IMAGERY CLASSIFICATION 4467 Fig. 5. (a) False color image of the Houston data set. (b) Its coregistered LiDAR image. (c) Training samples.

With these features extracted, we simply concatenate them as a fused high-dimensional multiple feature vector x multi = [x spe, x emp, x eap, x gabor ] R 1 D (D = 4 k=1 d k) and give them into the

Evaluated Methods In this paper, we evaluate six representative hashing methods in our MFH framework, which are introduced in Section III, including three unsupervised ones (LSH [40], KLSH [41], and

Since MFH is based on the concatenated multiple feature vectors and it does not explicitly consider structure information in multiple features as in [29], [31], [32], and [51], we select the

7 ZHONG et al.: EFFICIENT MFH FOR HYPERSPECTRAL IMAGERY CLASSIFICATION 4467 Fig. 5. (a) False color image of the Houston data set. (b) Its coregistered LiDAR image. (c) Training samples. (d) Testing samples. which is represented as x gabor R d 4,whered 4 is the number of Gabor filters times the number of principal components of the hyperspectral image. With these features extracted, we simply concatenate them as a fused high-dimensional multiple feature vector x multi = [x spe, x emp, x eap, x gabor ] R 1 D (D = 4 k=1 d k) and give them into the subsequent hashing learning procedure to obtain compact binary feature representations. C. Evaluated Methods In this paper, we evaluate six representative hashing methods in our MFH framework, which are introduced in Section III, including three unsupervised ones (LSH [40], KLSH [41], and SH [43]) and three supervised ones (KSH [44], CCA-ITQ [45], and FastHash [48]). Since MFH is based on the concatenated multiple feature vectors and it does not explicitly consider structure information in multiple features as in [29], [31], [32], and [51], we select the conventional subspace-based dimension reduction methods for a fair and reasonable comparison. These subspace-based methods include three linear dimension reduction ones (PCA [52], LDA [53], and NWFE [54]) and two kernel-based ones (KPCA [55], [56] and KLDA [56]). In addition, from the viewpoint of multiple feature fusion, we compare two kernel-based fusion methods (multiple feature learning MFL [15] and simple multiple kernel learning MKL [25], [57]), two SVM-ensemble-based feature fusion methods (certainty voting fusion C-Fusion [34] and probability fusion P-Fusion [34]), and two feature-selection-based fusion methods (minimum redundancy maximum mutual information mrmr [58] and discriminative sparse multimodal learning DSML-FS [32], [33]). As an apparent baseline, the concatenated vector is also evaluated (denoted as MultiFeature). In our evaluations, for hashing-based methods, the obtained binary codes are given as inputs for 1-NN classifiers with Hamming distance for classification. We select the number of hashing bits from the range [8, 16, 32, 48, 64]. For dimension reduction methods, we select the reduced dimensions from the range [2, 4,...,14, 16, 20, 30,...,100]. Then, the reduced feature vectors are given as inputs for 1-NN classifiers with Euclidean distance to conduct classification. For subspace-based dimension reduction methods, we select their hyperparameters by grid search based on the training set of each data set. For KPCA and KLDA, both have a bandwidth parameter of the Gaussian kernel. We decide their optimal values with a similar strategy as described in [59] and [60]. More specifically, for each data set, we first randomly select 5000 samples from it as anchors. For each anchor x i, (i =1,...,5000), we compute its nearest neighbor distance to the remaining anchors. All of these nearest neighbor distances are averaged to guide the search of bandwidth, i.e., we select σ from the following candidate (using MATLAB expression): [1 : 9, 10 : 10 : 200, 300 : 100 : 1000, 2000 : 1000 : d NN i 10000] μ, whereμ =(1/5000) 5000 i=1 dnn i. In addition, a linear scaling normalization is applied to the reduced features before fitting them into the classifier. Note that all features are linearly normalized into [0, 1] based on the training set before they are processed. Similar strategies are adopted for other methods having hyperparameters to be tuned. For MFL, with the four kinds of features described previously, we consider all four linear features and their nonlinear counterparts based on Gaussian RBF kernel, i.e., {h spe, K spe, h emp, K emp, h eap, K eap, h gabor, K gabor }. Take the spectral feature for example, h spe is the spectral feature itself, and the K spe feature is obtained by applying Gaussian RBF kernel to the spectral feature. A similar strategy is also adopted to the other three types of features. For MKL, we take Gaussian kernels with several bandwidths to compute multiple kernel matrices based on the concatenated feature vectors. For C-Fusion and P-Fusion, we implement them by ourselves and select the optimal spatial bandwidth and range bandwidth of mean-shift segmentation in the range [1, 2,...,10] and [10, 11,...,20], respectively. For DSML-FS, we select its parameters as suggested in [32] and [33]. D. Evaluation Criterion For the evaluation metric of classification performance, we report 1) overall accuracy (OA), which is the number of wellclassified samples divided by the number of test samples, 2) kappa statistics (κ), defined as the percentage of agreement corrected by the amount of agreement that can be expected due to chance alone, and 3) per-class classification accuracies. It is noted that, in those experiments using randomly selected training samples, the average accuracies over ten trials are reported along with their respective standard deviations. Besides classification accuracy, we also report the running times. For each method, we record three types of computational times, which include the following: 1) the training time of learning projection matrices for dimension reduction or the training time of learning hashing functions for feature hashing, which is

8 4468 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 TABLE I PERFORMANCE OF DIFFERENT HASHING METHODS ON THE INDIAN PINES DATA SET. FOR REFERENCE,WE ALSO INCLUDE THE RESULTS OF DIRECTLY USING THE ORIGINAL CONCATENATED FEATURES.THE NUMBER IN EACH BRACKET IS THENUMBER OF BYTES USED measured by seconds; 2) the time for extracting float-type lowdimensional features or binary codes from the concatenated multiple features, which is measured by microseconds. Note that the time used for computing each of the single features is not recorded since it is identical for all methods; 3) the averaged time for distance computation between any two features (float-type vectors or binary codes), which is measured by nanoseconds and denoted by Time(ns). For MFL and MKL, since they directly take multiple features to conduct classification with multiple kernels, it does not involve 1-NN classification, and we record its processing time together as 3) and set 2) as 0. Since C-Fusion and P-Fusion are related to multiple intermediate steps and the fusion mainly focuses on the classification results based on multiple SVM classifications, we do not present their times of distance computation in this paper. For mrmr and DSML-FS, we only summarize the times of distance computation based on the selected features. All experiments are conducted using MATLAB 8.10 (R2013a, glnxa64) on an Intel i CPU at 3.40 GHz, 16-GB RAM, Ubuntu (64-b) machine. V. R ESULTS AND ANALYSIS A. Experiment 1: Indian Pines Data Set We first report the results obtained on the Indian Pines data set. As shown in Fig. 2, this data set has 16 classes, mainly consisting of agricultural land-cover classes. For each class, we randomly select 50 samples as training set and the remaining as testing set. For those classes with less than 50 samples, we randomly select 15 samples as training set. This procedure is repeated ten times, and the averaged results and their standard deviations are recorded. For multiple feature extraction, the four aforementioned features are extracted. Specifically, for Spectral, we have a spectral vector x spe R 200. For EMP and EAP, we take the top five principal components to extract morphological profiles or attribute profiles. For EMP, nine MP features are computed for each component with disk-shaped structural elements of radius increased from 1 with a step size of 2; thus, the EMP feature vector for each pixel is x emp R 45. For EAP, four attributes are computed for each principal component with the same parameters as [20], which leads to an EAP feature vector x eap R 180. For Gabor, we first generate several Gabor filters with five spatial scales and eight frequency orientations under the wavelet framework, and then, we convolve the first principal component with these filters; consequently, the Gabor feature vector for each pixel is x gabor R 40. As a result, the size of the concatenated multiple feature vector (i.e., the MultiFeature) is 465. Comparison of Classification Accuracy: Table I gives the classification performance of different hashing methods used in the proposed MFH framework. For comparison, we also report the results of using other classical float-vector subspacebased methods, which are listed in Table II. From Table I, we can find that the supervised hashing methods (KSH, CCA-ITQ, and FastHash) improve the classification accuracy significantly compared to the original MultiFeature, and FastHash achieves the best result on this data set. By comparing to Table II, we can find that FastHash is even better than most float-vectorbased methods (PCA, KPCA, and NWFE), which are more time-consuming and memory expensive. Note that the results of FastHash are comparable to that of LDA and KLDA, but FastHash only requires about 1/8 memory to store the obtained features. It can be also found from Table I that LSH and SH perform worse than the MultiFeature due to the random hash functions used in LSH and the intrinsic assumption of uniform distribution of SH. However, since KLSH operates in kernelized feature space to consider the relations of the sample, it can generate more descriptive binary codes even if it uses similar random projection schemes as LSH. Table III shows the classification accuracies of different fusion methods. FastHash still achieves competitive performance to the best MFL method (i.e., MFL), but its storage of the obtained features only amounts about 1/230 of that.

9 ZHONG et al.: EFFICIENT MFH FOR HYPERSPECTRAL IMAGERY CLASSIFICATION 4469 TABLE II PERFORMANCE OF VARIOUS FLOAT-VECTOR METHODS ON THE INDIAN PINES DATA SET. THE NUMBER IN EACH BRACKET IS THENUMBER OF BYTES REQUIRED TO STORE ONE FEATURE TABLE III PERFORMANCE OF VARIOUS MULTIPLE FEATURE FUSION METHODS ON THE INDIAN PINES DATA SET. THE NUMBER IN EACH BRACKET IS THENUMBER OF BYTES REQUIRED TO STORE ONE FEATURE Comparison of Classification Map: Fig. 6 illustrates the classification maps obtained by different methods. These maps are generated from one of the ten runs. From these figures, we have the following observations. First, for those float-vector-based methods, the classification maps obtained by supervised dimension reduction methods (LDA, KLDA, or NWFE) show better visual effects with smoother classification predictions over that obtained by PCA or KPCA. Both PCA and KPCA can generate slightly better classification maps than the MultiFeature method and have no significant visual improvements. Second, for those feature hashing methods, it is observed that the FastHash can achieve a very satisfactory classification map that is very similar to the one obtained by LDA or KLDA. This indicates that the binary codes can preserve the semantic similarities among the original float-type representations while maintaining compactness. In addition, the supervised hashing methods achieve smoother classification maps than the unsupervised ones. It is obvious that, with the consideration of class label information, the supervised ones learn discriminative binary codes. The superiority of FastHash over KSH and CCA-ITQ mainly lies in its method for learning hash functions, in which the boosted decision trees have the capability of feature selection. Third, we also observe that, in the heavy mixed area of crops, i.e., the Corn-notill and the Soybean-mintill, there is no method that is obviously better than the others. The main reason is that those pixels are compounded by different types of land-cover crops that are in growth status; moreover, the pixel resolution is 20 m, which means that each pixel reflects the remote sensing signals over a large area of actual land-covers. Fourth, in terms of other fusion methods, the classification maps of SVM ensemble fusion methods (C-Fusion and P-Fusion) are obviously smoother than those of the other methods. The main reason lies in that both two methods conduct the object-level fusion based on the objects obtained by adaptive mean-shift

4470 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 Fig. 6. Classification maps of different methods on the Indian Pines data set.

10 4470 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 Fig. 6. Classification maps of different methods on the Indian Pines data set. (a) Ground truth image with a size of in pixels and classification maps obtained by (b) MultiFeature, (c) PCA, (d) LDA, (e) NWFE, (f) KPCA, (g) KLDA, (h) MFL, (i) MKL, (j) C-Fusion, (k) P-Fusion, (l) mrmr, (m) DLSRG21, (n) LSH, (o) KLSH, (p) SH, (q) KSH, (r) CCA-ITQ, and (s) FastHash. Fig. 7. Classification accuracies of different methods with variable number of dimensions or hashing bits on the Indian Pines data set. For MultiFeature, LDA, and KLDA, we just show one point in the figure for clearer illustration. (a) Float-vector-based methods. (b) Hashing-based methods. segmentation [61] and the pixelwise classification results by multiple SVM classifications on multiple features. On the other hand, for the methods without this kind of object-level fusion, their classification maps are very similar to those of subspacebased methods or hashing-based methods. In particular, the results of MFL, DLSR-FS, and FastHash are very comparable. However, FastHash only uses 8 B of binary codes, which is much smaller than that of MFL or DLSR-FS. Influence of Reduced Dimensions: To further study how the dimension of the fused feature influences the classification accuracy, we plot the accuracies with different dimensions for different float-vector-based methods and different hashing methods in Fig. 7. As shown in Fig. 7(a), the accuracy saturates when the dimension becomes larger. This means that adding more dimensions cannot significantly improve the classification performance, showing the redundancy in the original feature representation. From Fig. 7(b), we show the classification performance of the hashing methods with different numbers of hashing bits. It is worth to note that, with as few as 16 b, both FastHash and CCA-ITQ perform rather well. For other hashing methods, at most 64 b is enough for a satisfactory performance. Accordingly, we choose the best dimension for each evaluated method for a fair comparison, which is listed in the brackets of Tables I and II. Time Results: The timing results for training and feature extraction of different methods are shown in Fig. 8. We can learn from Fig. 8(a) that there are five methods (MFL, MKL, NWFE, KSH, and FastHash) that are significantly more computationally expensive than the other methods in the training stage. For MFL and MKL, the computational time mainly spends on the computation of kernels, which is very timeconsuming. For NWFE, it uses weighted mean for within-class scatter and between-class scatter, resulting in more expensive computational time than those of LDA and other subspacebased methods. For KSH and FastHash, the reason is that both of them have to learn hash functions one by one, and learning each hash function involves a time-consuming iterative optimization. As far as the testing time is concerned, we split

11 ZHONG et al.: EFFICIENT MFH FOR HYPERSPECTRAL IMAGERY CLASSIFICATION 4471 Fig. 8. Computational times for different methods on the Indian Pines data set. For clearer visual effect, we have log-scaled the y-axis. Note that the number on top of each bar corresponds to the real time that is accurate to two decimal places. (a) Training time. (b) Feature extraction time. it into two parts: 1) the time for feature extraction on one testing sample, which is shown in Fig. 8(b), and 2) the time for distance computation on one pair of testing samples in 1-NN classification, which is shown in Tables I and II. It can be learned from Fig. 8(b) that the linear dimension reduction methods (e.g., LDA) and linear hashing methods (e.g., CCA-ITQ) often have less testing time in feature extraction than the other methods. This is because they mainly involve matrixvector multiplications that can be computed very fast. For the kernel-based nonlinear methods including conventional dimension reduction and hashing, e.g., KLDA and KLSH, due to the computation of kernels, they are usually slower than the linear ones. In addition, FastHash is slow in encoding, which is attributed to the traversing of many binary decision trees. However, when considering the time for distance computation between one testing sample and one stored training sample, as shown in Tables I and II, hashing-based methods are more efficient due to the binary property of the extracted features. The Hamming distance of binary features can be computed extremely fast in modern computers. Another advantage lies in the storage. The memory footprint of binary features extracted by hashing-based methods is much less than that of float-type features extracted by float-vector-based methods. For example, by using FastHash, it only requires 8 B to store a feature, while it requires 60 B even for LDA, which is the one with the smallest memory requirement to store feature among those float-vector-based methods. This efficiency in computation and storage can facilitate a number of big data applications. B. Experiment 2: University of Pavia Data Set With Given Training Samples In this experiment, we evaluate the performance of different methods on the University of Pavia data set with the given training samples as shown in Fig. 3. From this figure, the given training samples are mainly located at the left part of the scene, which can only reflect the partial distribution information due to the variability of some certain land-cover class. The number of training and testing samples is listed in Table IV. For multiple feature extraction, we extract four types of features. For single feature extraction, the parameters are a little different from that in our first experiment. Specifically, here we use the top TABLE IV NUMBER OF TRAINING AND TESTING SAMPLES ON THE UNIVERSITY OF PAVIA DATA SET three principal components to extract EMP and EAP since [20] has shown a better performance with such setting. The other parameters are the same. As there are 103 bands on this data set, we finally obtain a 278-dimensional concatenated vector used for extracting low-dimensional float-type features as well as binary codes. Comparison of Classification Accuracy: Due to the space limit, here we only list OA and κ in Table V; please see the supplemental material for the classification accuracies of the individual class. From this table, we can see that NWFE and DSML-FS have very similar classification performances and both are higher than the other methods. The supervised subspace-based methods (LDA, NWFE, and KLDA) generally perform better than the unsupervised ones (PCA and KPCA). In terms of hashing-based methods, CCA-ITQ and FastHash achieve comparable accuracies, which is a little lower than NWFE. For other hashing methods, both KLSH and KSH have inferior performance, which might be because of the improper selection of anchors for kernel construction. Since the training set of this data set only reflects partial distribution of the total data set and meanwhile the anchors are partially selected from these training samples, the descriptive power of the learned hashing functions would degrade. SH also has very low accuracy due to the unreasonable assumption of SH on data distribution, and it cannot obtain desirable classification performance. For the compared fusion methods, DSML-FS, MFL, C-Fusion, and P-Fusion have achieved very high accuracies. In particular, DSML-FS only takes 240 B and has the best accuracy. Compared to these methods, the results of both CCA-ITQ and FastHash are very comparable and meanwhile with much lower storage due to the binary codes. Influence of Reduced Dimensions: Fig. 9 shows the trend of classification accuracy on this data set with variable number

4472 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 TABLE V PERFORMANCE OF DIFFERENT METHODS ON THE UNIVERSITY OF PAVIA DATA SET.

12 4472 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 TABLE V PERFORMANCE OF DIFFERENT METHODS ON THE UNIVERSITY OF PAVIA DATA SET.THE NUMBER IN EACH BRACKET IS THENUMBER OF BYTES USED Fig. 9. Classification accuracies of different methods with variable number of dimensions or hashing bits on the University of Pavia data set. (a) Float-vector-based methods. (b) Hashing-based methods. Fig. 10. (a) Ground truth image of the University of Pavia data set and classification maps obtained by (b) LDA, (c) MFL, (d) P-Fusion, and (e) FastHash. of reduced dimensions or hashing bits. From Fig. 9(a), the performances of PCA, KPCA, and NWFE are not as steady as those in the first experiment. For the hashing-based methods, the trends, as shown in Fig. 9(b), are similar to those demonstrated in the first experiment. This indicates that the hashingbased methods can be better generalized to different data sets compared to the subspace-based ones. Among these hashing methods, both CCA-ITQ and FastHash perform very well with steady classification accuracies. Comparison of Classification Map: Fig. 10 shows the classification maps of four methods (LDA, MFL, P-Fusion, and FastHash). The maps of other methods are referred to our supplemental material. From these maps, we can see that their visual results are very similar. FastHash can obtain smooth visual effects in the labeled regions. This shows that the obtained binary codes still preserve the similarity of features. However, in some wrongly classified areas, it seems that their results are also similar. This indicates that the fused features are not very descriptive in these areas. Time Results: The computational times of different methods are illustrated in Fig. 11. As shown in Fig. 11(a), the training times of MFL, NWFE, and KSH are much higher than those of the other methods as analyzed before. The reason why other methods have few training times is that their learning procedures only involve eigendecomposition or matrix multiplication, which are very fast given the relatively lowdimensional features. From Fig. 11(b) and Table V, we can see that the hashing-based methods spend less time on the distance computation than those subspace-based methods, owing to the adopted Hamming metric in binary feature representation. Although the dimension is largely reduced by subspace-based methods (LDA and KLDA) to as few as 8, it still spends more time than that using binary codes. In addition, because of the heavy computational cost on multiple kernel feature extraction, MFL or MKL still takes relatively more testing time than most hashing-based methods. C. Experiment 3: Salinas Data Set Our third experiment is conducted on the Salinas data set. We randomly select 30 samples from each labeled class as training set and the remaining as testing set. The parameter settings for single feature extraction are the same to those used in the second experiment. Table VI reports the classification accuracies. From this table, similar conclusions as those in the experiments described

ZHONG et al.: EFFICIENT MFH FOR HYPERSPECTRAL IMAGERY CLASSIFICATION 4473 Fig. 11. Running times of different methods on the University of Pavia data set. (a) Training time.

(a) Ground truth image of the Salinas data set and classification maps obtained by (b) MultiFeature, (c) MFL, (d) DSML-FS, and (e) FastHash. previously can be derived.

In particular, on this data set, FastHash can achieve the highest accuracy, followed by CCA-ITQ and KLDA. The supervised hashing methods are generally better than the unsupervised ones.

13 ZHONG et al.: EFFICIENT MFH FOR HYPERSPECTRAL IMAGERY CLASSIFICATION 4473 Fig. 11. Running times of different methods on the University of Pavia data set. (a) Training time. (b) Feature extraction time. TABLE VI PERFORMANCE OF DIFFERENT METHODS ON THE SALINAS DATA SET. THE NUMBER IN EACH BRACKET IS THE NUMBER OF BYTES USED Fig. 12. (a) Ground truth image of the Salinas data set and classification maps obtained by (b) MultiFeature, (c) MFL, (d) DSML-FS, and (e) FastHash. previously can be derived. The classification performances of the hashing-based methods are comparable to or better than those of the subspace-based methods. In particular, on this data set, FastHash can achieve the highest accuracy, followed by CCA-ITQ and KLDA. The supervised hashing methods are generally better than the unsupervised ones. As training samples are uniformly selected, both KLSH and SH can have comparable performance to the MultiFeature method but use much less memory to store the obtained features. Fig. 12 shows the classification maps of MultiFeature, MFL, and FastHash. As we can see in this figure, FastHash obtains the best classification map. The results of MultiFeature and FastHash are very similar, and the latter are smoother in the green Grapes region or the Vinyard_untrained region. This indicates that the binary codes obtained by FastHash in these regions have more descriptive power. The sensitivity to the reduced dimension for different methods is illustrated in Fig. 13. It is clear that three subspace-based methods (PCA, KPCA, and NWFE) perform not very steady with increased dimensions. For the hashing-based methods, CCA-ITQ and FastHash perform the best for the most number of bits. The performance of other methods becomes steady when the number of bits is larger than 16. With regard to the evaluation of computational time, as shown in Fig. 14 and Table VI, the four methods (NWFE, MFL, MKL, KSH, and FastHash) spend a lot of time on training projection matrices or hashing functions due to their respective intrinsic time-consuming schemes during the learning procedure. Other methods cost much less training time owing to the efficient eigendecomposition or matrix multiplication. In terms of testing time, hashing-based methods need less time on the distance computation owing to the high efficiency obtained by using binary codes. D. Experiment 4: Houston Data Set With Given Training Samples In this experiment, we adopt the Houston data set for evaluation, which contains 15 classes. As shown in Fig. 5, a large cloud shadow is present at the right part of the hyperspectral image, and no training samples are selected in this region. However, a large number of testing samples were collected, which make it very challenging for classification. For each class, the number of given training and testing samples on this data set is shown in Table VII.

14 4474 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 Fig. 13. Classification accuracies of different methods with variable number of reduced dimension or hashing bits on the Salinas data set. (a) Float-vector-based methods. (b) Hashing-based methods. Fig. 14. Timing results for different methods on the Salinas data set. (a) Training time. (b) Feature extraction time. TABLE VII NUMBER OF TRAINING AND TESTING SAMPLES ON THE HOUSTON DATA SET In this experiment, as in the previous experiments, we take three principal components of the original hyperspectral image to extract the EMP and EAP features and one principal component to extract the Gabor features. Since this data set also contains LiDAR data, we use it to extract the EMP, EAP, and Gabor features with the same parameters. Finally, all of these features are concatenated together to obtain a long vector as inputs to different fusion methods. Similarly, for MFL, the total features also contain two parts {h hyp, h lidar }, which are extracted from the hyperspectral and LiDAR images, respectively. The classification accuracies are shown in Table VIII. We can see that KLDA achieves the best classification accuracy. Considering the hashing-based methods, as shown in Table VIII, FastHash achieves an accuracy of 89.96%, which is slightly higher than the best result obtained by KLDA. By comparison between the results of the subspace-based methods and those of the hashing-based methods, we claim that the extracted spectral spatial features are less effective due to the existence of cloud shadow. How to extract useful features for this kind of degraded image is still an open problem. On the other hand, the classification maps of NWFE and FastHash are shown in Fig. 15. As we can see in this figure, FastHash can obtain similar visual effects as NWFE in nonshadow regions. Moreover, in the shadow region, FastHash seems to have better results than NWFE. This is mainly because the low-dimensional floattype feature representation obtained by NWFE might overfit for 1-NN classification, and the binary codes generated by the hash functions of FastHash are more powerful in describing the shadow region. Finally, according to Figs. 16 and 17, we can derive the same conclusions as those in the aforementioned experiments. For the results of other methods, please refer to our supplemental material. VI. DISCUSSIONS Throughout the experiments presented previously, we conclude that the MFH is effective compared to the traditional subspace-based methods and the state-of-the-art multiple kernel learning methods. This demonstrates that the binary codes generated by the hashing methods can preserve the similarities in the original data space. Meanwhile, the compact binary codes can also facilitate faster subsequent processing and economical storage. As a result, for the classification task in hyperspectral images, the feature hashing methods can be used to extract very compact features, which is one or two magnitudes smaller than

ON THE HOUSTON DATA SET. THE NUMBER IN EACH BRACKET IS THE NUMBER OF BYTES USED Fig. 15.

(a) NWFE, (b) MKL, (c) mrmr, (d) FastHash, and (e) labeled classes. Fig. 16.

15 ZHONG et al.: EFFICIENT MFH FOR HYPERSPECTRAL IMAGERY CLASSIFICATION 4475 TABLE VIII PERFORMANCE OF DIFFERENT METHODS ON THE HOUSTON DATA SET. THE NUMBER IN EACH BRACKET IS THE NUMBER OF BYTES USED Fig. 15. Classification maps on the Houston data set. (a) NWFE, (b) MKL, (c) mrmr, (d) FastHash, and (e) labeled classes. Fig. 16. Classification accuracy comparison of different methods with variable number of reduced dimension or hashing bits on the Houston data set. (a) Floatvector-based methods. (b) Hashing-based methods. Fig. 17. Timing results of different methods on the Houston data set. (a) Training time. (b) Feature extraction time.

16 4476 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 the traditional float-vector-based methods, while maintaining a comparable or even better classification accuracy. From the viewpoint of dimension reduction, the five classical subspace-based methods (PCA, LDA, NWFE, KPCA, and KLDA) can effectively project the high-dimensional data into low-dimensional representations to reduce redundancy. The linear dimension reduction methods (PCA and LDA) can quickly learn the transformation matrices when the feature dimension is relatively small. However, they may not be suitable when practical data have a complex structure. Kernel-based dimension reduction methods (KPCA and KLDA) can transform the features to a high-dimensional space in which the data may be linearly separable. These kernel-based methods usually achieve good performance when its hyperparameters are selected carefully. However, the corresponding time for training and testing will increase due to the process of kernel computation. For NWFE, it spends much time on the computation of weighted local mean, which also limits its applications on large-scale problems in which there are usually a huge number of samples or features. What is more, all of these methods output float-type features, which is not as attractive as binary codes, considering their memory footprint or the computational costs in subsequent processing. On the contrary, some of the feature hashing methods can handle these problems favorably. In this paper, we have studied three unsupervised hashing methods (LSH, KLSH, and SH) and three supervised hashing methods (KSH, CCA-ITQ, and FastHash). The unsupervised hashing methods have shown significant advantage in training, but they usually do not acquire a good classification performance. Compared to the conventional dimension reduction methods, FastHash does not involve costly eigenvalue decomposition, enabling it to deal with large-scale training data. In fact, as validated by [48], FastHash can efficiently deal with large data sets with more than training samples and features. By contrast, it is hard for the subspace-based or multiple kernel learning based methods to handle such a large data set. When considering the test or classification speed, the linear and kernel-based hashing methods are usually fast in computing the binary codes for low-dimensional data. For FastHash, it would be fast for converting the highdimensional data into binary codes as it only involves simple comparison operations. By contrast, it can be very slow to conduct a large matrix-vector multiplication or kernel computations for traditional dimension reduction methods, such as PCA and KPCA. What is more, the hashing methods can achieve comparable or even competitive classification performance to the traditional dimension reduction methods with much lower storage demand, typically 1/10 of the latter. As a result, hashing is very useful in compressing massive data. To sum up, based on their different characteristics, the choice of hashing methods is task oriented and depends on the tradeoff among accuracy, time complexity, and storage. As multisource, multitemporal, and multiresolution remote sensing data are collected day by day, it is highly urgent to develop efficient methods to store, retrieve, and classify this huge data. According to our study in this paper, it is clear that the hashing technique provides a promising way to do these things with faster processing and economical storage. VII. CONCLUSION AND FUTURE WORKS In this paper, we have proposed an MFH framework and have given a comparative evaluation on several existing hashing methods for hyperspectral imagery classification. The main characteristics of this work lie in the following aspects. First, the hashing technique has been introduced into multiple feature fusion for generating compact binary feature representation. Second, the classification experiments conducted on four real hyperspectral data sets have demonstrated that the obtained compact binary codes cannot only preserve similarity in the original data space but also allow more economical subsequent processing and meanwhile can achieve a comparable or better performance. Finally, along with the powerful features extracted on hyperspectral images, the feature hashing in multiple feature fusion is very effective and efficient as expected. As future work, more investigations on the MFH can be mainly explored in two aspects: theory and application. From the perspective of theory, the first improvement is to propose more flexible fusion schemes to take advantage of complementary but vital information from multiple types of features. Another possible improvement is to develop more efficient hashing methods to obtain more compact and discriminative binary codes. From the viewpoint of application, with the greater development of imaging technologies, large volumes of huge data in remote sensing have been captured and stored. How to effectively and efficiently explore the large-scale big remote sensing data urgently needs to be studied. One possible application is remote sensing data compression. With the efficient MFH, the large-scale data can be compressed into binary codes without significant loss of information, which will largely reduce the storage amount. Another possible exploration is fast retrieval for the near-duplicate spectrum or similar objects. Owing to the compact binary codes, the nearest neighbor search or approximate nearest neighbor search would be very efficient, thus largely decreasing the time complexity. ACKNOWLEDGMENT The authors would like to thank Dr. L. Johnson and Dr. J. A. Gualtieri for providing the public the AVIRIS Salinas data set, Dr. P. Gamba of the University of Pavia, Pavia, Italy, for providing the community the University of Pavia data set, the Image Analysis and Data Fusion Technical Committee of the IEEE Geoscience and Remote Sensing Society as well as Dr. F. Pacifici of DigitalGlobe, USA, for providing the Houston data set [49], and finally, the editor and anonymous reviewers for their careful reading and helpful comments. REFERENCES [1] C.-I. Chang, Hyperspectral Data Exploitation: Theory and Applications. Hoboken, NJ, USA: Wiley, [2] A. Plaza et al., Recent advances in techniques for hyperspectral image processing, Remote Sens. Environ., vol. 113, no. S1, pp. S110 S122, Sep [3] D. Landgrebe, Hyperspectral image data analysis, IEEE Signal Process. Mag., vol. 19, no. 1, pp , Jan [4] G. Camps-Valls, D. Tuia, L. Bruzzone, and J. A. Benediktsson, Advances in hyperspectral image classification: Earth monitoring with statistical learning methods, IEEE Signal Process. Mag., vol. 31, no. 1, pp , Jan

17 ZHONG et al.: EFFICIENT MFH FOR HYPERSPECTRAL IMAGERY CLASSIFICATION 4477 [5] D. Tuia, E. Merenyi, X. Jia, and M. Grana-Romay, Foreword to the special issue on machine learning for remote sensing data processing, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 4, pp , Apr [6] J. Bioucas-Dias et al., Hyperspectral remote sensing data analysis and future challenges, IEEE Geosci. Remote Sens. Mag., vol. 1, no. 2, pp. 6 36, Jun [7] X. Jia, B.-C. Kuo, and C. Melba, Feature mining for hyperspectral image classification, Proc. IEEE, vol. 101, no. 3, pp , Mar [8] M. Fauvel, Y. Tarabalka, J. A. Benediktsson, J. Chanussot, and J. C. Tilton, Advances in spectral spatial classification of hyperspectral images, Proc. IEEE, vol. 101, no. 3, pp , Mar [9] Y. Tarabalka, J. Benediktsson, and J. Chanussot, Spectral spatial classification of hyperspectral imagery based on partitional clustering techniques, IEEE Trans. Geosci. Remote Sens., vol. 47, no. 8, pp , Aug [10] Y. Tarabalka, J. Chanussot, and J. Benediktsson, Segmentation and classification of hyperspectral images using minimum spanning forest grown from automatically selected markers, IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 5, pp , Oct [11] J. Li, J. Bioucas-Dias, and A. Plaza, Hyperspectral image segmentation using a new Bayesian approach with active learning, IEEE Trans. Geosci. Remote Sens., vol. 49, no. 10, pp , Oct [12] J. Bai, S. Xiang, and C. Pan, A graph-based classification method for hyperspectral images, IEEE Trans. Geosci. Remote Sens., vol. 51, no. 2, pp , Feb [13] G. Camps-Valls, L. Gomez-Chova, J. Muñoz-Marí, J. Vila-Francés, and J. Calpe-Maravilla, Composite kernels for hyperspectral image classification, IEEE Geosci. Remote Sens. Lett., vol. 3, no. 1, pp , Jan [14] J. Li, P. Reddy Marpu, A. Plaza, J. Bioucas-Dias, and J. Atli Benediktsson, Generalized composite kernel framework for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens., vol. 51, no. 9, pp , Sep [15] J. Li et al., Multiple feature learning for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens., vol. 53, no. 3, pp , Mar [16] J. Li, J. M. Bioucas-Dias, and A. Plaza, Spectral spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random fields, IEEE Trans. Geosci. Remote Sens., vol. 50, no. 3, pp , Mar [17] O. Rajadell, P. Garcia-Sevilla, and F. Pla, Spectral spatial pixel characterization using Gabor filters for hyperspectral image classification, IEEE Geosci. Remote Sens. Lett., vol. 10, no. 4, pp , Jul [18] X. Kang, S. Li, and J. A. Benediktsson, Spectral spatial hyperspectral image classification with edge-preserving filtering, IEEE Trans. Geosci. Remote Sens., vol. 52, no. 5, pp , May [19] J. A. Benediktsson, J. A. Palmason, and J. R. Sveinsson, Classification of hyperspectral data from urban areas based on extended morphological profiles, IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp , Mar [20] M. Dalla Mura, J. Atli Benediktsson, B. Waske, and L. Bruzzone, Extended profiles with morphological attribute filters for the analysis of hyperspectral data, Int. J. Remote Sens., vol. 31, no. 22, pp , Dec [21] M. Dalla Mura, J. A. Benediktsson, B. Waske, and L. Bruzzone, Morphological attribute profiles for the analysis of very high resolution images, IEEE Trans. Geosci. Remote Sens., vol. 48, no. 10, pp , Mar [22] F. Tsai and J.-S. Lai, Feature extraction of hyperspectral image cubes using three-dimensional gray-level cooccurrence, IEEE Trans. Geosci. Remote Sens., vol. 51, no. 6, pp , Jun [23] X. Huang, X. Liu, and L. Zhang, A multichannel gray level cooccurrence matrix for multi/hyperspectral image texture representation, Remote Sens., vol. 6, no. 9, pp , Sep [24] D. Tuia, G. Camps-Valls, G. Matasci, and M. Kanevski, Learning relevant image features with multiple-kernel classification, IEEE Trans. Geosci. Remote Sens., vol. 48, no. 10, pp , Oct [25] X. Huang, Q. Lu, and L. Zhang, A multi-index learning approach for classification of high-resolution remotely sensed images over urban areas, ISPRS J. Photogramm. Remote Sens., vol. 90, pp , Apr [26] Y. Gu, Q. Wang, X. Jia, and J. Benediktsson, A novel MKL model of integrating LiDAR data and MSI for urban area classification, IEEE Trans. Geosci. Remote Sens., vol. 53, no. 10, pp , Oct [27] Y. Gu, Q. Wang, H. Wang, D. You, and Y. Zhang, Multiple kernel learning via low-rank nonnegative matrix factorization for classification of hyperspectral imagery, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 6, pp , Jun [28] Y. Gu et al., Representative multiple kernel learning for classification in hyperspectral imagery, IEEE Trans. Geosci. Remote Sens., vol. 50, no. 7, pp , Jul [29] L. Zhang, L. Zhang, D. Tao, and X. Huang, On combining multiple features for hyperspectral remote sensing image classification, IEEE Trans. Geosci. Remote Sens., vol. 50, no. 3, pp , Mar [30] T. Zhang, D. Tao, X. Li, and J. Yang, Patch alignment for dimensionality reduction, IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp , Sep [31] L. Zhang, L. Zhang, D. Tao, and X. Huang, A modified stochastic neighbor embedding for multi-feature dimension reduction of remote sensing images, ISPRS J. Photogramm. Remote Sens., vol. 83, pp , Sep [32] Q. Zhang, Y. Tian, Y. Yang, and C. Pan, Automatic spatial spectral feature selection for hyperspectral image via discriminative sparse multimodal learning, IEEE Trans. Geosci. Remote Sens., vol. 53, no. 1, pp , Jan [33] S. Xiang, F. Nie, G. Meng, C. Pan, and C. Zhang, Discriminative least squares regression for multiclass classification and feature selection, IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 11, pp , Nov [34] X. Huang and L. Zhang, An SVM ensemble approach combining spectral, structural, and semantic features for the classification of highresolution remotely sensed imagery, IEEE Trans. Geosci. Remote Sens., vol. 51, no. 1, pp , Jan [35] B. Demir and L. Bruzzone, Hashing-based scalable remote sensing image search and retrieval in large archives, IEEE Trans. Geosci. Remote Sens., vol. 54, no. 2, pp , Feb [36] S. Bondugula, Survey of Hashing Techniques for Compact Bit Representations of Images, Ph.D. dissertation, Dept. Comput. Sci., Univ. Maryland, College Park, MD, USA, [37] J. Wang, H. T. Shen, J. Song, and J. Ji, Hashing for similarity search: A survey, arxiv preprint, vol. arxiv: , [38] J. Wang, W. Liu, S. Kumar, and S.-F. Chang, Learning to hash for indexing big data A survey, Proc. IEEE, vol. 104, no. 1, 2015, pp , Jan [39] P. Indyk and R. Motwani, Approximate nearest neighbors: Towards removing the curse of dimensionality, in Proc. ACM Symp. Theory Comput., 1998, pp [40] M. S. Charikar, Similarity estimation techniques from rounding algorithms, in Proc. ACM Symp. Theory Comput., 2002, pp [41] B. Kulis and K. Grauman, Kernelized locality-sensitive hashing, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 6, pp , Jun [42] K. Jiang, Q. Que, and B. Kulis, Revisiting kernelized locality-sensitive hashing for improved large-scale image retrieval, arxiv preprint, vol. arxiv: , [43] Y. Weiss, A. Torralba, and R. Fergus, Spectral hashing, in Proc. Adv. Neural Inf. Process. Syst., 2009, pp [44] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, Supervised hashing with kernels, in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2012, pp [45] Y. Gong and S. Lazebnik, Iterative quantization: A Procrustean approach to learning binary codes, in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2011, pp [46] H. Hotelling, Relations between two sets of variates, Biometrika, vol. 28, no. 3/4, pp , Dec [47] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter, Fast supervised hashing with decision trees for high-dimensional data, in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp [48] G. Lin, C. Shen, and A. van den Hengel, Supervised hashing using graph cuts and boosted decision trees, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 11, pp , Nov [49] F. Pacifici, Q. Du, and S. Prasad, Report on the 2013 IEEE GRSS Data Fusion Contest: Fusion of hyperspectral and LiDAR data [technical committees], IEEE Geosci. Remote Sens. Mag., vol. 1, no. 3, pp , Sep [50] C. Debes et al., Hyperspectral and LiDAR data fusion: Outcome of the 2013 GRSS Data Fusion Contest, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6, pp , Jun [51] H. Li, S. Xiang, Z. Zhong, K. Ding, and C. Pan, Multi-cluster spatial spectral unsupervised feature selection for hyperspectral image classification, IEEE Geosci. Remote Sens. Lett., vol. 12, no. 8, pp , Aug [52] I. Jolliffe, Principal Component Analysis. Hoboken, NJ, USA: Wiley, 2005.

4478 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 [53] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.

g. 1999, pp. 41 48. [54] B.-C. Kuo and D. A. Landgrebe, Nonparametric weighted feature extraction for classification, IEEE Trans. Geosci. Remote Sens., vol. 42, no. 5, pp. 1096 1105, May 2004. [55] B.

Han, Efficient kernel discriminant analysis via spectral regression, in Proc. IEEE Int. Conf. Data Mining, 2007, pp. 427 432. [57] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet, SimpleMKL, J.

Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 1226 1238, Aug. 2005.

, vol. 2009, no. 1, Mar. 2009, Art. no. 783194. [60] Q. Wang, Kernel principal component analysis and its applications in face recognition and active shape models, Rensselaer Polytechnic Inst.

Zhang, An adaptive mean-shift analysis approach for object extraction and classification from urban hyperspectral imagery, IEEE Trans. Geosci. Remote Sens., vol. 46, no. 12, pp. 4173 4185, Dec. 2008.

18 4478 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 8, AUGUST 2016 [53] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. Mullers, Fisher discriminant analysis with kernels, in Proc. IEEE Signal Process. Soc. Workshop, Madison, WI, USA, Aug. 1999, pp [54] B.-C. Kuo and D. A. Landgrebe, Nonparametric weighted feature extraction for classification, IEEE Trans. Geosci. Remote Sens., vol. 42, no. 5, pp , May [55] B. Schölkopf, A. Smola, and K.-R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput., vol. 10, no. 5, pp , Jul [56] D. Cai, X. He, and J. Han, Efficient kernel discriminant analysis via spectral regression, in Proc. IEEE Int. Conf. Data Mining, 2007, pp [57] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet, SimpleMKL, J. Mach. Learn. Res., vol. 9, pp , [58] H. Peng, F. Long, and C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp , Aug [59] M. Fauvel, J. Chanussot, and J. Benediktsson, Kernel principal component analysis for the classification of hyperspectral remote sensing data over urban areas, EURASIP J. Adv. Signal Process., vol. 2009, no. 1, Mar. 2009, Art. no [60] Q. Wang, Kernel principal component analysis and its applications in face recognition and active shape models, Rensselaer Polytechnic Inst., Troy, NY, unpublished paper, [Online]. Available: abs/ [61] X. Huang and L. Zhang, An adaptive mean-shift analysis approach for object extraction and classification from urban hyperspectral imagery, IEEE Trans. Geosci. Remote Sens., vol. 46, no. 12, pp , Dec Zisha Zhong received the B.S. degree in automation from Central South University, Changsha, China, in He is currently working toward the Ph.D. degree in pattern recognition and intelligent systems in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include remote sensing image processing, pattern recognition, and machine learning. Kun Ding received the B.S. degree in automatic control from the Tianjin University of Science and Technology, Tianjin, China, in 2011 and the M.S. degree from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2014, where he is currently working toward the Ph.D. degree. His research interests include computer vision, information retrieval, deep learning, and remote sensing. Haichang Li is currently working toward the Ph.D. degree in the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests are in semantic image segmentation and understanding of remote sensing images. Shiming Xiang received the B.S. degree in mathematics and the M.S. degree from Chongqing Normal University, Chongqing, China, in 1993 and 1996, respectively, and the Ph.D. degree from the Institute of Computing-Technology, Chinese Academy of Sciences, Beijing, China, in From 1996 to 2001, he was a Lecturer with the Huazhong University of Science and Technology, Wuhan, China. He was a Postdoctorate Candidate with the Department of Automation, Tsinghua University, Beijing, from 2004 to He is currently a Professor with the Institute of Automation, Chinese Academy of Sciences, Beijing. His interests include pattern recognition and machine learning. Bin Fan (M 10) received the B.S. degree in automation from the Beijing University of Chemical Technology, Beijing, China, in 2006 and the Ph.D. degree in pattern recognition and intelligent systems from the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, in He is currently an Associate Professor with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. He has wide research interests in remote sensing image processing, computer vision, and pattern recognition. Dr. Fan is an Associate Editor of Neurocomputing and has served as an Area Chair of WACV 16. Chunhong Pan (M 14) received the B.S. degree in automatic control from Tsinghua University, Beijing, China, in 1987, the M.S. degree from the Shanghai Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Beijing, in 1990, and the Ph.D. degree in pattern recognition and intelligent system from the Institute of Automation, Chinese Academy of Sciences, Beijing, in He is currently a Professor with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His research interests include computer vision, image processing, computer graphics, and remote sensing.

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality