Hyperspectral Image Segmentation with Markov Random Fields and a Convolutional Neural Network

Size: px

Start display at page:

Download "Hyperspectral Image Segmentation with Markov Random Fields and a Convolutional Neural Network"

Moses Carter
5 years ago
Views:

1 1 Hyperspectral Image Segmentation with Markov Random Fields and a Convolutional Neural Network Xiangyong Cao, Feng Zhou, Lin Xu, Deyu Meng, Member, IEEE, Zongben Xu, John Paisley arxiv: v1 [cs.cv] 1 May 2017 Abstract This paper presents a new supervised segmentation algorithm for hyperspectral image (HSI) data which integrates both spectral and spatial information in a probabilistic framework. A convolutional neural network (CNN) is first used to learn the posterior class distributions using a patch-wise training strategy to better utilize the spatial information. Then, the spatial information is further considered by using a Markov random field prior, which encourages the neighboring pixels to have the same labels. Finally, a maximum a posteriori segmentation model is efficiently computed by the α-expansion min-cut-based optimization algorithm. The proposed segmentation approach achieves state-of-the-art performance on one synthetic dataset and two benchmark HSI datasets in a number of experimental settings. Index Terms Hypersepctral image segmentation, Hypersepctral image classification, Convolutional neural network, Markov random field I. INTRODUCTION Hyperspectral remote sensors capture digital images in hundreds of continuous narrow spectral bands to produce a highdimensional hyperspectral image (HSI). HSI has been used for many applications, such as land-use mapping, land-cover mapping, forest inventory, and urban-area monitoring [21], because it provides detailed information on spectral and spatial distributions of distinct materials. Such applications can often be converted into a classification task, where the aim is to classify each hyperspectral pixel vector into one of multiple categories. Thus effective methods for HSI classification are required for these applications. In the last few decades, many methods have been proposed for HSI classification. These methods can be roughly divided into two categories: spectral based methods and spectralspatial based methods. We first briefly review these approaches to HSI classification, and then discuss the contributions of our proposed method. Xiangyong Cao, Deyu Meng, and Zongben Xu are with the School of Mathematics and Statistics, Xi an Jiaotong University, Xi an , China (caoxiangyong45@gmail.com, dymeng@mail.xjtu.edu.cn, zbxu@mail.xjtu.edu.cn). Feng Zhou is with the National Laboratory of Radar Signal Processing, Xidian University, Xi an, China (fzhou@mail.xidian.edu.cn). Lin Xu is with the NYU Multimedia and Visual Computing Lab, New York University Abu Dhabi, UAE (xulinshadow@gmail.com). John Paisley is with the Department of Electrical Engineering & Data Science Institute, Columbia University, New York, NY, USA (jpaisley@columbia.edu). This research was supported in part by the National Natural Science Foundation of China under Grants , and Xiangyong Cao performed this work at Columbia University on a grant from the China Scholarship Council. A. Related work: spectral vs spectral-spatial based methods Many classical HSI classification approaches are only based on the spectral information [2], [23], [32]. In these methods, spectral features are first extracted by using a feature extraction method such as principal component analysis [23], independent component analysis [32] or linear discriminant analysis [2], to address the issue of Hughes phenomenon. The extracted features are then used to learn a classifier. These approaches only consider the spectral information and ignore the correlations among distinct pixels in the image, which tends to decrease their classification performance relative to those which consider both. In this paper we instead focus on designing a spectral-spatial based classification method. Spectral-spatial based methods can help improve the classification performance since they incorporate additional spatial information from the HSI. It has been pointed out in [25] that spatial information is often more crucial than spectral information in the HSI classification task. Therefore, many spectral-spatial methods have been proposed that additionally consider spatial correlation information. Specifically, there are two approaches: One is to extract the spatial dependence in advance using a spatial feature extraction method before learning a classifier. The patch-based feature extraction method is an example of this approach [9], [28], [29], which first groups neighboring pixels using square windows and then extracts features based on each local window using a subspace learning technique, such as low-rank matrix factorization [34], dictionary learning [8] or subspace clustering [17]. Compared with the original spectral vector, the features extracted by patchbased methods have less intra-class variability and higher spatial smoothness with reduced noise. Aside from patchbased feature extraction methods, many other approaches have been proposed, such as the 3-dimensional discrete wavelet transform [5], 3-dimensional Gabor wavelets [27], morphological profiles [3] and attribute profiles [11]. Another way to incorporate spatial information is by using a Markov random field (MRF) to post-processing the classification map after the classification step has been performed. MRF is a powerful tool for image segmentation, which can encourage neighboring pixels to have the same class label. It has been shown to greatly improve the HSI classification accuracy [22], [31]. Our approach will fall in this class of algorithms. A drawback of several previous approaches above is that the extracted features are usually hand-crafted, which highly depends on prior knowledge of the specific domain and is often sub-optimal [39]. In the last few years, deep learning has

2 2 been a very powerful machine learning technique for learning data-dependent and hierarchical feature representations from raw data. These learned features by deep learning are usually more discriminative than the handcrafted ones. Such methods have been especially for image processing and computer vision problems, such as image classification [20], image segmentation [6], action recognition [16] and object detection [30]. - 1) Deep learning approaches: A few deep learning methods have been introduced for the HSI classification task. For example, unsupervised feature learning methods, such as stacked autoencoders [7] and deep belief networks [10] have been proposed. Although these two unsupervised learning models can extract deep hierarchical feature, the 3- dimensional training patch samples must be first flattened into 1-dimensional vectors in order to satisfy the input requirement. These flattened training samples actually lose the some of the spatial information that the original 3-dimensional patches have. A supervised autoencoder [24] method was proposed that uses label information during feature learning, and later similar methods have been proposed based on the convolutional neural network (CNN). For example, the five-layer CNN of [15] trains on a 1-dimensional spectral vector without using spatial information. Very recently, a modified CNN [36] modeled spatial information by using 3-dimensional training patch samples as the input, and a dual-channel convolutional neural network (DC-CNN) [39] was proposed, which first extracts the spectral feature and spatial feature by using the 1D-CNN and 2D-CNN separately, then flattens each feature and finally stacks the two features into a 1-dimensional vector. Although the DC-CNN performs well, the extracted feature is somewhat redundant since the feature extracted by the 2D- CNN already contains much of the spectral information, and so Hughes phenomenon can be made worse using a stacked overlong 1D feature. In this paper, we compare with three recent state-of-the-art deep learning methods: SSD-CNN [38], SSDL [37] and DC-CNN [39]. B. Contributions of our proposed approach In this paper, we propose a new supervised HSI segmentation algorithm based on deep learning and the Markov random field. We first formulate the HSI segmentation problem from a probabilistic perspective. We then use the convolutional neural network (CNN) to model the posterior class distribution of each training sample patch and apply the Markov random field (MRF) to the spatial labels. To our knowledge, this is the first approach that integrates the CNN and MRF into a unified framework for the HSI classification problem. Our specific contributions are twofold: 1) We design a compact convolutional neural network (CNN) to extract the spectral-spatial features by using the 3D patch as the training sample. This structure of the proposed CNN is different from previous CNN structures in the HSI classification field. 2) We use a Markov random field (MRF) to further exploit the spatial information after the HSI classification step, which assumes that adjacent pixels are more likely to belong to the same class. Our experimental results on synthetic and real data substantiate that the proposed method outperforms other stateof-the-art methods, specifically SSD-CNN [38], SSDL [37] and DC-CNN [39]. The rest of the paper is organized as follows: In Section 2, we formulate the HSI segmentation problem. In Section 3, we describe our proposed approach. In Section 4, we perform experiments on one synthetic dataset and two benchmark real HSI datasets and compare with state-of-the-arts methods. We conclude in Section 5. Throughout the paper, we denote scalars, vectors, matrices and tensors as the non-bold, bold lower case, bold upper case and curlicue letters, respectively. II. PROBLEM FORMULATION Before formulating the HSI segmentation problem, we define the problem-related notation used throughout this paper. 1) Problem notation: Let the HSI dataset be H R h w d, where h and w are the height and width of the spatial dimensions, respectively, and d is the number of spectral bands. The set of class labels is defined as K = {1, 2,..., K}, where K is the number of classes in the given HSI dataset. The set of all patches extracted from HSI data H is denoted as X = {x 1, x 2,..., x n }, where x i R k k d, k is the patch size in the spatial dimension, n h w represents the total number of extracted patches (accounting for boundary cases), and the corresponding labels of patches in X are defined as y = {y 1, y 2,..., y n }, where y i K is the label of the spectral vector corresponding to the center of x i. We define the training set for class k as D (k) = {(x l (k) 1, y 1 ),..., (x l (k), y l (k))} and thus the entire training set can be denoted as D l = {D (1), D (2),..., D (K) }, where l = K l (1) l (2) l (K) k=1 l(k) n is the total number of training patches. 2) Problem setup: With the definitions above, the goal of HSI classification is to assign a label y i to each patch x i, i = 1, 2,..., n. The goal of HSI segmentation can be defined in a similar way. In this paper we use the term classification when the Markov random field is not applied in the post-processing step. We use the term segmentation when the MRF is used in the post-processing step to spatially smooth the labels in a way more suitable to segmentation. In the discriminative classification framework, the estimation of labels y for observations X can be obtained by defining a distribution P(y X ). In our framework, we seek to combine the spatial modeling power of a Markov random field with the discriminative power of deep learning. Therefore, we let this distribution be of the form P(y X ) = n P(y ỹ) P(ỹ i x i ). (1) ỹ K n i=1 That is, we let the distribution of the labels of a dataset X be represented as the marginal of a discriminative classifier P(ỹ i x i ), which provides an initial prediction independently for each data point, with a second classifier P(y ỹ) that takes this collection of predictions and outputs the final classification. Predictions can be made as ŷ = arg max y P(y X ). We simplify this in two ways for computational convenience. First, since a sum over all ỹ is computationally

3 3 intensive, we instead learn a point estimate of the pairs (y, ỹ). A goal can then be to solve { } n ŷ = arg max max log P(y ỹ) + log P(ỹ i x i ). (2) y K n ỹ Our second simplification takes the form of breaking this into two problems. First, making predictions based on the discriminative classifier P(ỹ i x i ), and then given predicted values ỹ = {ỹ 1, ỹ 2,..., ỹ n }, obtaining the final predictions according to ŷ = arg max log P(y ỹ). y K n i=1 In this paper, we design a new approach to simultaneously consider both spatial and spectral information of HSIs based on the above classification framework. Specifically, in our approach we learn the term P(ỹ i x i ) (i.e. the class-conditional probabilities of each patch), using a convolutional neural network. This will provide an accurate initial prediction for the label. Then, to enforce spatial consistency, we define P(y ỹ) to be a Markov random field (MRF), which will enforce the adjacent pixels to have the same label with a high probability. Finally, the MAP segmentation result using ŷ is computed by the efficient α-expansion min-cut optimization algorithm [4]. Detailed descriptions for each of these parts are given in the next section. III. PROPOSED APPROACH In this section, we first propose a convolutional neural network (CNN) based classifier for hyperspectral images. Second, the spatial information is further considered by placing a MRF prior on the labels y given the initial predictions ỹ from the CNN. Finally, we introduce an efficient min-cut optimization technique to compute the segmentation result. A. Discriminative CNN-based classifier Convolutional neural networks (CNN) play a dominant role in visual-related processing problems. They consist in various combinations of a convolutional layer, max pooling layer, and fully connected layer. Compared with the standard fully connected feed-forward neural network (also called multiple layer perceptron), CNN exploits spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers and achieves better performance in many image processing problem, such as image denoising [33], image super-resolution [12], image deraining [13] and image classification [20]. In this paper, we adopt the CNN structure shown in Figure 1. Specifically, this network contains one input layer, two pairs of convolution and max pooling layers, two fully connected layers and one output layer. The detailed parameter settings in each layer are also displayed in Figure 1. For the hyperspectral image classification task, each sample is a 3D patch of size k k d. Next, we describe the process of how each sample patch x i is processed at each layer of our CNN structure. We note that there is no need to flatten the sample patch x i into a 1-dimensional vector before it is fed into the input layer. Therefore, the size of the first layer is k k d. First, HSI Patch k x k x d Convolutional Layer 3x3 300 Max Pooling Layer 2x2 Convolutional Layer 3x3 200 Max Pooling Layer Fully Connected Layer Fully Connected Layer Fully Connected Layer 2x K Fig. 1. The network structure of the CNN used as our discriminative classifier. the sample patch is input into the first convolutional layer followed by a max pooling operation. The first convolutional layer filters the k k d sample patch using 300 kernels with size of 3 3 d. After this convolution, we have 300 feature maps, each of which is n 1 n 1, where n 1 = k 2. We then perform max pooling on the combined set of features of size n 1 n The kernel size of the max pooling layer is 2 2 and thus the pooled feature maps have size of n 2 n 2 300, where n 2 = n 1 /2. This set of pooled feature maps are then passed through a second pair of convolutional and max pooling layers. The convolutional layer contains 200 kernels of size 3 3 d, which filters into new feature maps of size of n 3 n 3 200, where n 3 = n 2 2. Again these new feature maps are input into the second max pooling layer with kernel size of 2 2 and turned into a second set of pooled feature maps with size n 4 n 4 200, where n 4 = n 3 /2. Finally, the second pooled feature maps are flattened into a 1-dimensional vector x pool2 and input to the fully connected layer. The computation of the next three fully connected layers are as follows: f (5) (x pool2 ) = σ(w (5) x pool2 + b (5) ), (3) f (6) (x pool2 ) = σ(w (6) f (5) (x pool2 ) + b (6) ), (4) f (7) (x pool2 ) = W (7) f (6) (x pool2 ) + b (7), (5) where W (5), W (6) and W (7) are weight matrices, b (5), b (6) and b (7) are the biases of the nodes, and σ(.) is the non-linear activation function. In our network, the activation functions in the convolutional layers and fully connected layers are all selected to be the rectified linear unit (ReLU) function. In order to simplify the notation, we denote W = {W (1), W (3), W (5), W (6), W (7) } and b = {b (1), b (3), b (5), b (6), b (7) }, where {W (1), W (3) } and {b (1), b (3) } are the weight matrices and biases in the convolutional layers respectively. The final vector f (7) R K is then passed through the softmax (normalized exponential) function, which gives a distribution on the label. Let this softmax distribution on labels be ỹ i R K for a given sample x i. Then the predicted classification label ỹ i = arg max k ỹ i. Therefore, given the training set D l where we represent each scalar label y i K of a training sample as a one-hot vector y i R K, the cross- Class Probability

4 4 entropy (CE) loss function can be computed as E(W, b) = 1 l = 1 l l CE(ỹ i, y i ), i=1 l K [y ik log ỹ ik +(1 y ik ) log (1 ỹ ik )],(6) i=1k=1 where W and b are the parameter set defined above. The optimization of the loss function Eq. (6) is conducted by using the stochastic gradient descent (SDG) algorithm. In the t th iteration, the weight and bias are updated by E(W, b) W t+1 = W t α W W t, (7) E(W, b) b t+1 = b t α bt, (8) b where the gradients with respect to W and b are calculated using the back-propagation algorithm [26] and α is the learning rate, which is set as in our experiments. B. Markov random field spatial prior In image segmentation task, it is very probable that adjacent pixels have the same label. The exploitation of this naive prior information can often dramatically improve the segmentation performance. In this paper, we enforce a spatially smooth label prior by treating the labels in y as a Markov random field (MRF). This prior encourages neighboring pixels to belong to the same class. Specifically, the MRF prior distribution on the labels y is defined as P(y) = 1 Z e( c C Vc(y)), (9) where Z is a normalization constant for the distribution and V c (y) are the potentials for the set of cliques C over the image, and is defined as v yi, if c = 1 (single clique) V c (y) = µ c, if c > 1 and i, j c, y i = y j µ c, if c > 1 and i, j c, y i y j where v yi and µ c > 0 are two parameters. By changing the set of cliques C and the two parameters v yi and µ c, the MRF prior can exhibit a great deal of flexibility. In this paper, we set v yi = 1 and µ c = µ. Therefore, Eq. (9) can be simplified into P(y) = 1 Z eµ (i,j) C δ(yi yj), (10) where δ() is the unit impulse function 1. It should be noted that the pairwise interaction terms δ(y i y j ) obtain higher probability when neighboring labels are equal than when they are not equal. In this way, the MRF prior encourages piecewise smooth segmentations. The level of smoothness is controlled by the parameter µ. C. A supervised CNN-MRF segmentation algorithm After the class probability of each pixel vector is estimated using CNN classifier which formulates a probabilistic map 1 The definition of δ() function is: δ(0) = 1 and δ(y) = 0 for y 0. Training set D l HSI patch set X Learn CNN W, b MRF prior Probability map P Fig. 2. The flowchart of the proposed CNN-MRF algorithm. α expansion ^ to label y P R n K, and the label is model using MRF prior respectively, our proposed final segmentation model is given by n ŷ =arg max log P(y i x i,w,b )+µ δ(y i y j ), y K n i=1 (i,j) C (11) where W and b are the learned parameters of the CNN. This objective contains many pairwise interaction terms and is actually a combinatorial optimization problem that is difficult to solve. Many algorithms have been proposed to solve this model, such as the graph cut method [4], [19], belief propagation [35] and message passing [18]. In this paper, we adopt the α-expansion min-cut method [4] because of its good approximation performance and fast computational efficiency. To clarify the details of our entire segmentation algorithm, we depict the flowchart of the proposed method in Figure 2 and summarize the procedure in Algorithm 1. Algorithm 1 CNN-MRF hyperspectral image segmentation Input: HSI patches X, training data D l, learning rate α, smoothness parameter µ, batch size n batch. Output: Labels ŷ. 1: Train the CNN classifier using D l ; 2: Compute the probabilistic map P R n K on X using the trained CNN; 3: Compute the segmentation labels ŷ = α-expansion(p,µ). IV. EXPERIMENTS To validate the effectiveness of our proposed CNN-MRF segmentation algorithm in different scenarios, we conduct experiments on one synthetic dataset and the two real-world Indian Pines and Pavia University benchmark datasets. For the purpose of comparison, we consider several state-of-theart HSI segmentation methods, including the support vector machine graph-cut method (SVM-GC), subspace multinomial logistic regression with multilevel logistic prior (MLRsubMLL) 2 [22] and the support vector matchine based on 3- dimensional discrete wavelet transform (SVM-3DG) 3 [5]. We also compare with the deep learning methods SSD-CNN [38], SSDL [37] and DC-CNN [39]. For our proposed CNN-MRF approach 4, all the experiments of are implemented using Python language and Tensorflow [1] library on a server with Nvidia GeForce GTX 1080 and Tesla Kc. All non-deep 2 Code is available at: jun/demos.html 3 Code is available at: 4 Accompanying code will be released (currently in preparation).

5 5 algorithms are run in Matlab R2014b on a PC with Intel Core i5 CPU 4460, 8GB RAM. For our comparisons with other deep learning models in Section IV-C, we use the best reported results for these algorithms. All methods are compared numerically using the following three criteria [5]: overall accuracy (OA), average accuracy (AA) and the kappa coefficient (κ). Specifically, OA represents the number of correctly classified samples divided by the total number of test samples, AA denotes the average of individual class accuracies, and the kappa coefficient (κ) involves both omission and commission errors and accounts for the overall effective performance of the classifier. For all the three criteria, a larger value indicates a better segmentation performance. TABLE I OVERALL ACCURACIES (%), AVERAGE ACCURACIES (%), KAPPA STATISTICS AND RUNNING TIME OF ALL COMPETING METHODS ON THE SYNTHETIC DATASET. Class SVM-GC MLRsubMLL SVM-3DG CNN-MRF OA AA Kappa Time(s) A. Synthetic HSI data In this section, we validate the effectiveness of our proposed method in a synthetic HSI data. As for the synthetic HSI, the five endmembers are randomly extracted from a real scene with 162 bands in ranges nm, and 000 vectors are generated as a sum of Gaussian fields. This gives a dataset of size of Specifically, this data is generated using a Generalized Bilinear Mixing Model (GBM) []: z = K i=1 K 1 a i e i + K i=1 j=i+1 γ ij a i a j e i e j + n, (12) with class number K = 5. Here z is the simulated pixel vector, γ ij are selected uniformly from [0, 1], e i, i = 1,..., 5 are the five endmembers, n is the Gaussian noise with an SNR of 30 db, a i 0 and K i=1 a i = 1. For this experiment, we randomly select 1% of samples from each class as training data and the remaining for testing. We show the final segmentation results in Table I and Figure 3. From Table I, it can be seen that our proposed CNN-MRF method achieves higher performance with respect to OA, AA and Kappa than other competing methods although it takes more time to converge. For this results, we emphasize that all the methods in this experiment use the MRF to post-processing the classification map, and the only difference is the method for extracting features. Thus the superiority of our method in this experiment is strictly due to using the CNN. From Figure 3, we can see that the segmentation maps obtained by our method are visually closer to the ground truth map than other methods, which is consistent with the quantitative metric OA. We conduct this synthetic experiment only to evaluate the effectiveness of our method, and to briefly introduce and analyze the experimental results. More discussion about the parameter settings as well as the performance evaluation of our method are studied in detail in the following sections. B. Real HSI Data We next test our method on two real HSI benchmark datasets, the Indian Pines data and Pavia University data, in a variety of experimental settings. 1) AVIRIS Indian Pines Data: This data set was gathered by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test site in North-western Indiana Ground-truth SVM-GC (96.44%) SVM-3DG(91.09%) MLRsubMLL(97.92%) CNN-MRF(98.86%) Fig. 3. Segmentation maps obtained by all competing methods on the synthetic dataset (overall accuracies are reported in parentheses). in June The original dataset contains 220 spectral reflectance bands in the wavelength range µm, of which 20 bands cover the region of water absorption. Unlike other methods which remove the 20 polluted bands, we keep all the 220 bands in our experiments. This HSI has a spectral resolution of 10 nm and a spatial resolution of 20 m by pixel, and the spatial dimension is The ground truth contains 16 land cover classes. This dataset poses a challenging problem because of to the significant presence of mixed pixels in all available classes and also because of the unbalanced number of available labeled pixels per class. A sample band of this data and the related ground truth data are shown in Fig. 4. First, to evaluate the validity of our proposed CNN-MRF method with limited training samples, we randomly chose 10% of the available labeled samples for each class from the reference data, and the remaining samples in each class are used to test. Available training and testing sets are summarized in Table II. This experiment is repeated 20 times for each methods and the average performance is reported. Additionally, some parameters need to be set in advance in the experiments. For the compared methods, these parameters are set as their papers suggest. For the SVM-based methods, the RBF kernel parameter γ and the penalty parameter C are tuned through 5-fold cross validation (γ = 2 8, 2 7,..., 2 8, C = 2 8, 2 7,..., 2 8 ). For SVM-GC and SVM-3DG, the spatial smoothness parameter β is set as 0.75 advised by [5], [31]. For MLRsubMLL, the smoothness parameter µ and the threshold parameter τ are set following [22]. For our proposed

Grass-pasture-mowed 3 25 28 8 Hay-windrowed 48 430 478 9 Oat 2 18 20 10 Soybean-no till 98 847 972 11 Soybean-min till 246 2209 2455 12 Soybean-clean 60 533 593 13 Wheat 21 184 205 14 Woods 127 1138

6 6 TABLE II STATISTICS OF THE INDIAN PINES DATA SET, INCLUDING THE NAME, THE NUMBER OF TRAINING, TEST AND TOTAL SAMPLES FOR EACH CLASS. Class Samples No Name Train Test Total 1 Alfalfa Corn-no till Corn-min till Corn Grass-pasture Grass-trees Grass-pasture-mowed Hay-windrowed Oat Soybean-no till Soybean-min till Soybean-clean Wheat Woods Buildings-Grass-Trees-Drives Stone-Steel-Towers Total (a) (b) (c) (d) Fig. 4. Indian Pines image and related ground truth categorization information. (a) The original HSI. (b) The ground truth categorization map. (c) The training map. (d) The test map. CNN-MRF method, several parameters also need to be tuned, such as the sample patch size k, network depth, network width (the number of kernels in each convolutional layer), the kernel size in each convolutional layer, the kernel size in each max pooling layer and the number of nodes in the fully connected layers. In this experiment, after testing several values, we fix these parameters as follows (also shown in Figure 1) 1) The network depth is 7 (not including the input layer): two pairs of convolutional and max pooling layers and three fully connected layers. 2) The kernel size in both convolutional layers is set to 3 and the number of kernels are 300 and 200 respectively. 3) The kernel size in the max pooling layer is 2. 4) The three fully connected layers have 200, 100 and 16 nodes respectively. 5) For the training patches, the patch size k is set to 9. 6) The smoothness parameter µ is set to 20. In a later section, we will study the impact of these parameters on the performance of our method. For the above experimental settings, in order to measure the performance improvement due to including spatial contextual information with the MRF, we also report the classification results of each method without using the MRF prior, and call these methods SVM, MLRsub, SVM-3D and CNN, respectively. Classification and segmentation maps for all the competing methods on the Indian dataset are shown in Fig. 5, and the accuracies (i.e. the classification accuracy of each class, OA, AA and kappa coefficient κ) are reported and compared in Table VI. From Table VI, we highlight two main results. First, CNN- MRF and CNN, which are both deep learning method, achieve the first and second best performance in terms of the three criteria (OA, AA and κ). Specifically, they attain about 5% improvement in term of OA in this scenario, with SVM-3DG and SVM-3D trailing slightly behind (85.88% and 89.99% OA, respectively). This means that using a CNN for segmentation without the MRF performs better than an MRFbased model using another classifier such as the SVM, and so deep learning is significantly helping here. Moreover, as depicted in Fig. 5, the classification maps of the CNN-based methods are noticeably closer to the ground truth map. Finally, directly comparing the MRF and non-mrf based methods, we can conclude that using an MRF prior significantly improves the classification accuracy of a particular classifier because it further embeds the spatial smoothness information into the segmentation stage. Therefore, the superior performance of our proposed CNN-MRF method can be explained by using the CNN and MRF strategies simultaneously to fully exploit the spectral and spatial information in a HSI. We also show the runtime of all the methods in Table VI, from which we can conclude that the time cost of our proposed method, while acceptable, is a potential drawback that could be addressed with high-performance computing resources. We also validate our segmentation result (OA) by the Wilcoxon test [14]. This is non-parametric statistical hypothesis test, which bases the hypothesis decision on the p-value, and determines whether the differences of the classification results between two methods are statistically significant. A p- value smaller than 0.05 indicates that the difference between classification accuracies is statistically significant with 95% confidence. The results of the Wilcoxon test are shown in Table IV, including comparisons between CNN-MRF and the other methods. The statistical difference (p-value<0.05) in Table IV indicates that the improved performance of CNN-MRF is statistically significant. 2) ROSIS Pavia University Data: This second dataset we consider was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) over the urban area of the University of Pavia, northern Italy, on July 8, The original dataset consists of 115 spectral bands ranging from 0.43 to 0.86 µm, of which 12 noisy bands are removed and only 103 bands are retained in our experiments. This scene has a spatial resolution of 1.3 m per pixel, and the spatial dimension is There are 9 land cover classes in this scene and the number of each class is displayed in Table VI. We show a sample band of this image and the corresponding ground truth map in Fig. 7. To evaluate the performance of our proposed CNN-MRF method using only a small number of labeled training samples, we randomly chose samples for each class from the ground truth data for training, and use the remaining for testing. The related statistics are summarized in Table VI. As previously, we repeat this experiment 20 times. In this experiment, all parameters involved in the compared methods are tuned in the same way as for the previous Indian Pines experiment.

7 7 TABLE III OVERALL, AVERAGE AND INDIVIDUAL CLASS ACCURACIES (%) AND KAPPA STATISTICS OF ALL METHODS ON THE INDIAN PINES IMAGE TEST SET. Class Classification algorithms Segmentation algorithms SVM MLRsub SVM-3D CNN SVM-GC MLRsubMLL SVM-3DG CNN-MRF OA AA Kappa Time(s) SVM (77.29%) MLRsub (63.03%) SVM-3D (85.88%) CNN (93.23%) Ground-truth SVM-GC (85.93%) MLRsubMLL(70.88%) SVM-3DG(89.99%) CNN-MRF(94.22%) Fig. 5. Classification/segmentation maps obtained by all methods on the Indian Pines dataset (overall accuracies are reported in parentheses). TABLE IV p-values OBTAINED THROUGH WILCOXON TESTS BETWEEN CNN-MRF AND OTHER METHODS ON INDIAN PINES DATASET. VS. Methods SVM MLRsub SVM-3D p value CNN SVM-GC MLRsubMLL SVM-3DG The network structure settings of the CNN are also the same. Classification and segmentation maps obtained by all the compared methods on this dataset are illustrated in Fig. 7, and the accuracies (i.e. the classification accuracy of each class, OA, AA and kappa coefficient κ) are summarized in Table V. From Table V, we can conclude that, for this dataset, our CNN-MRF approach achieves the best performance in terms of the three criteria OA, AA and kappa coefficient κ. For example, it gives the highest segmentation OA (95.68%). It is also worth noting that the CNN without MRF again performs second best with respect to OA and kappa coefficient (κ), which again means that the CNN plays a dominant role in improving the classification accuracy. By using the MRF prior, CNN-MRF obtains about 1% improvement in terms of the OA compared with the CNN. However, for other segmentation algorithms that perform more poorly, we observe that the MRF can greatly boost classification accuracy. For example, MLRsubMLL has about 25% improvement with respect to the OA compared to MLRsub. SVM-3DG has the third best segmentation OA (94.%) (only 1.28% lower than CNN- MRF) due to the 3D discrete wavelet transform (3DDWT),

8 TABLE V OVERALL, AVERAGE AND INDIVIDUAL CLASS ACCURACIES (%) AND KAPPA STATISTICS OF ALL COMPETING METHODS ON THE PAVIA U NIVERSITY IMAGE TEST SET.

70 96.92 65.57 79.16 85.96 86.34 61.50 98.78 99.16 94.74 99.92 56.66 82.20 90.10 86.20 87.36 84.42 65.98 86.05 89.84 99.67 100 96.80 66.52 90.50 94.80 74.62 90.44 93.28 57.75 87.46 93.09 51.24 416.

54 99.62 99. 93.57 94.50 90.62 64.83 91.21 99.78 100 91.13 94. 89.25 93.90 88.19 92.60 55.76 426.92 MLRsub(66.52%) SVM-3D (90.50%) CNN-MRF 97.92 97.38 87.66 98.97 99.92 92.00 85.27 91.54 97.13 95.

8 8 TABLE V OVERALL, AVERAGE AND INDIVIDUAL CLASS ACCURACIES (%) AND KAPPA STATISTICS OF ALL COMPETING METHODS ON THE PAVIA U NIVERSITY IMAGE TEST SET. Class OA AA Kappa Time(s) SVM Classification algorithms MLRsub SVM-3D CNN SVM (73.41%) SVM-GC Segmentation algorithms MLRsubMLL SVM-3DG MLRsub(66.52%) SVM-3D (90.50%) CNN-MRF CNN (94.80%) Ground-truth SVM-GC(90.36%) MLRsubMLL(91.13%) SVM-3DG (94.%) CNN-MRF(95.68%) Fig. 6. Classification and segmentation maps obtained by all competing methods on the Pavia University dataset (overall accuracies are reported in parentheses). TABLE VI S TATISTICS OF THE PAVIA U NIVERSITY DATA SET, INCLUDING THE NAME, THE NUMBER OF TRAINING, TEST AND TOTAL SAMPLES FOR EACH CLASS. No Class Name Asphalt Meadows Gravel Trees Painted metal sheets Bare Soil Bitumen Self-Blocking Bricks Shadows Total Train 360 Samples Test Total (a) (b) (c) (d) Fig. 7. Pavia University image and related ground truth categorization information. (a) The original HSI. (b) The ground truth categorization map. (c) The training map. (d) The test map.

9 per class a b Fig. 8. Overall accuracy (%) obtained by all competing methods with different proportions of training samples on Pavia University dataset. (a) Classification results. (b) Segmentation results. TABLE VII p-values OBTAINED THROUGH WILCOXON TESTS BETWEEN CNN-MRF AND OTHER COMPETING METHODS ON PAVIA UNIVERSITY DATASET. VS. Methods SVM MLRsub SVM-3D p value CNN SVM-GC MLRsubMLL SVM-3DG as has been previously studied in [5]. Meanwhile, it can be seen from Fig. 6 that CNN-MRF obtains much smoother segmentation map than other methods, which is consistent with the results shown in Table V. Consequently, improvement of our CNN-MRF approach can be explained by the use of CNN and MRF models simultaneously. We again show the runtime in Table V, again noting the reasonable runtime. We again validate the segmentation result (OA) by Wilcoxon test, shown in Table VII and can draw similar conclusions as previously. In order to analyze the sensitivity of the proposed method to different training sets consisting limited number of training samples, we conduct additional experiments in which 0.1%, 0.2%, 0.3%, 0.4% and 0.5% of each class are randomly selected from the Pavia University data as training samples and the remaining are used for testing. The OA of all methods are displayed in Fig. 8. From Fig. 8(a), it can be observed that the classification results of the CNN method outperform the other methods for each training set size. Additionally, when the spatial prior is considered using the MRF, the segmentation results in Fig. 8(b) significantly improve the corresponding classification results in Fig. 8(a), again indicating that the MRF is an important factor for improving the classification accuracy. C. Comparison with other deep learning methods In order to further evaluate the performance of our proposed CNN-MRF method, we compare with three recent deep learning HSI classification algorithms: SSD-CNN [38], SSDL [37] and DC-CNN [39]. We use the two previous real datasets for comparison. For the Indian Pines dataset, we choose 10% samples from each class as training set, and for the Pavia University dataset, we choose 5% samples from each class as training set. The results for each algorithm are displayed in Table VIII and Table IX. 5 Results for Indian Pines are shown in Table VIII, where it can be seen that our proposed CNN- MRF method achieves improved performance compared with other competing methods in terms of OA, AA and Kappa coefficient(κ). We show results for the Pavia University data in Table IX. We again observe that our method achieves competitive performance, but is outperformed by DC-CNN on this data. However, we note that runtime of our method is much faster than DC-CNN on both data sets. TABLE VIII OVERALL ACCURACY (%) OF NEURAL NETWORK METHODS ON THE INDIAN PINES DATASET. SSD-CNN SSDL DC-CNN CNN-MRF OA AA kappa(κ) Time TABLE IX OVERALL ACCURACY (%) OF NEURAL NETWORK METHODS ON THE PAVIA UNIVERSITY DATASET. SSD-CNN SSDL DC-CNN CNN-MRF OA AA kappa(κ) Time D. Impact of parameter settings In this section, we evaluate the sensitivity of performance to different parameters settings. In the following experiments, we use the Indian Pines dataset and randomly choose 5% samples from each class as training data. The remaining data is used to test. 5 The results for SSD-CNN, SSDL and DC-CNN are taken from [39].

10 10 1) Kernel Size: First, we test the impact of different kernel sizes. The default kernel sizes for both convolutional layers is 3. We fix the kernel size of the second layer as 3 and evaluate the performance by varying the kernel size in the first layer. Table X shows these results. From this table we can conclude that larger kernel sizes can obtain better results. This is because more structure and texture can be captured using a larger kernel size. Therefore, in our experiments, we set the kernel size of the first layer as 3 because it almost achieves the same result with the best performing case where kernel size is set to 5. TABLE X OVERALL ACCURACY (%) WITH DIFFERENT KERNEL SIZES. Kernel Size OA patch size. Thus we choose 9 as the default setting of patch size as a tradeoff between performance and the running time. TABLE XIII OVERALL ACCURACY (%) WITH DIFFERENT PATCH SIZES. Patch Size OA ) Smoothness parameter: Finally, we test the impact of different smoothness parameter µ on the obtained segmentation results. Table XIV shows these results, where it can be observed that setting µ = 20 produces the best segmentation results. Thus we use this value in our experiments. TABLE XIV OVERALL ACCURACY (%) WITH DIFFERENT SMOOTH PARAMETERS. 2) Network Width: Next, we evaluate the impact of network width in the convolutional layer on the segmentation results. The default settings of network width for the two convolutional layers are 300 and 200 respectively. Fixing the network width of the first convolutional layer to 300, we test the performance by changing the width of the second convolutional layer. These results are displayed in Table XI. It can be observed that the results are not very sensitive to the network width and thus in our experiments, we select 200 as the default setting of the network width in the second convolutional layer. TABLE XI OVERALL ACCURACY (%) WITH DIFFERENT NETWORK WIDTHS. Network Width OA ) Network Depth: We also conduct experiments to test the impact of different network depths by reducing or adding nonlinear layers. We train and test on 5 networks with depths 5, 7, 9, 11 and 13. The experimental results are summarized in Table XII. From this table it can be seen that increasing the network depth does not always generate better results as a result of the gradient vanishing problem encountered by deeper neural networks. For this experiment, the best performance can be achieved by setting the network depth to 7, which we use as the default setting for the network depth in our experiments. TABLE XII OVERALL ACCURACY (%) WITH DIFFERENT NETWORK DEPTHS. Network Depth OA ) Patch Size: We also investigate the performances of our proposed method with respect to different data patch sizes k = {1, 3, 5, 9, 13}. These results are shown in Table XIII. As can be seen, a larger patch size of the sample generates better results. This is because more spatial information is incorporated into the training process. However, it also takes more computation time to train the network with an increasing Smooth Parameter OA For other parameters, such as the kernel size in max pooling layer which is set as 2, the number of nodes in the two fully connected layers, which are set as 200 and 100 respectively, the learning rate α which is set as and the batch size n batch which is set as 100, we observes consistent performance under variations. Therefore, we don t report their impact to the performance in this section. V. CONCLUSIONS In this paper, we proposed a novel technique for HSI segmentation that incorporates both spectral and spatial information in a unified framework. Specifically, we use a convolutional neural network (CNN) in combination with a Markov random field to classify HSI pixel vectors in a way fully takes spatial and spectral information into account. We then efficiently learn the segmentation result using an optimized α-expansion graph-cut algorithm. Experimental results on one synthetic HSI dataset and two real benchmark HSI datasets show that our method outperforms state-of-the-art methods, including deep and non-deep models. REFERENCES [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arxiv preprint arxiv: , [2] T. V. Bandos, L. Bruzzone, and G. Camps-Valls. Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Transactions on Geoscience and Remote Sensing, 47(3): , [3] J. A. Benediktsson, J. A. Palmason, and J. R. Sveinsson. Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Transactions on Geoscience and Remote Sensing, 43(3): , [4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11): , [5] X. Cao, L. Xu, D. Meng, Q. Zhao, and Z. Xu. Integration of 3- dimensional discrete wavelet transform and markov random field for hyperspectral image classification. Neurocomputing, 226:90 100, [6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arxiv preprint arxiv: , 2014.

11 11 [7] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu. Deep learning-based classification of hyperspectral data. IEEE Journal of Selected topics in applied earth observations and remote sensing, 7(6): , [8] Y. Chen, N. M. Nasrabadi, and T. D. Tran. Hyperspectral image classification using dictionary-based sparse representation. IEEE Transactions on Geoscience and Remote Sensing, 49(10): , [9] Y. Chen, N. M. Nasrabadi, and T. D. Tran. Hyperspectral image classification via kernel sparse representation. IEEE Transactions on Geoscience and Remote Sensing, 51(1): , [10] Y. Chen, X. Zhao, and X. Jia. Spectral spatial classification of hyperspectral data based on deep belief network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(6): , [11] M. Dalla Mura, A. Villa, J. A. Benediktsson, J. Chanussot, and L. Bruzzone. Classification of hyperspectral images by using extended morphological attribute profiles and independent component analysis. IEEE Geoscience and Remote Sensing Letters, 8(3): , [12] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2): , [13] X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley. Clearing the skies: A deep network architecture for single-image rain removal. arxiv preprint arxiv: , [14] J. D. Gibbons and S. Chakraborti. Nonparametric statistical inference. Springer, [15] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li. Deep convolutional neural networks for hyperspectral image classification. Journal of Sensors, 2015, [16] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1): , [17] S. Jia, X. Zhang, and Q. Li. Spectral spatial hyperspectral image classification using regularized low-rank representation and sparse representation-based graph cuts. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(6): , [18] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE transactions on pattern analysis and machine intelligence, 28(10): , [19] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts? IEEE transactions on pattern analysis and machine intelligence, 26(2): , [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages , [21] D. A. Landgrebe. Signal theory methods in multispectral remote sensing, volume 29. John Wiley & Sons, [22] J. Li, J. M. Bioucas-Dias, and A. Plaza. Spectral spatial hyperspectral image segmentation using subspace multinomial logistic regression and markov random fields. IEEE Transactions on Geoscience and Remote Sensing, 50(3): , [23] G. Licciardi, P. R. Marpu, J. Chanussot, and J. A. Benediktsson. Linear versus nonlinear pca for the classification of hyperspectral data based on the extended morphological profiles. IEEE Geoscience and Remote Sensing Letters, 9(3): , [24] Z. Lin, Y. Chen, X. Zhao, and G. Wang. Spectral-spatial classification of hyperspectral image using autoencoders. In Information, Communications and Signal Processing (ICICS) th International Conference on, pages 1 5. IEEE, [25] Y. Qian, M. Ye, and J. Zhou. Hyperspectral image classification based on structured sparse logistic regression and three-dimensional wavelet texture features. IEEE Transactions on Geoscience and Remote Sensing, 51(4): , [26] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Technical report, DTIC Document, [27] L. Shen and S. Jia. Three-dimensional gabor wavelets for pixel-based hyperspectral imagery classification. IEEE Transactions on Geoscience and Remote Sensing, 49(12): , [28] A. Soltani-Farani, H. R. Rabiee, and S. A. Hosseini. Spatial-aware dictionary learning for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 53(1): , [29] X. Sun, Q. Qu, N. M. Nasrabadi, and T. D. Tran. Structured priors for sparse-representation-based hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters, 11(7): , [30] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In Advances in Neural Information Processing Systems, pages , [31] Y. Tarabalka and A. Rana. Graph-cut-based model for spectral-spatial classification of hyperspectral images. In 2014 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages IEEE, [32] A. Villa, J. A. Benediktsson, J. Chanussot, and C. Jutten. Hyperspectral image classification with independent component discriminant analysis. IEEE transactions on Geoscience and remote sensing, 49(12): , [33] J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In Advances in Neural Information Processing Systems, pages , [34] Y. Xu, Z. Wu, and Z. Wei. Spectral spatial classification of hyperspectral image based on low-rank decomposition. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(6): , [35] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51(7): , [36] S. Yu, S. Jia, and C. Xu. Convolutional neural networks for hyperspectral image classification. Neurocomputing, 219:88 98, [37] J. Yue, S. Mao, and M. Li. A deep learning framework for hyperspectral image classification using spatial pyramid pooling. Remote Sensing Letters, 7(9): , [38] J. Yue, W. Zhao, S. Mao, and H. Liu. Spectral spatial classification of hyperspectral images using deep convolutional neural networks. Remote Sensing Letters, 6(6): , [39] H. Zhang, Y. Li, Y. Zhang, and Q. Shen. Spectral-spatial classification of hyperspectral imagery using a dual-channel convolutional neural network. Remote Sensing Letters, 8(5): , [] W. Zhu, V. Chayes, A. Tiard, S. Sanchez, D. Dahlberg, D. Kuang, A. L. Bertozzi, S. Osher, and D. Zosso. Unsupervised classification in hyperspectral imagery with nonlocal total variation and primal-dual hybrid gradient algorithm. arxiv preprint arxiv: , 2016.

Hyperspectral Image Classification with Markov Random Fields and a Convolutional Neural Network

1 Hyperspectral Image Classification with Markov Random Fields and a Convolutional Neural Network Xiangyong Cao, Feng Zhou, Member, IEEE, Lin Xu, Deyu Meng, Member, IEEE, Zongben Xu, and John Paisley Member,