RECENT years have witnessed the rapid growth of image. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval

Size: px

Start display at page:

Download "RECENT years have witnessed the rapid growth of image. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval"

Alisha Caldwell
6 years ago
Views:

1 SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval Jian Zhang, Yuxin Peng, and Junchao Zhang arxiv: v [cs.cv] 28 Jul 206 Abstract The hashing methods have been widely used for efficient similarity retrieval on large scale image datasets. The traditional hashing methods learn hash functions to generate binary codes from hand-crafted features, which achieve limited accuracy since the hand-crafted features cannot optimally represent the image content and preserve the semantic similarity. Recently, several deep hashing methods have shown better performance because the deep architectures generate more discriminative feature representations. However, these deep hashing methods are mainly designed for the supervised scenarios, which only exploit the semantic similarity information, but ignore the underlying data structures. In this paper, we propose the semi-supervised deep hashing (SSDH) method, to perform more effective hash learning by simultaneously preserving the semantic similarity and the underlying data structures. Our proposed approach can be divided into two phases. First, a deep network is designed to extensively exploit both the labeled and unlabeled data, in which we construct the similarity graph online in a mini-batch with the deep feature representations. To the best of our knowledge, our proposed deep network is the first deep hashing method that can perform the hash code learning and feature learning simultaneously in a semi-supervised fashion. Second, we propose a loss function suitable for the semi-supervised scenario by jointly minimizing the empirical error on the labeled data as well as the embedding error on both the labeled and unlabeled data, which can preserve the semantic similarity, as well as capture the meaningful neighbors on the underlying data structures for effective hashing. Experiment results on 4 widely used datasets show that the proposed approach outperforms state-of-the-art hashing methods. Index Terms Semi-supervised deep hashing, large scale image retrieval, underlying data structures, online graph constructing. I. INTRODUCTION RECENT years have witnessed the rapid growth of image data on the web, and the content-based image retrieval (CBIR) on large scale datasets has attracted much attention, which tries to return similar images that match the visual query from a large image dataset. Nearest Neighbor (NN) search is the most basic scheme for content-based image retrieval. However, retrieving the exact nearest neighbors is computationally impracticable when the reference dataset becomes very large, since NN search usually exhaustively compares the query with each sample in the reference dataset. Therefore, many efforts have been focused on the approximate nearest neighbor (ANN) search, which is sufficient and efficient enough for many practical applications. Instead of finding the nearest neighbors, This work was supported by National Hi-Tech Research and Development Program of China (863 Program) under Grant 204AA0502, and National Natural Science Foundation of China under Grants and The authors are with the Institute of Computer Science and Technology, Peking University, Beijing 0087, China. Corresponding author: Yuxin Peng ( pengyuxin@pku.edu.cn). ANN search [] relaxes the goal of retrieval by finding the ǫ approximate nearest neighbors p of the query q, so that d(p,q) ( + ǫ)d(p,q), where ǫ > 0 and p denotes the nearest neighbor of the query q, which results in minor loss in accuracy but provides considerable computational efficiency. Many approaches have been designed to perform efficient ANN search, including tree-based methods and hash-based methods. The tree-based methods, such as KD tree [2], [3], usually organize data samples in efficient data structures to achieve fast search. However, the tree-based methods suffer from the curse of dimensionality, that is to say, the search efficiency greatly degrades with the high-dimensional data. While the hashbased methods, representatively Locality Sensitive Hashing (LSH) [4], aim to map the high-dimensional data into the lowdimensional binary code representations. On the one hand, hash codes are usually organized into a hash table which only takes constant query time; on the other hand, hash-based methods lead to substantially reduced storage as they only need to store the binary codes. Thus the hash-based methods support more efficient ANN search and have attracted more attention in the recent years. Generally speaking, the hashing methods learn the compact binary codes for each data, so that the similar data examples are mapped into the similar binary codes. Traditional hashing methods [4] [7] take the pre-extracted image features as input, then learn the hash functions by exploiting the data structures or applying the similarity preserving regularizations. These methods can be divided into two categories, namely unsupervised and supervised methods. And recently, inspired by the remarkable progress of the deep networks, some deep hashing methods have also been proposed [5], [8] [2], which take advantage of the deep networks to perform the hash learning. The unsupervised methods design the hash functions by using the unlabeled data to generate the binary codes. Locality Sensitive Hashing (LSH) [4] is a typical unsupervised method, which uses random linear projections to map data into the binary codes. Instead of using hash functions randomly generated, some data-dependent methods [7], [3] [9] have been proposed to capture the data properties, such as data distributions and manifold structures. For example, Spectral Hashing (SH) [7] designs the hash functions to be balanced and uncorrelated. Anchor Graph Hashing (AGH) [5] and Locally Linear Hashing (LLH) [4] utilize the manifold structures of data to learn compact hash codes, of which the former uses anchor graphs to discover the neighborhood structures, and the latter uses locality sensitive sparse coding to capture the local linear structures and then recovers these structures

2 2 in Hamming space. Iterative Quantization (ITQ) [3] attempts to generate the hash codes by minimizing the quantization error of mapping data to the vertices of a binary hypercube. Topology Preserving Hashing (TPH) [6], [8] performs the hash learning by preserving consistent neighborhood rankings of data points in Hamming space. While the unsupervised methods are promising to retrieve the neighbors under some kind of distance metrics (such as L 2 distance), the neighbors in the feature space can not optimally reflect the semantic similarity. Therefore, the supervised hashing methods are proposed, which utilize the semantic information such as image labels to generate the effective hash codes. For example, Binary Reconstruction Embedding (BRE) [20] learns the hash codes by minimizing the reconstruction error between the original distances and the reconstructed distances in the Hamming space. Supervised Hashing with Kernels (KSH) [2] learns the hash codes by preserving the pairwise relations between data samples provided by labels. Semi-supervised Hashing (SSH) [6] learns the hash codes by minimizing the empirical error over the labeled data while maximizing the information entropy of the generated hash codes over the labeled and unlabeled data. In [22], Norouzi et al. propose to learn hash functions based on a triplet ranking loss that can preserve relative semantic similarity. Ranking Preserving Hashing (RPH) [23] and Order Preserving Hashing (OPH) [24] learn the hash codes by preserving the ranking information, which is obtained based on the number of shared semantic labels between data examples. Supervised Discrete Hashing (SDH) [25] leverages label information to obtain hash codes by integrating hash code generation and classifier training. For most of the unsupervised and supervised hashing methods, the input images are represented by the hand-crafted features (e.g. GIST [26]), which can not optimally represent the semantic information of images. Inspired by the fast progress of the deep networks on the image classification [27], some deep hashing methods have been proposed [5], [8] [2] to take advantage of the superior feature representation power of the deep networks. Convolutional Neural Network Hashing (CNNH) [8] is a supervised hashing methods based on the convolutional networks. CNNH is a two-stage framework, which learns the fixed hash codes in the first stage, and learns the hash functions and image representations in the second stage. Although the learned hash codes can guide the feature learning, the learned image representations cannot provide feedback for learning better hash codes. To overcome the shortcomings of the two-stage learning scheme, some approaches have been proposed to perform simultaneously feature learning and hash code learning. Network in Network Hashing (NINH) [5] designs a triplet ranking loss to capture the relative similarities of images. NINH is a one-stage supervised method, thus the image representation learning and hash code learning can benefit each other in the deep architecture. Some similar ranking-based deep hashing methods [9], [0] have also been proposed recently, which also design the hash codes to preserve the ranking information obtained by labels. However, the existing deep hashing methods are mainly supervised methods, which heavily rely on the labeled images, and the deep networks often require a large amount of the labeled data to achieve better performance. But labeling images consumes large amounts of time and human resources, which is not practical in real applications. Thus it is necessary to make full use of the unlabeled images to improve the deep hashing performance. In this paper, we propose the semi-supervised deep hashing (SSDH) method, which learns the hash codes by preserving the semantic similarity and underlying data structures simultaneously. To the best of our knowledge, our proposed SSDH is the first deep hashing method that can perform the hash code learning and feature learning simultaneously in a semi-supervised fashion. The main contributions of this paper can be concluded as follows: Existing deep hashing methods only design the loss functions to preserve the semantic similarity but ignore the underlying data structures, which is essential to capture the semantic neighborhoods and return the meaningful nearest neighbors for effective hashing [5]. To address the problem, we propose a semi-supervised loss function to exploit the underlying structures of the unlabeled data for more effective hashing. By jointly minimizing the empirical error on the labeled data as well as the embedding error on both the labeled and unlabeled data, the learned hash codes can not only preserve the semantic similarity, but also capture the meaningful neighbors on the underlying data structures. Thus our proposed SSDH method can benefit greatly from both the labeled and unlabeled data. A deep network structure is designed to make full use of both the labeled and unlabeled data to learn hash codes and image features simultaneously. In our deep network, the representation learning layers extract discriminative deep features from the input images. Then the hash code learning layer and the classification layer are connected, of which the former maps the image features into q-dimensional hash vectors, and the latter predicts the pseudo labels of the unlabeled data. The last representation learning layer, the hash code learning layer and the classification layer are connected to the loss layer, and the proposed semi-supervised loss function is designed to model the semantic similarity constraints, meanwhile preserve the underlying data structures of the image features. The whole network is trained in an endto-end way, thus our proposed method can perform hash code learning and feature learning simultaneously. The traditional graph constructing methods are time consuming due to the O(n 2 ) complexity, which is intractable for large scale data. To address this issue, we propose an online graph construction strategy. Rather than construct the similarity graph over all the data in advance, our online graph construction strategy constructs the similarity graph in a mini-batch during the training procedure. On the one hand, the online graph construction strategy only needs to take consideration of a much smaller mini-batch of data, which is efficient and suitable for the batch-based deep network s training. On the other hand, the online graph constructing can benefit from the effective feature

3 3 representations extracted from deep networks. Experiments on 4 widely used datasets show that our proposed SSDH method achieves the best search accuracy comparing with state-of-art methods. The rest of this paper is organized as follows. Section 2 presents a brief review of related deep hashing methods. Section 3 introduces our proposed SSDH method. Section 4 shows our experiments on the 4 widely used image datasets, and Section 5 concludes this paper. II. RELATED WORK The hashing methods have become one of the most popular methods for similarity retrieval on large scale image datasets. In recent years, some hashing methods have been proposed to perform the hash learning with the deep networks and have made considerable progress. In this section, we briefly review some representative deep hashing methods. A. Convolutional Neural Network Hashing Convolutional Neural Network Hashing (CNNH) [8] decomposes the hash learning into two stages: a hash code learning stage and a hash function learning stage. Given the training imagesi = {I,I 2,...,I n }, in the hash code learning stage (Stage ), CNNH learns an approximate hash code for each training data by optimizing the following loss function: min H S q HHT 2 F () where F denotes the Frobenius norm;s denotes the semantic similarity of the image pairs in I, in which S ij = when image I i and I j are semantically similar, otherwise S ij = ; H {,} n q denotes the approximate hash code matrix whose each row is a q-dimensional hash code. Following KSH [2] that utilizes the equivalence between optimizing the code inner products and the Hamming distances, CNNH also tries to approximately reconstruct the similarity matrix S by the hash code inner product q HHT, thus optimizing the equation () means minimizing the reconstruction errors. CNNH firstly relaxes the integer constraints on H and randomly initializes H [,] n q, then optimizes the equation () using a coordinate descent algorithm. Thus it is equivalent to optimize the following equation: min H j H T j (qs c jh c H T c ) 2 F (2) where H j and H c denote the j-th and the c-th column of H respectively. In the hash function learning stage (Stage 2), CNNH utilizes deep networks to learn the image feature representation and hash functions. Specifically, CNNH adopts the deep framework in [28] as its basic network, and designs an output layer with softmax activation to generate q-dimensional hash codes. CNNH trains the designed deep network in a supervised way, in which the approximate hash codes learned in Stage are used as the ground-truth. In additionally, if the discrete class labels of training images are available, CNNH also incorporates these image labels to learn the hash functions. Based on the deep network, CNNH learns deep features and hash functions at the same time. However CNNH is a twostage framework, where the learned deep features in Stage 2 cannot help to improve the approximate hash code learning in Stage, which limits the performance of hash learning. B. Network in Network Hashing Different from the two-stage framework in CNNH [8], Network in Network Hashing (NINH) [5] integrates the image representation learning and hash code learning into a one-stage framework. NINH constructs the deep framework for hash learning based on the Network in Network architecture [29], with a shared sub-network composed of several stacked convolutional layers to extract the image features, as well as a divideand-encode module encouraged by sigmoid activation function and a piece-wise threshold function to output the binary hash codes. During the learning process, without generating the approximate hash codes in advance, NINH designs a triplet ranking loss function to exploit the relative similarity of the training images to directly guide the hash learning: l triplet (I,I +,I ) = max(0, ( b(i) b(i ) H b(i) b(i + ) H )) s.t. b(i),b(i + ),b(i ) {,} q (3) where I,I + and I specify the triplet constraint that image I is more similar to image I + than to image I based on image labels; b( ) denotes the binary hash code, and H denotes the Hamming distance. For easy optimization of the equation (3), NINH applies two relaxation tricks: relaxing the integer constraint on the binary hash code and replacing the Hamming distance with Euclidean distance. C. Bit-scalable Deep Hashing The Bit-scalable Deep Hashing (BS-DRSCH) method [9] is also a one-stage deep hashing method that constructs an endto-end architecture for hash learning. In addition, BS-DRSCH is able to flexibly manipulate the hash code length by weighing each bit of the hash codes. BS-DRSCH defines a weighted Hamming affinity to measure the dissimilarity between two hash codes: q H(b(I i ),b(i j )) = wk 2 b k(i i )b k (I j ) (4) k= where I i and I j are training images; b( ) denotes the q- dimensional binary hash code, and b k ( ) denotes the k-th hash bit in b( ); w k is the weight value attached to b k ( ), which is to be learned and expected to represent the significance of b k ( ). Similar to NINH [5], BS-DRSCH organizes the training images into triplet tuples, then formulates the loss function as follows: loss = max(h(b(i),b(i + )) H(b(I),b(I )),C) I,I +,I + H(b(I i ),b(i j ))S ij 2 i,j (5)

4 Fig.. Overview of our proposed deep network architecture. The training set are composed of both the labeled and unlabeled images (the notations y,y 2 and y 3 denote different image labels).

Then the hash code learning layer and the classification layer are connected, of which the former maps the image features into q-dimensional hash vectors, and the latter predicts the pseudo labels of

4 4 Fig.. Overview of our proposed deep network architecture. The training set are composed of both the labeled and unlabeled images (the notations y,y 2 and y 3 denote different image labels). The representation learning layers extract discriminative deep features for the input images, which are also used for online graph constructing. Then the hash code learning layer and the classification layer are connected, of which the former maps the image features into q-dimensional hash vectors, and the latter predicts the pseudo labels of the unlabeled data. The last representation learning layer, the hash code learning layer and the classification layer are all connected to the semi-supervised loss layer. In addition, a softmax loss layer is also used to train the classification layer. where S ij denotes the similarity between images I i and I j based on labels. The defined loss is the sum of two terms: the first term is a max-margin term, which is designed to preserve the ranking orders; the second term is a regularization term, which is defined to keep the pairwise adjacency relations. BS-DRSCH builds a deep convolutional neural network to perform hash learning, in which the convolutional layers are used to extract deep features, while the fully-connected layer followed by the tanh-like layer is used to generate the hash codes. For learning the weight values {w k },k =,...,q, BS- DRSCH designs an element-wise layer on the top of the deep architecture, which is able to weigh each bit of the hash code. Both NINH [5] and BS-DRSCH [9] are supervised deep hashing methods, and they only design the loss functions to preserve the semantic similarity based on the labeled training data, but ignore the underlying data structures of the unlabeled data, which is essential to capture the semantic neighborhoods for effective hashing. III. SEMI-SUPERVISED DEEP HASHING Given a set of n images X that are composed of the labeled images (X L,Y L ) = {(x,y ),(x 2,y 2 ),...,(x l,y l )} and the unlabeled images X U = {x l+,x l+2,...,x n }, where x i R D, i = n and y i {,2,...,C}, i = l, Y L is the label information. The goal of hashing methods is to learn a mapping function H : X {,} q, which encodes an image X X into a q-dimensional binary code H(X) in the Hamming space, while preserving the semantic similarity of images. Existing deep hashing methods [5], [8], [9] are designed for the supervised scenarios, which only use the labeled data (X L,Y L ) to preserve the semantic similarity but ignore the underlying data structures. In this paper, we propose a novel deep hashing method called semi-supervised deep hashing (SSDH), to learn better hash codes by preserving the semantic similarity and underlying data structures simultaneously. As shown in Fig., the deep architecture of SSDH includes three main components: ) A deep convolutional neural network is designed to learn discriminative deep features for images, which takes both the labeled and unlabeled image data as input. 2) A hash code learning layer and a classification layer are connected, of which the former is used to map the image features into q-dimensional hash codes, and the latter is used to predict the pseudo labels of the unlabeled data. 3) A semi-supervised loss is designed to preserve the semantic similarity, meanwhile captures the neighborhood structures for effective hashing, which is formulated as minimizing the empirical error on the labeled data as well as the embedding error on both the labeled and unlabeled data. These three components are integrated into a unified framework, thus our proposed SSDH can perform image representation learning and hash code learning simultaneously. In the following parts of this section, we first introduce the proposed deep network architecture, then we introduce the proposed semi-supervised loss function as well as the training of the proposed network. A. The Proposed Deep Network The deep architecture of the proposed SSDH is shown in Fig.. We adopt the Fast CNN architecture (denoted as VGG- F for simplification) from [30] as the base network. The first seven layers are configured the same as those in VGG-F [30], which build the representation learning layers to extract the discriminative image features. We mainly focus on designing the hash code learning layer and the semi-supervised loss

5 5 function. The eighth layer is the hash code learning layer, which is constructed by a fully-connected layer followed by a sigmoid activation function, and its outputs are hash codes defined as: h(x) = sigmoid(w T f(x)+v) (6) where f(x) is the deep features extracted from the last representation learning layer, i.e., the output of fully-connected layer fc7, W denotes the weights in the hash code learning layer, v is the bias parameter. Through the hash code learning layer, the image features f(x) are mapped into [0,] q. Since the hash codes h(x) [0,] q are continuous real values, we apply a thresholding function g(x) = sgn(x ) to obtain binary codes: b k (x) = g(h(x)) = sgn(h k (x) ), k =,2,,q (7) The ninth layer is the classification layer which predicts the pseudo labels of the unlabeled data, we use the softmax loss function to train the classification layer. Outputs of the last representation learning layer (fc7), the hash code learning layer (fc8) and the classification layer (fc9) are connected with the semi-supervised loss function, so that the proposed loss function can make full use of the underlying data structures of extracted image features and the pseudo labels to improve the hash learning performance. The whole network is an endto-end structure, which can perform hash code learning and feature learning simultaneously. TABLE I DETAILED NETWORK CONFIGURATIONS OF OUR PROPOSED SSDH MODEL Layers Configuration conv filter 64xx, st. 4x4, pad 0, LRN, pool 2x2 conv2 filter 256x5x5, st. x, pad 2, LRN, pool 2x2 conv3 filter 256x3x3, st. x, pad conv4 filter 256x3x3, st. x, pad conv5 filter 256x3x3, st. x, pad, pool 2x2 fc6 4096, dropout fc7 4096, dropout fc8 q-dimensional, sigmoid fc9 c-dimensional The detailed configurations of our network structure are shown in Table I. For the five convolutional layers: the filter parameter specifies the number and the receptive filed size of convolutional kernels as num size size; the st. and pad parameters specify the convolution stride and spatial padding respectively; the LRN indicates whether Local Response Normalization (LRN) [27] is used; the pool indicates whether to apply the max-pooling downsampling, and specifies the pooling window size by size size. For the fully-connected layers, we specify their dimensionality, of which the first two are 4096, the hash layer is the same as the hash code length q, the classification layer is the same as the class number c. The dropout indicates whether the fullyconnected layer is regularized by dropout [27]. The hash code learning layer takes sigmoid as activation function, while the other weighted layers take Rectied Linear Unit (ReLU) [27] as activation functions. Note that although we adopt VGG- F [30] as our base network, it s practicable to replace the base network by other deep convolutional network structures, such as Network in Network (NIN) [29] model or AlexNet [27] model. B. Semi-supervised Deep Hashing Learning Most of the existing deep hashing methods are designed for supervised scenario, which construct their loss functions to preserve the pairwise similarity [8] or ranking similarity [5], [9], [0] based on the label annotations. The deep networks often require a large amount of the labeled data to achieve better performance, however labeling images consumes large amounts of time and human resources, which is not practical in real world application scenario. To address this problem, we propose a semi-supervised loss function, which can jointly preserve the semantic similarity and capture the neighborhood constraints on the underlying data structures. Our proposed semi-supervised loss function is composed of three terms, which are a supervised ranking term applied on the labeled data, a semi-supervised embedding term applied on both the labeled and unlabeled data, and a pseudo-label term which applied on the unlabeled data. The first term can preserve the semantic similarity by minimizing the empirical error, the second term can capture the underlying data structures by minimizing the embedding error, and the third term encourages the decision boundary in low density regions, which is complementary with semi-supervised embedding term. Thus the proposed SSDH model can fully benefit from both the labeled and unlabeled data. We introduce these three terms separately in the following parts of this subsection. ) Supervised Ranking Term: Encouraged by [5], [9], we formulate the supervised ranking term as triplet ranking regularization to preserve the relative semantic similarity provided by the label annotations. For the labeled images (X L,Y L ), we sample a set of triplet tuples depending on the labels, T = {(x L i,xl+ i,x L i )} t i=, in which xl i and x L+ i are two similar images with the same labels, while x L i and x L i are two dissimilar images with different labels, and t is the number of sampled triplet tuples. For the triplet tuple (x L i,xl+ i,x L i ),i = t, the supervised ranking term is defined as: J L (x L i,x L+ i,x L i ) = max(0,m t + b(x L i ) b(x L+ i ) H b(x L i ) b(x L i ) H ) (8) where H denotes the Hamming distance, and the constant parameter m t defines the margins between the relative similarity of the two pairs (x L i,xl+ i ) and (x L i,xl i ), that is to say, we expect the Hamming distance of the dissimilar pair (x L i,xl i ) to be larger than the distance of the similar pair (x L i,xl+ i ) by at least m t. Thus minimizing J L can reach our goal to preserve the semantic ranking information provided by labels. In equation (8), the binary hash code and Hamming distance make it hard to directly optimize. Similar to NINH [5], we first replace the Hamming distance with Euclidean distance, then

6 6 we relax the binary hash code b(x) with continuous real value hash code h(x). Equation (8) can be rewritten as: J L (x L i,xl+ i,x L i ) = max(0,m t + h(x L i ) h(xl+ i ) 2 h(x L i ) h(xl i ) 2 ) (9) 2) Semi-supervised Embedding Term: Although we can train the network solely based on the supervised ranking term, the deep models with multiple layers often require a large amount of labeled data to achieve better performance. Usually it s hard to collect abundant labeled images, since labeling images consumes large amounts of time and human resources. Thus it s necessary to improve the search accuracy by utilizing the unlabeled data which are more convenient to obtain. Manifold assumption [3] states that data points within the same manifold are likely to have the same labels. Based on this assumption, the graph-based embedding algorithms are often used to capture the underlying manifold structures of data. Inspired by the semi-supervised embedding algorithms in deep networks [32], we add an embedding term which captures the underlying structures of both the labeled and unlabeled data. Modeling the neighborhood structures of data often requires to build a neighborhood graph. A neighborhood graph is defined as G = (V, E), where the vertices V represent the data points, and the edges E, which are often represented by an n n adjacency matrix A, indicate the adjacency relations among the data points in a specified space (usually in the feature space). But for large scale data, building the neighborhood graph on all the data is time and memory consuming, since the time complexity and memory cost of building the k-nn graph on all the data are both O(n 2 ), which is intractable for large n. Instead of constructing the graph offline on all the data, we propose to construct the graph online within a mini-batch by using the features extracted from the proposed deep networks. More specifically, given the minibatch including r images, B = {x i } r i=, and the deep features {f(x i ) : x i B}, we use the deep features to calculate the k-nearest-neighbors NN k (i) for each image x i, then we can construct the adjacency matrix as: A(i,j) = : I(i,j) = 0 x j NN k (i) 0 : I(i,j) = 0 x j / NN k (i) : I(i,j) = (0) where I(i,j) is an indicator function: I(i,j) = if x i and x j are both the labeled images; otherwise I(i,j) = 0. A(i,j) = means that we don t build neighborhood graph between two labeled image points, since we can use the supervised ranking term to model their similarity relations. Because the size of mini-batch is much smaller than the size of the whole training data, i.e., r n, thus it is efficient to compute online. Furthermore, building the neighborhood graph online can benefit from the representative deep features, since the training procedure of deep network generates more and more discriminative feature representations, which lead to capture the semantic neighbors more accurately. Based on the constructed adjacency matrix, we expect the neighbor image pairs to have small distances, while the nonneighbor image pairs to have large distances. We apply pairwise contrastive regularization [33] to achieve this goal. Thus for one image pair (x i,x j ), our semi-supervised embedding term is formulated as: J U (x i,x j ) = { b(x i ) b(x j ) H,A(i,j) = max(0,m p b(x i ) b(x j ) H ),A(i,j) = 0 () where J U (x i,x j ) is the contrastive loss between the two images x i and x j based on the neighborhood graph, H represents the Hamming distance, and the constant m p is a margin parameter for the distance metric of non-neighbor image pairs. With the neighborhood graphs constructed on the mini-batches, the semi-supervised embedding term J U captures the embedding error on both the labeled and unlabeled data. We minimize J U to preserve the underlying neighborhood structures. Similar to equation (8), we adopt the same relaxation strategy on equation (), then we rewrite the embedding term as: J U (x i,x j ) = { h(x i ) h(x j ) 2,A(i,j) = max(0,m p h(x i ) h(x j ) ) 2,A(i,j) = 0 (2) 3) Pseudo-label Term: Apart from our graph approach, D.-H. Lee [34] proposes the pseudo-label method for image classification task, which can also make use of the unlabeled data. Pseudo-label method predicts the unlabeled data with the class which has the maximum predicted probability as their true labels. The maximum a posteriori estimation encourages this approach to have the decision boundary in low density regions, which improves the generalization of classification model. Different from our proposed graph approach, which encourages neighborhood image pairs to have small distance and non-neighbor image pairs to have large distance, pseudolabel method encourages each image data to have an -of- C prediction, these two approaches are complementary to each other. Thus we integrate pseudo-label method in our framework. More specifically: in our network structure, the top layers have two branches, one is hash code learning layer, and the other is a classification layer, which predicts the pseudo labels of unlabeled data. We also take the class with maximum predicted probability as their true label. Then we also use the contrastive loss function to model semantic similarity between the pseudo labels. Similar to equation (2), the pseudo-label term is defined as follows: J P (x i,x j ) = { h(x i ) h(x j ) 2,PL i = PL j (3) max(0,m p h(x i ) h(x j ) ) 2,PL i PL j where the PL i and PL j are the predicted pseudo labels. 4) The Semi-supervised Loss: Combining the supervised ranking term, semi-supervised embedding term and the

7 7 pseudo-label term, we define the semi-supervised loss function as: t n J = J L (x L i,x L+ i,x L i )+λ J U (x i,x j ) +µ i= n J P (x i,x j ) i= i,j= s.t. x L i,x L+ i,x L i X L, x i,x j X (4) where t is the number of triplet tuples sampled from the labeled images, λ and µ are the balance parameters. In order to make the neighbor and non-neighbor image pairs to be balanced, we only use a small part of image pairs obtained by the neighborhood graph to calculate the loss. More specifically, for each image x i, we randomly select a neighbor pair (x i,x + i ), and a non-neighbor pair (x i,x i ) for semi-supervised embedding term, and we randomly select a positive pair (x L i,xl+ p i ) and negative pair (x L i,xl p i ) for the pseudo-label term, then we rewrite the loss function as: J = +µ t i n i= J L (x L i,x L+ i,x L i )+λ n (J U (x i,x + i )+J U(x, x i )) i= (J P (x L i,x L+ p i )+J P (x L i,x L p i )) (5) In our semi-supervised loss function, the supervised ranking term is applied on the labeled data to preserve the semantic similarity provided by image labels, while the semi-supervised embedding term and pseudo-label term are applied on both the labeled and unlabeled data to preserve the underlying data structures. Thus the proposed loss function can fit the semisupervised scenario, and make full use of both the labeled and unlabeled data to improve the hash learning performance. C. Network Training The proposed SSDH network is trained in a mini-batch way, that is to say, we only use a mini-batch of training images to update the network s parameters in one iteration. For each iteration, the input image batch B X is composed of some labeled images (X B L,YB L ) = {(x,y ),(x 2,y 2 ),,(x m,y m )} and some unlabeled images X B U = {x m+,x m+2,,x r }, where m denotes the number of labeled images in the minibatch, and r denotes the total number of images in the minibatch. The proposed network structure has two branches on the top, i.e., the hash code learning layer and the classification layer. We train these two layers simultaneously. For the classification layer, we use the softmax loss function to model the prediction errors on the labeled images, minimizing softmax loss can make the predicted pseudo labels more accurate. For the hash code learning layer, with the equations (9), (2), (3) and (5), we adopt stochastic gradient descent (SGD) to train the deep network. For each triplet tuple and image pair, we use the back-propagation algorithm (BP) to update the parameters of the network. More specifically, according to equation (9), we can compute the gradient for each triplet tuple(x L i,xl+ i,x L i ), with regard to h(x L i ), h(xl+ i ) and h(x L i ) respectively as: J L h(x L i ) = 2(h(xL i ) h(x L+ i )) J L h(x L+ i ) = + 2(h(xL i ) h(x L i )) J L h(x L i ) = 2(h(xL i ) h(xl i )) (6) Since the semi-supervised term and pseudo-label term are similar, according to equation (2) and (3), for the neighbor pair (x i,x + i ) (or positive pseudo-label pair (xl i,xl+ p i )), we compute the gradient w.r.t. h(x i ), h(x + i ) respectively as: J U h(x i ) = 2(h(x i) h(x + i )) J U h(x + i ) = 2(h(x+ i ) h(x i)) (7) and for the non-neighbor pair (x i,x i ) (or negative pseudolabel pair (x L i,xl p i )), we compute the gradient w.r.t. h(x i ), h(x i ) respectively as: J U h(x i ) = 2(m p h(x i ) h(x i ) ) sgn(h(x i) h(x i )) J U h(x i ) = 2(m p h(x i ) h(x i ) ) sgn(h(x i ) h(x i)) (8) These derivative values can be fed into the network via the back-propagation algorithm to update the parameters of each layer in the deep network. The training procedure is ended until the loss converges or a predefined maximal iteration number is reached. After the deep model is trained, we can generate a q-bit binary code for any input image. For an input image x, it is first encoded into a hash codes h(x) [0,] q by the trained model. Then we can compute the binary code for hashing by equation (7). In this way, the deep network can be trained end-to-end in a semi-supervised fashion, and the newly input images can be encoded into binary hash codes by the deep model, we can perform efficient image retrieval via fast Hamming distance calculation between binary hash codes. IV. EXPERIMENTS In this section, we will introduce the experiments conducted on 4 widely used datasets (, CIFAR-0, NUS-WIDE and ) with 7 state-of-the-art methods, including unsupervised methods LSH [4], SH [7] and ITQ [3], supervised methods SDH [25], CNNH [8], NINH [5] and DRSCH [9]. LSH, SH, ITQ and SDH are traditional hashing methods, which take hand-crafted image features as input to learn hash functions, while CNNH, NINH, DRSCH and proposed SSDH approach are the deep hashing methods, which take the raw image pixels as input to conduct hash learning. In the following of this section, we first introduce the datasets and experiment settings, then we compare the proposed approach with state-of-the-art methods. To further verify each term in

8 DRSCH* CNNH* SDH-VGGF ITQ-VGGF SH-VGGF LSH-VGGF SDH ITQ SH LSH Fig. 2. The comparison results on dataset. Precision curves within top 500 retrieved samples w.r.t. different length of hash codes; Precision curves with 48bit hash codes w.r.t. different number of top retrieved samples; Precision-recall curves of Hamming ranking with 48bit hash codes. the proposed semi-supervised loss function, we also present the results of only using the first term (supervised ranking term), only using the first and second terms (supervised ranking term and semi-supervised embedding term), and using all the three terms (our SSDH). A. Datasets We conduct experiments on 4 widely used image retrieval datasets: The dataset contains 70,000 gray scale handwritten images from 0 to 9 with the resolution of Following [8],,000 images are randomly selected as the query set (00 images per class). For the unsupervised methods, all the rest of the images are used as the training set. For the supervised methods, 5,000 images (500 images per class) are randomly selected from the rest images as the training set. The CIFAR-0 2 dataset consists of 60,000 color images with the resolution of from 0 classes, each of which has 6,000 images. For the fairly comparison, following [5], [8],,000 images are randomly selected as the query set (00 images per class). For the unsupervised methods, all the rest of images are used as the training set. For the supervised methods, 5,000 images (500 images per class) are further randomly selected from the rest of images to form the training set. The NUS-WIDE 3 [35] dataset contains nearly 270,000 images, and each image is associated with one or multiple labels of 8 semantic concepts. Following [5], [5], only the 2 most frequent concepts are used, where each concept has at least 5,000 images. Following [5], [8], 2,00 images are randomly selected as the query set (00 images per concept). For the unsupervised methods, all the rest of images are used as the training set. For the supervised methods, 500 images from each of the 2 concepts are randomly selected to form the training set kriz/cifar.html 3 The 4 [36] dataset consists of 25,000 images collected from Flickr, and each image is associated with one or multiple labels of 38 semantic concepts.,000 images are randomly selected as the query set. For the unsupervised methods, all the rest of images are used as the training set. For the supervised methods, 5,000 images are randomly selected from the rest of images to form the training set. B. Experiment Settings We implement the proposed SSDH method based on the open-source framework Caffe [37]. The parameters of our network are initialized with the VGG-F network [30], which is pre-trained on the ImageNet dataset [38]. Similar initialization strategy has been used in other deep hashing methods [0], []. In all experiments, our networks are trained with the initial learning rate of 0.00, and we divide the learning rate by 0 each 20,000 steps. The size of mini-batch is 28, and the weight decay parameter is For the four parameters in the proposed loss function, we simply set m t =, m p = 2, λ = 0. and µ = 0. in all the experiments. For the proposed SSDH method, CNNH, NINH and DRSCH, we use the raw image pixels as input. The implementations of CNNH [8] and DRSCH [9] are provided by their authors, while NINH [5] is of our own implementation. Since the network structures of these four deep hashing methods are different from each other, for a fair comparison, we use the same VGG-F network as the base structure for the four deep methods. And the network parameters of the deep hashing methods are initialized with the same pre-trained VGG-F model, thus we can perform fair comparison among these deep hashing methods. The results of CNNH [8], NINH [5] and DRSCH [9] are referred as CNNH, NINH and DRSCH respectively. For other compared methods without deep networks, i.e., the traditional hashing methods, we represent each image by hand-crafted features and deep features respectively. For 4

9 DRSCH* CNNH* SDH-VGGF ITQ-VGGF SH-VGGF LSH-VGGF SDH ITQ SH LSH Fig. 3. The comparison results on CIFAR-0 dataset. Precision curves within top 500 retrieved samples w.r.t. different length of hash codes; Precision curves with 48bit hash codes w.r.t. different number of top retrieved samples; Precision-recall curves of Hamming ranking with 48bit hash codes. hand-crafted features, we represent images in the CIFAR- 0 and by 52-dimensional GIST features, images in by 784-dimensional gray scale values, and images in the NUS-WIDE by 500-dimensional bag-of-words features. For a fair comparison between traditional methods and deep hashing methods, we also conduct experiments on the traditional methods with the features extracted from deep networks, where we extract 4096-dimensional deep feature for each image from the activation of the second fully-connected layer (fc7) of the pre-trained VGG-F network. We denote the results of traditional methods using hand-crafted features by LSH, SH, ITQ and SDH, and we denote the results of traditional methods using deep features by LSH-VGGF, SH- VGGF, ITQ-VGGF and SDH-VGGF. The results of SDH, SH and ITQ are obtained from the implementations provided by the authors of original paper, while the results of LSH are from our own implementation. TABLE II MAP SCORES WITH DIFFERENT LENGTH OF HASH CODES ON DATASET. Methods (MAP) 2bit 24bit 32bit 48bit SSDH DRSCH* [9] [5] CNNH* [8] SDH-VGGF ITQ-VGGF SH-VGGF LSH-VGGF SDH [25] ITQ [3] SH [7] LSH [4] C. Experiment Results and Analysis To evaluate the retrieval accuracy of the proposed approach and the compared methods, we use four evaluation metrics: Mean Average Precision (MAP), precision-recall curves, precision curves w.r.t different number of top retrieved samples, and precision curves within top 500 retrieved samples w.r.t. TABLE III MAP SCORES WITH DIFFERENT LENGTH OF HASH CODES ON DATASET. Methods (MAP) 2bit 24bit 32bit 48bit SSDH DRSCH [9] NINH [5] CNNH [8] SDH-VGGF ITQ-VGGF SH-VGGF LSH-VGGF SDH [25] ITQ [3] SH [7] LSH [4] different hash code length, which can objectively and comprehensively evaluate the accuracy of our approach and the compared methods. ) Experiment results on : Table II shows the MAP scores within all the retrieved results on dataset for different length of hash codes. Fig. 2, 2 and 2 show the precision curves within top 500 retrieved samples for different code length, the precision curves with 48bit hash codes of different number of top retrieved samples, and the precision-recall curves of 48bit hash codes on respectively. We can clearly see that: () The proposed SSDH approach outperforms all compared state-of-the-art methods on all hash code length. Note that is a relatively simple dataset compared to the other 3 datasets. Although the deep method DRSCH has achieved a high average MAP score of 95.6%, our methods still achieve the average highest MAP score of 97.0%. (2) From these 3 figures, although compared deep hashing methods have achieved a high precision on three metrics, our proposed SSDH still achieves the highest precision at all recall levels, and the highest precision at all retrieved numbers, as well as the highest precision of top 500 retrieved samples on all length of hash codes. (3) Our proposed SSDH method outperforms other 3 supervised deep hashing methods, which demonstrates the effectiveness

10 0 of exploiting the underlying structures of unlabeled data. (4) Comparing traditional methods using deep features and the identical methods using hand-crafted features, we can observe that deep features can boost the performance of traditional methods. 2) Experiment results on : Table III shows the MAP scores within all the retrieved results on dataset for different hash code length. We can observe from the result table that: () Compared to state-of-the-art supervised deep hashing methods, the proposed SSDH approach improves the average MAP from 73.2% (DRSCH ), 67.2% (NINH ) and 56.0% (CNNH ) to 8.0%. It s because the proposed SSDH designs a semi-supervised loss function, which not only preserves the semantic similarity provided by labels, but also captures the meaningful neighborhood structure, thus improving the search accuracy. (2) SSDH, DSRCH and NINH significantly outperform the two-stage method CNNH, because they learn the hash codes and image features simultaneously, which can benefit from each other, thus leading to better performance compared to CNNH. (3) The traditional hashing methods using deep features significantly outperform the identical methods using hand-crafted features. For example, compared to SDH, SDH-VGG improves the average MAP score from 32.2% to 49.0%, which indicates that deep features can also boosts traditional hashing methods performance. Fig. 3 shows the precision curves within top 500 retrieved samples for different length of hash codes, we can observe that the proposed SSDH approach achieves the best search accuracy on all hash code lengths (over 80% search precision). Fig. 3 shows the precision curves with 48bit hash codes of different number of top retrieved samples, the proposed SSDH consistently achieves over 80% search precision when the number of retrieved results increases. Fig. 3 shows the precision-recall curves of Hamming ranking with 48bit hash codes, and the proposed SSDH method still achieves the best results. TABLE IV MAP SCORES WITH DIFFERENT LENGTH OF HASH CODES ON NUS-WIDE DATASET. Methods NUS-WIDE (MAP) 2bit 24bit 32bit 48bit SSDH DRSCH [9] NINH [5] CNNH [8] SDH-VGGF ITQ-VGGF SH-VGGF LSH-VGGF SDH [25] ITQ [3] SH [7] LSH [4] ) Experiment results on NUS-WIDE: Table IV shows the MAP scores within all the retrieved results on NUS-WIDE. The precision curves within top 500 retrieved samples of different length of hash codes are shown in Fig. 4. The precision curves with 48bit hash codes of different number of top retrieved samples are illustrated in Fig. 4. The precision-recall curves of Hamming ranking with 48bit hash codes are demonstrated in Fig. 4. Similar results can be observed on NUS-WIDE. Our proposed SSDH achieves the best results on all the four evaluation metrics: () SSDH achieves average MAP score of 72.5%, which outperforms all the compared state-of-the-art supervised deep hashing methods DRSCH (64.5%), NINH (63.%) and CNNH (53.%). This demonstrates the effectiveness of jointly preserving the semantic similarity and the underlying data structure. (2) The proposed SSDH achieves over 80.0% precision of top 500 retrieved results on all code lengths, and outperforms other state-of-the-art methods. (3) SSDH consistently achieves over 80.0% search accuracy when retrieve samples increase on 48bit hash codes, which shows the effectiveness of the proposed method on Hamming ranking. (4) SSDH achieves the best precisions on all recall levels, which further demonstrates the effectiveness of the proposed approach. TABLE V MAP SCORES WITH DIFFERENT LENGTH OF HASH CODES ON DATASET. Methods (MAP) 2bit 24bit 32bit 48bit SSDH DRSCH [9] NINH [5] CNNH [8] SDH-VGGF ITQ-VGGF SH-VGGF LSH-VGGF SDH [25] ITQ [3] SH [7] LSH [4] ) Experiment results on : Table V shows the MAP scores within all the retrieved results on. Similar results can be observed, our SSDH method achieves average MAP score of 77.7%, which outperforms other supervised deep hashing methods DRSCH (73.6%), NINH (7%) and CNNH (65.9%). Fig. 5 shows the precision curves within top 500 retrieved samples of different hash code length, SSDH achieves over 85.0% precision on all code length, which shows a clear advantage over state-of-the-art methods. Fig. 5 illustrates the precision curves with 48bit hash codes of different number of top retrieved samples, it can be observed that SSDH achieves over 85% search accuracy steadily when the number of retrieved results increases, which has superior advantage over state-of-the-art methods. Fig. 5 demonstrates the precision-recall curves of Hamming ranking with 48bit hash codes, SSDH still achieves the best results. On all the four evaluation metrics, it can be observed that our proposed SSDH outperforms other state-of-the-art supervised hashing methods, which shows the benefit of jointly preserving the semantic similarity and neighborhood structures. In all metrics on the 4 datasets, the proposed SSDH method shows superior advantages over the state-of-the-art hashing methods, which demonstrates the effectiveness of proposed SSDH method. And SSDH outperforms supervised deep hash-

learning stage (Stage 1), CNNH learns approximate hash codes for training images by optimizing the following loss function:

learning stage (Stage 1), CNNH learns approximate hash codes for training images by optimizing the following loss function: 1 Query-adaptive Image Retrieval by Deep Weighted Hashing Jian Zhang and Yuxin Peng arxiv:1612.2541v2 [cs.cv] 9 May 217 Abstract Hashing methods have attracted much attention for large scale image retrieval.