WITH the rapid advancement in social networking sites, Deep Semantic Multimodal Hashing Network for Scalable Multimedia Retrieval

Size: px

Start display at page:

Download "WITH the rapid advancement in social networking sites, Deep Semantic Multimodal Hashing Network for Scalable Multimedia Retrieval"

Hope Randall
5 years ago
Views:

1 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING Deep Semantic Multimodal Hashing Network for Scalable Multimedia Retrieval Lu Jin, Jinhui Tang, Senior Member, IEEE, Zechao Li, Guo-Jun Qi, Fu iao ariv: v [cs.cv] 9 Jan 209 Abstract Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. Particularly, deep hashing has received unprecedented research attention in recent years, owing to its perfect retrieval performance. However, most of existing deep hashing methods learn binary hash codes by preserving the similarity relationship while without exploiting the semantic labels, which result in suboptimal binary codes. In this work, we propose a novel Deep Semantic Multimodal Hashing Network () for scalable multimodal retrieval. In, two sets of modality-specific hash functions are jointly learned by explicitly preserving both the inter-modality similarities and the intra-modality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for task-specific classification, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Different from previous deep hashing methods, which are tied to some particular forms of loss functions, our deep hashing framework can be flexibly integrated with different types of loss functions. In addition, the bit balance property is investigated to generate binary codes with each bit having 50% probability to be or. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and highquality hash codes by exploiting the feature representation learning, inter-modality similarity preserving learning, semantic label preserving learning and hash functions learning with bit balanced constraint simultaneously. We conduct extensive experiments for both unimodal and cross-modal retrieval tasks on three widelyused multimodal retrieval datasets. The experimental result demonstrates that significantly outperforms state-of-theart methods. Index Terms Multimedia data retrieval, deep hashing, intramodality similarity, semantic label information, binary hash codes, convolutional neural network I. INTRODUCTION WITH the rapid advancement in social networking sites, recent years have witnessed the explosive growth of social media including images, user-provided textual captions videos and so on. For instance, as one of the most popular online social media, Facebook has more than 2 billion users sharing photos, videos, and user-provided textual contents daily. When searching the topics of interests for the Facebook user, it is desirable to return relevant results such as images, textual contents, and videos. As there are huge semantic gap L. Jin, J. Tang and Z. Li are with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 20094, China ( lujin505@gmail.com, jinhuitang@njust.edu.cn and zechao.li@njust.edu.cn). G. Qi is with the Department of Computer Science, University of Central Florida, Orlando, FL, 3286, USA ( guojun.qi@ucf.edu). F. iao is with the School of Computer, Nanjing University of Posts and Telecommunications, Nanjing 20023, China ( xiaof@njupt.edu.cn). Corresponding author: Jinhui Tang. among different modalities, it is especially challenging when searching for relevant multimodal contents among a great amount of heterogenous multimedia data. Consequently, it is imperative to develop efficient similarity search for large-scale multimedia data. Hashing [], [2], [3], [4], [5] has received a great deal of attention recently in the field of computer vision [6], [7], multimedia retrieval [8], [9] and person re-identification [0], []. The core idea of hashing is to project high-dimensional data into compact binary hash codes such that similar samples can be encoded to close hash codes in the Hamming space. Compared with real-valued features, the binary hash codes greatly reduce the storage cost and allow to fast Hamming distance computation with OR operation, thus making efficient similarity search on large-scale databases. Existing multimodal hashing methods [2], [3], [4] can be roughly divided into unimodal hashing methods and cross-modal hashing methods. Given a query (i.e., a query image), unimodal hashing methods focus on searching for the relevant contents among the unimodal retrieval database. Well-known unimodal hashing methods include Locality-Sensitive Hashing (LSH) [5], Iterative Quantization (ITQ) [6], Supervised Hashing with Kernels (KSH) [3], Deep Supervised Hashing (DSH) [7] and Supervised Semantics-preserving Deep Hashing (SSDH) [8]. Multimodal data is usually represented with distinct feature representations in heterogenous feature spaces, which makes searching across multimodal data (i.e., image, text and, video) particularly challenging. Most of the cross-modal hashing methods [9], [20], [2] attempt to map heterogenous features into a common space so that the correlation structures of different modalities can be preserved. However, very little efforts have been made to address both of the unimodal and cross-modal retrieval, which really limits the real application for multimedia retrieval. Due to its great potential in learning powerful feature representation, deep hashing methods [22], [23], [24], [25], [26] have shown superior performance than the traditional hashing methods which adopt the hand-craft features to learn the hash functions. By jointly learning the feature representation and hash functions, the feature representation is optimally compatible with hashing learning, thus leading to discriminative and high-quality binary hash codes. For most deep unimodal hashing methods [27], [28], [29], the foremost concern is to approximate the hash functions by preserving the intra-modality similarity, which is insufficient to build the correlation structure of different modalities. Meanwhile, although latest deep cross-modal hashing methods [30], [3], [32] can well preserve inter-modality correlation structure of

SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 2 Cross-entropy loss fcc Quantization loss Image modality Flower Butterfly Animal Plant Fly conv conv2 conv3 conv4 conv5 Tag occurance fc6 fc7 fch

2 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 2 Cross-entropy loss fcc Quantization loss Image modality Flower Butterfly Animal Plant Fly conv conv2 conv3 conv4 conv5 Tag occurance fc6 fc7 fch Similaritypreserving loss Bit balanced constraint Wild Text modality TF-IDF Feature extractor Feature vector fc fc2 fch fcc Cross-entropy loss Figure : An overview of the proposed Deep Semantic Multimodal Hashing Network method for multimodal retrieval. The network learns the binary hash codes by imposing four constraints on the hash layer (fch) of the network. First, the intermodality similarity-preserving loss is minimized to make the hash codes of different modalities consistent. Second, the crossentropy loss is used to minimize the prediction errors of the learned hash codes. Third, the bit balanced constraint makes the hash bits evenly distributed. Finally, the quantization loss forces the outputs of the hash layer to be close and. Better view in color version. multiple modalities, these methods limit efficient unimodal retrieval because of lacking in preserving the intra-modality semantic label information for hashing learning. In order to address these limitations, we propose a generalized deep hashing framework for scalable yet efficient unimodal and cross-modal retrieval. To this end, we propose a novel Deep Semantic Multimodal Hashing Network () to perform scalable unimodal and cross-modal retrieval for multimedia data. The flowchart of the proposed method is illustrated in Figure. To learn efficient modality-specific hash functions, the inter-modality similarity relationships and the intra-modality semantic information are jointly exploited to guide the learning process of hash functions. In addition to learning inter-modality similarity preserving hash codes, the learned hash codes are expected to be optimally compatible with the label predicting so as to effectively capture both of inter-modality correlation and intramodality semantic information. Based on this, the objective of is to minimize the pairwise similarity preserving loss and prediction errors of the learned hash codes simultaneously. Specifically, different from previous deep hashing methods, which merge with a particular loss function, can be flexibly trained with different types of loss functions under a generalized deep hashing framework. Moreover, to obtain balanced hash codes, we also investigate the bit balance property by making each bit having 50% probability being and. Overall, we highlight three main contributions of the proposed method as follow This paper proposes a generalized deep multimodal hashing framework which exploits feature representation learning, inter-modality similarity preserving learning, intra-modality semantic label preserving learning and hash functions learning with bit balance constraint simultaneously for scalable unimodal and multimodal retrieval. Our network aims to learn the binary hash codes by directly embedding the semantic labels on the similaritypreserving hash codes, and can be flexibly trained with different types of pair-wise similarity-preserving loss functions. Extensive experiments on three widely-used multimodal retrieval datasets are conducted to evaluate the performance of the proposed method for both unimodal and cross-modal retrieval tasks. The experimental results demonstrate the superiority and effectiveness of against state-of-the-art multimodal hashing methods for both unimodal and cross-modal retrieval tasks. The rest of this paper is organized as follows. We briefly review the related works in Section II. The proposed method and the optimization algorithm are presented in detailed in Section III. In Section IV, we conduct the experiments on several multimedia retrieval datasets and discuss the experimental results in detailed. Section V concludes the proposed method. II. RELATED WORKS As mentioned above, existing multimodal hashing methods can be roughly divided into two categories: unimodal hashing methods and cross-modal hashing methods. We briefly review these two kinds of methods in the following. Unimodal hashing methods learn hash functions by preserving the intra-modality similarity which is defined by utilizing the feature representation in the original feature space or the semantic label information. Representative works of the former one, such as Spectral Hashing (SH) [2], Iterative Quantization (ITQ) [6], Binary Reconstructive Embedding (BRE) [33], Anchor Graph Hashing (AGH) [34] and Neighborhood Discriminant Hashing (NDH) [], have a common sense that the

3 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 3 underlying data structure in the original feature space is exploited to learn the hash functions. Another successful attempt has been made to leverage supervised label information to improve the quality of binary hash codes. Supervised Hashing with Kernels (KSH) [3] employs kernel-based hash functions to learn similarity-preserving binary hash codes. Supervised Discrete Hashing (SDH) [35] aims to directly learn labelpreserving binary hash codes with the discrete constraint. By generating the list ranking orders from word embeddings of semantic labels, Discrete Semantic Ranking Hashing (DSeRH) [36] attempts to encode the semantic rank orders to learn binary hash codes with the triplet ranking loss. Furthermore, Asymmetric Multi-Valued Hashing (AMVH) [37] and Asymmetric Binary Coding (ASH) [38] utilize asymmetric similarity metric to learn two different sets of hash functions for image retrieval. Cross-modal hashing methods learn modality-specific hash functions by projecting hierarchical feature representations into a common space. Cross-View Hashing (CVH) [39], Co- Regularized Hashing [40] and Inter-Media Hashing (IMH) [4] utilize unlabeled training data to generate binary hash codes. As the supervised information is not exploited, these methods can not produce desirable hash codes. Some recent work utilizes labeled and unlabeled training data to learn more efficient hash functions. Cross-Modality Similarity-Sensitive Hashing () [42] adopts the boosting algorithm to efficiently solve the cross-modal similarity learning by projecting the input data from two different space into the Hamming space. Collective Matrix Factorization Hashing () [3] utilizes collective matrix factorization to learn cross-view hash functions by exploiting the latent structure of different modalities. Similarly, Latent Semantic Sparse Hashing () [43] employs the matrix factorization and sparse coding to learn the latent concepts and salient structure from the text and image respectively. Semantic Topic Multimodal Hashing () [44] learns a common subspace by capturing multiple semantic topics from multimedia data, and then generates binary hash codes by checking the appearance of the semantic topics. Semantic Correlation Maximization (SCM) [2] extends the canonical correlation analysis in a supervised manner. Semantics-Preserving Hashing () [45] first learns the joint binary hash codes by minimizing the Kullback-Leibler divergence between the hash codes and the semantic affinities, and learns the cross-view hash functions with the kernel logistic regression. Linear Subspace Ranking Hashing () [46] learns the ranking-based hash functions by exploiting the ranking structure of feature space. Recently, Convolutional Neural Networks (CNN) [47], [48], [49] has shown impressive learning potential for many vision tasks such as image classification [50], [5], [52], [53], image segmentation [54], [55], [56], image captioning [57], [58], [59], multimedia retrieval [60], [6], [62], [63] and video classification [64], [65], [66]. The success of CNN on those vision tasks indicates that the learnt hierarchical representation can well preserve the underlying semantic structure of the input data from different modalities. CNNH [67] is a twostep hashing method which learns hash codes and deep hash functions separately for image retrieval. In [29], [68], [69], the triplet ranking loss is designed to learn hash functions with CNN model by capturing the triplet-based relative similarity relationships of image pairs. Multimodal Similarity-Preserving Hashing () [70] employs a siamese network to learn cross-view hash functions by jointly preserving inter- and intra-modality similarities of different modalities. Correlation Hashing Network () [7] and Correlation Autoencoder Hashing (CAH) [72] exploits multimodal correlation structure of multimedia data under a deep network architecture. Deep Cross-Modal Hashing () [73] directly learns unified discrete hash codes with the deep neural networks. Different from the above deep hashing methods, we propose a generalized deep hashing method for both unimodal and cross-modal retrieval tasks by integrating the feature representation learning, the inter-modality similarity learning, intramodality semantic label learning and hashing learning with the bit balanced constraint into one unified framework. III. DEEP SEMANTIC MULTIMODAL HASHING NETWORK A. Mathematical Notations and Problem Definition Throughout this paper, uppercase bold font characters are used to denote matrices and lowercase bold font characters are used to denote vectors. Let T = {x i } N i= RN d denote a set of d -dimensional data points from the image modality. Similar, we use T = {y i } N i= RN d to define a set of d -dimensional data points from the text modality. We assume each instance (i.e., x i or y i ) belongs to one or more categories and its corresponding label denotes as g {0, } C, where C is the number of the categories, and the non-zero elements in g indicate that the instance is associated with the corresponding categories. We target to learn L-bit binary codes B {, } L N for each modality, as well as two sets of modality-specific hash functions H = [h,, h L ] with deep networks, where {, } is a placeholder and h l is used to generate the l th binary code b l {, }. After we obtain the modality-specific hash functions, new coming samples from different modalities can be easily projected to the common Hamming space. The proposed hashing framework aims to jointly learn modality-specific hash functions and hierarchical representations in an end-to-end manner by exploring the intermodality semantic correlation and the semantic information simultaneously. In addition, each hash bit of the binary hash codes are expected to be zero-mean over the training data. In the following, we describe the proposed hashing framework in detail. B. Deep Modality-specific Hash Functions In recent years, deep learning techniques have shown superiority in many computer vision tasks such as classification, segmentation and image retrieval. Convolutional Neural Networks especially have achieved great success in various visual applications due to the effectiveness in capturing highlevel semantic structure from raw data. In contrast to the hand-crafted features, CNN can automatically learn hierarchical feature representations which are invariant to irrelevant variations such as translation and rotation. In this paper, we

4 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 4 propose to learn the modality-specific hash functions and feature representations synchronously for multimodal data by employing the deep networks. Assume the deep network has M and M layers for the image modality and the text modality, respectively. For the m th layer of the deep network, we can calculate the outputs of this layer as follow Z m = ψ(w m Z m + c m ), for m =,, M, () where ψ( ) is the nonlinear activation function, W m R d m d m is the weight matrix for the m th layer, c m R d m is the bias vector, and Z m R d m N is the outputs of the (m ) th layer. In order to learn the binary hash codes, the hash layer (i.e., the (M ) th layer) of the deep network is used to construct the modality-specific hash functions. Specifically, the modality-specific binary codes can be computed by B = sign(z M ), (2) where sign(a) = if a 0; otherwise sign(a) =, and is the output of the (M ) th layer. Z M C. Inter-modality Similarity Preserving Binary Codes To capture the correlation of different modalities, the intermodality similarity is preserved during the hash functions learning. Firstly, we construct the inter-modality similarity S as {, if g T s = i g j > 0 (3), otherwise, where g i and g j are the label vectors for the i th instance and the j th instance. For each cross-modal pair (x i, y j, s ), if s =, the corresponding binary codes b i and b j should be similar with each other; otherwise b i and b j should be dissimilar. In other words, if s =, the similarity between b i and b j should be close to ; otherwise to. Specifically, the widely-used code inner product is used to measure the similarity of two data points [3]. Therefore, the cross-modal similarity between b i and b j can be computed by c = (b i ) T b j /L, (4) where L is the length of the hash codes. In general, we can formulate the inter-modality similaritypreserving learning problem as follow min l(c, s ), for m =,, M, {W m,bm },{Wm,bm } s S (5) where l(, ) is the loss function defined to enforce c and s to be close, {W m, b m } is the parameters of the m th layer for the network. In Eq. (5), we can use any proper loss function, and several widely-used loss functions are discussed in the following. L Loss. The L loss minimizes the absolute differences between the estimated value c and the target value s, defined as l L (c, s ) = c s, (6) where a denotes the absolute value of a. Euclidian Loss. The euclidian loss is the L2-norm of the error between the estimated values and the target values, which minimizes the squared error as follow l L2 (c, s ) = 2 (c s ) 2. (7) Hinge Loss. The hinge loss is widely-used in SVMs for maximum-margin classification. Specifically, we define the hinge loss as { l hg max(0, δhg φ(c (c, s ) = )), if s = (8) φ(c ), if s =, where φ(a) = (a + )/2 transforms the values in [, ] to [0, ], δ hg is the margin parameter, which is fixed as in this paper. Contrastive Loss. The contrastive loss is defined as { d l ct 2 (c, s ) =, if s = max(0, δ ct d ) 2 (9), if s =, where d = b i b j = 2L( c ) denotes the distance between b i and b j, and δ ct is the margin parameter. D. Label Preserving Binary Codes Although the inter-modality similarity is leveraged to explore the correlation of different modality, the intra-modality label information is not fully exploited for cross-view retrieval. Recent works for unimodal retrieval have indicated that highquality hash functions can be learned when the semantic information is exploited during the hash functions learning procedure. However, most of the existing works encode the semantic information with a two-stream deep network, where one stream learns hash functions while another one is used for the classification task. To make full use of the semantic label information, the learned binary hash codes are directly used for label predicting. That is to say, the binary hash codes are expected to be optimal for the jointly learned classifier. In addition to exploring the correlations of different modalities, we take advantage of the semantic labels to learn the binary hash codes for the image and text modality respectively. In particular, the last layer (named as the fcc layer) of the modality-specific network is trained for the classification task. In this work, we adopt the widely-used cross-entropy loss for the multi-label classification, which can be written as L c (B ; W M, c M ) = N = N N i= N l c(b i ; WM, c M C g ic log ĝ ic i= c= ( g ic ) log( ĝ ic ) ) (0) where W M and c M define the weight matrix and bias vector for the fcc layer of the network, g ic is the c th element of g i, ĝ ic = φ(w M c b i + cm c ) estimates the probability that the i th instance is classified as the c th category, wc M is the c th row of W M, c M c is the c th element of c M, and φ(a) = ( + exp( a)) defines the sigmoid function.

5 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 5 E. Overall Objective and Optimization As mentioned above, we propose to learn the hash functions with the deep neural networks by simultaneously exploring the intra-modality semantic structure and the inter-modality correlations between different modalities. By combining Eq. (5) and Eq. (0), the proposed deep hashing framework can be formulated as min O = Φ,Φ s S l + α(l c + L c ) s.t. B, B {, } L N, () where l stands for l(c, s ), L c and L c are short for L c (B ; W M, cm ) and L c(b ; W M, cm ) respectively, Φ = {W m, cm }M m= is the parameter set of the image network, Φ = {W m, cm }M m = is the parameter set of the text network, and α > 0 is a hyper parameter. To make the hash codes balanced, most hashing methods enforce each bit of the binary codes to be mean-zero over the training data [8]. Similarly, we impose the bit balance constraints under the proposed deep hashing framework, therefore problem () can be reformulated as min O = Φ,Φ s S l + α(l c + L c ) s.t. B, B {, } L N, B = 0, B = 0, (2) where R N defines a vector with all the elements being. The optimization problem in Eq. (2) is NP-hard because of the discrete constraints on B and B. To make it tractable, we remove the discrete constraint and relax B and B to be continuous values, and then Eq. (2) can be reformulated as B = Z M, (3) where Z M is the output of the hash layer (fch layer). Furthermore, the optimization problem in (2) becomes: min O = Φ,Φ s S l + α(l c + L c ) s.t. Z M Z M, Z M {, } L N, = 0, Z M = 0, (4) To make the hash codes either or, we introduce a quantization error term as follow L q = 2N ( ZM L N 2 F + Z M L N 2 F ). (5) By minimizing the quantization error term L q, the entries of Z M and Z M are enforced to be close to and. As a result, the optimization problem in (4) can be reformulated as min O = l + α(l c + L c ) + βl q + γl b Φ,Φ (6) s S where L b = 2N ( ZM N Z M N 2 2) defines the bit balance penalty term, and β, γ > 0 are two hype parameters. Note that, with sufficiently large γ, the binary hash codes B would be balanced as much as possible. In order to solve the above problem, we adopt the widelyused Stochastic Gradient Descent (SGD) with mini-batch to train the whole deep networks. Specifically, the back propagation (BP) strategy is employed to update the parameters of each layer. In general, the joint optimization problem in (6) is non-convex with respect to Φ and Φ. In this work, we propose to alternatively learn Φ and Φ, where one parameter set is optimized while with another one fixed. In the following, we describe the optimization algorithm in detail. Update Φ. When Φ is fixed, for the m th layer of the image network, the gradients of the overall objective function O with respect to the parameters W m and bm can be derived by O = ξ m(zm i ) T, W m O c m = ξ m, (7) where z m i is the i th column of Z m. For the hidden layer (i.e. m =,, M 2), ξ m can be derived as follows ξ m = ((W m+ ) T ξ m+ ) ψ (W m z m i + c m ), (8) where denotes the element-wise multiplication, and ψ ( ) defines the gradient of ψ( ). For the classification layer (i.e. m = M ), ξ M can be computed by ξ M = α N (ĝ i g i ) ψ (W M zm i + c M ), (9) where ĝi = φ(w M zm i + c M ). For the hash layer (i.e. m = M ), ξ M can be calculated by where ξ M = ( L q W M l W M + β ψ (W M L q W M z M 2 i + γ L b W M + c M ), ) (20) = N ( zm i ) sign(z M i ); (2) L b W M = N zm i N ; (22) Since different loss functions are used in the proposed deep hashing framework, the subgradient of different loss functions are discussed as follow l L W M l L2 W M l hg W M l ct W M l W M = (c s )sign(c s )z M j /L = (c s )z M j /L = 2L ( s I(µhg > 0) + ( s ))zm j = 4(s d ( s )I(µct > 0)(δ ct d )) (z M i z M j ) (23) where s = 2 (s + ), µ hg = δ hg φ(c ), µ ct = δ ct d, and I(condition) defines the indicator function which equals to if the condition is true and 0 otherwise. Update Φ. With Φ fixed, the gradients of the overall objective function O with respect to the parameters W m and

6 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 6 b m by of the mth layer for the text network can be calculated O W m O c m = ξ m(zm i ) T, = ξ m, (24) where z m i is the i th column of Z m. If m =,, M 2, we can obtain ξ m ξ m = ((W m+ ) T ξ m+ ) ψ (W m z m i + c m ). (25) Otherwise, for the classification layer, ξ M can be computed by ξ M = α N (ĝ j g j) ψ (W M zm j by + c M ), (26) where ĝj = φ(wm zm j + c M ). Similarly, for the hash layer, ξ M can be computed by where ξ M = ( L q W M and the subgradient defined as l W M + β ψ (W M L q W M z M 2 i + γ L b W M + c M ), ) (27) = N ( zm j ) sign(z M j ), (28) L b W M = N zm j N, (29) l W M of different loss functions are l L = (c W M s )sign(c s )z M i /L l L2 W M l hg W M l ct W M = (c s )z M i /L = 2L ( s I(µhg > 0) + ( s ))zm i = 4(s d ( s )I(µct > 0)(δ ct d )) (z M j z M i ) (30) Algorithm summarizes the learning procedure of the proposed method. IV. EPERIMENTS In this section, we conduct extensive experiments on three widely-used multimodal datasets to validate the effectiveness of the proposed deep hashing method. A. Datasets Extensive experiments are conducted on three widely-used multimedia benchmarks including Wiki [74], MIRFlickr25k [75] and NUS-WIDE [76]. The detail of these multimodal datasets are introduced in the following. Wiki. The Wiki dataset includes 2,866 image-text pairs which are crawled from Wikipedia. Each image-text pair in this dataset is annotated with one of 0 provided labels. We randomly select 20% of the image-text pairs as the query set, Algorithm The learning algorithm for Input: Training set T and T. Output: Network parameter sets Φ and Φ. Initialization Initialize Network parameter sets Φ and Φ, iteration number N iter and mini-batch size N batch. repeat: for iter =,, N iter do Randomly select N batch cross-modal pairs from the training set T and T. for m =,, M do For each selected cross-modal pair, compute z m by forward propagation according to Eq. (). end for for m = M,, do O W m Calculate the derivatives according to Eq. (7) - (30). Update the parameters W m and c m using the BP algorithm. end for end for until a fixed number of iteration. and O c m and the remaining pairs are used to form the training set and database. MIRFlickr25k. The MIRFlickr25k dataset consists of 25,000 images crawled from Flickr, where each image is associated with its corresponding textual tags. The imagetext pair in this dataset is annotated with one or more of 24 provided labels. We keep the tags that appear at least 20 times, and the image-text pairs without any annotations and tags are removed. Therefore, we obtain 8,59 image-text pairs and,075 textual tags for the experiment. 908 image-text pairs are randomly selected to form the query set, and remaining pairs are adopted to form the database from which 5000 image-text pairs are manually selected as the training set. NUS-WIDE. The NUS-WIDE dataset contains 269,648 images collected from Flickr, where each image is associated with some textual tags. There are 8 ground-truth semantic concepts in this dataset. We keep the top 2 most-frequent concepts and get 95,834 image-text pairs for this dataset. We randomly select 00 pairs per class as queries and the rest pairs are used as the database, from which 500 pairs per class are selected to form the training set. B. Compared Baselines and Evaluation Metrics In order to evaluate the performance of the proposed deep hashing method, we compare it with several state-of-the-art methods, including six non-deep hashing methods (i.e., [3], [43], [44], SCM [2], [45], [46]) and three deep hashing methods (i.e., [70], [7], [73]). Those hashing methods have been introduced in Section II in detail. The source codes of those methods are available online expect for and which are carefully implemented by ourselves. The parameters of those methods are chosen according to the suggestions of

7 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 7 Method Table I: map comparison with nondeep-based hashing methods on all datasets Wiki MIRFlickr25k NUS-WIDE 8 bits 6 bits 32 bits 48 bits 8 bits 6 bits 32 bits 48 bits 8 bits 6 bits 32 bits 48 bits Image-query-text Text-query-image Image-query-image Method Table II: map comparison with deep-based hashing methods on all datasets Wiki MIRFlickr25k NUS-WIDE 8 bits 6 bits 32 bits 48 bits 8 bits 6 bits 32 bits 48 bits 8 bits 6 bits 32 bits 48 bits Image-query-text Text-query-image Image-query-image the papers. For the non-deep hashing methods, the images are represented by the 4096-dimensional CNN features of the fully-connected layer of Alexnet [47] network. For the deep hashing methods, the Alexnet network is used for fair comparison. Furthermore, except for Wiki in which the texts are represented by 000-D tf-idf features, the texts in other datasets are represented by the binary tag vectors such as 075- D binary vectors for MIRFlickr25k and 000-D binary vectors for NUS-WIDE. In this paper, we verify the proposed method for three multimodal retrieval tasks: the image-query-text, text-queryimage and image-query-image task. In addition, we adopt the following evaluation metrics to measure the performance of the methods: mean Average (map), Top-K (P@K) and - (PR). C. Experimental Setting Our network contains two parts: an image subnetwork and a text subnetwork. For the image subnetwork, we employ the widely-used Alexnet [47] model for feature representation learning. The Alexnet has five convolutional layers (conv to conv5) and two fully-connected layers (fc6 to fc7). For hash function learning, we add two fully-connected layers at the top of the layer fc7, as illustrated in Figure. The first new fully-connected layer (named as the f ch layer) transforms the deep feature representation into compact hash codes. We set

8 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING (a) Wiki (Image-query-text) (b) Wiki (Text-query-image) (c) Wiki (Image-query-image) (d) MIRFlickr25k (Image-query-text) (e) MIRFlickr25k (Text-query-image) (f) MIRFlickr25k (Image-query-image) (g) NUS-WIDE (Image-query-text) (h) NUS-WIDE (Text-query-image) (i) NUS-WIDE (Image-query-image) Figure 2: P@00 curves with respect to different code lengths for different methods on the three datasets. the number of the fch layer as the code length of the hash codes. To encourage the activations of the f ch layer binary, the tanh activation function is utilized to further make the activations bounded between to. To make the hash codes discriminative, the learned hash codes are expected to well predict the class labels. Specifically, the outputs of the f ch layer are fed to the second new fully-connected layer (named as the fcc layer) for the classification task. The fcc layer has the same number of the class labels. For the text subnetwork, it is comprised of four fully-connected layers, where the first two layers are for feature representation learning and the last two layers are for hash function learning. The number of the outputs of the first two layers are set to be 4096 and the configuration of the last two layers is the same as the image subnetwork. The proposed method is implemented by using the open source Caffe [77] framework on an NVIDIA K20 GPU server. The whole network is initialized with avier except for the layers from conv to fc7 of the image subnetwork that are copied from the pre-trained Alexnet model. The network is trained by using the mini-batch Stochastic Gradient Descent with the Back Propagation algorithm. We set the learning rate and the batch size to be 0 5 and 28 respectively. Besides, the learning rates of the fch and fcc layers are set as 000 and 00 times bigger than other layers. In, we set α =, β = γ = by using a validation strategy for all the datasets. D. Experimental Results We evaluate the retrieval performance for three multimodal retrieval tasks: image-query-text, text-query-image and image-query-image tasks. In this experiment, we adopt the contrastive loss for Wiki and NUS-WIDE datasets to evaluate the retrieval performance. Besides, the euclidian loss is used for the MIRFlickr25k dataset. We first report the map results of nondeep-based and deep-based

9 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING (a) Wiki (Image-query-text) (b) Wiki (Text-query-image) (c) Wiki (Image-query-image) (d) MIRFlickr25k (Image-query-text) (e) MIRFlickr25k (Text-query-image) (f) MIRFlickr25k (Image-query-image) (g) NUS-WIDE (Image-query-text) (h) NUS-WIDE (Text-query-image) (i) NUS-WIDE (Image-query-image) Figure 3: curves of 48-bit hash code with respect to different numbers of top returned samples on the three datasets. hashing methods on Wiki, MIRFlickr25k and NUS-WIDE datasets in Tabel I and Tabel II respectively. As the map value is calculated across all the returned samples, it evaluates the overall performance of the hashing method. From Tabel I and Tabel II, we can observe that the proposed method consistently outperforms all compared methods when varying the code length. For example, as shown in Tabel I, our method can achieve the average performance increases of 2.85%/5.64%/4.98%, 6.38%/7.56%/9.39% and 3.86%/7.35%/2.62% for image-query-text/text-queryimage/image-query-image on Wiki, MIRFlickr25k and NUS- WIDE datasets when compared with the best results (placed with underlines) of the nondeep-based methods. Similarly, when compared with the best results of the deepbased baselines, gains the average performance increases of 5.3%/7.65%/6.06%, 7.4%/6.47%/.34% and 2.62%/7%/.66% on Wiki, MIRFlickr25k and NUS- WIDE dataset respectively. We find that the performance gap of different multimodal retrieval tasks is very small for MIRFlickr25k and NUS-WIDE datasets except for the Wiki dataset. This is mainly caused by the huge semantic gap between the image and text modalities of the Wiki dataset. As the texts are much better than the images in describing the semantic concepts in this dataset, much better performance can be achieved when the texts are used to query against the image retrieval database. Another interesting observation is that the map results of the proposed method for the image-query-image task are much better than the compared baseline methods on all datasets. In, the intra-modality semantic class labels are explicitly exploited in hashing learning, thus the intra-modality correlation is preserved maximized. This makes the learned hash codes discriminative to achieve superior performance for unimodal retrieval task. In addition to the map, we also evaluate the retrieval performance according to the top-k precision and precision-

10 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING (a) Wiki (Image-query-text) (b) Wiki (Text-query-image) (c) Wiki (Image-query-image) (d) MIRFlickr25k (Image-query-text) (e) MIRFlickr25k (Text-query-image) (f) MIRFlickr25k (Image-query-image) (g) NUS-WIDE (Image-query-text) (h) NUS-WIDE (Text-query-image) (i) NUS-WIDE (Image-query-image) Figure 4: -recall curves with respect to 48-bit hash code for different methods on the three datasets. recall, where K is set to be 00. Figure 2 illustrates the performance of P@00 with different code bits on all datasets. Figure 3 shows the precision curves with respect to different numbers of top returned samples for 48-bit hash codes on all datasets. From Figure 2 and Figure 3, we can observe that outperforms all the baselines for all multimodal retrieval tasks which are consistent with that of map. We also plot the precision-recall curves with 48 hash bits for the baselines on all datasets in Figure 4. Note that the larger area under the precision-recall curve indicates better overall performance. As shown in Figure 4, is competitive or outperforms all the baselines across different datasets. Overall, the proposed is competitive or substantially outperforms the baseline methods under different evaluation metrics for both unimodal and cross-modal retrieval tasks. The good performance indicates the superiority of the proposed method. Specifically, we conclude the following advantages of the proposed method. First, explicitly preserves both the inter-modality similarity and semantic class labels for hashing learning. Therefore, the intra-modality and inter-modality correlations of different modalities are maximally preserved to generate high-quality hash codes. Second, takes into account the bit balance constraint to make the hash bits evenly distributed. E. Discussion of Different Loss Functions The proposed deep hashing method can be integrated with different types of loss functions. In this section, we evaluate the performance of the proposed according to map and P@00 by using four different types of loss functions such as L loss, L2 loss, hinge loss and contrastive loss. The details of these loss functions can be referred in Section III-B. We report the map and P@00 results with 6-bit hash code for four different loss functions in Figure 5. We observe that the performance of different loss functions is

11 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING map map map L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image (a) Wiki (map) Image-query-image L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image Image-query-image (c) MIRFlickr25k (map) L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image Image-query-image (e) NUS-WIDE (map) P@00 P@00 P@ L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image Image-query-image (b) Wiki (P@00) L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image Image-query-image (d) MIRFlickr25k (P@00) L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image Image-query-image (f) NUS-WIDE (P@00) Figure 5: The map and P@00 comparison using different types of loss functions for 6-bit hash code on the three datasets. very close. Such small performance difference demonstrates the robustness of the proposed method with different loss functions. Additionally, the capability of integrating different loss functions into the unified deep hashing framework demonstrates the scalability and flexibility of the proposed method. V. CONCLUSION In this paper, we propose a novel Deep Semantic Multimodal Hashing Network () for scalable multimodal retrieval by exploring the inter-modality correlation structure and intra-modality semantic label information with the deep neural networks. Specifically, integrates the feature representation learning, inter-modality similarity preserving learning, intra-modality semantic label learning and hashing learning with bit balanced constraint into an end-to-end framework. Besides, can recommend different types of loss functions with minimal modification to the hash layer of the network, which demonstrates the scalability and flexibility of the proposed framework. Extensive experiments on three multimodal datasets shows the superiority of the proposed method for both of the unimodal and cross-modal retrieval tasks. REFERENCES [] J. Tang, Z. Li, M. Wang, and R. Zhao, Neighborhood discriminant hashing for large-scale image retrieval, IEEE Trans. Image Process., vol. 24, no. 9, pp , 205. [2]. Weiss, A. Torralba, and R. Fergus, Spectral hashing, in Proc. Int. Conf. Neural Inf. Process. Syst., pp , [3] W. Liu, J. Wang, R. Ji,.-G. Jiang, and S.-F. Chang, Supervised hashing with kernels, in Proc. Comput. Vis. Pattern Recognit., pp , 202. [4] R. Salakhutdinov and G. Hinton, Semantic hashing, Int. J. Approx. Reasoning, vol. 50, no. 7, pp , [5] L. Jin, K. Li, H. Hu, G.-J. Qi, and J. Tang, Semantic neighbor graph hashing for multimodal retrieval, IEEE Trans. Image Process., vol. 27, no. 3, pp , 208. [6] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, in Proc. Comput. Vis. Pattern Recognit., pp , 206. [7] C. Szegedy, W. Liu,. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, in Proc. Comput. Vis. Pattern Recognit., pp. 9, 205. [8] J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy, and N. Vasconcelos, On the role of correlation and abstraction in cross-modal multimedia retrieval, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 3, pp , 204. [9] C. Kang, S. iang, S. Liao, C. u, and C. Pan, Learning consistent feature representation for cross-modal multimedia retrieval, IEEE Trans. Multimedia, vol. 7, no. 3, pp , 205. [0] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification, IEEE Trans. Image Process., vol. 24, no. 2, pp , 205. [] F. Zhu,. Kong, L. Zheng, H. Fu, and Q. Tian, Part-based deep hashing for large-scale person re-identification, IEEE Trans. Image Process., vol. 26, no. 0, pp , 207. [2] D. Zhang and W.-J. Li, Large-scale supervised multimodal hashing with semantic correlation maximization., in Proc. AAAI Conf. Artif. Intell., pp , 204. [3] G. Ding,. Guo, and J. Zhou, Collective matrix factorization hashing for multimodal data, in Proc. Comput. Vis. Pattern Recognit., pp , 204. [4] J. Tang and Z. Li, Weakly-supervised multimodal hashing for scalable social image retrieval, IEEE Trans. Circuits Syst. Video Techn., vol. 28, no. 0, pp , 208. [5] P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in Proc. ACM Symp. Theory Comput., pp , 998. [6]. Gong and S. Lazebnik, Iterative quantization: A procrustean approach to learning binary codes, in Proc. Comput. Vis. Pattern Recognit., pp , 20. [7] H. Liu, R. Wang, S. Shan, and. Chen, Deep supervised hashing for fast image retrieval, in Proc. Comput. Vis. Pattern Recognit., pp , 206. [8] H.-F. ang, K. Lin, and C.-S. Chen, Supervised learning of semanticspreserving hash via deep convolutional neural networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, pp , 208. [9]. Zhu, Z. Huang, H. T. Shen, and. Zhao, Linear cross-modal hashing for efficient multimedia search, in Proc. ACM Int. Conf. Multimedia, pp , 203. [20] G. Irie, H. Arai, and. Taniguchi, Alternating co-quantization for crossmodal hashing, in Proc. IEEE Int. Conf. Comput. Vis., pp , 205. [2] J. Tang, K. Wang, and L. Shao, Supervised matrix factorization hashing for cross-modal retrieval, IEEE Trans. Image Process., vol. 25, no. 7, pp , 206. [22] J. Tang, Z. Li, and. Zhu, Supervised deep hashing for scalable face image retrieval, Pattern Recognit., vol. 75, pp , 208. [23] V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, Deep hashing for compact binary codes learning, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [24] W.-J. Li, S. Wang, and W.-C. Kang, Feature learning based deep supervised hashing with pairwise labels, in Proc. Int. Joint Conf. Artif. Intell., pp. 7 77, 206. [25] F. Shen,. u, L. Liu,. ang, Z. Huang, and H. T. Shen, Unsupervised deep hashing with similarity-adaptive and discrete optimization, IEEE Trans. Pattern Anal. Mach. Intell., 208.

12 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 2 [26] T.-T. Do, A.-D. Doan, and N.-M. Cheung, Learning to hash with binary deep neural network, in Proc. Eur. Conf. Comput. Vis., pp , 206. [27] J. Tang, J. Lin, Z. Li, and J. ang, Discriminative deep quantization hashing for face image retrieval, IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 2, pp , 208. [28] Q. Li, Z. Sun, R. He, and T. Tan, Deep supervised discrete hashing, in Proc. Int. Conf. Neural Inf. Process. Syst., pp , 207. [29] F. Zhao,. Huang, L. Wang, and T. Tan, Deep semantic ranking based hashing for multi-label image retrieval, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [30]. Cao, M. Long, J. Wang, Q. ang, and P. S. u, Deep visualsemantic hashing for cross-modal retrieval, in Proc. ACM Int. Conf. Knowl. Discovery Data Mining, pp , 206. [3]. Wei,. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. an, Crossmodal retrieval with cnn visual features: A new baseline, IEEE Trans. Cybernetics, vol. 47, no. 2, pp , 207. [32] L. Jin, K. Li, Z. Li, F. iao, G.-J. Qi, and J. Tang, Deep semanticpreserving ordinal hashing for cross-modal similarity search, IEEE Trans. Neural Netw. Learn. Syst., 208. to be published, doi: 0.09/TNNLS [33] B. Kulis and T. Darrell, Learning to hash with binary reconstructive embeddings, in Proc. Int. Conf. Neural Inf. Process. Syst., pp , [34] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, Hashing with graphs, in Proc. Int. Conf. Mach. learn., pp. 8, 20. [35] F. Shen, C. Shen, W. Liu, and H. Tao Shen, Supervised discrete hashing, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [36] L. Liu, L. Shao, F. Shen, and M. u, Discretely coding semantic rank orders for supervised image hashing, in Proc. Comput. Vis. Pattern Recognit., pp , 207. [37] C. Da, S. u, K. Ding, G. Meng, S. iang, and C. Pan, Amvh: Asymmetric multi-valued hashing, in Proc. Comput. Vis. Pattern Recognit., pp , 207. [38] F. Shen,. ang, L. Liu, W. Liu, D. Tao, and H. T. Shen, Asymmetric binary coding for image search, IEEE Trans. Multimedia, vol. 9, no. 9, pp , 207. [39] S. Kumar and R. Udupa, Learning hash functions for cross-view similarity search, in Proc. Int. Joint Conf. Artif. Intell., pp , 20. [40]. Zhen and D.-. eung, Co-regularized hashing for multimodal data, in Proc. Int. Conf. Neural Inf. Process. Syst., pp , 202. [4] J. Song,. ang,. ang, Z. Huang, and H. T. Shen, Inter-media hashing for large-scale retrieval from heterogeneous data sources, in Proc. ACM Int. Conf. Manage. Data, pp , 203. [42] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios, Data fusion through cross-modality metric learning using similarity-sensitive hashing, in Proc. Comput. Vis. Pattern Recognit., pp , 200. [43] J. Zhou, G. Ding, and. Guo, Latent semantic sparse hashing for cross-modal similarity search, in Proc. Int. ACM Conf. Res. & Dev. Inf. Retrieval, pp , 204. [44] D. Wang,. Gao,. Wang, and L. He, Semantic topic multimodal hashing for cross-media retrieval., in Proc. Int. Joint Conf. Artif. Intell., pp , 205. [45] Z. Lin, G. Ding, M. Hu, and J. Wang, Semantics-preserving hashing for cross-view retrieval, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [46] K. Li, G.-J. Qi, J. e, and K. A. Hua, Linear subspace ranking hashing for cross-modal retrieval, IEEE Trans. Pattern Anal. Mach. Intell., no. 9, pp , 207. [47] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Proc. Int. Conf. Neural Inf. Process. Syst., 202. [48]. LeCun,. Bengio, and G. Hinton, Deep learning, Nature, vol. 52, no. 7553, p. 436, 205. [49] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, Large-scale video classification with convolutional neural networks, in Proc. Comput. Vis. Pattern Recognit., pp , 204. [50] T. iao,. u, K. ang, J. Zhang,. Peng, and Z. Zhang, The application of two-level attention models in deep convolutional neural network for fine-grained image classification, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [5] J. Wang,. ang, J. Mao, Z. Huang, C. Huang, and W. u, Cnnrnn: A unified framework for multi-label image classification, in Proc. Comput. Vis. Pattern Recognit., pp , 206. [52] R. Girshick, Fast r-cnn, in Proc. IEEE Int. Conf. Comput. Vis., pp , 205. [53] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, Cnn features off-the-shelf: an astounding baseline for recognition, in Proc. Comput. Vis. Pattern Recognit. Workshop, pp , 204. [54] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. uille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp , 208. [55] K. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask r-cnn, in Proc. IEEE Int. Conf. Comput. Vis., pp , 207. [56] H. Noh, S. Hong, and B. Han, Learning deconvolution network for semantic segmentation, in Proc. IEEE Int. Conf. Comput. Vis., pp , 205. [57] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [58] K. u, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in Proc. Int. Conf. Mach. learn., pp , 205. [59] L. Chen, H. Zhang, J. iao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning, in Proc. Comput. Vis. Pattern Recognit., pp , 207. [60] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu,. Zhang, and J. Li, Deep learning for content-based image retrieval: A comprehensive study, in Proc. ACM Int. Conf. Multimedia, pp , 204. [6] L. Zheng,. ang, and Q. Tian, Sift meets cnn: A decade survey of instance retrieval, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 5, pp , 208. [62] L. Jin,. Shu, K. Li, Z. Li, G.-J. Qi, and J. Tang, Deep ordinal hashing with spatial attention, IEEE Trans. Image Process., 208. to be published, doi: 0.09/TIP [63] E. ang, C. Deng, W. Liu,. Liu, D. Tao, and. Gao, Pairwise relationship guided deep hashing for cross-modal retrieval., in Proc. AAAI Conf. Artif. Intell., pp , 207. [64] J. ue-hei Ng, M. Hausknecht, S. Vayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, Beyond short snippets: Deep networks for video classification, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [65] B. Fernando and S. Gould, Learning end-to-end video classification with rank-pooling, in Proc. Int. Conf. Mach. learn., pp , 206. [66] K. Kang, W. Ouyang, H. Li, and. Wang, Object detection from video tubelets with convolutional neural networks, in Proc. Comput. Vis. Pattern Recognit., pp , 206. [67] R. ia,. Pan, H. Lai, C. Liu, and S. an, Supervised hashing for image retrieval via image representation learning., in Proc. AAAI Conf. Artif. Intell., pp , 204. [68] T. ao, F. Long, T. Mei, and. Rui, Deep semantic-preserving and ranking-based hashing for image retrieval., in Proc. Int. Joint Conf. Artif. Intell., pp , 206. [69] H. Lai,. Pan,. Liu, and S. an, Simultaneous feature learning and hash coding with deep neural networks, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [70] J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber, Multimodal similarity-preserving hashing, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 4, pp , 204. [7]. Cao, M. Long, J. Wang, and P. S. u, Correlation hashing network for efficient cross-modal retrieval, [72]. Cao, M. Long, J. Wang, and H. Zhu, Correlation autoencoder hashing for supervised cross-modal search, in Proc. ACM Int. Conf. Multimedia Retrieval, pp , 206. [73] Q.-. Jiang and W.-J. Li, Deep cross-modal hashing, in Proc. Comput. Vis. Pattern Recognit., pp , 207. [74] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, A new approach to cross-modal multimedia retrieval, in Proc. ACM Int. Conf. Multimedia, pp , 200. [75] M. J. Huiskes and M. S. Lew, The mir flickr retrieval evaluation, in Proc. ACM Int. Conf. Multimedia Inf. Retrieval, pp , [76] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and. Zheng, Nus-wide: a real-world web image database from national university of singapore, in Proc. ACM Int. Conf. Image Video Retrieval, 2009.

SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 3 [77]. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T.

degree in Measuring and Control Technology and Instrumentations from Northeast University at Qinhuangdao, Hebei, China, in 200. Now she is a Ph.

From 205 to 207, she worked as a visiting scholar in the Department of Computer Science at University of Central Florida.

D Degree in Computer Science and Technology from Nanjing University of Science and Technology, Nanjing, China, in 2007.

He has published over 30 papers in related international conferences and journals, including INFOCOM, ICC, IPCCC, IEEE JSAC, IEEE/ACM TON, IEEE TMC, ACM TECS, IEEE TVT and so on. Dr.

13 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 3 [77]. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in Proc. ACM Int. Conf. Multimedia, pp , 204. ICIMCS 208. Lu Jin received the B.E. degree in Measuring and Control Technology and Instrumentations from Northeast University at Qinhuangdao, Hebei, China, in 200. Now she is a Ph.D candidate student in Nanjing University of Science and Technology in Computer Science and Technology. From 205 to 207, she worked as a visiting scholar in the Department of Computer Science at University of Central Florida. Her research interests include multimedia computing, deep learning and multimedia retrieval. She has received the Best Student Paper Award in Fu iao received the Ph.D Degree in Computer Science and Technology from Nanjing University of Science and Technology, Nanjing, China, in He is currently a Professor and PhD supervisor with the School of Computer, Nanjing University of Posts and Telecommunications, Nanjing, China. He has published over 30 papers in related international conferences and journals, including INFOCOM, ICC, IPCCC, IEEE JSAC, IEEE/ACM TON, IEEE TMC, ACM TECS, IEEE TVT and so on. Dr. iao is a member of the IEEE Computer Society and the Association for Computing Machinery. Jinhui Tang is a Professor in School of Computer Science and Engineering, Nanjing University of Science and Technology, China. He received his B.E. and Ph.D. degrees in July 2003 and July 2008 respectively, both from the University of Science and Technology of China. From 2008 to 200, he worked as a research fellow in School of Computing, National University of Singapore. His current research interests include large scale multimedia search. He has authored over 50 journal and conference papers in these areas. Prof. Tang is a co-recipient of the Best Paper Awards in ACM MM 2007, PCM 20 and ICIMCS 20, and the Best Student Paper Award in MMM 206. Zechao Li is currently a Professor at Nanjing University of Science and Technology. He received the Ph.D degree from National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences in 203, and the B.E. degree from University of Science and Technology of China in His research interests include intelligent media analysis, computer vision, etc. He received the oung Talent Program of China Association for Science and Technology, the Excellent Doctoral Dissertation of Chinese Academy of Sciences, and the Excellent Doctoral Theses of China Computer Federation. Guo-Jun Qi received the PhD degree from the University of Illinois at Urbana-Champaign, in December 203. His research interests include pattern recognition, machine learning, computer vision, multimedia, and data mining. He received twice IBM PhD fellowships, and Microsoft fellowship. He is the recipient of the Best Paper Award at the 5th ACM International Conference on Multimedia, Augsburg, Germany, He is currently a faculty member with the Department of Computer Science at the University of Central Florida, and has served as program committee member and reviewer for many academic conferences and journals in the fields of pattern recognition, machine learning, data mining, computer vision, and multimedia.

arxiv: v2 [cs.cv] 27 Nov 2017

arxiv: v2 [cs.cv] 27 Nov 2017 Deep Supervised Discrete Hashing arxiv:1705.10999v2 [cs.cv] 27 Nov 2017 Qi Li Zhenan Sun Ran He Tieniu Tan Center for Research on Intelligent Perception and Computing National Laboratory of Pattern Recognition