WITH the rapid advancement in social networking sites, Deep Semantic Multimodal Hashing Network for Scalable Multimedia Retrieval

Size: px
Start display at page:

Download "WITH the rapid advancement in social networking sites, Deep Semantic Multimodal Hashing Network for Scalable Multimedia Retrieval"

Transcription

1 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING Deep Semantic Multimodal Hashing Network for Scalable Multimedia Retrieval Lu Jin, Jinhui Tang, Senior Member, IEEE, Zechao Li, Guo-Jun Qi, Fu iao ariv: v [cs.cv] 9 Jan 209 Abstract Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. Particularly, deep hashing has received unprecedented research attention in recent years, owing to its perfect retrieval performance. However, most of existing deep hashing methods learn binary hash codes by preserving the similarity relationship while without exploiting the semantic labels, which result in suboptimal binary codes. In this work, we propose a novel Deep Semantic Multimodal Hashing Network () for scalable multimodal retrieval. In, two sets of modality-specific hash functions are jointly learned by explicitly preserving both the inter-modality similarities and the intra-modality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for task-specific classification, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Different from previous deep hashing methods, which are tied to some particular forms of loss functions, our deep hashing framework can be flexibly integrated with different types of loss functions. In addition, the bit balance property is investigated to generate binary codes with each bit having 50% probability to be or. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and highquality hash codes by exploiting the feature representation learning, inter-modality similarity preserving learning, semantic label preserving learning and hash functions learning with bit balanced constraint simultaneously. We conduct extensive experiments for both unimodal and cross-modal retrieval tasks on three widelyused multimodal retrieval datasets. The experimental result demonstrates that significantly outperforms state-of-theart methods. Index Terms Multimedia data retrieval, deep hashing, intramodality similarity, semantic label information, binary hash codes, convolutional neural network I. INTRODUCTION WITH the rapid advancement in social networking sites, recent years have witnessed the explosive growth of social media including images, user-provided textual captions videos and so on. For instance, as one of the most popular online social media, Facebook has more than 2 billion users sharing photos, videos, and user-provided textual contents daily. When searching the topics of interests for the Facebook user, it is desirable to return relevant results such as images, textual contents, and videos. As there are huge semantic gap L. Jin, J. Tang and Z. Li are with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 20094, China ( lujin505@gmail.com, jinhuitang@njust.edu.cn and zechao.li@njust.edu.cn). G. Qi is with the Department of Computer Science, University of Central Florida, Orlando, FL, 3286, USA ( guojun.qi@ucf.edu). F. iao is with the School of Computer, Nanjing University of Posts and Telecommunications, Nanjing 20023, China ( xiaof@njupt.edu.cn). Corresponding author: Jinhui Tang. among different modalities, it is especially challenging when searching for relevant multimodal contents among a great amount of heterogenous multimedia data. Consequently, it is imperative to develop efficient similarity search for large-scale multimedia data. Hashing [], [2], [3], [4], [5] has received a great deal of attention recently in the field of computer vision [6], [7], multimedia retrieval [8], [9] and person re-identification [0], []. The core idea of hashing is to project high-dimensional data into compact binary hash codes such that similar samples can be encoded to close hash codes in the Hamming space. Compared with real-valued features, the binary hash codes greatly reduce the storage cost and allow to fast Hamming distance computation with OR operation, thus making efficient similarity search on large-scale databases. Existing multimodal hashing methods [2], [3], [4] can be roughly divided into unimodal hashing methods and cross-modal hashing methods. Given a query (i.e., a query image), unimodal hashing methods focus on searching for the relevant contents among the unimodal retrieval database. Well-known unimodal hashing methods include Locality-Sensitive Hashing (LSH) [5], Iterative Quantization (ITQ) [6], Supervised Hashing with Kernels (KSH) [3], Deep Supervised Hashing (DSH) [7] and Supervised Semantics-preserving Deep Hashing (SSDH) [8]. Multimodal data is usually represented with distinct feature representations in heterogenous feature spaces, which makes searching across multimodal data (i.e., image, text and, video) particularly challenging. Most of the cross-modal hashing methods [9], [20], [2] attempt to map heterogenous features into a common space so that the correlation structures of different modalities can be preserved. However, very little efforts have been made to address both of the unimodal and cross-modal retrieval, which really limits the real application for multimedia retrieval. Due to its great potential in learning powerful feature representation, deep hashing methods [22], [23], [24], [25], [26] have shown superior performance than the traditional hashing methods which adopt the hand-craft features to learn the hash functions. By jointly learning the feature representation and hash functions, the feature representation is optimally compatible with hashing learning, thus leading to discriminative and high-quality binary hash codes. For most deep unimodal hashing methods [27], [28], [29], the foremost concern is to approximate the hash functions by preserving the intra-modality similarity, which is insufficient to build the correlation structure of different modalities. Meanwhile, although latest deep cross-modal hashing methods [30], [3], [32] can well preserve inter-modality correlation structure of

2 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 2 Cross-entropy loss fcc Quantization loss Image modality Flower Butterfly Animal Plant Fly conv conv2 conv3 conv4 conv5 Tag occurance fc6 fc7 fch Similaritypreserving loss Bit balanced constraint Wild Text modality TF-IDF Feature extractor Feature vector fc fc2 fch fcc Cross-entropy loss Figure : An overview of the proposed Deep Semantic Multimodal Hashing Network method for multimodal retrieval. The network learns the binary hash codes by imposing four constraints on the hash layer (fch) of the network. First, the intermodality similarity-preserving loss is minimized to make the hash codes of different modalities consistent. Second, the crossentropy loss is used to minimize the prediction errors of the learned hash codes. Third, the bit balanced constraint makes the hash bits evenly distributed. Finally, the quantization loss forces the outputs of the hash layer to be close and. Better view in color version. multiple modalities, these methods limit efficient unimodal retrieval because of lacking in preserving the intra-modality semantic label information for hashing learning. In order to address these limitations, we propose a generalized deep hashing framework for scalable yet efficient unimodal and cross-modal retrieval. To this end, we propose a novel Deep Semantic Multimodal Hashing Network () to perform scalable unimodal and cross-modal retrieval for multimedia data. The flowchart of the proposed method is illustrated in Figure. To learn efficient modality-specific hash functions, the inter-modality similarity relationships and the intra-modality semantic information are jointly exploited to guide the learning process of hash functions. In addition to learning inter-modality similarity preserving hash codes, the learned hash codes are expected to be optimally compatible with the label predicting so as to effectively capture both of inter-modality correlation and intramodality semantic information. Based on this, the objective of is to minimize the pairwise similarity preserving loss and prediction errors of the learned hash codes simultaneously. Specifically, different from previous deep hashing methods, which merge with a particular loss function, can be flexibly trained with different types of loss functions under a generalized deep hashing framework. Moreover, to obtain balanced hash codes, we also investigate the bit balance property by making each bit having 50% probability being and. Overall, we highlight three main contributions of the proposed method as follow This paper proposes a generalized deep multimodal hashing framework which exploits feature representation learning, inter-modality similarity preserving learning, intra-modality semantic label preserving learning and hash functions learning with bit balance constraint simultaneously for scalable unimodal and multimodal retrieval. Our network aims to learn the binary hash codes by directly embedding the semantic labels on the similaritypreserving hash codes, and can be flexibly trained with different types of pair-wise similarity-preserving loss functions. Extensive experiments on three widely-used multimodal retrieval datasets are conducted to evaluate the performance of the proposed method for both unimodal and cross-modal retrieval tasks. The experimental results demonstrate the superiority and effectiveness of against state-of-the-art multimodal hashing methods for both unimodal and cross-modal retrieval tasks. The rest of this paper is organized as follows. We briefly review the related works in Section II. The proposed method and the optimization algorithm are presented in detailed in Section III. In Section IV, we conduct the experiments on several multimedia retrieval datasets and discuss the experimental results in detailed. Section V concludes the proposed method. II. RELATED WORKS As mentioned above, existing multimodal hashing methods can be roughly divided into two categories: unimodal hashing methods and cross-modal hashing methods. We briefly review these two kinds of methods in the following. Unimodal hashing methods learn hash functions by preserving the intra-modality similarity which is defined by utilizing the feature representation in the original feature space or the semantic label information. Representative works of the former one, such as Spectral Hashing (SH) [2], Iterative Quantization (ITQ) [6], Binary Reconstructive Embedding (BRE) [33], Anchor Graph Hashing (AGH) [34] and Neighborhood Discriminant Hashing (NDH) [], have a common sense that the

3 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 3 underlying data structure in the original feature space is exploited to learn the hash functions. Another successful attempt has been made to leverage supervised label information to improve the quality of binary hash codes. Supervised Hashing with Kernels (KSH) [3] employs kernel-based hash functions to learn similarity-preserving binary hash codes. Supervised Discrete Hashing (SDH) [35] aims to directly learn labelpreserving binary hash codes with the discrete constraint. By generating the list ranking orders from word embeddings of semantic labels, Discrete Semantic Ranking Hashing (DSeRH) [36] attempts to encode the semantic rank orders to learn binary hash codes with the triplet ranking loss. Furthermore, Asymmetric Multi-Valued Hashing (AMVH) [37] and Asymmetric Binary Coding (ASH) [38] utilize asymmetric similarity metric to learn two different sets of hash functions for image retrieval. Cross-modal hashing methods learn modality-specific hash functions by projecting hierarchical feature representations into a common space. Cross-View Hashing (CVH) [39], Co- Regularized Hashing [40] and Inter-Media Hashing (IMH) [4] utilize unlabeled training data to generate binary hash codes. As the supervised information is not exploited, these methods can not produce desirable hash codes. Some recent work utilizes labeled and unlabeled training data to learn more efficient hash functions. Cross-Modality Similarity-Sensitive Hashing () [42] adopts the boosting algorithm to efficiently solve the cross-modal similarity learning by projecting the input data from two different space into the Hamming space. Collective Matrix Factorization Hashing () [3] utilizes collective matrix factorization to learn cross-view hash functions by exploiting the latent structure of different modalities. Similarly, Latent Semantic Sparse Hashing () [43] employs the matrix factorization and sparse coding to learn the latent concepts and salient structure from the text and image respectively. Semantic Topic Multimodal Hashing () [44] learns a common subspace by capturing multiple semantic topics from multimedia data, and then generates binary hash codes by checking the appearance of the semantic topics. Semantic Correlation Maximization (SCM) [2] extends the canonical correlation analysis in a supervised manner. Semantics-Preserving Hashing () [45] first learns the joint binary hash codes by minimizing the Kullback-Leibler divergence between the hash codes and the semantic affinities, and learns the cross-view hash functions with the kernel logistic regression. Linear Subspace Ranking Hashing () [46] learns the ranking-based hash functions by exploiting the ranking structure of feature space. Recently, Convolutional Neural Networks (CNN) [47], [48], [49] has shown impressive learning potential for many vision tasks such as image classification [50], [5], [52], [53], image segmentation [54], [55], [56], image captioning [57], [58], [59], multimedia retrieval [60], [6], [62], [63] and video classification [64], [65], [66]. The success of CNN on those vision tasks indicates that the learnt hierarchical representation can well preserve the underlying semantic structure of the input data from different modalities. CNNH [67] is a twostep hashing method which learns hash codes and deep hash functions separately for image retrieval. In [29], [68], [69], the triplet ranking loss is designed to learn hash functions with CNN model by capturing the triplet-based relative similarity relationships of image pairs. Multimodal Similarity-Preserving Hashing () [70] employs a siamese network to learn cross-view hash functions by jointly preserving inter- and intra-modality similarities of different modalities. Correlation Hashing Network () [7] and Correlation Autoencoder Hashing (CAH) [72] exploits multimodal correlation structure of multimedia data under a deep network architecture. Deep Cross-Modal Hashing () [73] directly learns unified discrete hash codes with the deep neural networks. Different from the above deep hashing methods, we propose a generalized deep hashing method for both unimodal and cross-modal retrieval tasks by integrating the feature representation learning, the inter-modality similarity learning, intramodality semantic label learning and hashing learning with the bit balanced constraint into one unified framework. III. DEEP SEMANTIC MULTIMODAL HASHING NETWORK A. Mathematical Notations and Problem Definition Throughout this paper, uppercase bold font characters are used to denote matrices and lowercase bold font characters are used to denote vectors. Let T = {x i } N i= RN d denote a set of d -dimensional data points from the image modality. Similar, we use T = {y i } N i= RN d to define a set of d -dimensional data points from the text modality. We assume each instance (i.e., x i or y i ) belongs to one or more categories and its corresponding label denotes as g {0, } C, where C is the number of the categories, and the non-zero elements in g indicate that the instance is associated with the corresponding categories. We target to learn L-bit binary codes B {, } L N for each modality, as well as two sets of modality-specific hash functions H = [h,, h L ] with deep networks, where {, } is a placeholder and h l is used to generate the l th binary code b l {, }. After we obtain the modality-specific hash functions, new coming samples from different modalities can be easily projected to the common Hamming space. The proposed hashing framework aims to jointly learn modality-specific hash functions and hierarchical representations in an end-to-end manner by exploring the intermodality semantic correlation and the semantic information simultaneously. In addition, each hash bit of the binary hash codes are expected to be zero-mean over the training data. In the following, we describe the proposed hashing framework in detail. B. Deep Modality-specific Hash Functions In recent years, deep learning techniques have shown superiority in many computer vision tasks such as classification, segmentation and image retrieval. Convolutional Neural Networks especially have achieved great success in various visual applications due to the effectiveness in capturing highlevel semantic structure from raw data. In contrast to the hand-crafted features, CNN can automatically learn hierarchical feature representations which are invariant to irrelevant variations such as translation and rotation. In this paper, we

4 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 4 propose to learn the modality-specific hash functions and feature representations synchronously for multimodal data by employing the deep networks. Assume the deep network has M and M layers for the image modality and the text modality, respectively. For the m th layer of the deep network, we can calculate the outputs of this layer as follow Z m = ψ(w m Z m + c m ), for m =,, M, () where ψ( ) is the nonlinear activation function, W m R d m d m is the weight matrix for the m th layer, c m R d m is the bias vector, and Z m R d m N is the outputs of the (m ) th layer. In order to learn the binary hash codes, the hash layer (i.e., the (M ) th layer) of the deep network is used to construct the modality-specific hash functions. Specifically, the modality-specific binary codes can be computed by B = sign(z M ), (2) where sign(a) = if a 0; otherwise sign(a) =, and is the output of the (M ) th layer. Z M C. Inter-modality Similarity Preserving Binary Codes To capture the correlation of different modalities, the intermodality similarity is preserved during the hash functions learning. Firstly, we construct the inter-modality similarity S as {, if g T s = i g j > 0 (3), otherwise, where g i and g j are the label vectors for the i th instance and the j th instance. For each cross-modal pair (x i, y j, s ), if s =, the corresponding binary codes b i and b j should be similar with each other; otherwise b i and b j should be dissimilar. In other words, if s =, the similarity between b i and b j should be close to ; otherwise to. Specifically, the widely-used code inner product is used to measure the similarity of two data points [3]. Therefore, the cross-modal similarity between b i and b j can be computed by c = (b i ) T b j /L, (4) where L is the length of the hash codes. In general, we can formulate the inter-modality similaritypreserving learning problem as follow min l(c, s ), for m =,, M, {W m,bm },{Wm,bm } s S (5) where l(, ) is the loss function defined to enforce c and s to be close, {W m, b m } is the parameters of the m th layer for the network. In Eq. (5), we can use any proper loss function, and several widely-used loss functions are discussed in the following. L Loss. The L loss minimizes the absolute differences between the estimated value c and the target value s, defined as l L (c, s ) = c s, (6) where a denotes the absolute value of a. Euclidian Loss. The euclidian loss is the L2-norm of the error between the estimated values and the target values, which minimizes the squared error as follow l L2 (c, s ) = 2 (c s ) 2. (7) Hinge Loss. The hinge loss is widely-used in SVMs for maximum-margin classification. Specifically, we define the hinge loss as { l hg max(0, δhg φ(c (c, s ) = )), if s = (8) φ(c ), if s =, where φ(a) = (a + )/2 transforms the values in [, ] to [0, ], δ hg is the margin parameter, which is fixed as in this paper. Contrastive Loss. The contrastive loss is defined as { d l ct 2 (c, s ) =, if s = max(0, δ ct d ) 2 (9), if s =, where d = b i b j = 2L( c ) denotes the distance between b i and b j, and δ ct is the margin parameter. D. Label Preserving Binary Codes Although the inter-modality similarity is leveraged to explore the correlation of different modality, the intra-modality label information is not fully exploited for cross-view retrieval. Recent works for unimodal retrieval have indicated that highquality hash functions can be learned when the semantic information is exploited during the hash functions learning procedure. However, most of the existing works encode the semantic information with a two-stream deep network, where one stream learns hash functions while another one is used for the classification task. To make full use of the semantic label information, the learned binary hash codes are directly used for label predicting. That is to say, the binary hash codes are expected to be optimal for the jointly learned classifier. In addition to exploring the correlations of different modalities, we take advantage of the semantic labels to learn the binary hash codes for the image and text modality respectively. In particular, the last layer (named as the fcc layer) of the modality-specific network is trained for the classification task. In this work, we adopt the widely-used cross-entropy loss for the multi-label classification, which can be written as L c (B ; W M, c M ) = N = N N i= N l c(b i ; WM, c M C g ic log ĝ ic i= c= ( g ic ) log( ĝ ic ) ) (0) where W M and c M define the weight matrix and bias vector for the fcc layer of the network, g ic is the c th element of g i, ĝ ic = φ(w M c b i + cm c ) estimates the probability that the i th instance is classified as the c th category, wc M is the c th row of W M, c M c is the c th element of c M, and φ(a) = ( + exp( a)) defines the sigmoid function.

5 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 5 E. Overall Objective and Optimization As mentioned above, we propose to learn the hash functions with the deep neural networks by simultaneously exploring the intra-modality semantic structure and the inter-modality correlations between different modalities. By combining Eq. (5) and Eq. (0), the proposed deep hashing framework can be formulated as min O = Φ,Φ s S l + α(l c + L c ) s.t. B, B {, } L N, () where l stands for l(c, s ), L c and L c are short for L c (B ; W M, cm ) and L c(b ; W M, cm ) respectively, Φ = {W m, cm }M m= is the parameter set of the image network, Φ = {W m, cm }M m = is the parameter set of the text network, and α > 0 is a hyper parameter. To make the hash codes balanced, most hashing methods enforce each bit of the binary codes to be mean-zero over the training data [8]. Similarly, we impose the bit balance constraints under the proposed deep hashing framework, therefore problem () can be reformulated as min O = Φ,Φ s S l + α(l c + L c ) s.t. B, B {, } L N, B = 0, B = 0, (2) where R N defines a vector with all the elements being. The optimization problem in Eq. (2) is NP-hard because of the discrete constraints on B and B. To make it tractable, we remove the discrete constraint and relax B and B to be continuous values, and then Eq. (2) can be reformulated as B = Z M, (3) where Z M is the output of the hash layer (fch layer). Furthermore, the optimization problem in (2) becomes: min O = Φ,Φ s S l + α(l c + L c ) s.t. Z M Z M, Z M {, } L N, = 0, Z M = 0, (4) To make the hash codes either or, we introduce a quantization error term as follow L q = 2N ( ZM L N 2 F + Z M L N 2 F ). (5) By minimizing the quantization error term L q, the entries of Z M and Z M are enforced to be close to and. As a result, the optimization problem in (4) can be reformulated as min O = l + α(l c + L c ) + βl q + γl b Φ,Φ (6) s S where L b = 2N ( ZM N Z M N 2 2) defines the bit balance penalty term, and β, γ > 0 are two hype parameters. Note that, with sufficiently large γ, the binary hash codes B would be balanced as much as possible. In order to solve the above problem, we adopt the widelyused Stochastic Gradient Descent (SGD) with mini-batch to train the whole deep networks. Specifically, the back propagation (BP) strategy is employed to update the parameters of each layer. In general, the joint optimization problem in (6) is non-convex with respect to Φ and Φ. In this work, we propose to alternatively learn Φ and Φ, where one parameter set is optimized while with another one fixed. In the following, we describe the optimization algorithm in detail. Update Φ. When Φ is fixed, for the m th layer of the image network, the gradients of the overall objective function O with respect to the parameters W m and bm can be derived by O = ξ m(zm i ) T, W m O c m = ξ m, (7) where z m i is the i th column of Z m. For the hidden layer (i.e. m =,, M 2), ξ m can be derived as follows ξ m = ((W m+ ) T ξ m+ ) ψ (W m z m i + c m ), (8) where denotes the element-wise multiplication, and ψ ( ) defines the gradient of ψ( ). For the classification layer (i.e. m = M ), ξ M can be computed by ξ M = α N (ĝ i g i ) ψ (W M zm i + c M ), (9) where ĝi = φ(w M zm i + c M ). For the hash layer (i.e. m = M ), ξ M can be calculated by where ξ M = ( L q W M l W M + β ψ (W M L q W M z M 2 i + γ L b W M + c M ), ) (20) = N ( zm i ) sign(z M i ); (2) L b W M = N zm i N ; (22) Since different loss functions are used in the proposed deep hashing framework, the subgradient of different loss functions are discussed as follow l L W M l L2 W M l hg W M l ct W M l W M = (c s )sign(c s )z M j /L = (c s )z M j /L = 2L ( s I(µhg > 0) + ( s ))zm j = 4(s d ( s )I(µct > 0)(δ ct d )) (z M i z M j ) (23) where s = 2 (s + ), µ hg = δ hg φ(c ), µ ct = δ ct d, and I(condition) defines the indicator function which equals to if the condition is true and 0 otherwise. Update Φ. With Φ fixed, the gradients of the overall objective function O with respect to the parameters W m and

6 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 6 b m by of the mth layer for the text network can be calculated O W m O c m = ξ m(zm i ) T, = ξ m, (24) where z m i is the i th column of Z m. If m =,, M 2, we can obtain ξ m ξ m = ((W m+ ) T ξ m+ ) ψ (W m z m i + c m ). (25) Otherwise, for the classification layer, ξ M can be computed by ξ M = α N (ĝ j g j) ψ (W M zm j by + c M ), (26) where ĝj = φ(wm zm j + c M ). Similarly, for the hash layer, ξ M can be computed by where ξ M = ( L q W M and the subgradient defined as l W M + β ψ (W M L q W M z M 2 i + γ L b W M + c M ), ) (27) = N ( zm j ) sign(z M j ), (28) L b W M = N zm j N, (29) l W M of different loss functions are l L = (c W M s )sign(c s )z M i /L l L2 W M l hg W M l ct W M = (c s )z M i /L = 2L ( s I(µhg > 0) + ( s ))zm i = 4(s d ( s )I(µct > 0)(δ ct d )) (z M j z M i ) (30) Algorithm summarizes the learning procedure of the proposed method. IV. EPERIMENTS In this section, we conduct extensive experiments on three widely-used multimodal datasets to validate the effectiveness of the proposed deep hashing method. A. Datasets Extensive experiments are conducted on three widely-used multimedia benchmarks including Wiki [74], MIRFlickr25k [75] and NUS-WIDE [76]. The detail of these multimodal datasets are introduced in the following. Wiki. The Wiki dataset includes 2,866 image-text pairs which are crawled from Wikipedia. Each image-text pair in this dataset is annotated with one of 0 provided labels. We randomly select 20% of the image-text pairs as the query set, Algorithm The learning algorithm for Input: Training set T and T. Output: Network parameter sets Φ and Φ. Initialization Initialize Network parameter sets Φ and Φ, iteration number N iter and mini-batch size N batch. repeat: for iter =,, N iter do Randomly select N batch cross-modal pairs from the training set T and T. for m =,, M do For each selected cross-modal pair, compute z m by forward propagation according to Eq. (). end for for m = M,, do O W m Calculate the derivatives according to Eq. (7) - (30). Update the parameters W m and c m using the BP algorithm. end for end for until a fixed number of iteration. and O c m and the remaining pairs are used to form the training set and database. MIRFlickr25k. The MIRFlickr25k dataset consists of 25,000 images crawled from Flickr, where each image is associated with its corresponding textual tags. The imagetext pair in this dataset is annotated with one or more of 24 provided labels. We keep the tags that appear at least 20 times, and the image-text pairs without any annotations and tags are removed. Therefore, we obtain 8,59 image-text pairs and,075 textual tags for the experiment. 908 image-text pairs are randomly selected to form the query set, and remaining pairs are adopted to form the database from which 5000 image-text pairs are manually selected as the training set. NUS-WIDE. The NUS-WIDE dataset contains 269,648 images collected from Flickr, where each image is associated with some textual tags. There are 8 ground-truth semantic concepts in this dataset. We keep the top 2 most-frequent concepts and get 95,834 image-text pairs for this dataset. We randomly select 00 pairs per class as queries and the rest pairs are used as the database, from which 500 pairs per class are selected to form the training set. B. Compared Baselines and Evaluation Metrics In order to evaluate the performance of the proposed deep hashing method, we compare it with several state-of-the-art methods, including six non-deep hashing methods (i.e., [3], [43], [44], SCM [2], [45], [46]) and three deep hashing methods (i.e., [70], [7], [73]). Those hashing methods have been introduced in Section II in detail. The source codes of those methods are available online expect for and which are carefully implemented by ourselves. The parameters of those methods are chosen according to the suggestions of

7 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 7 Method Table I: map comparison with nondeep-based hashing methods on all datasets Wiki MIRFlickr25k NUS-WIDE 8 bits 6 bits 32 bits 48 bits 8 bits 6 bits 32 bits 48 bits 8 bits 6 bits 32 bits 48 bits Image-query-text Text-query-image Image-query-image Method Table II: map comparison with deep-based hashing methods on all datasets Wiki MIRFlickr25k NUS-WIDE 8 bits 6 bits 32 bits 48 bits 8 bits 6 bits 32 bits 48 bits 8 bits 6 bits 32 bits 48 bits Image-query-text Text-query-image Image-query-image the papers. For the non-deep hashing methods, the images are represented by the 4096-dimensional CNN features of the fully-connected layer of Alexnet [47] network. For the deep hashing methods, the Alexnet network is used for fair comparison. Furthermore, except for Wiki in which the texts are represented by 000-D tf-idf features, the texts in other datasets are represented by the binary tag vectors such as 075- D binary vectors for MIRFlickr25k and 000-D binary vectors for NUS-WIDE. In this paper, we verify the proposed method for three multimodal retrieval tasks: the image-query-text, text-queryimage and image-query-image task. In addition, we adopt the following evaluation metrics to measure the performance of the methods: mean Average (map), Top-K (P@K) and - (PR). C. Experimental Setting Our network contains two parts: an image subnetwork and a text subnetwork. For the image subnetwork, we employ the widely-used Alexnet [47] model for feature representation learning. The Alexnet has five convolutional layers (conv to conv5) and two fully-connected layers (fc6 to fc7). For hash function learning, we add two fully-connected layers at the top of the layer fc7, as illustrated in Figure. The first new fully-connected layer (named as the f ch layer) transforms the deep feature representation into compact hash codes. We set

8 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING (a) Wiki (Image-query-text) (b) Wiki (Text-query-image) (c) Wiki (Image-query-image) (d) MIRFlickr25k (Image-query-text) (e) MIRFlickr25k (Text-query-image) (f) MIRFlickr25k (Image-query-image) (g) NUS-WIDE (Image-query-text) (h) NUS-WIDE (Text-query-image) (i) NUS-WIDE (Image-query-image) Figure 2: P@00 curves with respect to different code lengths for different methods on the three datasets. the number of the fch layer as the code length of the hash codes. To encourage the activations of the f ch layer binary, the tanh activation function is utilized to further make the activations bounded between to. To make the hash codes discriminative, the learned hash codes are expected to well predict the class labels. Specifically, the outputs of the f ch layer are fed to the second new fully-connected layer (named as the fcc layer) for the classification task. The fcc layer has the same number of the class labels. For the text subnetwork, it is comprised of four fully-connected layers, where the first two layers are for feature representation learning and the last two layers are for hash function learning. The number of the outputs of the first two layers are set to be 4096 and the configuration of the last two layers is the same as the image subnetwork. The proposed method is implemented by using the open source Caffe [77] framework on an NVIDIA K20 GPU server. The whole network is initialized with avier except for the layers from conv to fc7 of the image subnetwork that are copied from the pre-trained Alexnet model. The network is trained by using the mini-batch Stochastic Gradient Descent with the Back Propagation algorithm. We set the learning rate and the batch size to be 0 5 and 28 respectively. Besides, the learning rates of the fch and fcc layers are set as 000 and 00 times bigger than other layers. In, we set α =, β = γ = by using a validation strategy for all the datasets. D. Experimental Results We evaluate the retrieval performance for three multimodal retrieval tasks: image-query-text, text-query-image and image-query-image tasks. In this experiment, we adopt the contrastive loss for Wiki and NUS-WIDE datasets to evaluate the retrieval performance. Besides, the euclidian loss is used for the MIRFlickr25k dataset. We first report the map results of nondeep-based and deep-based

9 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING (a) Wiki (Image-query-text) (b) Wiki (Text-query-image) (c) Wiki (Image-query-image) (d) MIRFlickr25k (Image-query-text) (e) MIRFlickr25k (Text-query-image) (f) MIRFlickr25k (Image-query-image) (g) NUS-WIDE (Image-query-text) (h) NUS-WIDE (Text-query-image) (i) NUS-WIDE (Image-query-image) Figure 3: curves of 48-bit hash code with respect to different numbers of top returned samples on the three datasets. hashing methods on Wiki, MIRFlickr25k and NUS-WIDE datasets in Tabel I and Tabel II respectively. As the map value is calculated across all the returned samples, it evaluates the overall performance of the hashing method. From Tabel I and Tabel II, we can observe that the proposed method consistently outperforms all compared methods when varying the code length. For example, as shown in Tabel I, our method can achieve the average performance increases of 2.85%/5.64%/4.98%, 6.38%/7.56%/9.39% and 3.86%/7.35%/2.62% for image-query-text/text-queryimage/image-query-image on Wiki, MIRFlickr25k and NUS- WIDE datasets when compared with the best results (placed with underlines) of the nondeep-based methods. Similarly, when compared with the best results of the deepbased baselines, gains the average performance increases of 5.3%/7.65%/6.06%, 7.4%/6.47%/.34% and 2.62%/7%/.66% on Wiki, MIRFlickr25k and NUS- WIDE dataset respectively. We find that the performance gap of different multimodal retrieval tasks is very small for MIRFlickr25k and NUS-WIDE datasets except for the Wiki dataset. This is mainly caused by the huge semantic gap between the image and text modalities of the Wiki dataset. As the texts are much better than the images in describing the semantic concepts in this dataset, much better performance can be achieved when the texts are used to query against the image retrieval database. Another interesting observation is that the map results of the proposed method for the image-query-image task are much better than the compared baseline methods on all datasets. In, the intra-modality semantic class labels are explicitly exploited in hashing learning, thus the intra-modality correlation is preserved maximized. This makes the learned hash codes discriminative to achieve superior performance for unimodal retrieval task. In addition to the map, we also evaluate the retrieval performance according to the top-k precision and precision-

10 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING (a) Wiki (Image-query-text) (b) Wiki (Text-query-image) (c) Wiki (Image-query-image) (d) MIRFlickr25k (Image-query-text) (e) MIRFlickr25k (Text-query-image) (f) MIRFlickr25k (Image-query-image) (g) NUS-WIDE (Image-query-text) (h) NUS-WIDE (Text-query-image) (i) NUS-WIDE (Image-query-image) Figure 4: -recall curves with respect to 48-bit hash code for different methods on the three datasets. recall, where K is set to be 00. Figure 2 illustrates the performance of P@00 with different code bits on all datasets. Figure 3 shows the precision curves with respect to different numbers of top returned samples for 48-bit hash codes on all datasets. From Figure 2 and Figure 3, we can observe that outperforms all the baselines for all multimodal retrieval tasks which are consistent with that of map. We also plot the precision-recall curves with 48 hash bits for the baselines on all datasets in Figure 4. Note that the larger area under the precision-recall curve indicates better overall performance. As shown in Figure 4, is competitive or outperforms all the baselines across different datasets. Overall, the proposed is competitive or substantially outperforms the baseline methods under different evaluation metrics for both unimodal and cross-modal retrieval tasks. The good performance indicates the superiority of the proposed method. Specifically, we conclude the following advantages of the proposed method. First, explicitly preserves both the inter-modality similarity and semantic class labels for hashing learning. Therefore, the intra-modality and inter-modality correlations of different modalities are maximally preserved to generate high-quality hash codes. Second, takes into account the bit balance constraint to make the hash bits evenly distributed. E. Discussion of Different Loss Functions The proposed deep hashing method can be integrated with different types of loss functions. In this section, we evaluate the performance of the proposed according to map and P@00 by using four different types of loss functions such as L loss, L2 loss, hinge loss and contrastive loss. The details of these loss functions can be referred in Section III-B. We report the map and P@00 results with 6-bit hash code for four different loss functions in Figure 5. We observe that the performance of different loss functions is

11 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING map map map L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image (a) Wiki (map) Image-query-image L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image Image-query-image (c) MIRFlickr25k (map) L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image Image-query-image (e) NUS-WIDE (map) P@00 P@00 P@ L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image Image-query-image (b) Wiki (P@00) L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image Image-query-image (d) MIRFlickr25k (P@00) L Loss L2 Loss Hinge Loss Contrastive Loss Image-query-text Text-query-image Image-query-image (f) NUS-WIDE (P@00) Figure 5: The map and P@00 comparison using different types of loss functions for 6-bit hash code on the three datasets. very close. Such small performance difference demonstrates the robustness of the proposed method with different loss functions. Additionally, the capability of integrating different loss functions into the unified deep hashing framework demonstrates the scalability and flexibility of the proposed method. V. CONCLUSION In this paper, we propose a novel Deep Semantic Multimodal Hashing Network () for scalable multimodal retrieval by exploring the inter-modality correlation structure and intra-modality semantic label information with the deep neural networks. Specifically, integrates the feature representation learning, inter-modality similarity preserving learning, intra-modality semantic label learning and hashing learning with bit balanced constraint into an end-to-end framework. Besides, can recommend different types of loss functions with minimal modification to the hash layer of the network, which demonstrates the scalability and flexibility of the proposed framework. Extensive experiments on three multimodal datasets shows the superiority of the proposed method for both of the unimodal and cross-modal retrieval tasks. REFERENCES [] J. Tang, Z. Li, M. Wang, and R. Zhao, Neighborhood discriminant hashing for large-scale image retrieval, IEEE Trans. Image Process., vol. 24, no. 9, pp , 205. [2]. Weiss, A. Torralba, and R. Fergus, Spectral hashing, in Proc. Int. Conf. Neural Inf. Process. Syst., pp , [3] W. Liu, J. Wang, R. Ji,.-G. Jiang, and S.-F. Chang, Supervised hashing with kernels, in Proc. Comput. Vis. Pattern Recognit., pp , 202. [4] R. Salakhutdinov and G. Hinton, Semantic hashing, Int. J. Approx. Reasoning, vol. 50, no. 7, pp , [5] L. Jin, K. Li, H. Hu, G.-J. Qi, and J. Tang, Semantic neighbor graph hashing for multimodal retrieval, IEEE Trans. Image Process., vol. 27, no. 3, pp , 208. [6] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, in Proc. Comput. Vis. Pattern Recognit., pp , 206. [7] C. Szegedy, W. Liu,. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, in Proc. Comput. Vis. Pattern Recognit., pp. 9, 205. [8] J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy, and N. Vasconcelos, On the role of correlation and abstraction in cross-modal multimedia retrieval, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 3, pp , 204. [9] C. Kang, S. iang, S. Liao, C. u, and C. Pan, Learning consistent feature representation for cross-modal multimedia retrieval, IEEE Trans. Multimedia, vol. 7, no. 3, pp , 205. [0] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang, Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification, IEEE Trans. Image Process., vol. 24, no. 2, pp , 205. [] F. Zhu,. Kong, L. Zheng, H. Fu, and Q. Tian, Part-based deep hashing for large-scale person re-identification, IEEE Trans. Image Process., vol. 26, no. 0, pp , 207. [2] D. Zhang and W.-J. Li, Large-scale supervised multimodal hashing with semantic correlation maximization., in Proc. AAAI Conf. Artif. Intell., pp , 204. [3] G. Ding,. Guo, and J. Zhou, Collective matrix factorization hashing for multimodal data, in Proc. Comput. Vis. Pattern Recognit., pp , 204. [4] J. Tang and Z. Li, Weakly-supervised multimodal hashing for scalable social image retrieval, IEEE Trans. Circuits Syst. Video Techn., vol. 28, no. 0, pp , 208. [5] P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in Proc. ACM Symp. Theory Comput., pp , 998. [6]. Gong and S. Lazebnik, Iterative quantization: A procrustean approach to learning binary codes, in Proc. Comput. Vis. Pattern Recognit., pp , 20. [7] H. Liu, R. Wang, S. Shan, and. Chen, Deep supervised hashing for fast image retrieval, in Proc. Comput. Vis. Pattern Recognit., pp , 206. [8] H.-F. ang, K. Lin, and C.-S. Chen, Supervised learning of semanticspreserving hash via deep convolutional neural networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, pp , 208. [9]. Zhu, Z. Huang, H. T. Shen, and. Zhao, Linear cross-modal hashing for efficient multimedia search, in Proc. ACM Int. Conf. Multimedia, pp , 203. [20] G. Irie, H. Arai, and. Taniguchi, Alternating co-quantization for crossmodal hashing, in Proc. IEEE Int. Conf. Comput. Vis., pp , 205. [2] J. Tang, K. Wang, and L. Shao, Supervised matrix factorization hashing for cross-modal retrieval, IEEE Trans. Image Process., vol. 25, no. 7, pp , 206. [22] J. Tang, Z. Li, and. Zhu, Supervised deep hashing for scalable face image retrieval, Pattern Recognit., vol. 75, pp , 208. [23] V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, Deep hashing for compact binary codes learning, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [24] W.-J. Li, S. Wang, and W.-C. Kang, Feature learning based deep supervised hashing with pairwise labels, in Proc. Int. Joint Conf. Artif. Intell., pp. 7 77, 206. [25] F. Shen,. u, L. Liu,. ang, Z. Huang, and H. T. Shen, Unsupervised deep hashing with similarity-adaptive and discrete optimization, IEEE Trans. Pattern Anal. Mach. Intell., 208.

12 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 2 [26] T.-T. Do, A.-D. Doan, and N.-M. Cheung, Learning to hash with binary deep neural network, in Proc. Eur. Conf. Comput. Vis., pp , 206. [27] J. Tang, J. Lin, Z. Li, and J. ang, Discriminative deep quantization hashing for face image retrieval, IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 2, pp , 208. [28] Q. Li, Z. Sun, R. He, and T. Tan, Deep supervised discrete hashing, in Proc. Int. Conf. Neural Inf. Process. Syst., pp , 207. [29] F. Zhao,. Huang, L. Wang, and T. Tan, Deep semantic ranking based hashing for multi-label image retrieval, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [30]. Cao, M. Long, J. Wang, Q. ang, and P. S. u, Deep visualsemantic hashing for cross-modal retrieval, in Proc. ACM Int. Conf. Knowl. Discovery Data Mining, pp , 206. [3]. Wei,. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. an, Crossmodal retrieval with cnn visual features: A new baseline, IEEE Trans. Cybernetics, vol. 47, no. 2, pp , 207. [32] L. Jin, K. Li, Z. Li, F. iao, G.-J. Qi, and J. Tang, Deep semanticpreserving ordinal hashing for cross-modal similarity search, IEEE Trans. Neural Netw. Learn. Syst., 208. to be published, doi: 0.09/TNNLS [33] B. Kulis and T. Darrell, Learning to hash with binary reconstructive embeddings, in Proc. Int. Conf. Neural Inf. Process. Syst., pp , [34] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, Hashing with graphs, in Proc. Int. Conf. Mach. learn., pp. 8, 20. [35] F. Shen, C. Shen, W. Liu, and H. Tao Shen, Supervised discrete hashing, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [36] L. Liu, L. Shao, F. Shen, and M. u, Discretely coding semantic rank orders for supervised image hashing, in Proc. Comput. Vis. Pattern Recognit., pp , 207. [37] C. Da, S. u, K. Ding, G. Meng, S. iang, and C. Pan, Amvh: Asymmetric multi-valued hashing, in Proc. Comput. Vis. Pattern Recognit., pp , 207. [38] F. Shen,. ang, L. Liu, W. Liu, D. Tao, and H. T. Shen, Asymmetric binary coding for image search, IEEE Trans. Multimedia, vol. 9, no. 9, pp , 207. [39] S. Kumar and R. Udupa, Learning hash functions for cross-view similarity search, in Proc. Int. Joint Conf. Artif. Intell., pp , 20. [40]. Zhen and D.-. eung, Co-regularized hashing for multimodal data, in Proc. Int. Conf. Neural Inf. Process. Syst., pp , 202. [4] J. Song,. ang,. ang, Z. Huang, and H. T. Shen, Inter-media hashing for large-scale retrieval from heterogeneous data sources, in Proc. ACM Int. Conf. Manage. Data, pp , 203. [42] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios, Data fusion through cross-modality metric learning using similarity-sensitive hashing, in Proc. Comput. Vis. Pattern Recognit., pp , 200. [43] J. Zhou, G. Ding, and. Guo, Latent semantic sparse hashing for cross-modal similarity search, in Proc. Int. ACM Conf. Res. & Dev. Inf. Retrieval, pp , 204. [44] D. Wang,. Gao,. Wang, and L. He, Semantic topic multimodal hashing for cross-media retrieval., in Proc. Int. Joint Conf. Artif. Intell., pp , 205. [45] Z. Lin, G. Ding, M. Hu, and J. Wang, Semantics-preserving hashing for cross-view retrieval, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [46] K. Li, G.-J. Qi, J. e, and K. A. Hua, Linear subspace ranking hashing for cross-modal retrieval, IEEE Trans. Pattern Anal. Mach. Intell., no. 9, pp , 207. [47] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Proc. Int. Conf. Neural Inf. Process. Syst., 202. [48]. LeCun,. Bengio, and G. Hinton, Deep learning, Nature, vol. 52, no. 7553, p. 436, 205. [49] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, Large-scale video classification with convolutional neural networks, in Proc. Comput. Vis. Pattern Recognit., pp , 204. [50] T. iao,. u, K. ang, J. Zhang,. Peng, and Z. Zhang, The application of two-level attention models in deep convolutional neural network for fine-grained image classification, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [5] J. Wang,. ang, J. Mao, Z. Huang, C. Huang, and W. u, Cnnrnn: A unified framework for multi-label image classification, in Proc. Comput. Vis. Pattern Recognit., pp , 206. [52] R. Girshick, Fast r-cnn, in Proc. IEEE Int. Conf. Comput. Vis., pp , 205. [53] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, Cnn features off-the-shelf: an astounding baseline for recognition, in Proc. Comput. Vis. Pattern Recognit. Workshop, pp , 204. [54] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. uille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp , 208. [55] K. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask r-cnn, in Proc. IEEE Int. Conf. Comput. Vis., pp , 207. [56] H. Noh, S. Hong, and B. Han, Learning deconvolution network for semantic segmentation, in Proc. IEEE Int. Conf. Comput. Vis., pp , 205. [57] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [58] K. u, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in Proc. Int. Conf. Mach. learn., pp , 205. [59] L. Chen, H. Zhang, J. iao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning, in Proc. Comput. Vis. Pattern Recognit., pp , 207. [60] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu,. Zhang, and J. Li, Deep learning for content-based image retrieval: A comprehensive study, in Proc. ACM Int. Conf. Multimedia, pp , 204. [6] L. Zheng,. ang, and Q. Tian, Sift meets cnn: A decade survey of instance retrieval, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 5, pp , 208. [62] L. Jin,. Shu, K. Li, Z. Li, G.-J. Qi, and J. Tang, Deep ordinal hashing with spatial attention, IEEE Trans. Image Process., 208. to be published, doi: 0.09/TIP [63] E. ang, C. Deng, W. Liu,. Liu, D. Tao, and. Gao, Pairwise relationship guided deep hashing for cross-modal retrieval., in Proc. AAAI Conf. Artif. Intell., pp , 207. [64] J. ue-hei Ng, M. Hausknecht, S. Vayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, Beyond short snippets: Deep networks for video classification, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [65] B. Fernando and S. Gould, Learning end-to-end video classification with rank-pooling, in Proc. Int. Conf. Mach. learn., pp , 206. [66] K. Kang, W. Ouyang, H. Li, and. Wang, Object detection from video tubelets with convolutional neural networks, in Proc. Comput. Vis. Pattern Recognit., pp , 206. [67] R. ia,. Pan, H. Lai, C. Liu, and S. an, Supervised hashing for image retrieval via image representation learning., in Proc. AAAI Conf. Artif. Intell., pp , 204. [68] T. ao, F. Long, T. Mei, and. Rui, Deep semantic-preserving and ranking-based hashing for image retrieval., in Proc. Int. Joint Conf. Artif. Intell., pp , 206. [69] H. Lai,. Pan,. Liu, and S. an, Simultaneous feature learning and hash coding with deep neural networks, in Proc. Comput. Vis. Pattern Recognit., pp , 205. [70] J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber, Multimodal similarity-preserving hashing, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 4, pp , 204. [7]. Cao, M. Long, J. Wang, and P. S. u, Correlation hashing network for efficient cross-modal retrieval, [72]. Cao, M. Long, J. Wang, and H. Zhu, Correlation autoencoder hashing for supervised cross-modal search, in Proc. ACM Int. Conf. Multimedia Retrieval, pp , 206. [73] Q.-. Jiang and W.-J. Li, Deep cross-modal hashing, in Proc. Comput. Vis. Pattern Recognit., pp , 207. [74] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, A new approach to cross-modal multimedia retrieval, in Proc. ACM Int. Conf. Multimedia, pp , 200. [75] M. J. Huiskes and M. S. Lew, The mir flickr retrieval evaluation, in Proc. ACM Int. Conf. Multimedia Inf. Retrieval, pp , [76] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and. Zheng, Nus-wide: a real-world web image database from national university of singapore, in Proc. ACM Int. Conf. Image Video Retrieval, 2009.

13 SUBMISSION FOR IEEE TRANSACTIONS ON IMAGE PROCESSING 3 [77]. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in Proc. ACM Int. Conf. Multimedia, pp , 204. ICIMCS 208. Lu Jin received the B.E. degree in Measuring and Control Technology and Instrumentations from Northeast University at Qinhuangdao, Hebei, China, in 200. Now she is a Ph.D candidate student in Nanjing University of Science and Technology in Computer Science and Technology. From 205 to 207, she worked as a visiting scholar in the Department of Computer Science at University of Central Florida. Her research interests include multimedia computing, deep learning and multimedia retrieval. She has received the Best Student Paper Award in Fu iao received the Ph.D Degree in Computer Science and Technology from Nanjing University of Science and Technology, Nanjing, China, in He is currently a Professor and PhD supervisor with the School of Computer, Nanjing University of Posts and Telecommunications, Nanjing, China. He has published over 30 papers in related international conferences and journals, including INFOCOM, ICC, IPCCC, IEEE JSAC, IEEE/ACM TON, IEEE TMC, ACM TECS, IEEE TVT and so on. Dr. iao is a member of the IEEE Computer Society and the Association for Computing Machinery. Jinhui Tang is a Professor in School of Computer Science and Engineering, Nanjing University of Science and Technology, China. He received his B.E. and Ph.D. degrees in July 2003 and July 2008 respectively, both from the University of Science and Technology of China. From 2008 to 200, he worked as a research fellow in School of Computing, National University of Singapore. His current research interests include large scale multimedia search. He has authored over 50 journal and conference papers in these areas. Prof. Tang is a co-recipient of the Best Paper Awards in ACM MM 2007, PCM 20 and ICIMCS 20, and the Best Student Paper Award in MMM 206. Zechao Li is currently a Professor at Nanjing University of Science and Technology. He received the Ph.D degree from National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences in 203, and the B.E. degree from University of Science and Technology of China in His research interests include intelligent media analysis, computer vision, etc. He received the oung Talent Program of China Association for Science and Technology, the Excellent Doctoral Dissertation of Chinese Academy of Sciences, and the Excellent Doctoral Theses of China Computer Federation. Guo-Jun Qi received the PhD degree from the University of Illinois at Urbana-Champaign, in December 203. His research interests include pattern recognition, machine learning, computer vision, multimedia, and data mining. He received twice IBM PhD fellowships, and Microsoft fellowship. He is the recipient of the Best Paper Award at the 5th ACM International Conference on Multimedia, Augsburg, Germany, He is currently a faculty member with the Department of Computer Science at the University of Central Florida, and has served as program committee member and reviewer for many academic conferences and journals in the fields of pattern recognition, machine learning, data mining, computer vision, and multimedia.

arxiv: v2 [cs.cv] 27 Nov 2017

arxiv: v2 [cs.cv] 27 Nov 2017 Deep Supervised Discrete Hashing arxiv:1705.10999v2 [cs.cv] 27 Nov 2017 Qi Li Zhenan Sun Ran He Tieniu Tan Center for Research on Intelligent Perception and Computing National Laboratory of Pattern Recognition

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

learning stage (Stage 1), CNNH learns approximate hash codes for training images by optimizing the following loss function:

learning stage (Stage 1), CNNH learns approximate hash codes for training images by optimizing the following loss function: 1 Query-adaptive Image Retrieval by Deep Weighted Hashing Jian Zhang and Yuxin Peng arxiv:1612.2541v2 [cs.cv] 9 May 217 Abstract Hashing methods have attracted much attention for large scale image retrieval.

More information

Online Cross-Modal Hashing for Web Image Retrieval

Online Cross-Modal Hashing for Web Image Retrieval Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-6) Online Cross-Modal Hashing for Web Image Retrieval Liang ie Department of Mathematics Wuhan University of Technology, China

More information

Deep Supervised Hashing with Triplet Labels

Deep Supervised Hashing with Triplet Labels Deep Supervised Hashing with Triplet Labels Xiaofang Wang, Yi Shi, Kris M. Kitani Carnegie Mellon University, Pittsburgh, PA 15213 USA Abstract. Hashing is one of the most popular and powerful approximate

More information

Column Sampling based Discrete Supervised Hashing

Column Sampling based Discrete Supervised Hashing Column Sampling based Discrete Supervised Hashing Wang-Cheng Kang, Wu-Jun Li and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology

More information

RECENT years have witnessed the rapid growth of image. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval

RECENT years have witnessed the rapid growth of image. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval Jian Zhang, Yuxin Peng, and Junchao Zhang arxiv:607.08477v [cs.cv] 28 Jul 206 Abstract The hashing methods have been widely used for efficient

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

Supervised Hashing for Image Retrieval via Image Representation Learning

Supervised Hashing for Image Retrieval via Image Representation Learning Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Supervised Hashing for Image Retrieval via Image Representation Learning Rongkai Xia 1, Yan Pan 1, Hanjiang Lai 1,2, Cong Liu

More information

Deep Multi-Label Hashing for Large-Scale Visual Search Based on Semantic Graph

Deep Multi-Label Hashing for Large-Scale Visual Search Based on Semantic Graph Deep Multi-Label Hashing for Large-Scale Visual Search Based on Semantic Graph Chunlin Zhong 1, Yi Yu 2, Suhua Tang 3, Shin ichi Satoh 4, and Kai Xing 5 University of Science and Technology of China 1,5,

More information

Supervised Hashing for Image Retrieval via Image Representation Learning

Supervised Hashing for Image Retrieval via Image Representation Learning Supervised Hashing for Image Retrieval via Image Representation Learning Rongkai Xia, Yan Pan, Cong Liu (Sun Yat-Sen University) Hanjiang Lai, Shuicheng Yan (National University of Singapore) Finding Similar

More information

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng and Tsuhan Chen School of Electrical and Computer Engineering, Cornell University, Ithaca, NY

More information

SIMILARITY search is a fundamental subject in numerous. Shared Predictive Cross-Modal Deep Quantization. arxiv: v1 [cs.

SIMILARITY search is a fundamental subject in numerous. Shared Predictive Cross-Modal Deep Quantization. arxiv: v1 [cs. IEEE TRANSACTION ON NEURAL NETWORKS AND LEARNING SYSTEMS Shared Predictive Cross-Modal Deep Quantization Erkun Yang, Cheng Deng, Member, IEEE, Chao Li, Wei Liu, Jie Li, Dacheng Tao, Fellow, IEEE arxiv:904.07488v

More information

arxiv: v1 [cs.cv] 19 Oct 2017 Abstract

arxiv: v1 [cs.cv] 19 Oct 2017 Abstract Improved Search in Hamming Space using Deep Multi-Index Hashing Hanjiang Lai and Yan Pan School of Data and Computer Science, Sun Yan-Sen University, China arxiv:70.06993v [cs.cv] 9 Oct 207 Abstract Similarity-preserving

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Zelun Luo Department of Computer Science Stanford University zelunluo@stanford.edu Te-Lin Wu Department of

More information

Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval

Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval IEEE TRANSACTION ON IMAGE PROCESSING, VOL. XX, NO. XX, XXXXXX 2X Learning Discriative Binary Codes for Large-scale Cross-modal Retrieval Xing Xu, Fu Shen, Yang Yang, Heng Tao Shen and Xuelong Li, Fellow,

More information

Deep Supervised Hashing for Fast Image Retrieval

Deep Supervised Hashing for Fast Image Retrieval Deep Supervised Hashing for Fast Image Retrieval Haomiao Liu 1,, Ruiping Wang 1, Shiguang Shan 1, Xilin Chen 1 1 Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS),

More information

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana FaceNet Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana Introduction FaceNet learns a mapping from face images to a compact Euclidean Space

More information

Compact Hash Code Learning with Binary Deep Neural Network

Compact Hash Code Learning with Binary Deep Neural Network Compact Hash Code Learning with Binary Deep Neural Network Thanh-Toan Do, Dang-Khoa Le Tan, Tuan Hoang, Ngai-Man Cheung arxiv:7.0956v [cs.cv] 6 Feb 08 Abstract In this work, we firstly propose deep network

More information

Rank Preserving Hashing for Rapid Image Search

Rank Preserving Hashing for Rapid Image Search 205 Data Compression Conference Rank Preserving Hashing for Rapid Image Search Dongjin Song Wei Liu David A. Meyer Dacheng Tao Rongrong Ji Department of ECE, UC San Diego, La Jolla, CA, USA dosong@ucsd.edu

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

Rongrong Ji (Columbia), Yu Gang Jiang (Fudan), June, 2012

Rongrong Ji (Columbia), Yu Gang Jiang (Fudan), June, 2012 Supervised Hashing with Kernels Wei Liu (Columbia Columbia), Jun Wang (IBM IBM), Rongrong Ji (Columbia), Yu Gang Jiang (Fudan), and Shih Fu Chang (Columbia Columbia) June, 2012 Outline Motivations Problem

More information

arxiv: v1 [cs.cv] 14 Jul 2017

arxiv: v1 [cs.cv] 14 Jul 2017 Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie Zhou, Shilei Wen Baidu IDL & Tsinghua University

More information

Progressive Generative Hashing for Image Retrieval

Progressive Generative Hashing for Image Retrieval Progressive Generative Hashing for Image Retrieval Yuqing Ma, Yue He, Fan Ding, Sheng Hu, Jun Li, Xianglong Liu 2018.7.16 01 BACKGROUND the NNS problem in big data 02 RELATED WORK Generative adversarial

More information

DUe to the explosive increasing of data in real applications, Deep Discrete Supervised Hashing. arxiv: v1 [cs.

DUe to the explosive increasing of data in real applications, Deep Discrete Supervised Hashing. arxiv: v1 [cs. Deep Discrete Supervised Hashing Qing-Yuan Jiang, Xue Cui and Wu-Jun Li, Member, IEEE arxiv:707.09905v [cs.ir] 3 Jul 207 Abstract Hashing has been widely used for large-scale search due to its low storage

More information

arxiv: v1 [cs.cv] 30 Oct 2016

arxiv: v1 [cs.cv] 30 Oct 2016 Accurate Deep Representation Quantization with Gradient Snapping Layer for Similarity Search Shicong Liu, Hongtao Lu arxiv:1610.09645v1 [cs.cv] 30 Oct 2016 Abstract Recent advance of large scale similarity

More information

Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval

Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval Ting Yao, Fuchen Long, Tao Mei,

More information

Asymmetric Deep Supervised Hashing

Asymmetric Deep Supervised Hashing Asymmetric Deep Supervised Hashing Qing-Yuan Jiang and Wu-Jun Li National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization

More information

An efficient face recognition algorithm based on multi-kernel regularization learning

An efficient face recognition algorithm based on multi-kernel regularization learning Acta Technica 61, No. 4A/2016, 75 84 c 2017 Institute of Thermomechanics CAS, v.v.i. An efficient face recognition algorithm based on multi-kernel regularization learning Bi Rongrong 1 Abstract. A novel

More information

Bit-Scalable Deep Hashing with Regularized Similarity Learning for Image Retrieval and Person Re-identification

Bit-Scalable Deep Hashing with Regularized Similarity Learning for Image Retrieval and Person Re-identification IEEE TRANSACTIONS ON IMAGE PROCESSING 1 Bit-Scalable Deep Hashing with Regularized Similarity Learning for Image Retrieval and Person Re-identification Ruimao Zhang, Liang Lin, Rui Zhang, Wangmeng Zuo,

More information

Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval

Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval Gengshen Wu 1, Zijia Lin 2, Jungong Han 1, Li Liu 3, Guiguang Ding 4, Baochang Zhang 5, Jialie Shen 6 1 Lancaster

More information

Part Localization by Exploiting Deep Convolutional Networks

Part Localization by Exploiting Deep Convolutional Networks Part Localization by Exploiting Deep Convolutional Networks Marcel Simon, Erik Rodner, and Joachim Denzler Computer Vision Group, Friedrich Schiller University of Jena, Germany www.inf-cv.uni-jena.de Abstract.

More information

Handwritten Chinese Character Recognition by Joint Classification and Similarity Ranking

Handwritten Chinese Character Recognition by Joint Classification and Similarity Ranking 2016 15th International Conference on Frontiers in Handwriting Recognition Handwritten Chinese Character Recognition by Joint Classification and Similarity Ranking Cheng Cheng, Xu-Yao Zhang, Xiao-Hu Shao

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

Image Captioning with Object Detection and Localization

Image Captioning with Object Detection and Localization Image Captioning with Object Detection and Localization Zhongliang Yang, Yu-Jin Zhang, Sadaqat ur Rehman, Yongfeng Huang, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

More information

Semi-supervised Data Representation via Affinity Graph Learning

Semi-supervised Data Representation via Affinity Graph Learning 1 Semi-supervised Data Representation via Affinity Graph Learning Weiya Ren 1 1 College of Information System and Management, National University of Defense Technology, Changsha, Hunan, P.R China, 410073

More information

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological

More information

3D model classification using convolutional neural network

3D model classification using convolutional neural network 3D model classification using convolutional neural network JunYoung Gwak Stanford jgwak@cs.stanford.edu Abstract Our goal is to classify 3D models directly using convolutional neural network. Most of existing

More information

Deep Matrix Factorization for Social Image Tag Refinement and Assignment

Deep Matrix Factorization for Social Image Tag Refinement and Assignment Deep Matrix Factorization for Social Image Tag Refinement and Assignment Zechao Li, Jinhui Tang School of Computer Science, Nanjing University of Science and Technology No 200, Xiaolingwei Road, Nanjing,

More information

arxiv: v1 [cs.cv] 6 Jul 2016

arxiv: v1 [cs.cv] 6 Jul 2016 arxiv:607.079v [cs.cv] 6 Jul 206 Deep CORAL: Correlation Alignment for Deep Domain Adaptation Baochen Sun and Kate Saenko University of Massachusetts Lowell, Boston University Abstract. Deep neural networks

More information

MULTI-VIEW FEATURE FUSION NETWORK FOR VEHICLE RE- IDENTIFICATION

MULTI-VIEW FEATURE FUSION NETWORK FOR VEHICLE RE- IDENTIFICATION MULTI-VIEW FEATURE FUSION NETWORK FOR VEHICLE RE- IDENTIFICATION Haoran Wu, Dong Li, Yucan Zhou, and Qinghua Hu School of Computer Science and Technology, Tianjin University, Tianjin, China ABSTRACT Identifying

More information

SCALABLE visual search has gained large interests in computer

SCALABLE visual search has gained large interests in computer IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 6, JUNE 2017 1209 Deep Video Hashing Venice Erin Liong, Jiwen Lu, Senior Member, IEEE, Yap-Peng Tan, Senior Member, IEEE, and Jie Zhou, Senior Member, IEEE

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

DeepIndex for Accurate and Efficient Image Retrieval

DeepIndex for Accurate and Efficient Image Retrieval DeepIndex for Accurate and Efficient Image Retrieval Yu Liu, Yanming Guo, Song Wu, Michael S. Lew Media Lab, Leiden Institute of Advance Computer Science Outline Motivation Proposed Approach Results Conclusions

More information

Multi-Glance Attention Models For Image Classification

Multi-Glance Attention Models For Image Classification Multi-Glance Attention Models For Image Classification Chinmay Duvedi Stanford University Stanford, CA cduvedi@stanford.edu Pararth Shah Stanford University Stanford, CA pararth@stanford.edu Abstract We

More information

Deep Neural Networks:

Deep Neural Networks: Deep Neural Networks: Part II Convolutional Neural Network (CNN) Yuan-Kai Wang, 2016 Web site of this course: http://pattern-recognition.weebly.com source: CNN for ImageClassification, by S. Lazebnik,

More information

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 Etienne Gadeski, Hervé Le Borgne, and Adrian Popescu CEA, LIST, Laboratory of Vision and Content Engineering, France

More information

Image retrieval based on bag of images

Image retrieval based on bag of images University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2009 Image retrieval based on bag of images Jun Zhang University of Wollongong

More information

IEEE TRANSACTIONS ON IMAGE PROCESSING 1. Instance-Aware Hashing for Multi-Label Image Retrieval

IEEE TRANSACTIONS ON IMAGE PROCESSING 1. Instance-Aware Hashing for Multi-Label Image Retrieval IEEE TRANSACTIONS ON IMAGE PROCESSING Instance-Aware Hashing for Multi-Label Image Retrieval Hanjiang Lai, Pan Yan, Xiangbo Shu,Yunchao Wei, Shuicheng Yan, Senior Member, IEEE arxiv:63.3234v [cs.cv] Mar

More information

Structured Prediction using Convolutional Neural Networks

Structured Prediction using Convolutional Neural Networks Overview Structured Prediction using Convolutional Neural Networks Bohyung Han bhhan@postech.ac.kr Computer Vision Lab. Convolutional Neural Networks (CNNs) Structured predictions for low level computer

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

Supervised Hashing for Multi-labeled Data with Order-Preserving Feature

Supervised Hashing for Multi-labeled Data with Order-Preserving Feature Supervised Hashing for Multi-labeled Data with Order-Preserving Feature Dan Wang, Heyan Huang (B), Hua-Kang Lin, and Xian-Ling Mao Beijing Institute of Technology, Beijing, China {wangdan12856,hhy63,1120141916,maoxl}@bit.edu.cn

More information

LARGE-SCALE PERSON RE-IDENTIFICATION AS RETRIEVAL

LARGE-SCALE PERSON RE-IDENTIFICATION AS RETRIEVAL LARGE-SCALE PERSON RE-IDENTIFICATION AS RETRIEVAL Hantao Yao 1,2, Shiliang Zhang 3, Dongming Zhang 1, Yongdong Zhang 1,2, Jintao Li 1, Yu Wang 4, Qi Tian 5 1 Key Lab of Intelligent Information Processing

More information

arxiv: v1 [cs.mm] 12 Jan 2016

arxiv: v1 [cs.mm] 12 Jan 2016 Learning Subclass Representations for Visually-varied Image Classification Xinchao Li, Peng Xu, Yue Shi, Martha Larson, Alan Hanjalic Multimedia Information Retrieval Lab, Delft University of Technology

More information

This version was downloaded from Northumbria Research Link:

This version was downloaded from Northumbria Research Link: Citation: Liu, Li, Lin, Zijia, Shao, Ling, Shen, Fumin, Ding, Guiguang and Han, Jungong (207) Sequential Discrete Hashing for Scalable Cross-modality Similarity Retrieval. IEEE Transactions on Image Processing,

More information

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Grounded Compositional Semantics for Finding and Describing Images with Sentences Grounded Compositional Semantics for Finding and Describing Images with Sentences R. Socher, A. Karpathy, V. Le,D. Manning, A Y. Ng - 2013 Ali Gharaee 1 Alireza Keshavarzi 2 1 Department of Computational

More information

An Efficient Methodology for Image Rich Information Retrieval

An Efficient Methodology for Image Rich Information Retrieval An Efficient Methodology for Image Rich Information Retrieval 56 Ashwini Jaid, 2 Komal Savant, 3 Sonali Varma, 4 Pushpa Jat, 5 Prof. Sushama Shinde,2,3,4 Computer Department, Siddhant College of Engineering,

More information

Hashing with Graphs. Sanjiv Kumar (Google), and Shih Fu Chang (Columbia) June, 2011

Hashing with Graphs. Sanjiv Kumar (Google), and Shih Fu Chang (Columbia) June, 2011 Hashing with Graphs Wei Liu (Columbia Columbia), Jun Wang (IBM IBM), Sanjiv Kumar (Google), and Shih Fu Chang (Columbia) June, 2011 Overview Graph Hashing Outline Anchor Graph Hashing Experiments Conclusions

More information

Learning Label Preserving Binary Codes for Multimedia Retrieval: A General Approach

Learning Label Preserving Binary Codes for Multimedia Retrieval: A General Approach 1 Learning Label Preserving Binary Codes for Multimedia Retrieval: A General Approach KAI LI, GUO-JUN QI, and KIEN A. HUA, University of Central Florida Learning-based hashing has received a great deal

More information

Learning Discrete Representations via Information Maximizing Self-Augmented Training

Learning Discrete Representations via Information Maximizing Self-Augmented Training A. Relation to Denoising and Contractive Auto-encoders Our method is related to denoising auto-encoders (Vincent et al., 2008). Auto-encoders maximize a lower bound of mutual information (Cover & Thomas,

More information

A REVIEW ON SEARCH BASED FACE ANNOTATION USING WEAKLY LABELED FACIAL IMAGES

A REVIEW ON SEARCH BASED FACE ANNOTATION USING WEAKLY LABELED FACIAL IMAGES A REVIEW ON SEARCH BASED FACE ANNOTATION USING WEAKLY LABELED FACIAL IMAGES Dhokchaule Sagar Patare Swati Makahre Priyanka Prof.Borude Krishna ABSTRACT This paper investigates framework of face annotation

More information

Lecture 7: Semantic Segmentation

Lecture 7: Semantic Segmentation Semantic Segmentation CSED703R: Deep Learning for Visual Recognition (207F) Segmenting images based on its semantic notion Lecture 7: Semantic Segmentation Bohyung Han Computer Vision Lab. bhhanpostech.ac.kr

More information

Supervised Hashing with Latent Factor Models

Supervised Hashing with Latent Factor Models Supervised Hashing with Latent Factor Models Peichao Zhang Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

Open Access Research on the Prediction Model of Material Cost Based on Data Mining Send Orders for Reprints to reprints@benthamscience.ae 1062 The Open Mechanical Engineering Journal, 2015, 9, 1062-1066 Open Access Research on the Prediction Model of Material Cost Based on Data Mining

More information

Deconvolutions in Convolutional Neural Networks

Deconvolutions in Convolutional Neural Networks Overview Deconvolutions in Convolutional Neural Networks Bohyung Han bhhan@postech.ac.kr Computer Vision Lab. Convolutional Neural Networks (CNNs) Deconvolutions in CNNs Applications Network visualization

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information

2022 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017

2022 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017 2022 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017 Asymmetric Binary Coding for Image Search Fu Shen, Yang Yang, Li Liu, Wei Liu, Dacheng Tao, Fellow, IEEE, and Heng Tao Shen Abstract

More information

A REVIEW ON IMAGE RETRIEVAL USING HYPERGRAPH

A REVIEW ON IMAGE RETRIEVAL USING HYPERGRAPH A REVIEW ON IMAGE RETRIEVAL USING HYPERGRAPH Sandhya V. Kawale Prof. Dr. S. M. Kamalapur M.E. Student Associate Professor Deparment of Computer Engineering, Deparment of Computer Engineering, K. K. Wagh

More information

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction by Noh, Hyeonwoo, Paul Hongsuck Seo, and Bohyung Han.[1] Presented : Badri Patro 1 1 Computer Vision Reading

More information

Kaggle Data Science Bowl 2017 Technical Report

Kaggle Data Science Bowl 2017 Technical Report Kaggle Data Science Bowl 2017 Technical Report qfpxfd Team May 11, 2017 1 Team Members Table 1: Team members Name E-Mail University Jia Ding dingjia@pku.edu.cn Peking University, Beijing, China Aoxue Li

More information

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH Sai Tejaswi Dasari #1 and G K Kishore Babu *2 # Student,Cse, CIET, Lam,Guntur, India * Assistant Professort,Cse, CIET, Lam,Guntur, India Abstract-

More information

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION S.Dhanalakshmi #1 #PG Scholar, Department of Computer Science, Dr.Sivanthi Aditanar college of Engineering, Tiruchendur

More information

arxiv: v1 [cs.cv] 4 Apr 2018

arxiv: v1 [cs.cv] 4 Apr 2018 Discriminative Cross-View Binary Representation Learning Liu Liu, Hairong Qi Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville {lliu25, hqi}@utk.edu arxiv:804.0233v

More information

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection Zeming Li, 1 Yilun Chen, 2 Gang Yu, 2 Yangdong

More information

Real-time Object Detection CS 229 Course Project

Real-time Object Detection CS 229 Course Project Real-time Object Detection CS 229 Course Project Zibo Gong 1, Tianchang He 1, and Ziyi Yang 1 1 Department of Electrical Engineering, Stanford University December 17, 2016 Abstract Objection detection

More information

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network Feng Mao [0000 0001 6171 3168], Xiang Wu [0000 0003 2698 2156], Hui Xue, and Rong Zhang Alibaba Group, Hangzhou, China

More information

Fully Convolutional Networks for Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation Fully Convolutional Networks for Semantic Segmentation Jonathan Long* Evan Shelhamer* Trevor Darrell UC Berkeley Chaim Ginzburg for Deep Learning seminar 1 Semantic Segmentation Define a pixel-wise labeling

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Announcements Computer Vision Lecture 16 Deep Learning Applications 11.01.2017 Seminar registration period starts on Friday We will offer a lab course in the summer semester Deep Robot Learning Topic:

More information

Clustering and Unsupervised Anomaly Detection with l 2 Normalized Deep Auto-Encoder Representations

Clustering and Unsupervised Anomaly Detection with l 2 Normalized Deep Auto-Encoder Representations Clustering and Unsupervised Anomaly Detection with l 2 Normalized Deep Auto-Encoder Representations Caglar Aytekin, Xingyang Ni, Francesco Cricri and Emre Aksu Nokia Technologies, Tampere, Finland Corresponding

More information

Robust Face Recognition Based on Convolutional Neural Network

Robust Face Recognition Based on Convolutional Neural Network 2017 2nd International Conference on Manufacturing Science and Information Engineering (ICMSIE 2017) ISBN: 978-1-60595-516-2 Robust Face Recognition Based on Convolutional Neural Network Ying Xu, Hui Ma,

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

Deep Similarity Learning for Multimodal Medical Images

Deep Similarity Learning for Multimodal Medical Images Deep Similarity Learning for Multimodal Medical Images Xi Cheng, Li Zhang, and Yefeng Zheng Siemens Corporation, Corporate Technology, Princeton, NJ, USA Abstract. An effective similarity measure for multi-modal

More information

Adaptive Feature Selection via Boosting-like Sparsity Regularization

Adaptive Feature Selection via Boosting-like Sparsity Regularization Adaptive Feature Selection via Boosting-like Sparsity Regularization Libin Wang, Zhenan Sun, Tieniu Tan Center for Research on Intelligent Perception and Computing NLPR, Beijing, China Email: {lbwang,

More information

The Normalized Distance Preserving Binary Codes and Distance Table *

The Normalized Distance Preserving Binary Codes and Distance Table * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 32, XXXX-XXXX (2016) The Normalized Distance Preserving Binary Codes and Distance Table * HONGWEI ZHAO 1,2, ZHEN WANG 1, PINGPING LIU 1,2 AND BIN WU 1 1.

More information

Tag Based Image Search by Social Re-ranking

Tag Based Image Search by Social Re-ranking Tag Based Image Search by Social Re-ranking Vilas Dilip Mane, Prof.Nilesh P. Sable Student, Department of Computer Engineering, Imperial College of Engineering & Research, Wagholi, Pune, Savitribai Phule

More information

An Adaptive Threshold LBP Algorithm for Face Recognition

An Adaptive Threshold LBP Algorithm for Face Recognition An Adaptive Threshold LBP Algorithm for Face Recognition Xiaoping Jiang 1, Chuyu Guo 1,*, Hua Zhang 1, and Chenghua Li 1 1 College of Electronics and Information Engineering, Hubei Key Laboratory of Intelligent

More information

Query-Sensitive Similarity Measure for Content-Based Image Retrieval

Query-Sensitive Similarity Measure for Content-Based Image Retrieval Query-Sensitive Similarity Measure for Content-Based Image Retrieval Zhi-Hua Zhou Hong-Bin Dai National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China {zhouzh, daihb}@lamda.nju.edu.cn

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning for Object Categorization 14.01.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning Applications 11.01.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period starts

More information

Transitive Hashing Network for Heterogeneous Multimedia Retrieval

Transitive Hashing Network for Heterogeneous Multimedia Retrieval Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-7) Transitive Hashing Network for Heterogeneous Multimedia Retrieval Zhangjie Cao, Mingsheng Long, Jianmin Wang, Qiang Yang

More information

MULTI-POSE FACE HALLUCINATION VIA NEIGHBOR EMBEDDING FOR FACIAL COMPONENTS. Yanghao Li, Jiaying Liu, Wenhan Yang, Zongming Guo

MULTI-POSE FACE HALLUCINATION VIA NEIGHBOR EMBEDDING FOR FACIAL COMPONENTS. Yanghao Li, Jiaying Liu, Wenhan Yang, Zongming Guo MULTI-POSE FACE HALLUCINATION VIA NEIGHBOR EMBEDDING FOR FACIAL COMPONENTS Yanghao Li, Jiaying Liu, Wenhan Yang, Zongg Guo Institute of Computer Science and Technology, Peking University, Beijing, P.R.China,

More information

Location-Aware Web Service Recommendation Using Personalized Collaborative Filtering

Location-Aware Web Service Recommendation Using Personalized Collaborative Filtering ISSN 2395-1621 Location-Aware Web Service Recommendation Using Personalized Collaborative Filtering #1 Shweta A. Bhalerao, #2 Prof. R. N. Phursule 1 Shweta.bhalerao75@gmail.com 2 rphursule@gmail.com #12

More information

In Defense of Fully Connected Layers in Visual Representation Transfer

In Defense of Fully Connected Layers in Visual Representation Transfer In Defense of Fully Connected Layers in Visual Representation Transfer Chen-Lin Zhang, Jian-Hao Luo, Xiu-Shen Wei, Jianxin Wu National Key Laboratory for Novel Software Technology, Nanjing University,

More information

Joint Object Detection and Viewpoint Estimation using CNN features

Joint Object Detection and Viewpoint Estimation using CNN features Joint Object Detection and Viewpoint Estimation using CNN features Carlos Guindel, David Martín and José M. Armingol cguindel@ing.uc3m.es Intelligent Systems Laboratory Universidad Carlos III de Madrid

More information

An Exploration of Computer Vision Techniques for Bird Species Classification

An Exploration of Computer Vision Techniques for Bird Species Classification An Exploration of Computer Vision Techniques for Bird Species Classification Anne L. Alter, Karen M. Wang December 15, 2017 Abstract Bird classification, a fine-grained categorization task, is a complex

More information

Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks

Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks Video Gesture Recognition with RGB-D-S Data Based on 3D Convolutional Networks August 16, 2016 1 Team details Team name FLiXT Team leader name Yunan Li Team leader address, phone number and email address:

More information