IN THESE years, the volume of feature-rich data

Size: px

Start display at page:

Download "IN THESE years, the volume of feature-rich data"

Maurice Norris
5 years ago
Views:

1 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Reversed Spectral Hashing Qingshan Liu, Senior Member, IEEE, Guangcan Liu, Member, IEEE, Lai Li, Xiao-Tong Yuan, Member, IEEE, Meng Wang, and Wei Liu, Member, IEEE Abstract Hashing is emerging as a powerful tool for building highly efficient indices in large-scale search systems. In this paper, we study spectral hashing (SH), which is a classical method of unsupervised hashing. In general, SH solves for the hash codes by minimizing an objective function that tries to preserve the similarity structure of the data given. Although computationally simple, very often SH performs unsatisfactorily and lags distinctly behind the state-of-the-art methods. We observe that the inferior performance of SH is mainly due to its imperfect formulation; that is, the optimization of the minimization problem in SH actually cannot ensure that the similarity structure of the highdimensional data is really preserved in the low-dimensional hash code space. In this paper, we, therefore, introduce reversed SH (ReSH), which is SH with its input and output interchanged. Unlike SH, which estimates the similarity structure from the given high-dimensional data, our ReSH defines the similarities between data points according to the unknown low-dimensional hash codes. Equipped with such a reversal mechanism, ReSH can seamlessly overcome the drawback of SH. More precisely, the minimization problem in our ReSH can be optimized if and only if similar data points are mapped to adjacent hash codes, and mostly important, dissimilar data points are considerably separated from each other in the code space. Finally, we solve the minimization problem in ReSH by multilayer neural networks and obtain state-of-the-art retrieval results on three benchmark data sets. Index Terms Neural networks, spectral methods, unsupervised hashing. I. INTRODUCTION IN THESE years, the volume of feature-rich data (e.g., photographs and videos) gets bigger and bigger, and thus, it is very crucial to develop the techniques of fast Manuscript received August 24, 2016; revised March 14, 2017 and April 14, 2017; accepted April 17, The work of Q. Liu was supported by the National Natural Science Foundation of China (NSFC) under Grant The work of G. Liu was supported in part by NSFC under Grant and Grant , in part by the Natural Science Foundation of Jiangsu Province of China (NSFJPC) under Grant BK The work of X.-T. Yuan was supported in part by NSFC under Grant and Grant , and in part by NSFJPC under Grant BK The work of M. Wang was supported by NSFC under Grant and Grant (Corresponding author: Guangcan Liu.) Q. Liu is with B-DAT, School of Information and Control, Nanjing University of Information Science and Technology, Nanjing , China ( qsliu@nuist.edu.cn). G. Liu and X.-T. Yuan is with B-DAT and CICAEET, School of Information and Control, Nanjing University of Information Science and Technology, Nanjing , China ( gcliu@nuist.edu.cn; xtyuan@nuist.edu.cn). L. Li is with the School of Information and Control, Nanjing University of Information Science and Technology, Nanjing , China ( nuist_lilai@foxmail.com). M. Wang is with the School of Computer Science and Information Engineering, Hefei University of Technology, Hefei , China ( eric.mengwang@gmail.com). W. Liu is with the Computer Vision Group, Tencent AI Lab, Shenzhen , China ( wliu@ee.columbia.edu). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TNNLS k-nearest neighbor (KNN) search in gigantic data sets, particularly the techniques of building similarity indices of high retrieval efficiency. Among various approaches for building efficient indices and performing fast KNN search, hashing is widely regarded as a promising one due to its advantages of fast query speed and small storage expenses [1] [6]. According to how much label information is used, the existing hashing methods can be divided into three categories: unsupervised [7], [8], supervised [9], [10], and semisupervised [11]. Benefiting from the advancement of deep neural networks, the performance of supervised hashing methods has been boosted a lot in recent years [12], [13]. However, supervised methods are inapplicable to the environments where label information is lacking. In this paper, we shall, therefore, study unsupervised hashing, aiming to establish a desirable method that is useful for building indices of high retrieval efficiency in large-scale search systems. A. Related Work 1) Spectral Hashing: We are particularly interested in spectral hashing (SH) [14] [16], a classical method of unsupervised hashing. Given a real-valued data matrix X = [x 1, x 2,...,x n ] R d n, each column in which is a d-dimensional data point x i R d, SH tries to learn a matrix of hash codes, denoted as Y = [y 1, y 2,...,y n ],by minimizing n min a ij y i y j 2 Y s.t. Y { 1, 1} k n, YY T = ki, Y 1 = 0 (1) where denotes the l 2 norm of a vector, y i { 1, +1} k, the ith column of Y,isthek-dimensional hash code one wants to learn for the ith data point, x i R d, 1 denotes a vector of all ones, and A =[a ij ] n n R n n is a predefined affinity matrix that encodes the similarity structure of the data. The last two constraints, YY T = ki and Y 1 = 0, are to achieve uncorrelated and balanced hash bits, respectively. Usually, the affinity matrix A is computed by a ij = exp ( x i x j 2 ) 2σ 2 (2) where σ > 0 is a thresholding parameter. By discarding the binary constraint, Y { 1, 1} k n, the minimization problem in (1) can be easily solved by performing an eigenvalue decomposition on the Laplacian matrix of A. To obtain hash codes of strictly binary, after the code matrix Y is learned, SH applies a simple quantization operator h( ) on the elements of y i s, where h(ν) takes value 1 if ν 0 and 0 otherwise X 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Although computationally simple, very often SH performs unsatisfactorily and lags distinctly behind the recently established state-of-the-art methods, e.g., iterative quantization (ITQ) [17]. The main reason is that the formulation of SH is indeed imperfect: The minimization of n a ij y i y j 2, in general, tends to assign similar hash codes to all data points, including the dissimilar ones, resulting in inferior retrieval performance. This issue is rather striking when the binary constraint Y { 1, 1} k n is discarded during the spectral relaxation procedure. 2) Discrete Graph Hashing: To alleviate the drawback of SH, Liu et al. [18] and Shen et al. [19] show that it is helpful to keep the binary constraint Y { 1, 1} k n, establishing a method called discrete graph hashing (DGH). However, DGH may not fully remedy the defect of mapping dissimilar data points to similar hash codes, which is essentially caused by the imperfect objective function in (1). Moreover, the discrete optimization problem in DGH is NP-hard in nature and hard to solve. B. This Work In order to fix the defect of mapping dissimilar points to adjacent hash codes, we argue that one just needs to change the position of the input x i s with the output y is in (1), resulting in a novel formulation termed reversed SH (ReSH). Different from SH, which estimates the affinity matrix A from the given high-dimensional data points, x i s, our ReSH defines the similarities between data points according to the unknown low-dimensional hash codes, y i s. Those codedependent similarities take the responsibility of evaluating the quality of the produced hash codes and, therefore, dominate the whole training process. Interestingly, unlike SH, which tends to assign similar hash codes to dissimilar data points, the minimization problem in our ReSH can be optimized if and only if similar data points are mapped to adjacent hash codes, and even more, dissimilar data points are considerably separated from each other in the hash code space. In our ReSH, the hash codes y i s and the similarities a ij s are mutually dependent and both entirely unknown. As a consequence, not surprisingly, the optimization problem arising from ReSH is complex and not easy to solve. Fortunately, we find that the optimization problem could be solved approximately by making use of artificial neural networks (ANNs). Namely, we first simulate the underlying hash mapping by a multilayer feedforward neural network, and then use the resilient backpropagation algorithm [20] to train the network that approximately solves the optimization problem in ReSH. Experiments on three benchmark data sets, CIFAR-10 [21], 22K-LabelMe [22], and ALOI [23], show the superior performance of the proposed ReSH method. In summary, the contributions of this paper include the following. 1) We propose a novel unsupervised hashing method termed ReSH, which can seamlessly overcome the drawback of the classical SH method. Compared with DGH, our ReSH is more effective and more convenient to work with. 2) The formulation of ReSH results in a difficult minimization problem. We devise a neural network-based algorithm to approximately solve the optimization problem and obtain superior performance in our experiments. 3) The idea of ReSH is probably useful for improving many other graph-based learning methods that share a similar drawback as SH. In that sense, the significance of this paper is indeed not limited to hashing. The rest of this paper is organized as follows. Section II presents the details of ReSH and the optimization algorithm. Section III implements different methods to experimentally verify the effectiveness of our proposed ReSH. Section IV concludes this paper. II. SPECTRAL HASHING WITH INPUT AND OUTPUT INTERCHANGED In general, the proposed ReSH method is the reversed form of SH. Different from SH, ReSH defines the similarities between data points according to the unknown lowdimensional hash codes rather than the given high-dimensional data. Those code-dependent similarities take charge of evaluating the quality of the produced hash codes. Due to this key difference, ReSH can seamlessly eliminate the drawback of assigning close hash codes to distant data points. A. Problem Suppose we are given with a real-valued data matrix X = [x 1, x 2,...,x n ] R d n, each column in which is a d-dimensional data point x i R d. Without loss of generality, we assume that the data are normalized, such that the elements in x i s are within the range from 1 to1.ourgoalistolearn a hash mapping, denoted as f, that converts any data point x to a binary string y {0, 1} k of length k y = f (x) : R d {0, 1} k. (3) The similarity structure of the original data is required to be preserved. Namely, similar points in the original data space R d should be hashed into adjacent binary codes, and mostly important, dissimilar points must be considerably separated from each other in the hash code space. Also, the learned hash mapping f is required to have generalization ability to well handle any fresh data. B. Model To remedy the defect of mapping dissimilar points to similar hash codes, we argue that one just needs to change the position of the input x i s with the output y is in SH (1). More precisely, in this paper, we suggest to consider the following model: 1 n min f F n 2 a ij x i x j p s.t. y i = f (x i ), y i {0, 1} k, i = 1,...,n { 1, if D(yi, y a ij = j )<σ 0, otherwise where F generally denotes the continuous mapping family each member of which converts high-dimensional data to (4)

3 LIU et al.: ReSH 3 low-dimensional codes, 1 D(, ) denotes the Hamming distance between hash codes, p > 0 is a parameter that is responsible for rescaling the distances between the data points, and σ>0 is a thresholding parameter. Notice that the continuity of the mapping f can enforce similar data points to have adjacent hash codes. Also, mostly important, the objective function in (4) can be minimized only if distant data points are considerably separated from each other in the low-dimensional code space. So, the problem in (4) is optimized if and only if the produced hash codes can strictly preserve the similarity structure of the given high-dimensional data. That is, dissimilar data points are considerably separated from each other in the code space, while similar data points are mapped to adjacent hash codes. C. Relaxation The optimization problem in (4) is not easy to solve, due to the discrete nature of the hash codes and, even more, the mutual dependence between the hash codes y i s and the similarities a ij s. In particular, the dependence between y i s and a ij s is determined by a hard thresholding function, the output of which is discrete too. For the ease of optimization, first, we relax the binary constraint, y i {0, 1} k, by using ANN with S -shaped activation functions (e.g., the sigmoid) to simulate the hash mapping f, i.e., we assume that F ={ANN with the sigmoid activation function}. (5) In the experiments of this paper, the used sigmoid function is simply s(t) = 1/(1+exp t ). It is worth noting that the binary constraint y i {0, 1} k is indeed well treated in our method, because the use of sigmoid (or any S -shaped) activation function can make the output of f be near binary, as will be shown later. Second, to smooth the mutual dependence between the codes y i s and the similarities a ijs, we replace the hard thresholding operator, D(y i, y j )<σ, with a soft one. To be more precise, the affinity matrix A in (4) is defined as a ij = exp ( y i y j 2 ) 2σ 2 (6) where y i is the real-valued, near binary codes of the ith data point x i, denotes the l 2 norm of a vector, and σ is the thresholding parameter. In this way, the discrete nature of the problem in (4) has been removed, and thereby we reach the following optimization problem: min f F 1 n 2 n a ij x i x j p s.t. y i = f (x i ), i = 1,..., n a ij = exp ( y i y j 2 ) 2σ 2. (7) 1 Since a hash mapping with discrete outputs cannot be meanwhile continuous, it is assumed in (4) that the output space of f is R k rather than {0, 1} k. This is not inconsistent with the binary constraint of f (x i ) {0, 1} k, because {x i } n i=1 is a discrete set. In general, this problem can be solved by using backpropagation to train an ANN, as will be shown in Section II-C1. Connections to Existing Methods: The objective function in (7) is somewhat similar to the second term of elastic embedding (EE) [24], whose formulation involves two terms. The first term puts the similar points near each other and the second term puts the dissimilar points far apart from each other. The idea of EE has been used in several hashing papers [25]. Different from EE-like methods, our proposed method utilizes the continuity of the hash mapping to achieve local smoothness and, therefore, involves only one term in the formulation. Moreover, please note that (7) is just a relaxation of our original formulation (4), which is essentially different from the second term of EE. D. Optimization To solve the problem in (7), generally, there are many techniques, such as expectation maximization and deep learning, for neural networks to be used. For the sake of computational efficiency, we use a standard backpropagation algorithm proposed in [20] to train a flat feedforward neural network that solves (7) in an approximate fashion. We first denote the objective function in (7) as E E = 1 n n 2 a ij x i x j p (8) = 1 n n 2 exp ( y i y j 2 ) 2σ 2 x i x j p. With this notation, it can be calculated that the gradient of the objective function value, E, with respect to the output variable, y i,isgivenby y i = n x i x j p n 2 exp ( y i y j 2 ) yi y j 2σ 2 σ 2. (9) j=1 Given / y i, the gradients with respect to the network weights, denoted as /, can be computed by backpropagation. So, the problem in (7) can be solved by the standard backpropagation algorithm. To obtain more reliable solutions, we use instead the resilient backpropagation algorithm [20], the pseudocode of which is shown in Algorithm 1. The parameters of the resilient backpropagation algorithm do not need extensive adjustments and could be set as follows: 0 = 0.07 (initial weight change), max = 50 (maximum weight change), min = 10 6 (minimum weight change), η + = 1.2 (increment to weight change), and η = 0.5 (decrement to weight change). Empirically, the resilient backpropagation algorithm can converge within 70 epochs, as exemplified in Fig. 1 (left). 1) Quantization: The hash mapping f, learned by solving the problem in (7), does not convert a data point into a hash code of strictly binary. Thus, we need a quantization scheme to postprocess the output of f. As can be seen from Fig. 1 (right), the output of f is very close to be binary. Hence, we apply a simple operator h( ) to quantize

4 This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Fig. 1. Left: exemplifying the convergence properties of the optimization algorithm presented in Section II-D. Right: distribution of the output value of the learned hash mapping f. These statistics are collected from 1.4 million output values obtained in our experiments. The label 0.05 refers to the interval of (0, 0.1], and similarly for 0.15, 0.25,..., Fig. 2. Precision recall and recall curves on CIFAR-10. (a) Precision recall curve at 16 b. (b) Precision recall curve at 24 b. (c) Precision recall curve at 32 b. (d) Recall curve at 16 b. (e) Recall curve at 24 b. (f) Recall curve at 32 b. the output of f, where h(ν) takes value 1 if ν 0.5 and 0 otherwise. 2) Reducing the Training Time: Since the gradient for a single output point is a sum over all points [see (9)], the computational cost for learning f from n data points is O(n 2 ). This is expensive on big data sets where n is very large. To speed up the training process, we utilize the idea of stochastic gradient descent. In each epoch, we randomly choose r integers from 1 to n, denoted as T = {i 1, i 2,..., i r }, and then approximate the objective function E in (8) by n yi y j 2 1 x i x j p (10) Er = exp nr 2σ 2 i T j =1

5 LIU et al.: ReSH 5 Fig. 3. Precision recall and recall curves on 22K-LabelMe. (a) Precision recall curve at 16 b. (b) Precision recall curve at 24 b. (c) Precision recall curve at 32 b. (d) Recall curve at 16 b. (e) Recall curve at 24 b. (f) Recall curve at 32 b. where r is token as a parameter and referred to as sampling rate. In this way, the computational cost is reduced to O(nr) with r n. According to our experimental investigations, ReSH can work equally well when r 0.05n. A. Experimental Data III. EXPERIMENTS We evaluate our method on four benchmark data sets, CIFAR-10 [21], 22K-LabelMe [22], ALOI [23], and ANN- GIST1M [26], which are widely used as test beds in largescale search. 1) CIFAR-10: The first data set, CIFAR-10, contains 60K RGB images of size and has been manually labeled into ten classes. Each class consists of 6K images. For each image, we extract a 128-D GIST [27] feature vector, resulting in a collection of 60K data points of 128-D for experiments. We randomly choose 59K images as training data for learning the underlying hash mapping f and the rest for testing. 2) 22K-LabelMe: The second data set we experiment with, 22K-LabelMe, consists of images randomly sampled from the whole LabelMe database. We exact a 128-D GIST feature vector for each image, resulting in a collection of data points of 128-D for experiments. Again, the data set is randomly divided into two parts: 20K data points for training and the rest 2019 points for testing. 3) ALOI: The third data set, ALOI, contains 108K 128-D data points. We randomly choose 100K points as training data and the rest for testing. 4) ANN-GIST1M: The fourth data set for experiment is the ANN-GIST1M [26], which contains a training set of one million 916-D GIST feature points and a testing set of 1K feature points. For convenience, the dimension of the feature points is reduced to 128 by principal component analysis. B. Baselines Since the proposed ReSH is closely related to SH, we adopt SH as a benchmark baseline for comparison. To verify the position of ReSH in the state-of-the-art (unsupervised) hashing methods, besides of SH, we also include for comparison eight prevalent methods, including DGH [18], semantic hashing [28], ITQ [17], isotropic hashing [29], sparse projection [30], locality sensitive hashing [2], unsupervised sequential projection learning for hashing [31], and spherical

6 This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Fig. 4. Precision recall and recall curves on ALOI. (a) Precision recall curve at 16 b. (b) Precision recall curve at 24 b. (c) Precision recall curve at 32 b. (d) Recall curve at 16 b. (e) Recall curve at 24 b. (f) Recall curve at 32 b. hashing [32]. In addition, we also include, for comparison, the unsupervised kernel hashing, which is an unsupervised version of [9]. C. Evaluation Metrics For each query point (i.e., test point), its ground-truth neighbors are defined as the points that lie within the top 0.01% 1% points close to the query in terms of the original Euclidean distance. Then precision recall, recall, and mean average precision (MAP) are used as the metrics for evaluating the performance of hashing methods. D. Parameter Configuration In the proposed ReSH method, the parameter p takes the role of rescaling the distances between the data points so as to refine the similarity structure of the given data. In our experiments, p is chosen as 8. Obviously, the thresholding parameter σ should increase as the number of hash bits, k. Since the L2 norm of a nearly binary k-bit code is O((k)1/2 ), the dependence of σ on k could be described as (11) σ =α k where α is set as 0.2 in our experiments. Since the desired hashing mapping f is required to be continuous, the neural network for simulating f only needs one hidden layer. In all the experiments shown in this paper, the neural network structure is configured as k, where k is the number of hash bits we want to generate. E. Results Using a workstation of Intel Core i CPU@3.60 GHz with 8G memory, the training time of our algorithm on two small data sets (CIFAR and LabelMe) is about 10 min (r = 0.05n). For training the network using 1 million data points, the runtime is a little longer than 1 h (r = 0.05n). The comparison results are shown in Figs. 2 5 and Table I. It can be seen that our ReSH is distinctly better than the benchmark baseline, SH. To be more precise, in terms of MAP, Table I shows that ReSH is at most 132% and at least 33% better than SH. This verifies that it is very beneficial to change the position of the input with the output in SH, which arguably fixes the defect of assigning similar hash codes to dissimilar data points. Compared with the state-of-the-art methods, the performance of ReSH is also superior. For example, on ALOI,

7 LIU et al.: ReSH 7 Fig. 5. Precision recall and recall curves on ANN-GIST1M. (a) Precision recall curve at 16 b. (b) Precision recall curve at 24 b. (c) Precision recall curve at 32 b. (d) Recall curve at 16 b. (e) Recall curve at 24 b. (f) Recall curve at 32 b. TABLE I MAP ON CIFAR-10, 22K-LabelMe, AND ALOI. FOR EACH METHOD,WE RUN TEN TIMES AND USE THE AVERAGE PERFORMANCE FOR COMPARISON the 32-D hash codes produced by ReSH get an MAP of 0.56, which is 7.5% higher than the most close baseline, ITQ. These results confirm the effectiveness of the proposed ReSH method. Influences of the Parameters: In ReSH, the thresholding parameter σ cannot be too large or too small, and similarly for the rescaling parameter p. Regarding the sampling rate, r, there is a tradeoff between efficiency and effectiveness: Using smaller r, the training process is faster but the retrieval results might be less accurate. The results in Fig. 6 (middle) show that it is adequate to set the thresholding parameter as σ = α(k) 1/2, with α being around 0.2. The rescaling parameter p could be chosen around 8, as can be seen from Fig. 6 (left). Empirically, as shown in Fig. 6 (right), our ReSH performs almost equally well when r 0.05n. Fig. 6. Influences of the parameters. Left: p is varying and α = 0.2 (i.e., σ = 0.2(k) 1/2 ). Middle: σ is varying and p = 8. Right: r is varying while α = 0.2 andp = 8. These experiments are demonstrated on 22K-LabelMe. In these experiments, we consistently use r = 0.05n. IV. CONCLUSION In this paper, we studied hashing that is a key technique for solving the efficiency issues in dealing with high-dimensional,

8 8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Algorithm 1: Resilient Backpropagation initialize t = 1, ij (t) = 0,and w ij (t 1) = 0, i, j, where t is the epoch number, i and j generally denote two units, w ij is the weight of the edge that connects the two units, and ij is the corresponding weight change. (t), i, j, using back- compute the gradients propagation. for all weights if ( w ij (t 1) w ij (t) >0) ij (t) = min ( ij (t 1)η +, max ). w ij (t) = sign( (t)) ij (t). w ij (t + 1) = w ij (t) + w ij (t). (t 1) = (t). else if ( (t 1) (t) <0) ij (t) = max ( ij (t 1)η, min ). (t 1) = 0. else if ( (t 1) (t) = 0) ij (t) = ij (t 1) w ij (t) = sign( (t)) ij (t). w ij (t + 1) = w ij (t) + w ij (t). (t 1) = w ij (t). end if end for t = t + 1. until stopped massive data. Particularly, we were interested in SH, which is a classical method of unsupervised hashing. While computationally simple, SH has a defect of mapping dissimilar data points to adjacent hash codes. To overcome this drawback, we proposed a novel method termed ReSH. We showed that the minimization problem in ReSH can be optimized if and only if the similarity structure of the high-dimensional data is strictly preserved in the low-dimensional hash code space. Experimental results have verified that ReSH is much better than SH and can achieve the state-of-the-art performance. However, it is entirely possible to improve ReSH by using some more effective learning techniques (e.g., the deep learning methods) to solve the optimization problem in (7). We would leave this as future work. REFERENCES [1] A. Andoni and P. Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, Commun. ACM, vol. 51, no. 1, pp , Jan [2] A. Gionis, P. Indyk, and R. Motwani, Similarity search in high dimensions via hashing, in Proc. Int. Conf. Very Large Data Bases, 1999, pp [3] B. Kulis, P. Jain, and K. Grauman, Fast similarity search for learned metrics, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 12, pp , Dec [4] B. Neyshabur, N. Srebro, R. Salakhutdinov, Y. Makarychev, and P. Yadollahpour, The power of asymmetry in binary hashing, in Proc. Adv. Neural Inf. Process. Syst., 2013, pp [5] J. Sänchez and F. Perronnin, High-dimensional signature compression for large-scale image classification, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Apr. 2011, pp [6] G. Shakhnarovich, Learning task-specific similarity, Ph.D. dissertation, Dept. Elect. Eng. Comput. Sci., Massachusetts Inst. Technol., Cambridge, MA, USA, [7] B. Kulis and T. Darrell, Learning to hash with binary reconstructive embeddings, in Proc. Adv. Neural Inf. Process. Syst., 2009, pp [8] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, Deep hashing for compact binary codes learning, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp [9] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, Supervised hashing with kernels, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp [10] H.-F. Yang, K. Lin, and C.-S. Chen, Supervised learning of semanticspreserving hash via deep convolutional neural networks, IEEE Trans. Pattern Anal. Mach. Intell., to be published. [Online]. Available: [11] J. Wang, S. Kumar, and S.-F. Chang, Semi-supervised hashing for largescale search, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 12, pp , Dec [12] H. Lai, Y. Pan, Y. Liu, and S. Yan, Simultaneous feature learning and hash coding with deep neural networks, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp [13] F. Zhao, Y. Huang, L. Wang, and T. Tan, Deep semantic ranking based hashing for multi-label image retrieval, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp [14] Y. Weiss, R. Fergus, and A. Torralba, Multidimensional spectral hashing, in Proc. Eur. Conf. Comput. Vis., 2012, pp [15] Y. Weiss, A. Torralba, and R. Fergus, Spectral hashing, in Proc. Adv. Neural Inf. Process. Syst., 2008, pp [16] F. Shen, C. Shen, Q. Shi, A. V. D. Hengel, Z. Tang, and H. T. Shen, Hashing on nonlinear manifolds, IEEE Trans. Image Process., vol. 24, no. 6, pp , Jun [17] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp , Dec [18] W. Liu, C. Mu, S. Kumar, and S.-F. Chang, Discrete graph hashing, in Proc. Adv. Neural Inf. Process. Syst., Apr. 2014, pp [19] F. Shen, X. Zhou, Y. Yang, J. Song, H. T. Shen, and D. Tao, A fast optimization method for general binary code learning, IEEE Trans. Image Process., vol. 25, no. 12, pp , Dec [20] M. Riedmiller and H. Braun, A direct adaptive method for faster backpropagation learning: The Rprop algorithm, in Proc. IEEE Int. Conf. Neural Netw., 1993, pp [21] A. Krizhevsky, Learning multiple layers of features from tiny images, Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, Tech. Rep., [22] A. Torralba, R. Fergus, and Y. Weiss, Small codes and large image databases for recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp [23] A. Rocha and S. K. Goldenstein, Multiclass from binary: Expanding one-versus-all, one-versus-one and ECOC-based approaches, IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 2, pp , Feb [24] M. A. Carreira-Perpinán, The elastic embedding algorithm for dimensionality reduction, in Proc. Int. Conf. Mach. Learn., 2010, pp [25] F. Shen, C. Shen, Q. Shi, A. Hengel, and Z. Tang, Inductive hashing on manifolds, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp [26] H. Jégou, M. Douze, and C. Schmid, Product quantization for nearest neighbor search, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp , Jan [27] A. Oliva and A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, Int. J. Comput. Vis., vol. 42, no. 3, pp , [28] R. Salakhutdinov and G. Hinton, Semantic hashing, Int. J. Approx. Reasoning, vol. 50, no. 7, pp , Jul [29] W. Kong and W. Li, Isotropic hashing, in Proc. Adv. Neural Inf. Process. Syst., 2012, pp [30] Y. Xia, K. He, P. Kohli, and J. Sun, Sparse projections for highdimensional binary codes, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LIU et al.: ReSH 9 [31] J. Wang, S. Kumar, and S.-F.

Yoon, Spherical hashing: Binary code embedding with hyperspheres, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 11, pp. 2304 2316, Nov. 2015. Qingshan Liu (M 05 SM 08) received the Ph.D.

He was an Assistant Research Professor with the Computational Biomedicine Imaging and Modeling Center, Department of Computer Science, Rutgers University, New Brunswick, NJ, USA.

He is currently a Professor with the School of Information and Control Engineering, Nanjing University of Information Science and Technology, Nanjing.

analysis Dr. Liu was a recipient of the President Scholarship of the Chinese Academy of Sciences in 2003. Guangcan Liu (M 11) received the bachelor s degree in mathematics and the Ph.D. degree in computer science and engineering from Shanghai Jiao Tong University, Shanghai, China, in 2004 and 2010, respectively.

Cornell University, Ithaca, NY, USA, from 2013 to 2014, and Rutgers University, New Brunswick, NJ, USA, in 2014.

9 This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LIU et al.: ReSH 9 [31] J. Wang, S. Kumar, and S.-F. Chang, Sequential projection learning for hashing with compact codes, in Proc. Int. Conf. Mach. Learn., 2010, pp [32] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon, Spherical hashing: Binary code embedding with hyperspheres, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 11, pp , Nov Qingshan Liu (M 05 SM 08) received the Ph.D. degree from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, Beijing, China, in 2003, and the M.S. degree from Southeast University, Nanjing, China, in He was an Assistant Research Professor with the Computational Biomedicine Imaging and Modeling Center, Department of Computer Science, Rutgers University, New Brunswick, NJ, USA. From 2010 to 2011, he was an Associate Professor with the National Laboratory of Pattern Recognition, Chinese Academy of Sciences. He is currently a Professor with the School of Information and Control Engineering, Nanjing University of Information Science and Technology, Nanjing. His current research interests include image and vision analysis, including face image analysis, graph- and hypergraphbased image and video understanding, medical image analysis, and eventbased video analysis Dr. Liu was a recipient of the President Scholarship of the Chinese Academy of Sciences in Guangcan Liu (M 11) received the bachelor s degree in mathematics and the Ph.D. degree in computer science and engineering from Shanghai Jiao Tong University, Shanghai, China, in 2004 and 2010, respectively. He was a Post-Doctoral Researcher with the National University of Singapore, Singapore, from 2011 to 2012, the University of Illinois at Urbana Champaign, Champaign, IL, USA, from 2012 to 2013, Cornell University, Ithaca, NY, USA, from 2013 to 2014, and Rutgers University, New Brunswick, NJ, USA, in Since 2014, he has been a Professor with the School of Information and Control, Nanjing University of Information Science and Technology, Nanjing, China. His current research interests include the areas of machine learning, computer vision, and image processing. Xiao-Tong Yuan (M 14) received the B.A. degree in computer science from the Nanjing University of Posts and Telecommunications, Nanjing, China, in 2002, the M.E. degree in electrical engineering from Shanghai Jiao Tong University, Shanghai, China, in 2005, and the Ph.D. degree in pattern recognition from the Chinese Academy of Sciences, Beijing, China, in He was a Post-Doctoral Research Associate with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, the Department of Statistics and Biostatistics, Rutgers University, New Brunswick, NJ, USA, and the Department of Statistical Science, Cornell University, Ithaca, NY, USA. In 2013, he joined the Nanjing University of Information Science and Technology, Nanjing, where he is currently a Professor of Computer Science. His current research interests include machine learning, data mining, and computer vision. Meng Wang received the B.E. and Ph.D. degrees from the Special Class for the Gifted Young and the Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China, in 2003 and 2008, respectively. He is currently a Professor with the Hefei University of Technology, Hefei. His current research interests include multimedia content analysis, computer vision, and pattern recognition. He has authored over 200 book chapters, journal, and conference papers in these areas. Wei Liu (M 14) received the M.Phil. and Ph.D. degrees in EECS from Columbia University, New York, NY, USA, in He has been a Research Scientist with the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, since He is currently a Computer Vision Director with the Tencent AI Lab, Shenzhen, China. He has been involved in the research and development of computer vision, machine learning, data mining, and information retrieval. He has authored over 100 peer-reviewed journal and confer- Lai Li received the bachelor s degree in electronic information and electrical engineering from the Nanjing University of Information Science and Technology, Nanjing, China, in 2014, where he is currently pursuing the master s degree with the School of Information and Control. His current research interests include machine learning and computer vision. ence papers.

Supervised Hashing for Image Retrieval via Image Representation Learning

Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Supervised Hashing for Image Retrieval via Image Representation Learning Rongkai Xia 1, Yan Pan 1, Hanjiang Lai 1,2, Cong Liu