IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval"

Stella Woods
5 years ago
Views:

1 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY SSDH: Sem-supervsed Deep Hashng for Large Scale Image Retreval Jan Zhang, and Yuxn Peng arxv: v2 [cs.cv] 8 Jun 207 Abstract Hashng methods have been wdely used for effcent smlarty retreval on large scale mage database. Tradtonal hashng methods learn hash functons to generate bnary codes from hand-crafted features, whch acheve lmted accuracy snce the hand-crafted features cannot optmally represent the mage content and preserve the semantc smlarty. Recently, several deep hashng methods have shown better performance because the deep archtectures generate more dscrmnatve feature representatons. However, these deep hashng methods are manly desgned for supervsed scenaros, whch only explot the semantc smlarty nformaton, but gnore the underlyng data structures. In ths paper, we propose the sem-supervsed deep hashng (SSDH) approach, to perform more effectve hash functon learnng by smultaneously preservng semantc smlarty and underlyng data structures. The man contrbutons are as follows: () We propose a sem-supervsed loss to jontly mnmze the emprcal error on labeled data, as well as the embeddng error on both labeled and unlabeled data, whch can preserve the semantc smlarty and capture the meanngful neghbors on the underlyng data structures for effectve hashng. (2) A semsupervsed deep hashng network s desgned to extensvely explot both labeled and unlabeled data, n whch we propose an onlne graph constructon method to beneft from the evolvng deep features durng tranng to better capture semantc neghbors. To the best of our knowledge, the proposed deep network s the frst deep hashng method that can perform hash code learnng and feature learnng smultaneously n a sem-supervsed fashon. Expermental results on 5 wdely-used datasets show that our proposed approach outperforms the state-of-the-art hashng methods. Index Terms Sem-supervsed deep hashng, onlne graph constructon, underlyng data structures, large scale mage retreval. I. INTRODUCTION MULTIMEDIA data ncludng mages and vdeos have been growng explosvely on the Internet n recent years, and retrevng smlar data from a large scale database has become an urgent need. Hashng methods can map smlar data to smlar bnary codes that have small dstance. Due to the low storage cost and fast retreval speed, hashng methods have been recevng a broad attenton n many applcatons, such as mage retreval [] [0] and vdeo retreval [] [3]. Tradtonal hashng methods [5], [7], [4], [5] take preextracted mage features as nput, and then learn hash functons by explotng the data structures or applyng the smlarty preservng regularzatons. These methods can be dvded nto Ths work was supported by Natonal Natural Scence Foundaton of Chna under Grants and The authors are wth the Insttute of Computer Scence and Technology, Pekng Unversty, Bejng 0087, Chna. Correspondng author: Yuxn Peng (e-mal: pengyuxn@pku.edu.cn). two categores: unsupervsed and supervsed methods. And recently, nspred by the remarkable progress of deep networks, some deep hashng methods have also been proposed [7] [0], [6], [7]. Unsupervsed methods desgn hash functons by usng unlabeled data to generate bnary codes. Localty Senstve Hashng (LSH) [4] s a representatve unsupervsed method, whch s proposed to use random lnear projectons to map data nto bnary codes. Instead of usng randomly generated hash functons, some data-dependent methods [], [2], [4] [6], [8] have been proposed to capture the data propertes, such as data dstrbutons and manfold structures. For example, Wess et al. [5] propose Spectral Hashng (SH), whch tres to keep hash functons balanced and uncorrelated. Lu et al. [4] propose Anchor Graph Hashng (AGH) to preserve the neghborhood structures by anchor graphs, and Locally Lnear Hashng (LLH) [2] s proposed to use localty senstve sparse codng to capture the local lnear structures and then recover these structures n Hammng space. Gong et al. [] propose Iteratve Quantzaton (ITQ) to generate hash codes by mnmzng the quantzaton error of mappng data to the vertces of a bnary hypercube. Topology Preservng Hashng (TPH) [8] s proposed to perform hash functon learnng by preservng consstent neghborhood rankngs of data ponts n Hammng space. Whle unsupervsed methods are promsng to retreve the neghbors under some knd of dstance metrcs (such as L 2 dstance), the neghbors n the feature space can not optmally reflect the semantc smlarty. Therefore, supervsed hashng methods [5], [9] [29] are proposed to utlze the semantc nformaton such as mage labels to generate effectve hash codes. For example, Kuls et al. [9] propose Bnary Reconstructon Embeddng (BRE) to learn hash functons by mnmzng the reconstructon error between orgnal dstances and reconstructed dstances n the Hammng space. Lu et al. [22] propose Supervsed Hashng wth Kernels (KSH) to learn hash functons by preservng the parwse relatons between data samples provded by labels. Wang et al. [5] propose Sem-supervsed Hashng (SSH) to learn hash functons by mnmzng the emprcal error over labeled data whle maxmzng the nformaton entropy of the generated hash codes over both labeled and unlabeled data. Norouz et al. [23] propose to learn hash functons based on a trplet rankng loss that can preserve relatve semantc smlarty. Rankng Preservng Hashng (RPH) [28] and Order Preservng Hashng (OPH) [25] are proposed to learn hash functons by preservng the rankng nformaton, whch s obtaned based on the number of shared semantc labels between data examples. Shen

2 2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY et al. [29] propose Supervsed Dscrete Hashng (SDH) to leverage label nformaton to obtan hash codes by ntegratng hash code generaton and classfer tranng. For most of the unsupervsed and supervsed hashng methods, nput mages are represented by hand-crafted features (e.g. GIST [30]), whch can not optmally represent the semantc nformaton of mages. Inspred by the fast progress of deep networks on mage classfcaton [3], some deep hashng methods have been proposed [7] [0], [6], [7] to take advantage of the superor feature representaton power of deep networks. Convolutonal Neural Network Hashng (CNNH) [8] s a two-stage framework based on the convolutonal networks, whch learns fxed hash codes n the frst stage, and learns hash functons and mage representatons n the second stage. Although the learned hash codes can gude feature learnng, the learned mage features cannot provde feedback for learnng better hash codes. To overcome the shortcomngs of the two-stage learnng scheme, some approaches have been proposed to perform smultaneously feature learnng and hash code learnng. Network n Network Hashng (NINH) [7] s proposed to desgn a trplet rankng loss to capture the relatve smlartes of mages. NINH s a one-stage supervsed method, thus mage representaton learnng and hash code learnng can beneft each other n the deep archtecture. Some smlar rankng-based deep hashng methods [9], [6] have also been proposed recently, whch also desgn hash functons to preserve the rankng nformaton obtaned by labels. However, exstng deep hashng methods [7] [0], [6], [7] are manly supervsed methods. On one hand, they heavly rely on labeled mages and requre a large amount of labeled data to acheve better performance. But labelng mages consumes large amounts of tme and human labors, whch s not practcal n real world applcatons. One the other hand, they only consder semantc nformaton whle gnore the underlyng data structures of unlabeled data. Thus t s necessary to make full use of unlabeled mages to mprove the deep hashng performance. In ths paper, we propose the sem-supervsed deep hashng (SSDH) approach, whch learns hash functons by preservng the semantc smlarty and underlyng data structures smultaneously. To the best of our knowledge, the proposed SSDH s the frst deep hashng method that can perform hash code learnng and feature learnng smultaneously n a sem-supervsed fashon. The man contrbutons of ths paper can be concluded as follows: Sem-supervsed loss. Exstng deep hashng methods only desgn the loss functons to preserve the semantc smlarty but gnore the underlyng data structures, whch s essental to capture the semantc neghborhoods and return the meanngful nearest neghbors for effectve hashng [4]. To address the problem, we propose a sem-supervsed loss functon to explot the underlyng structures of unlabeled data for more effectve hashng. By jontly mnmzng the emprcal error on labeled data as well as the embeddng error on both labeled and unlabeled data, the learned hash functons can not only preserve the semantc smlarty, but also capture the meanngful neghbors on the underlyng data structures. Thus the proposed SSDH method can beneft greatly from both labeled and unlabeled data. Sem-supervsed deep hashng network. A deep network structure s desgned to learn hash functons and mage representatons smultaneously n a sem-supervsed fashon. The proposed deep network conssts of three pars: representaton learnng layers extract dscrmnatve deep features from the nput mages, hash code learnng layer maps the mage features nto bnary hash codes and classfcaton layer predcts the pseudo labels of unlabeled data. Fnally the proposed sem-supervsed loss functon s desgned to model the semantc smlarty constrants, meanwhle preserve the underlyng data structures of mage features. The whole network s traned n an endto-end way, thus our proposed method can perform hash code and feature learnng smultaneously. Onlne graph constructon. The tradtonal offlne graph constructon methods are tme consumng due to the O(n 2 ) complexty, whch s ntractable for large scale data. To address ths ssue, we propose an onlne graph constructon strategy. Rather than constructng neghborhood graph over all the data n advance, our onlne graph constructon strategy constructs the neghborhood graph n a mn-batch durng the tranng procedure. On one hand, onlne graph constructon only needs to take consderaton of a much smaller mn-batch of data, whch s effcent and sutable for the batch-based deep network s tranng. On the other hand, onlne graph constructon can beneft from the evolvng feature representatons extracted from deep networks. Experments on 5 wdely-used datasets show that our proposed SSDH approach acheves the best search accuracy comparng wth 7 state-of-art methods. The rest of ths paper s organzed as follows. Secton II presents a bref revew of related deep hashng methods. Secton II ntroduces our proposed SSDH approach. Secton IV shows the experments on the 5 wdely-used mage datasets. Fnally Secton V concludes ths paper. II. RELATED WORK Hashng methods have become one of the most popular methods for smlarty retreval on large scale mage datasets. In recent years, some hashng methods have been proposed to perform hash functon learnng wth the deep networks and have made consderable progress. In ths secton, we brefly revew some representatve deep hashng methods. A. Convolutonal Neural Network Hashng Convolutonal Neural Network Hashng (CNNH) [8] s a two-stage framework, ncludng a hash code learnng stage and a hash functon learnng stage. Gven the tranng mages I = {I, I 2,..., I n }, n the hash code learnng stage (Stage ), CNNH learns an approxmate hash code for each tranng data by optmzng the followng loss functon: mn H S q HHT 2 F () where F denotes the Frobenus norm. S denotes the semantc smlarty of the mage pars n I, n whch S j =

3 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 3 when mage I and I j are semantcally smlar, otherwse S j =. H {, } n q denotes the approxmate hash codes. For the tranng mages, H encodes the approxmate hash codes whch preserve the parwse smlartes n S. It s dffcult to drectly optmze equaton (), CNNH frstly relaxes the nteger constrants on H and randomly ntalzes H [, ] n q, then optmzes equaton () usng a coordnate descent algorthm, whch sequentally chooses one entry n H to update whle keepng other entres fxed. Thus t s equvalent to optmze the followng equaton: mn H j H T j (qs H c H T c) 2 F (2) H j c j where H j and H c denote the j-th and the c-th column of H respectvely. In the hash functon learnng stage (Stage 2), CNNH utlzes deep networks to learn the mage feature representaton and hash functons. Specfcally, CNNH adopts the deep framework n [32] as ts basc network, and desgns an output layer wth sgmod actvaton to generate q-bt hash codes. CNNH trans the desgned deep network n a supervsed way, n whch the approxmate hash codes learned n Stage are used as the ground-truth. In addton, f dscrete class labels of tranng mages are avalable, CNNH also ncorporates these mage labels to hash functons. Based on the deep network, CNNH learns deep features and hash functons at the same tme. However CNNH s a twostage framework, where the learned deep features n Stage 2 cannot help to mprove the approxmate hash code learnng n Stage, whch lmts the performance of hash functon learnng. B. Network n Network Hashng Dfferent from the two-stage framework n CNNH [8], Network n Network Hashng (NINH) [7] ntegrates mage representaton learnng and hash code learnng nto a onestage framework. NINH constructs the deep framework for hash functon learnng based on the Network n Network archtecture [33], wth a shared sub-network composed of several stacked convolutonal layers to extract mage features, as well as a dvde-and-encode module encouraged by sgmod actvaton functon and a pece-wse threshold functon to output bnary hash codes. Durng the learnng process, wthout generatng approxmate hash codes n advance, NINH desgns a trplet rankng loss functon to explot the relatve smlarty of the tranng mages to drectly gude hash functon learnng: l trplet (I, I +, I ) = max(0, ( b(i) b(i ) H b(i) b(i + ) H )) s.t. b(i), b(i + ), b(i ) {, } q (3) where I, I + and I specfy the trplet constrant that mage I s more smlar to mage I + than to mage I based on mage labels, b( ) denotes bnary hash code, and H denotes the Hammng dstance. For smplfyng optmzaton of equaton (3), NINH apples two relaxaton trcks: relaxng the nteger constrant on bnary hash code, and replacng Hammng dstance wth Eucldean dstance. C. Bt-scalable Deep Hashng The Bt-scalable Deep Hashng (BS-DRSCH) method [6] s also a one-stage deep hashng method that constructs an end-to-end archtecture for hash functon learnng. In addton, BS-DRSCH s able to flexbly manpulate hash code length by weghng each bt of hash codes. BS-DRSCH defnes a weghted Hammng affnty to measure the dssmlarty between two hash codes: q H(b(I ), b(i j )) = wkb 2 k (I )b k (I j ) (4) k= where I and I j are tranng mages, b( ) denotes the q-bt bnary hash code, and b k ( ) denotes the k-th hash bt n b( ), and w k s the weght value attached to b k ( ), whch s to be learned and expected to represent the sgnfcance of b k ( ). Smlar to NINH [7], BS-DRSCH organzes the tranng mages nto trplet tuples, then formulates the loss functon as follows: loss = max(h(b(i), b(i + )) H(b(I), b(i )), C) I,I +,I + H(b(I ), b(i j ))S j 2,j where S j denotes the smlarty between mages I and I j based on labels. The defned loss s the sum of two terms: the frst term s a max-margn term, whch s desgned to preserve the rankng orders. The second term s a regularzaton term, whch s defned to keep the parwse adjacency relatons. BS- DRSCH bulds a deep convolutonal neural network to perform hash functon learnng, n whch the convolutonal layers are used to extract deep features, whle the fully-connected layer followed by the tanh-lke layer s used to generate hash codes. For learnng the weght values {w k }, k =,..., q, BS- DRSCH desgns an element-wse layer on the top of the deep archtecture, whch s able to wegh each bt of hash code. Both NINH [7] and BS-DRSCH [6] are supervsed deep hashng methods, and they only preserve the semantc smlarty based on labeled tranng data, but gnore the underlyng data structures of unlabeled data, whch s essental to capture the semantc neghborhoods for effectve hashng. III. SEMI-SUPERVISED DEEP HASHING A. Overvew of proposed hashng framework Gven a set of n mages X that are composed of labeled mages (X L, Y L ) = {(x, y ), (x 2, y 2 ),..., (x l, y l )} and unlabeled mages X U = {x l+, x l+2,..., x n }, where Y L s the label nformaton. The goal of hashng methods s to learn a mappng functon B : X {, } q, whch encodes an mage X X nto a q-bt bnary codes B(X) n the Hammng space, whle preservng the semantc smlarty of mages. In ths paper, we propose a novel deep hashng method called sem-supervsed deep hashng (SSDH), to learn better hash codes by preservng the semantc smlarty and underlyng data structures smultaneously. As shown n Fg., the deep archtecture of SSDH ncludes three man components: (5)

4 4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Hash code learnng layer Representaton learnng layers q Label-based trplet samplng KNN-based parwse samplng Onlne graph constructng Pseudo label-based parwse samplng c Classfcaton layer Tranng set Labeled mages y y2 Trplet tuples Onlne graph constructng Pseudo labels y3 Unlabeled mages Fg.. Overvew of the proposed SSDH framework, whch conssts of three parts: representaton learnng layers, hash code learnng layer and classfcaton layer. Fnally a sem-supervsed loss s proposed to gude the network tranng. ) A deep convolutonal neural network s desgned to learn and extract dscrmnatve deep features for mages, whch takes both the labeled and unlabeled mage data as nput. 2) A hash code learnng layer s desgned to map the mage features nto q-bt hash codes. 3) A sem-supervsed loss s desgned to preserve the semantc smlarty, as well as the neghborhood structures for effectve hashng, whch s formulated as mnmzng the emprcal error on the labeled data as well as the embeddng error on both the labeled and unlabeled data. The above three components are coupled nto a unfed framework, thus our proposed SSDH can perform mage representaton learnng and hash code learnng smultaneously. In the followng of ths secton, we frst ntroduce the proposed deep network archtecture, then we ntroduce the proposed sem-supervsed loss functon as well as the tranng of the proposed network. codes h(x) [0, ]q are contnuous real values, we apply a thresholdng functon to obtan bnary codes: bk (x) = sgn(hk (x) ), k =, 2,, q (7) However, bnary codes are hard to drectly optmze, thus we relax bnary codes b(x) wth contnuous real valued hash codes h(x) n the rest of ths paper. The nnth layer s the classfcaton layer, whch predcts the pseudo labels of unlabeled data, we use the cross-entropy loss functon to tran the classfcaton layer. Outputs of the last representaton learnng layer (fc7), the hash code learnng layer (fc8) and the classfcaton layer (fc9) are connected wth the sem-supervsed loss functon, so that the proposed loss functon can make full use of the underlyng data structures of extracted mage features and pseudo labels to mprove hash functon learnng performance. The whole network s an endto-end structure, whch can perform hash functon learnng and feature learnng smultaneously. B. The Proposed Deep Network We adopt the Fast CNN archtecture (denoted as VGGF for smplfcaton) from [34] as the base network. The frst seven layers are confgured the same as those n VGGF [34], whch buld the representaton learnng layers to extract dscrmnatve mage features. The eghth layer s the hash code learnng layer, whch s constructed by a fully-connected layer followed by a sgmod actvaton functon, and ts outputs are hash codes defned as: h(x) = sgmod(w T f (x) + v) (6) where f (x) s deep feature extracted from the last representaton learnng layer,.e., the output of fully-connected layer fc7, W denotes the weghts n the hash code learnng layer, and v s the bas parameter. Through the hash code learnng layer, mage features f (x) are mapped nto [0, ]q. Snce hash TABLE I D ETAILED NETWORK CONFIGURATIONS OF OUR PROPOSED SSDH APPROACH Layers conv conv2 conv3 conv4 conv5 fc6 fc7 fc8 fc9 Confguraton flter 64xx, st. 4x4, pad 0, LRN, pool 2x2 flter 256x5x5, st. x, pad 2, LRN, pool 2x2 flter 256x3x3, st. x, pad flter 256x3x3, st. x, pad flter 256x3x3, st. x, pad, pool 2x2 4096, dropout 4096, dropout q-dmensonal, sgmod c-dmensonal The detaled confguratons of our network structure are shown n Table I. For the fve convolutonal layers: the flter parameter specfes the number and the receptve fled sze of convolutonal kernels as num sze sze, the

5 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 5 st. and pad parameters specfy the convoluton strde and spatal paddng respectvely, the LRN ndcates whether Local Response Normalzaton (LRN) [3] s used, the pool ndcates whether to apply the max-poolng downsamplng, and specfes the poolng wndow sze by sze sze. For the fully-connected layers, we specfy ther dmensons, of whch the frst two are 4096, the hash layer s the same as hash code length q, and the classfcaton layer s the same as the class number c. The dropout ndcates whether the fullyconnected layer s regularzed by dropout [3]. The hash code learnng layer takes sgmod as actvaton functon, whle the other weghted layers take Rected Lnear Unt (ReLU) [3] as actvaton functons. It s noted that although we adopt VGG- F [34] as our base network, t s practcable to replace the base network by other deep convolutonal network structures, such as Network n Network (NIN) [33] model or AlexNet [3] model. C. Sem-supervsed Deep Hashng Learnng The proposed sem-supervsed loss functon s composed of three terms, namely supervsed rankng term, sem-supervsed embeddng term, and pseudo-label term. The frst term can preserve the semantc smlarty by mnmzng the emprcal error. The second term can capture the underlyng data structures by mnmzng the embeddng error, whch measures whether hash functons preserve the local structure of mage data when embeddng mages nto Hammng space, namely the neghbor ponts should have smlar hash codes whle nonneghbor ponts should have dssmlar hash codes. The thrd term encourages the decson boundary n low densty regons, whch s complementary to sem-supervsed embeddng term. Thus the proposed SSDH can fully beneft from both labeled and unlabeled data. We ntroduce these three terms separately n the followng parts of ths subsecton. ) Supervsed Rankng Term: Encouraged by [7], we formulate the supervsed rankng term as trplet rankng regularzaton to preserve the relatve semantc smlarty provded by label annotatons. For labeled mages (X L, Y L ), we randomly sample a set of trplet tuples dependng on the labels, T = {(x L, xl+, x L )} t =, n whch xl and x L+ are two smlar mages wth the same labels, whle x L and x L are two dssmlar mages wth dfferent labels, and t s the number of sampled trplet tuples. For the randomly sampled trplet tuples (x L, xl+, x L ), = t, the supervsed rankng term s defned as: J L (x L, x L+, x L ) = max(0, m t + h(x L ) h(x L+ ) 2 h(x L ) h(x L ) 2 ) (8) where 2 denotes the Eucldean dstance, and the constant parameter m t defnes the margns between the relatve smlarty of the two pars (x L, xl+ ) and (x L, xl ), that s to say, we expect the dstance of dssmlar par (x L, xl ) to be larger than the dstance of smlar par (x L, xl+ ) by at least m t. Thus mnmzng J L can reach our goal to preserve the semantc rankng nformaton provded by labels. 2) Sem-supervsed Embeddng Term: Although we can tran the network solely based on supervsed rankng term, the deep models wth multple layers often requre a large amount of labeled data to acheve better performance. Usually t s hard to collect abundant labeled mages, snce labelng mages consumes large amounts of tme and human labors. Thus t s necessary to mprove search accuracy by utlzng unlabeled data whch are much easer to obtan. Manfold assumpton [35] states that data ponts wthn the same manfold are lkely to have the same labels. Based on ths assumpton, the graph-based embeddng algorthms are often used to capture the underlyng manfold structures of data. Thus we add an embeddng term to capture the underlyng structures of both labeled and unlabeled data. Modelng the neghborhood structures often requres to buld a neghborhood graph. A neghborhood graph s defned as G = (V, E), where the vertces V represent the data ponts, and the edges E, whch are often represented by an n n adjacency matrx A, ndcate the adjacency relatons among the data ponts n a specfed space (usually n the feature space). But for large scale data, buldng neghborhood graph on all the data s tme and memory consumng, snce the tme complexty and memory cost of buldng the k-nn graph on all the data are both O(n 2 ), whch s ntractable for large n. Instead of constructng graph offlne, we propose to construct graph onlne wthn a mn-batch by usng the features extracted from the proposed deep networks. More specfcally, gven the mn-batch ncludng r mages, X B = {x } r =, and the deep features {f(x ) : x X B }. We frstly use deep features to calculate the k-nearest-neghbors NN k () for each mage x, then we can construct the adjacency matrx as: : I(, j) = 0 x j NN k () A(, j) = 0 : I(, j) = 0 x j / NN k () : I(, j) = where I(, j) s an ndcator functon: I(, j) = f x and x j are both labeled mages, otherwse I(, j) = 0. A(, j) = means that we don t buld neghborhood graph between two labeled mages, snce we can use supervsed rankng term to model ther smlarty relatons. But we construct neghborhood graph between labeled and unlabeled mages, so that we can make unlabeled neghbors of labeled mages close to them and mprove the retreval accuracy. Thus the embeddng term s n a sem-supervsed fashon. Because the sze of mn-batch s much smaller than the sze of the whole tranng data,.e., r n, thus t s effcent to compute onlne. Furthermore, buldng neghborhood graph onlne can beneft from the representatve deep features, snce the tranng procedure of deep network generates more and more dscrmnatve feature representatons, whch can capture the semantc neghbors more accurately. Based on the constructed adjacency matrx, we expect the neghbor mage pars to have small dstances, whle the nonneghbor mage pars to have large dstances. We apply parwse contrastve regularzaton [36] to acheve ths goal. Thus for one mage par (x, x j ), our sem-supervsed embeddng (9)

6 6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY term s formulated as: J U (x, x j ) = { h(x ) h(x j ) 2, A(, j) = max(0, m p h(x ) h(x j ) 2 ), A(, j) = 0 (0) where J U (x, x j ) s the contrastve loss between the two mages x and x j based on the neghborhood graph, 2 represents the Eucldean dstance, and the constant m p s a margn parameter for the dstance metrc of non-neghbor mage pars. Wth the neghborhood graphs constructed on the mn-batches, the sem-supervsed embeddng term J U captures the embeddng error on both labeled and unlabeled data. We mnmze J U to preserve the underlyng neghborhood structures of mage data. 3) Pseudo-label Term: Apart from proposed graph approach, Lee et al. [37] propose the pseudo-label method for mage classfcaton task, whch can also make use of unlabeled data. Pseudo-label method predcts unlabeled data wth the class that has the maxmum predcted probablty as ther labels. The maxmum a posteror estmaton encourages ths approach to have the decson boundary n low densty regons, whch mproves the generalzaton of classfcaton model. Dfferent from our proposed graph approach, whch encourages neghborhood mage pars to have small dstance and non-neghbor mage pars to have large dstance, pseudolabel method encourages each mage data to have an -of- C predcton, these two approaches are complementary to each other. Thus we ntegrate pseudo-label method n our framework. More specfcally, n our network structure, the top layers have two branches: one s hash code learnng layer, and the other s a classfcaton layer, whch uses softmax functon to calculate the label probablty of each unlabeled data. p j () = e zj c l= ez l () where z s the output of fc9 layer for mage x, and c s the class number. We also take the class wth maxmum predcted probablty as ther true label. Then we also use the contrastve loss functon to model semantc smlarty between the pseudo labels. Smlar to equaton (0), the pseudo-label term s defned as follows: J P (x, x j ) = { h(x ) h(x j ) 2, P L = P L j (2) max(0, m p h(x ) h(x j ) 2 ), P L P L j where the P L and P L j are the predcted pseudo labels. 4) The Sem-supervsed Loss: Combnng the supervsed rankng term, sem-supervsed embeddng term and pseudolabel term, we defne the sem-supervsed loss functon as: J = + µ t = J L (x L, x L+ n J P (x, x j ) =, x L ) + λ n J U (x, x j ),j= s.t. x L, x L+, x L X L, x, x j X (3) where t s the number of trplet tuples sampled from labeled mages, whch s the same as the number of labeled mages n the mn-batch, that s to say we randomly generate one trplet tuple for each labeled mage. λ and µ are the balance parameters. In order to make the neghbor and non-neghbor mage pars to be balanced, we only use a small part of mage pars obtaned by the neghborhood graph to calculate the loss. More specfcally, for each mage x, we randomly select a neghbor par (x, x + ), and a non-neghbor par (x, x ) for sem-supervsed embeddng term, and we randomly select a postve par (x L, xl+ p ) and negatve par (x L, xl p ) for pseudo-label term, then we rewrte the loss functon as: J = + µ t n = J L (x L, x L+ (J P (x L, x L+ p D. Network Tranng, x L ) + λ n (J U (x, x + ) + J U (x, x )) = ) + J P (x L, x L p )) (4) The proposed SSDH network s traned n a mn-batch way, that s to say, we only use a mn-batch of tranng mages to update the network s parameters n one teraton. For each teraton, the nput mage batch X B X s composed of some labeled mages (XL B, YB L ) = {(x, y ), (x 2, y 2 ),, (x m, y m )} and some unlabeled mages XU B = {x m+, x m+2,, x r }, where m denotes the number of labeled mages n the mnbatch, and r denotes the total number of mages n the mnbatch. The proposed network structure has two branches on the top,.e., the hash code learnng layer and the classfcaton layer. We use a jont tranng scheme to learn the network smultaneously. For the classfcaton layer, we use the crossentropy loss functon to model the predcton errors on labeled mages, mnmzng cross-entropy loss can make the predcted pseudo labels more accurate. For the hash code learnng layer, wth the equatons (8), (0), (2) and (4), we adopt stochastc gradent descent (SGD) to tran the deep network. For each trplet tuple and mage par, we use the back-propagaton algorthm (BP) to update the parameters of the network. More specfcally, accordng to equaton (8), we can compute the sub-gradent for each trplet tuple (x L, xl+ to h(x L ), h(xl+ ) and h(x L, x L ) respectvely as: J L h(x L ) = 2(h(xL ) h(x L+ )) I c J L h(x L+ ) = + 2(h(xL ) h(x L )) I c J L h(x L ) = 2(h(xL ) h(x L )) I c I c = I mt+ h(x L ) h(xl+ ) 2 h(x L ) h(xl ) 2 >0 ), wth respect (5) where I c s an ndcator functon, I c = f c s true, otherwse I c = 0. For the sem-supervsed embeddng term and pseudo-label term, accordng to equatons (0) and (2), when A(, j) =

7 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 7 or P L = P L j, we compute the sub-gradent for h(x ) and h(x j ) respectvely as: J U h(x ) = 2(h(x ) h(x j )) J U h(x j ) = 2(h(x j) h(x )) (6) and when A(, j) = 0 or P L P L j, we compute the subgradent for h(x ) and h(x j ) respectvely as: J U h(x ) = 2( h(x ) h(x j ) m p ) sgn(h(x ) h(x j )) I c J U h(x j ) = 2( h(x ) h(x j ) m p ) sgn(h(x j ) h(x )) I c I c = I mp h(x ) h(x j) 2 >0 (7) where I c s the ndcator functon as descrbed above. These dervatve values can be fed nto the network va the back-propagaton algorthm to update the parameters of each layer n the deep network. The tranng procedure s ended untl the loss converges or a predefned maxmal teraton number s reached. After the deep model s traned, we can generate q-bt bnary codes for any nput mage. For an nput mage x, t s frst encoded nto a hash codes h(x) [0, ] q by the traned model. Then we can compute bnary code for hashng by equaton (7). In ths way, the deep network can be traned end-to-end n a sem-supervsed fashon, and newly nput mages can be encoded nto bnary hash codes by the deep model, we can perform effcent mage retreval va fast Hammng dstance rankng between bnary hash codes. IV. EXPERIMENTS In ths secton, we frst ntroduce the experments conducted on 4 wdely-used datasets (MNIST, CIFAR-0, NUS-WIDE and MIRFLICKR) wth 7 state-of-the-art methods, ncludng unsupervsed methods LSH [4], SH [5] and ITQ [], supervsed methods SDH [29], CNNH [8], NINH [7] and DRSCH [6]. LSH, SH, ITQ and SDH are tradtonal hashng methods, whch take hand-crafted mage features as nput to learn hash functons, whle CNNH, NINH, DRSCH and our proposed SSDH approach are deep hashng methods, whch take raw mage pxels as nput to conduct hash functon learnng. To further analyze the proposed SSDH approach, we conduct several baselne experments to demonstrate the mpact of each term n proposed loss functon, as well as the mpact of parameters λ and µ and onlne graph constructon. We further conduct experments on a more challengng transfer type testng protocols to comprehensvely evaluate the proposed approach. Fnally we evaluate the proposed approach on a large scale ImageNet dataset. A. Datasets and evaluaton metrcs We conduct experments on 4 wdely-used mage retreval datasets. Each dataset s splt nto query, database and tranng set, we summarze the splt of each dataset n table II. TABLE II SPLIT OF EACH DATASET MNIST CIFAR0 NUS-WIDE MIRFLICKR Query,000,000 2,00,000 Database 64,000 54,000 53,447 9,000 Tranng 5,000 5,000 0,500 5,000 The detaled descrptons and settngs of each dataset are as follows: The CIFAR-0 dataset conssts of 60,000 color mages wth the resoluton of from 0 categores, each of whch has 6,000 mages. For far comparson, followng [7], [8],,000 mages are randomly selected as the query set (00 mages per class), the rest mages are used as retreval database. For the unsupervsed methods, all the rest of mages are used as the tranng set. For the supervsed methods, 5,000 mages (500 mages per class) are further randomly selected to form the tranng set. The MNIST 2 dataset contans 70,000 gray scale handwrtten mages from 0 to 9 wth the resoluton of Followng [8],,000 mages are randomly selected as the query set (00 mages per class), the rest mages are used as retreval database. For the unsupervsed methods, all the rest of the mages are used as the tranng set. For the supervsed methods, 5,000 mages (500 mages per class) are randomly selected from the database as the tranng set. The NUS-WIDE 3 [38] dataset contans nearly 270,000 mages, and each mage s assocated wth one or multple labels of 8 semantc concepts. Followng [4], [7], only the 2 most frequent concepts are used, where each concept has at least 5,000 mages, resultng n a total of 66,047 mages. 2,00 mages are randomly selected as the query set (00 mages per concept), the rest mages are used as retreval database. For the unsupervsed methods, all the rest of mages are used as the tranng set. For the supervsed methods, 500 mages from each of the 2 concepts are randomly selected to form the tranng set. The MIRFLICKR 4 [39] dataset conssts of 25,000 mages collected from Flckr, and each mage s assocated wth one or multple labels of 38 semantc concepts.,000 mages are randomly selected as the query set, the rest mages are used as retreval database. For the unsupervsed methods, all the rest of mages are used as the tranng set. For the supervsed methods, 5,000 mages are randomly selected from the rest of mages to form the tranng set. To objectvely and comprehensvely evaluate the retreval accuracy of the proposed approach and all compared methods, we use four evaluaton metrcs: Mean Average Precson (MAP), Precson-Recall curves, Precson@topk and Prec- krz/cfar.html

8 8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY SSDH(Ours) DRSCH* NINH* CNNH* SDH-VGGF ITQ-VGGF SH-VGGF LSH-VGGF SDH ITQ SH LSH 48 bts 48 bts 48 bts 48 bts bts CIFAR-0 (a) bts MNIST (a) bts NUS-WIDE (a) bts MIR-FLICKR (a) 48 bts 48 bts 48 bts 48 bts Number of retreved samples CIFAR-0 (b) Number of retreved samples MNIST (b) Number of retreved samples NUS-WIDE (b) Number of retreved samples MIR-FLICKR (b) top500 top500 top500 top Number of bts CIFAR-0 (c) Number of bts MNIST (c) Number of bts NUS-WIDE (c) Number of bts MIR-FLICKR (c) Fg. 2. The comparson results on four datasets. The frst row s the Precson-Recall curves wth 48bt hash codes. The second row s the Precson@topk curves wth 48bt hash codes. The thrd row s the Precson@top500 w.r.t. dfferent length of hash codes. son@top500. The defntons of these evaluaton metrcs are defned as follows: Mean Average Precson (MAP): MAP presents an overall measurement of the retreval performance. MAP for a set of queres s the mean of the average precson (AP) for each query, where AP s defned as: AP = R n k= k R k rel k (8) where n s the sze of database set, R s the number of relevant mages n database set, R k s the number of relevant mages n the top k returns, and rel k = f the mage ranked at kth poston s relevant and 0 otherwse. Precson-Recall curves: The precsons at certan level of recall, we calculate PR curves of all returned results. Precson@topk: The average precson of top k returned mages for each query. Precson@top500: The average precson of the top 500 returned mage for each query wth dfferent lengths of hash codes. B. Implementaton detals We mplement the proposed SSDH method based on the open-source framework Caffe [40]. The parameters of our network are ntalzed wth the VGG-F network [34], whch s pre-traned on the ImageNet [4]. Smlar ntalzaton strategy has been used n other deep hashng methods [9], [0]. In all experments, our networks are traned wth the ntal learnng rate of 0.00, and we dvde the learnng rate by 0 each 20,000 steps. The sze of mn-batch s 28, and the mnbatch s generated randomly by shufflng the tranng data. The weght decay parameter s For the parameters n the proposed loss functon, we set λ = 0. and µ = 0. n all experments, we wll further conduct experments on the parameters to show that retreval accuracy s not senstve to parameter settngs. For the proposed SSDH method, CNNH, NINH and DRSCH, we use raw mage pxels as nput. The mplementatons of CNNH [8] and DRSCH [6] are provded by ther authors, whle NINH [7] s of our own mplementaton. Snce the network structures of these four deep hashng methods are dfferent from each other, for a far comparson, we use

9 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 9 TABLE III MAP SCORES WITH DIFFERENT LENGTH OF HASH CODES ON CIFAR0, MNIST, NUS-WIDE AND MIRFLICKR DATASET MAP CIFAR0 MNIST NUS-WIDE MIRFLICKR 2bt 24bt 32bt 48bt 2bt 24bt 32bt 48bt 2bt 24bt 32bt 48bt 2bt 24bt 32bt 48bt SSDH(Ours) DRSCH NINH CNNH SDH-VGGF ITQ-VGGF SH-VGGF LSH-VGGF SDH ITQ SH LSH the same VGG-F network as the base structure for the four deep methods. And the network parameters of deep hashng methods are ntalzed wth the same pre-traned VGG-F model, thus we can perform far comparson among these deep hashng methods. The results of CNNH [8], NINH [7] and DRSCH [6] are referred as CNNH, NINH and DRSCH respectvely. For other compared methods wthout deep networks,.e., the tradtonal hashng methods, we represent each mage by hand-crafted features and deep features respectvely. For hand-crafted features, we represent mages n the CIFAR- 0 and MIRFLICKR by 52-dmensonal GIST features, mages n MNIST by 784-dmensonal gray scale values, and mages n the NUS-WIDE by 500-dmensonal bag-of-words features. For a far comparson between tradtonal methods and deep hashng methods, we also conduct experments on the tradtonal methods wth features extracted from deep networks, where we extract 4096-dmensonal deep feature from the actvaton of second fully-connected layer (fc7) of the pre-traned VGG-F network. We denote the results of tradtonal methods usng hand-crafted features by LSH, SH, ITQ and SDH, and we denote the results of tradtonal methods usng deep features by LSH-VGGF, SH-VGGF, ITQ- VGGF and SDH-VGGF. The results of SDH, SH and ITQ are obtaned from the mplementatons provded by the authors of orgnal paper, whle the results of LSH are from our own mplementaton. C. Experment Results and Analyss Table III shows the MAP scores of all the retreved results on CIFAR0, MNIST, NUS-WIDE and MIRFLICKR datasets wth dfferent length of hash codes. We can observe that: () The proposed SSDH approach outperforms compared stateof-the-art supervsed deep hashng methods. For example on CIFAR0 dataset, the proposed SSDH approach mproves the average MAP from 73.2% (DRSCH ), 67.2% (NINH ) and 56.0% (CNNH ) to 8.0%. Smlar results can be observed on MNIST, NUS-WIDE and MIRFLICKR dataset. It s because the proposed SSDH approach desgns a sem-supervsed loss functon, whch not only preserves the semantc smlarty provded by labels, but also captures the meanngful neghborhood structure, thus mproves the search accuracy. (2) SSDH, DSRCH and NINH sgnfcantly outperform the two-stage method CNNH, because they learn hash codes and mage features smultaneously, whch can beneft from each other, thus leadng to better performance compared to CNNH. (3) The tradtonal hashng methods usng deep features sgnfcantly outperform the dentcal methods usng hand-crafted features. For example on CIFAR0, compared to SDH, SDH-VGGF mproves the average MAP score from 32.2% to 49.0%, whch ndcates that deep features can also boosts tradtonal hashng methods performance. The frst row of Fg. 2 shows the Precson-Recall curves wth 48bt hash codes on four datasets. We can clearly see that the proposed SSDH approach outperforms other state-ofthe-art hashng methods, whch demonstrates the effectveness of explotng the underlyng structures of unlabeled data. The second row of Fg. 2 demonstrates the Precson@topk curves wth 48bt hash codes, t can be observed that SSDH acheves the best search accuracy steadly when the number of retreved results ncreases, whch has superor advantage over state-of-the-art methods. The thrd row of Fg. 2 shows the Precson@top500 for dfferent code lengths on four datasets. We can clearly see that the proposed SSDH approach outperforms all compared state-of-the-art methods on all hash code length. For example on CIFAR0 dataset, the proposed SSDH approach acheves over 80% search accuracy on all hash code lengths. Also from all these fgures, by comparng tradtonal methods usng deep features and the dentcal methods usng hand-crafted features, we can observer that deep features can boost the performance of tradtonal methods. In all metrcs on the 4 wdely-used datasets, the proposed SSDH approach shows superor advantages over the state-ofthe-art hashng methods, whch demonstrates the effectveness of proposed SSDH method. And SSDH outperforms supervsed deep hashng methods NINH, DRSCH and CNNH n all evaluaton metrcs, whch ndcates that the proposed SSDH method can preserve the semantc smlarty and underlyng data structures smultaneously, thus leadng to better retreval accuracy. D. Comparson of Computaton Tme Besdes the retreval performance, we also compare the computaton tme cost of the proposed SSDH approach wth other compared methods. All the experments are carred out on the same PC wth NVIDIA Ttan X GPU, Intel

10 0 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Core k 3.50GHz CPU and 64 GB memory. Hashng based mage retreval process generally conssts of three parts: Feature extracton, hash codes generaton and mage search among database. Note that our proposed SSDH approach and other deep hashng methods are end-to-end frameworks whose nputs are raw mages, and outputs are hash codes. For a far comparson between deep hashng methods and tradtonal methods, we compare wth tradtonal methods by deep features nput. We record feature extracton tme, hash codes generaton tme and mage search tme cost of each method. The average computaton tme of dfferent methods are shown n table IV. We can observe that, among deep hashng methods, DRSCH s relatvely slow, because t uses weghted Hammng dstance to perform mage retreval, whch s slower than orgnal Hammng dstance. The computaton tme of our proposed SSDH approach and other 6 compared methods s comparable wth each other, because when hash functons are learned, the tme cost of hash generaton s only a matrx multplcaton whch s very fast, and all of them use Hammng dstance that can be effcently mplemented by bt-wse XOR operaton. TABLE IV COMPARISON OF THE AVERAGE COMPUTATION TIME (MILLISECOND PER IMAGE) ON FOUR BENCHMARK DATASET BY FIXING THE CODE LENGTH 48. Method CIFAR0 (ms) MNIST (ms) NUSWIDE (ms) MIRFLICKR (ms) SSDH(ours) DRSCH NINH CNNH SDH-VGGF ITQ-VGGF SH-VGGF LSH-VGGF E. Baselne Experments and Analyss ) Evaluaton of the Loss Functon: To further verfy the effectveness of proposed sem-supervsed loss functon, we also report results of usng frst supervsed rankng term only (NINH ), and results of usng supervsed rankng term and sem-supervsed embeddng term (SSDH base). By comparng SSDH base wth NINH, we can verfy the effectveness of sem-supervsed embeddng term. By comparng SSDH wth SSDH base, we can verfy the complementarty of semsupervsed embeddng term and pseudo-label term. The MAP scores are shown n table V. SSDH base acheves better results on all 4 datasets wth all hash code lengths than NINH, whch demonstrates the effectveness of sem-supervsed embeddng term. And the SSDH method outperforms SSDH base on 4 datasets and all hash code lengths, whch verfes the complementarty of sem-supervsed embeddng term and pseudo-label term. Fg. 4 demonstrates the Precson@topk curves wth 48bt hash codes on 4 datasets. Fg. 5 shows the Precson-Recall curves of Hammng rankng wth 48bt hash codes on 4 datasets. And Fg. 6 shows the Precson@top500 w.r.t. dfferent length of hash codes on 4 datasets. From these three fgures, we can clearly observe that SSDH outperforms SSDH base and SSDH base outperforms NINH, whch further demonstrates the effectveness of each term n our proposed loss functon. Fg. 3 demonstrates the top 0 retreval results of NUS-WIDE and CIFAR0 usng Hammng rankng on 48bt hash codes. 2) Evaluaton of Parameters: The proposed SSDH approach has two parameters λ and µ, whch control the mpact of sem-supervsed embeddng term and pseudo-label term. We set them wth dfferent values rangng from 0. to, and compute MAP scores on CIFAR0 dataset wth 48bt code length. The results are shown n Fg. 7, and t can be observed that expermental results are not senstve to these parameters. 3) Evaluaton of Onlne Graph Constructon: To further demonstrate the effectveness of our proposed onlne graph constructon strategy, we conduct a baselne experment to compare wth offlne graph constructon strategy. More specfcally, we frst construct a neghborhood graph offlne based on all the data. It s constructed based on deep features extracted from the fc7 layer of the pre-traned VGG-F network. Then we use the proposed SSDH approach to tran a deep hashng network based on the offlne neghborhood graph. The only dfference between offlne and onlne approach s that n equaton (4): offlne approach randomly selects neghbor pars and non-neghbor pars based on the fxed offlne graph, whle onlne approach s based on the evolvng graph constructed onlne. We conduct ths baselne experment on CIFAR0 dataset, the results are shown n table VI. We can observe that, onlne graph constructon acheves better retreval results than offlne graph constructon on all hash code length, whch shows that onlne graph constructon can beneft from the evolvng deep features and capture more meanngful neghbors. We can also observe that offlne graph constructon can mprove the retreval accuracy compared wth state-of-the-art methods n table III, whch also demonstrates the effectveness of smultaneously preservng the semantc smlarty and the underlyng data structures n our proposed SSDH approach. To further demonstrate the representatve power of neghborhood graph and pseudo labels, we also conduct experments to analyze the accuracy of the onlne constructed graph and predcted pseudo labels. More specfcally, durng the tranng stage we calculate the accuracy of neghbors among the constructed graph and the accuracy of predcted pseudo labels of unlabeled data. We also calculate MAP score of mage retreval by usng the model of correspondng teratons, so that we can clearly see the relatons between them. The results are shown n Fg. 8. We can observe that both the accuraces of neghborhood graph and predcted pseudo labels ncrease fast durng tranng, and acheve over 80% wthn 2,000 teratons, whch ndcates that the onlne graph and the predcted pseudo labels are becomng more and more accurate due to the evolvng deep features. And from Fg. 8 we can also observe that the accuracy of mage retreval ncreases along wth the neghborhood graph and pseudo labels, whch ndcates that proposed SSDH captures more and more meanngful semantc neghbors and mproves the search accuracy.

11 NUSWIDE IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY query top0 returned samples query top0 returned samples query top0 returned samples query top0 returned samples SSDH SSDH_base NINH* CIFAR0 SSDH SSDH_base NINH* Fg. 3. Some retreval results of NUS-WIDE and CIFAR0 usng Hammng rankng on 48bt hash codes. The blue rectangles denote the query mages. The red rectangles ndcate wrong retreval results. We can observe that SSDH acheves the best results, and SSDH base acheves better results than NINH. TABLE V E VALUATION OF THREE TERMS IN THE PROPOSED LOSS FUNCTION ON 4 DATASETS. MNIST CIFAR0 NUS-WIDE MIRFLICKR 2bt SSDH (All three terms) 24bt 32bt MIRFLICKR MNIST Number of retreved samples (a) & second term) 32bt 48bt bt Number of retreved samples (b) NINH (Frst 24bt term only) 32bt bt CIFAR0 NUSWIDE 8 48 bts 48 bts 48 bts 200 SSDH base (Frst 2bt 24bt bt bts Methods Number of retreved samples (c) SSDH(Ours) SSDH_base NINH* Number of retreved samples (d) Fg. 4. Precson@topk curves wth 48bt hash codes on 4 datasets. (a) MNIST; (b) MIRFLICKR; (c) NUS-WIDE; (d) CIFAR0. TABLE VI MAP SCORES OF ONLINE AND OFFLINE GRAPH CONSTRUCTION STRATEGY ON CIFAR0 DATASET Methods SSDH wth Onlne Graph SSDH wth Offlne Graph 2bt 0 56 CIFAR0 24bt 32bt bt 4 66 F. Experment Results on Transfer Type Testng Protocols Besdes the tradtonal testng protocols used by most hashng works, we also conduct experments under transfer type testng protocols proposed n [42], where only 75% of categores are known durng tranng, and the remanng 25% of categores are used for evaluaton. We follow the exactly same settng, and randomly splt each dataset nto 4 sets, namely tran75, test75, tran25 and test25, where tran75 and test75 are the mages of 75% categores, whle tran25 and test25 are the mages of the remanng 25% categores. The tranng set s formed by tran75, the query set s formed by test25, and the database set s formed by tran25 and test75. Note that test75 s a dstracton set n database, snce there are no ground truth mages n them. The detaled splt of each dataset under transfer type protocol s shown n table VII. The experment results are shown n table VIII. For far comparson, we report tradtonal methods by only usng deep features. From table VIII, we can observe that the overall retreval accuracy of each method decreases under transfer type testng protocol, whch s because of the nformaton loss of unseen categores. And the retreval accuracy gap between supervsed methods and unsupervsed methods becomes smaller than that of tradtonal settngs. For example, ITQ even acheves better retreval accuracy than SDH and CNNH on NUS-WIDE dataset. It s because unsupervsed methods don t depend on the semantc nformaton, whch are more robust under transfer type testng protocols, whle supervsed methods are lkely to overfttng to the known categores. Under the transfer type testng protocols, our proposed SSDH approach stll acheves the best results, because proposed SSDH preserves the semantc smlarty and the underlyng data structures of unlabeled data smultaneously, thus t s more robust to the

12 2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 48 bts MNIST 48 bts MIRFLICKR 48 bts NUSWIDE 48 bts 0.3 CIFAR0 SSDH(Ours) SSDH_base NINH* bts (a) bts (b) bts (c) bts (d) Fg. 5. Precson-Recall curves of Hammng rankng wth 48bt hash codes on 4 datasets. (a) MNIST; (b) MIRFLICKR; (c) NUS-WIDE; (d) CIFAR0. top MNIST top MIRFLICKR top NUSWIDE top CIFAR0 SSDH(Ours) SSDH_base NINH* Number of bts (a) Number of bts (b) Number of bts (c) Number of bts (d) Fg. 6. Precson@top500 w.r.t. dfferent length of hash codes on 4 datasets. (a) MNIST; (b) MIRFLICKR; (c) NUS-WIDE; (d) CIFAR0. map Fg. 7. MAP scores wth respect to parameter varatons on CIFAR0 dataset wth 48bt code length. unseen categores and acheves the best results. G. Experment Results on Large Scale ImageNet Dataset The ImageNet dataset (ILSVRC2) [4] contans 000 categores wth.2 mllon mages. ImageNet s a large scale dataset whch can comprehensvely evaluate the effectveness of the proposed approach and compared state-of-the-art methods. In order to farly evaluate the retreval accuracy on the map & Accuracy 0.3 Accuracy of k-nn Accuracy of Pseudo-Label map Iteratons 0 4 Fg. 8. The accuracy of neghborhood graph, predcted pseudo labels, and the MAP scores of mage retreval durng each teraton on CIFAR0 dataset wth 48bt code length. TABLE VII SPLIT OF EACH DATASET UNDER TRANSFER TYPE TESTING PROTOCOL CIFAR0 MNIST NUS-WIDE MIRFLICKR Query Database Tranng ImageNet dataset, we no longer use the pre-traned VGG-F model to ntalze the parameters of the network structure, snce the pre-traned model has learned the nformaton from

13 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 3 TABLE VIII MAP SCORES WITH DIFFERENT LENGTH OF HASH CODES ON CIFAR0, MNIST, NUS-WIDE AND MIRFLICKR DATASET UNDER TRANSFER TYPE TESTING PROTOCOL Methods CIFAR0 MNIST NUS-WIDE MIRFLICKR 2bt 24bt 32bt 48bt 2bt 24bt 32bt 48bt 2bt 24bt 32bt 48bt 2bt 24bt 32bt 48bt SSDH(Ours) DRSCH NINH CNNH SDH-VGGF ITQ-VGGF SH-VGGF LSH-VGGF all the.2 mllon tranng mages. Thus we also use the transfer type testng protocols on the ImageNet dataset. More specfcally, we randomly splt ImageNet dataset nto 4 sets, namely tran75, test75, tran25 and test25, where tran75 and test75 are the mages of 750 categores, whle tran25 and test25 are the mages of the remanng 250 categores. Smlar to the above settng, we use tran75 as tranng set, test25 as query set, and the unon of test75 and tran25 as database set. Thus we construct a tranng set of 499,07 mages, query set of 66,53 mages and database set of 665,583 mages. For a far comparson between deep hashng methods and tradtonal methods, we frst use the tranng set to tran the VGG-F model from scratch, then the traned model s used to ntalze the parameters of network structure for deep hashng methods, and to extract 4096 dmensonal features for tradtonal hashng methods. For the tranng of each hashng methods, snce the number of mages n tranng set s too large to tran the compared hashng methods, we further randomly select 75,000 mages (00 mages per category) to tran hashng methods. For ths large-scale dataset, we only report the precsons wthn top 500 retreved samples wth dfferent hash code length due to hgh computaton cost of MAP evaluaton. The results are shown n table IX. Snce ths dataset s very large and we test on more challengng transfer type testng protocols, the precson values are relatvely low. From table IX we can observe that our proposed SSDH approach stll acheves the best results on ths large scale ImageNet dataset and challengng testng protocol, whch demonstrates the effectveness of the proposed SSDH approach. TABLE IX THE RETRIEVAL PRECISIONS WITHIN TOP 500 RETRIEVED SAMPLES W.R.T. DIFFERENT HASH CODE LENGTH ON IMAGENET DATASET Methods ImageNet 2bt 24bt 32bt 48bt SSDH(ours) DRSCH NINH CNNH SDH-VGGF ITQ-VGGF SH-VGGF LSH-VGGF V. CONCLUSION In ths paper, we have proposed a novel sem-supervsed deep hashng (SSDH) method. Frstly, we desgn a deep network wth an onlne graph constructon strategy to make full use of unlabeled data effcently, whch can perform hash code learnng and mage feature learnng smultaneously n a sem-supervsed fashon. Secondly, we propose a semsupervsed loss functon wth a supervsed rankng term to mnmze the emprcal error on labeled data, as well as a sem-supervsed embeddng term and a pseudo-label term to mnmze the embeddng error on both labeled and unlabeled data, whch can capture the semantc smlarty nformaton and the underlyng structures of data. Expermental results show the effectveness of our method compared wth 7 stateof-the-art methods. For the future work, we wll explore dfferent semsupervsed embeddng algorthms that can make better use of unlabeled data, and more advanced graph constructon strateges can be utlzed. We also plan to extend ths framework to an unsupervsed scenaro, n whch we ntend to use clusterng methods to obtan vrtual labels of mages. REFERENCES [] Y. Gong and S. Lazebnk, Iteratve quantzaton: A procrustean approach to learnng bnary codes, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), 20, pp [2] G. Ire, Z. L, X.-M. Wu, and S.-F. Chang, Locally lnear hashng for extractng non-lnear manfolds, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), 204, pp [3] M. Kan, D. Xu, S. Shan, and X. Chen, Semsupervsed hashng va kernel hyperplane learnng for scalable mage search, IEEE Transactons on Crcuts and Systems for Vdeo Technology (TCSVT), vol. 24, no. 4, pp , 204. [4] W. Lu, J. Wang, S. Kumar, and S.-F. Chang, Hashng wth graphs, n Internatonal Conference on Machne Learnng (ICML), 20, pp. 8. [5] Y. Wess, A. Torralba, and R. Fergus, Spectral hashng, n Advances n neural nformaton processng systems (NIPS), 2009, pp [6] Y. Wang, X. Ln, L. Wu, W. Zhang, and Q. Zhang, Lbmch: Learnng brdgng mappng for cross-modal hashng, n ACM SIGIR Conference on Research and Development n Informaton Retreval (SIGIR), 205, pp [7] H. La, Y. Pan, Y. Lu, and S. Yan, Smultaneous feature learnng and hash codng wth deep neural networks, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), 205, pp [8] R. Xa, Y. Pan, H. La, C. Lu, and S. Yan, Supervsed hashng for mage retreval va mage representaton learnng, n AAAI Conference on Artfcal Intellgence (AAAI), 204, pp [9] F. Zhao, Y. Huang, L. Wang, and T. Tan, Deep semantc rankng based hashng for mult-label mage retreval, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), 205, pp [0] H. Zhu, M. Long, J. Wang, and Y. Cao, Deep hashng network for effcent smlarty retreval, n AAAI Conference on Artfcal Intellgence (AAAI), 206, pp

Feng, and J. Zhou, Nonlnear structural hashng for scalable vdeo search, IEEE Transactons on Crcuts and Systems for Vdeo Technology (TCSVT), vol. PP, no. 99, pp., 207. [3] A. Araujo and B.

14 4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY [] S. Zhang, J. L, M. Jang, and B. Zhang, Scalable dscrete supervsed multmeda hash learnng wth clusterng, IEEE Transactons on Crcuts and Systems for Vdeo Technology (TCSVT), vol. PP, no. 99, pp., 207. [2] Z. Chen, J. Lu, J. Feng, and J. Zhou, Nonlnear structural hashng for scalable vdeo search, IEEE Transactons on Crcuts and Systems for Vdeo Technology (TCSVT), vol. PP, no. 99, pp., 207. [3] A. Araujo and B. Grod, Large-scale vdeo retreval usng mage queres, IEEE Transactons on Crcuts and Systems for Vdeo Technology (TCSVT), vol. PP, no. 99, pp., 207. [4] A. Gons, P. Indyk, R. Motwan et al., Smlarty search n hgh dmensons va hashng, n Internatonal Conference on Very Large Data Bases (VLDB), vol. 99, no. 6, 999, pp [5] J. Wang, S. Kumar, and S.-F. Chang, Sequental projecton learnng for hashng wth compact codes, n Internatonal Conference on Machne Learnng (ICML), 200, pp [6] R. Zhang, L. Ln, R. Zhang, W. Zuo, and L. Zhang, Bt-scalable deep hashng wth regularzed smlarty learnng for mage retreval and person re-dentfcaton, IEEE Transactons on Image Processng (TIP), vol. 24, no. 2, pp , 205. [7] H. Lu, R. Wang, S. Shan, and X. Chen, Deep supervsed hashng for fast mage retreval, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), 206, pp [8] L. Zhang, Y. Zhang, J. Tang, X. Gu, J. L, and Q. Tan, Topology preservng hashng for smlarty search, n ACM Internatonal Conference on Multmeda (ACM-MM), 203, pp [9] B. Kuls and T. Darrell, Learnng to hash wth bnary reconstructve embeddngs, n Advances n neural nformaton processng systems (NIPS), 2009, pp [20] B. Kuls, P. Jan, and K. Grauman, Fast smlarty search for learned metrcs, IEEE Transactons on Pattern Analyss and Machne Intellgence (PAMI), vol. 3, no. 2, pp , [2] M. Norouz and D. M. Ble, Mnmal loss hashng for compact bnary codes, n Internatonal Conference on Machne Learnng (ICML), 20, pp [22] W. Lu, J. Wang, R. J, Y.-G. Jang, and S.-F. Chang, Supervsed hashng wth kernels, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), 202, pp [23] M. Norouz, D. J. Fleet, and R. R. Salakhutdnov, Hammng dstance metrc learnng, n Advances n neural nformaton processng systems (NIPS), 202, pp [24] J. Wang, W. Lu, A. X. Sun, and Y.-G. Jang, Learnng hash codes wth lstwse supervson, n IEEE Internatonal Conference on Computer Vson (ICCV), 203, pp [25] J. Wang, J. Wang, N. Yu, and S. L, Order preservng hashng for approxmate nearest neghbor search, n ACM Internatonal Conference on Multmeda (ACM-MM), 203, pp [26] X. L, G. Ln, C. Shen, A. Van Den Hengel, and A. R. Dck, Learnng hash functons usng column generaton. n Internatonal Conference on Machne Learnng (ICML), 203, pp [27] G. Ln, C. Shen, Q. Sh, A. van den Hengel, and D. Suter, Fast supervsed hashng wth decson trees for hgh-dmensonal data, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), 204, pp [28] Q. Wang, Z. Zhang, and L. S, Rankng preservng hashng for fast smlarty search, n Internatonal Jont Conference on Artfcal Intellgence (IJCAI), 205, pp [29] F. Shen, C. Shen, W. Lu, and H. Tao Shen, Supervsed dscrete hashng, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), 205, pp [30] A. Olva and A. Torralba, Modelng the shape of the scene: A holstc representaton of the spatal envelope, Internatonal Journal of Computer Vson (IJCV), vol. 42, no. 3, pp , 200. [3] A. Krzhevsky, I. Sutskever, and G. E. Hnton, Imagenet classfcaton wth deep convolutonal neural networks, n Advances n neural nformaton processng systems (NIPS), 202, pp [32] G. E. Hnton, N. Srvastava, A. Krzhevsky, I. Sutskever, and R. R. Salakhutdnov, Improvng neural networks by preventng co-adaptaton of feature detectors, arxv preprnt arxv: , 202. [33] M. Ln, Q. Chen, and S. Yan, Network n network, arxv preprnt arxv: , 203. [34] K. Chatfeld, K. Smonyan, A. Vedald, and A. Zsserman, Return of the devl n the detals: Delvng deep nto convolutonal nets, arxv preprnt arxv: , 204. [35] O. Chapelle, B. Schölkopf, A. Zen et al., Sem-supervsed learnng, MIT press Cambrdge, [36] R. Hadsell, S. Chopra, and Y. LeCun, Dmensonalty reducton by learnng an nvarant mappng, n IEEE Conference on Computer Vson and Pattern Recognton (CVPR), 2006, pp [37] D.-H. Lee, Pseudo-label: The smple and effcent sem-supervsed learnng method for deep neural networks, n Workshop on Challenges n Representaton Learnng, ICML, 203, pp [38] T.-S. Chua, J. Tang, R. Hong, H. L, Z. Luo, and Y. Zheng, Nus-wde: a real-world web mage database from natonal unversty of sngapore, n ACM nternatonal conference on mage and vdeo retreval (CIVR), 2009, pp. 9. [39] M. J. Huskes and M. S. Lew, The mr flckr retreval evaluaton, n ACM nternatonal conference on Multmeda nformaton retreval (MIR), 2008, pp [40] Y. Ja, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Grshck, S. Guadarrama, and T. Darrell, Caffe: Convolutonal archtecture for fast feature embeddng, n ACM Internatonal Conference on Multmeda (ACM-MM), 204, pp [4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernsten et al., Imagenet large scale vsual recognton challenge, Internatonal Journal of Computer Vson (IJCV), vol. 5, no. 3, pp , 205. [42] A. Sablayrolles, M. Douze, H. Jégou, and N. Usuner, How should we evaluate supervsed hashng? arxv preprnt arxv: , 206. Jan Zhang receved the B.S. degree n computer scence and technology from Pekng Unversty, n Jul He s currently pursung the Ph.D. degree n the Insttute of Computer Scence and Technology (ICST), Pekng Unversty. Hs research nterests nclude multmeda retreval, and machne learnng. Yuxn Peng s the professor and drector of Multmeda Informaton Processng Lab (MIPL) n the Insttute of Computer Scence and Technology (ICST), Pekng Unversty, and the chef scentst of 863 Program (Natonal H-Tech Research and Development Program of Chna). He receved the Ph.D. degree n computer applcaton technology from School of Electroncs Engneerng and Computer Scence (EECS), Pekng Unversty, n July After that, he worked as an assstant professor n ICST, Pekng Unversty. From Aug to Nov. 2004, he was a vstng scholar wth the Department of Computer Scence, Cty Unversty of Hong Kong. He was promoted to assocate professor and professor n Pekng Unversty n Aug and Aug. 200 respectvely. In 2006, he was authorzed by the Program for New Star n Scence and Technology of Bejng and the Program for New Century Excellent Talents n Unversty (NCET). He has publshed over 00 papers n refereed nternatonal journals and conference proceedngs, ncludng IJCV, TIP, TCSVT, TMM, PR, ACM MM, ICCV, CVPR, IJCAI, AAAI, etc. He led hs team to partcpate n TRECVID (TREC Vdeo Retreval Evaluaton) many tmes. In TRECVID 2009, hs team won four frst places on 4 sub-tasks n the Hgh-Level Feature Extracton (HLFE) task and Search task. In TRECVID 202, hs team ganed four frst places on all 4 sub-tasks n the Instance Search (INS) task and Known-Item Search (KIS) task. In TRECVID 204, hs team ganed frst place n the Interactve Instance Search task. Hs team also ganed both two frst places n the INS task (both the Automatc Instance Search and Interactve Instance Search) n TRECVID 205 and TRECVID 206. Besdes, he won the frst prze of Bejng Scence and Technology Award for Technologcal Inventon n 206 (rankng frst). He has appled 34 patents, and obtaned 5 of them. Hs current research nterests manly nclude cross-meda analyss and reasonng, mage & vdeo understandng and retreval, and computer vson.

RECENT years have witnessed the rapid growth of image. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval

RECENT years have witnessed the rapid growth of image. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval Jian Zhang, Yuxin Peng, and Junchao Zhang arxiv:607.08477v [cs.cv] 28 Jul 206 Abstract The hashing methods have been widely used for efficient