Linear Cross-Modal Hashing for Efficient Multimedia Search

Size: px

Start display at page:

Download "Linear Cross-Modal Hashing for Efficient Multimedia Search"

Bethanie Norton
6 years ago
Views:

1 Lnear Cross-Modal Hashng for Effcent Multmeda Search Xaofeng Zhu Z Huang Heng Tao Shen Xn Zhao College of CSIT, Guangx Normal Unversty, Guangx, 544,P.R.Chna School of ITEE, The Unversty of Queensland, QLD 472, Australa {zhux,huang,shenht}@tee.u.edu.au, x.zhao@u.edu.au ABSTRACT Most exstng cross-modal hashng methods suffer from the scalablty ssue n the tranng phase. In ths paper, we propose a novel cross-modal hashng approach wth a lnear tme complexty to the tranng data sze, to enable scalable ndexng for multmeda search across multple modals. Takng both the ntra-smlarty n each modal and the ntersmlarty across dfferent modals nto consderaton, the proposed approach ams at effectvely learnng hash functons from large-scale tranng datasets. More specfcally, for each modal, we frst partton the tranng data nto k clusters and then represent each tranng data pont wth ts dstances to k centrods of the clusters. Interestngly, such a k-dmensonal data representaton can reduce the tme complexty of the tranng phase from tradtonal O(n 2 )orhgher to O(n), where n s the tranng data sze, leadng to practcal learnng on large-scale datasets. We further prove that ths new representaton preserves the ntra-smlarty n each modal. To preserve the nter-smlarty among data ponts across dfferent modals, we transform the derved data representatons nto a common bnary subspace n whch bnary codes from all the modals are consstent and comparable. The transformaton smultaneously outputs the hash functons for all modals, whch are used to convert unseen data nto bnary codes. Gven a uery of one modal, t s frst mapped nto the bnary codes usng the modal s hash functons, followed by matchng the database bnary codes of any other modals. Expermental results on two benchmark datasets confrm the scalablty and the effectveness of the proposed approach n comparson wth the state of the art. Categores and Subject Descrptors H.3. [Content Analyss and Indexng]: Indexng Methods; H.3.3 [Informaton Search and Retreval]: Search Process Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. Copyrghts for components of ths work owned by others than ACM must be honored. Abstractng wth credt s permtted. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, reures pror specfc permsson and/or a fee. Reuest permssons from permssons@acm.org. MM 3, October 2 25, 23, Barcelona, Span. Copyrght 23 ACM /3/ $5.. Keywords Cross-modal, hashng, ndex, multmeda search. INTRODUCTION Hashng s ncreasngly popular to support approxmate nearest neghbor (ANN) search from multmeda data. The dea of hashng for ANN search s to learn the hash functons for convertng hgh-dmensonal data nto short bnary codes whle preservng the neghborhood relatonshps of orgnal data as much as possble [3, 5, 2, 3]. It has been shown that hash functon learnng (HFL) s the key process for effectve hashng [3, 2]. Extng hashng methods on sngle modal data (referred to un-modal hashng methods n ths paper) can be categorzed nto LSH-lke hashng (e.g., localty senstve hashng (LSH) [7, 8], KLSH [5], and SKLSH [2]) whch randomly selects lnear functons as hash functons, PCA-lke hashng (e.g., SH [33], PCAH [3], and ITQ []) whch uses the prncpal components of tranng data to learn hash functons, and manfold-lke hashng (e.g., MFH [26] and [34]) whch employs manfold learnng technues to learn hash functons. More recently, some hashng methods have been proposed to ndex data represented by multple modals (referred to mult-modal hashng n ths paper) [26, 36], whch can be used to facltate the retreval for data descrbed by multple modals n many real-lfe applcatons, such as nearduplcate mage retreval. Consderng an mage database where each mage s descrbed by multple modals, such as SIFT descrptor, color hstogram, bag-of-word, etc, multmodal hashng learns hash functons from all the modals to support effectve mage retreval, where the smlartes from all the modals are consdered n rankng the fnal results wth respect to a mult-modal uery. Cross-modal hashng also constructs hash functons from all the modals by analyzng ther correlatons. However, t serves for a dfferent purpose,.e., supportng cross-modal retreval where a uery of one modal can search for the relevant results of another modal [2, 6, 22, 37, 38]. For example, gven a uery descrbed by SIFT descrptor, relevant results descrbed by other modals such as color hstogram and bag-of-word can also be found and returned 2. Modal, feature and vew are often used wth subtle dfferences n multmeda research. In ths paper, we consstently use the term modal. 2 In ths sense, cross-modal retreval s defned more generally than tradtonal cross-meda retreval [35] where ueres and results can be of dfferent meda types, such as text document, mage, vdeo, and audo. Area Char: Roelof van Zwol 43

Offlne process Onlne process Tranng data Database () m m2 m3 m4 m5..7.

6.2.2.7 Hash functons Hash functons Query mage Query bnary codes Text

tranng phase becomes lnear to the tranng data sze,.e., O(n).

among data ponts n each modal and the nter-smlarty among data ponts

The learnt hash functons ensure that all the data ponts descrbed by

data of dfferent modals should have smlar bnary codes) and comparable (.

llustrates the whole ﬂowchart of the proposed.

In the ﬁrst step, for each modal we partton ts data nto k clusters.

complexty and eﬀectvely wth the ntra- and nter-smlarty preservatons.

wth k-dmensonal representatons, whch are then mapped nto the bnary codes

In the onlne search process, a uery of one modal s ﬁrst approxmated wth

bnary codes wth the hash functons for ths modal, followed by matchng the

Extensve expermental results on two benchmark datasets conﬁrm the

the state of the art. The rest of the paper s organzed as follows.

The proposed and ts analyss are presented n Secton 3.

Whle few attempts have been made towards eﬀectve cross-modal hashng,

(.e., O(n2 ) or hgher, where n s the tranng data sze) and thus fal to

2 Offlne process Onlne process Tranng data Database () m m2 m3 m4 m (4) Database bnary codes (2) (2) (5) m m2 m3 m (3) m Hash functons Hash functons Query mage Query bnary codes Text results Fgure : Flowchart of the proposed lnear cross-modal hashng (). tranng phase becomes lnear to the tranng data sze,.e., O(n). To acheve hgh ualty hash functons, also preserves both the ntra-smlarty among data ponts n each modal and the nter-smlarty among data ponts across dfferent modals. The learnt hash functons ensure that all the data ponts descrbed by dﬀerent modals n the common bnary subspace are consstent (.e., relevant data of dfferent modals should have smlar bnary codes) and comparable (.e., bnary codes of dﬀerent modals can be drectly compared). Fg. llustrates the whole ﬂowchart of the proposed. The tranng phase of s an oﬄne process and ncludes ﬁve key steps. In the ﬁrst step, for each modal we partton ts data nto k clusters. In the second step, we represent each tranng data pont wth ts dstances to the k clusters centrods. In the thrd step, hash functons are learnt eﬃcently wth a lnear tme complexty and eﬀectvely wth the ntra- and nter-smlarty preservatons. In the fourth step, all the data ponts n the database are approxmated wth k-dmensonal representatons, whch are then mapped nto the bnary codes wth the learn hash functons n the ﬁfth step. In the onlne search process, a uery of one modal s ﬁrst approxmated wth ts k-dmensonal representaton n ths modal whch s then mapped nto the uery bnary codes wth the hash functons for ths modal, followed by matchng the database bnary codes to ﬁnd relevant results of any other modal. Extensve expermental results on two benchmark datasets conﬁrm the scalablty and the eﬀectveness of the proposed approach n comparson wth the state of the art. The rest of the paper s organzed as follows. Related work s revewed n Secton 2. The proposed and ts analyss are presented n Secton 3. Secton 4 reports the results and the paper s concluded n Secton 5. Whle few attempts have been made towards eﬀectve cross-modal hashng, most exstng cross-modal hashng methods [6, 22, 27, 37, 38] suﬀer from hgh tme complexty n the tranng phase (.e., O(n2 ) or hgher, where n s the tranng data sze) and thus fal to learn from large-scale tranng datasets n practcal amount of tme. Such a hgh complexty constrans the above methods from applcatons dealng wth large-scale datasets. For example, mult-modal latent bnary embeddng () [38] s a generatve model such that only a small-szed tranng dataset (e.g., 3 out of 8, data ponts) can be used n the tranng phase. Although cross-modal smlarty senstve hashng () [2] s able to learn from large-scale tranng datasets, t reures pror knowledge (.e., postve pars and negatve pars among tranng data ponts) to be predeﬁned and known, whch s not practcal n most real-lfe applcatons. To enable cross-modal retreval, nter-meda hashng (IMH) [27] explores the correlatons among multple modals from dﬀerent data sources and acheves better hashng performance, but the tran process of IMH wth the tme complexty O(n3 ) s expensve for large-scale cross-modal hashng. In ths paper, we propose a novel hashng method, named lnear cross-modal hashng (), to address the scalablty ssues wthout usng any pror knowledge. acheves a lnear tme complexty to the tranng data sze n the tranng phase, enablng eﬀectve learnng from largescale datasets. The key dea s to ﬁrst partton the tranng data of each modal nto k clusters by applyng a lnear tme clusterng method, and then represent each tranng data pont usng ts dstances to the k clusters centrods. That s, we approxmate each data pont wth a k-dmensonal representaton. Interestngly, such a representaton leads to the tme complexty of O(kn) for the tranng phase. Gven a really large-scale tranng dataset, t s expected that k n. Snce k s a constant, the overall tme complexty of the 44

3 2. RELATED WORK In ths secton we revew exstng hashng methods n three major categores, ncludng un-modal hashng, mult-modal hashng and cross-modal hashng. In un-modal hashng, early work such as LSH-lke hashng methods [7, 8, 5, 2] construct hash functons based on random projectons and are typcally unsupervsed. Although they have some asymptotc theoretcal propertes, LSH-lke hashng methods often reure long bnary codes and multple hash tables to acheve reasonable retreval accuracy [2]. Ths leads to long uery tme and hgh storage cost. Recently machne learnng technues have been appled to mprove hashng performance. For example, PCAlke hashng [, 3, 33] learns hash functons va preservng the maxmal covarance of orgnal data and has been shown to outperform LSH-lke hashng n [4, 7, 29]. Manfoldlke hashng [8, 26] employs manfold learnng technues to learn hash functons. Besdes, some hashng methods conduct hash functon learnng by makng the best use of pror knowledge of data. For example, supervsed hashng methods [4, 7, 9, 24, 28] mprove the hashng performance usng the pre-provded pars of data wth the assumpton that there s smlar or dssmlar pars n datasets. There are also some sem-supervsed hashng methods [3, 34] n whch a supervsed term s used to mnmze the emprcal error on the labeled data whle an unsupervsed term s used to maxmze desrable propertes, such as varance and ndependence of ndvdual bts n the bnary codes. Mult-modal hashng s desgned to conduct hash functon learnng for encodng mult-modal data. To ths end, the method n [36] frst uses an teratve method to preserve the semantc smlartes among tranng examples, and then keeps the consstency between the hash codes and the correspondng hashng functons desgned for multple modals. The method multple feature hashng (MFH) [26] preserves the local structure nformaton of each modal and also globally consders the algnments for all the modals to learn a group of hash functons for real-tme large scale nearduplcate web vdeo retreval. Cross-modal hashng also encodes mult-modal data. However, t focuses more on dscoverng the correlatons among dfferent modals to enable cross-modal retreval. Cross-modal smlarty senstve hashng () [2] s the frst crossmodal hashng method for cross-modal retreval. However, only consders the nter-smlarty and gnores the ntra-smlarty. Cross-vew hashng () [6] extends spectral hashng [33] to the mult-modal case, amng at mnmzng the Hammng dstances for smlar ponts and maxmzng those for dssmlar ponts. However, t needs to construct the smlarty matrx for all the data ponts, whch leads to a uadratc tme complexty to the tranng data sze. Raswasa et al., [22] employs canoncal correlaton analyss (CCA) to conduct hash functon learnng, whch s a specal case of. Recently, mult-modal latent bnary embeddng () [38] uses a probablstc latent factor model to learn hash functons. Smlar to, t also has a uadratc tme complexty for constructng the smlarty matrx. Moreover, t uses a samplng method to solve the ssue of out-of-sample extenson. Co-regularzed hashng (CRH) [37] s a boosted co-regularzaton framework whch learns a group of hash functons for each bt of bnary codes n every modal. However, ts objectve functon s nonconvex. Inter-meda hashng (IMH) [27] ams to dscover a common Hammng space for learnng hash functons. IMH preserves the ntra-smlarty of each ndvdual modal va enforcng that the data wth smlar semantc should have smlar hash codes, and preserves the nter-smlarty among dfferent modals va preservng local structural nformaton embedded n each modal. 3. LINEAR CROSS-MODAL HASHING In ths secton we descrbe the detals of the proposed method. For the purpose of nterpretng our basc dea, we frst focus on hash functon learnng for bmodal data from Secton 3. to Secton 3.5, and then extend t to the general settng of mult-modal data n Secton 3.6. In ths paper, we use boldface uppercases, boldface lowercase and letter to denote matrces, vectors and scales, respectvely. Besdes, the transpose of X s denoted as X T, the nverse of X s denoted as X, and the trace operator of a matrx X s denoted as the symbol tr(x). 3. Problem formulaton Assume we have two modals, X () = {x (),, x() n }; =, 2, descrbng the same data ponts where n s the number of data ponts. For example, X () s the SIFT vsual feature extracted from the content of mages, and X (2) s the bagof-words feature extracted from the text surroundng the mages. In general the feature dmensonaltes of dfferent modals are dfferent. Wth the same assumpton n [4, ] that there s an nvarant common space among multple modals, the objectve of s to effectvely and effcently learn hash functons for dfferent modals to support cross-modal retreval. To ths end, needs to generate the hash functons: f () : x () b () = {, } c, =, 2, where c s the code length. Note that all the modals have the same code length. Moreover, needs to ensure that the neghborhood relatonshps wthn each ndvdual modal and across dfferent modals are preserved n the produced common Hammng space. To do ths, s devsed to preserve both the ntra-smlarty and the nter-smlarty of the orgnal feature spaces n the Hammng space. The man dea of learnng the hash functons goes as follows. Data of each ndvdual modal are frstly converted nto ther new representatons, denoted as Z (), for preservng the ntra-smlarty (see Secton 3.2). Data of all modals represented by Z are then mapped nto a common space where the nter-smlarty s preserved to generate hash functons (see Secton 3.3). Fnally, values generated from hash functons are bnarzed nto the Hammng space (see Secton 3.4). Wth the learnt hash functons, ueres and database data can be mapped nto the Hammng space to facltate fast search by effcent bnary code matchng. 3.2 Intra-smlarty preservaton Intra-smlarty preservaton s desgned to mantan the neghborhood relatonshps among tranng data ponts n each ndvdual modal after they are mapped nto the new space spanned by ther new representatons. To acheve ths, manfold-lke hashng [26, 27, 36, 39] constructs a smlarty matrx, where each entry represents the dstance between two data ponts. In such a matrx, each data pont can be regarded as a n-dmensonal representaton ndcatng ts dstance to n data ponts. Typcally, a neghborhood for a data pont s descrbed by ts few nearest neghbors. To 45

4 preserve the neghborhood of each data pont, only few dmensons correspondng to ts nearest neghbors n the n- dmensonal representaton are non-zero. In other words, the n-dmensonal representaton s hghly sparse. However, to buld such a sparse matrx needs uadratc tme complexty,.e., O(n 2 ), whch s mpractcal for large-scale datasets. Observed from the sparse n-dmensonal representaton, only few data ponts are used to descrbe the neghborhood for a data pont. Ths motvates us to derve a smaller k-dmensonal representaton (.e., k n) to approxmate each tranng data pont, amng at reducng the tme complexty for buldng the neghborhood structures. The dea s to select k most representatve data ponts from the tranng dataset and approxmate each tranng data pont usng ts dstances to these k representatve data ponts. To do ths, n ths paper we use a scalable k-means clusterng method [5] to generate k centrods whch are taken as k most representatve data ponts ponts n the tranng dataset. It has been shown that k centrods have a strong representaton power to adeuately cover large-scale datasets [5]. More specfcally, gven a tranng dataset n the frst modal X (), nstead of mappng each tranng data pont x () nto the n-dmensonal representaton leadng to uadratc tme complexty, we convert t nto the k-dmensonal representaton z (), usng the obtaned k centrods whch are denoted by m (), =, 2,, k. For a z (),tsj-thdmenson carres the dstance from x () to the j-th centrod m () j, denoted as z () j. To obtan the value of z () j, we frst calculate the Eucldean dstance between x () and m () j,.e., z () j = x () m () j 2, () where. stands for the Eucldean norm. As n [9], the value of z () j can be further defned as a functon of the Eucldean to better ft the Gaussan dstrbuton n real applcatons. Denote the redefned value of z () j as p () j,wehave: p () exp( z () j j = /σ) k l= exp( z() l /σ), (2) where σ s a tunng parameter for controllng the decay rate of z () j. For smplcty, we set σ = n ths paper, whle an adaptve settng of σ can lead to better results. Let p () =[p () ; ; p() j ; ; p() k ], p() forms the new representaton of x (). It can be seen that the ratonale of defnng p () s smlar to that of kernel densty estmaton wth a Gaussan kernel,.e., f x () s near to the j-th centrod, p () j wll be relatvely hgh; otherwse, p () j wll decay. To preserve the neghborhood of each tranng data pont n the new k-dmensonal space, here we also represent each tranng data pont usng several (say s and s k) nearest centrods so that the new representaton p () of x () s sparse. Therefore, n the mplementaton, for each tranng data pont we only keep the values to ts s nearest centrods n p () and set the rest as. After ths, we normalze the derved value to generate the fnal value of z () j. Accordng to the perspectve of geometrc reconstructon n the lteratures [23, 25, 32], we can easly show that the ntra-smlarty can be well preserved n the derved k-dmensonal representaton,.e., the nvarance to rotatons, rescalngs, and translatons. Accordng to Es.-2, we convert the tranng data X () nto ther k-dmensonal representatons Z (), =, 2. That s,wecanuseak n matrx to approxmate the orgnal n n smlarty matrx wth ntra-smlarty preservaton. The advantage s to reduce the complexty from O(n 2 )too(kn). Note that one can select dfferent numbers of centrods for each modal. For smplcty, n ths paper we select the same numbers of centrods n our experments. The next problem s to preserve the nter-smlarty between Z () and Z (2) va seekng a common latent space between them. 3.3 Inter-smlarty preservaton It s well known that multmeda data wth same semantcs can exst n dfferent types of modals. For example, a text document and an mage can descrbe exactly the same topc. Research has shown that f data descrbed n dfferent modal spaces are related to the same event or topc, they are expected to have some common latent space [6, 38]. Ths suggests that mult-modal data wth the same semantc should share some common space n whch relevant data are close to each other. Such a property s understood as nter-smlarty preservaton when modal-modal data are mapped nto the common space. In our problem settng, mult-modal data are eventually represented by bnary codes n the common Hammng space. To ths end, we frst learn a semantc brdge for each modal Z () n ts k-dmensonal space to map Z () nto the common Hammng space. To ensure nter-smlarty preservaton, n the Hammng space, data descrbng the same object from dfferent modals should have same or smlar bnary codes. For example, n Fg.2, we map both the mages vsual modal and textual modal va learnt semantc brdges (.e., the arrows n Fg.2) nto the Hammng space (.e., the crcle n Fg.2), n whch two modals of an mage are represented wth same or smlar bnary codes n the Hammng space. That s, consstency across dfferent modals s acheved. Fgure 2: An llustraton on nter-smlarty preservaton. More formally, gven Z () R n k and Z (2) R n k where n s sample sze and k s the number of centrods, we learn the transformaton matrx (.e., semantc brdge ) W () R k c and W (2) R k c for convertng Z () and Z (2) nto the new representaton B () {, } n c and B (2) {, } n c 46

5 n a common Hammng space, n whch each sample par (descrbng the same object,.e., B () and B (2) descrbng the -th object wth dfferent modals) has the mnmal Hammng dstance,.e., the maxmal consstency. Ths leads to the followng objectve functon: mn B (),B (2) B () B (2) 2 F s.t., B ()T e =, b () {, }, B ()T B () = I c, =, 2, where. F means a Frobenus norm, e s a n vector whose each entry s and I c s a c c dentty matrx, the constrant B ()T e = reures each bt has eual chance to be or -, the constrant B ()T B () = I c reures the bts to be obtaned ndependently, and the loss functon term B () B (2) 2 F acheves the mnmal dfference (or the maxmal consstency) on two representatons of an object. The optmzaton problem n E.3 euals to the ssue of balanced graph parttonng and s NP-hard. Followng the lteratures [6, 33], we frst denote Y () as the real-valued representaton of B () and solve the derved objectve functon on Y () n ths subsecton, and then bnarze Y () nto bnary codes based on the medan threshold method n Secton 3.4. To map Z () nto Y () R n c va the transformaton matrx W (),welety () = Z () W (). Accordng to E.3, we have the objectve functon as Z mn () W () Z (2) W (2) 2 W (),W (2) F s.t., W ()T W () = I, W (2)T W (2) = I, where orthogonal constrants are set to avod trval solutons. To optmze the objectve functon n E.4, we frst convert ts loss functon term nto (3) (4) 3.4 Bnarzaton After obtanng all Y (), we get the medan vector of Y () u () =medan(y () ) R c, (8) we then bnarze Y () as follows: b () jl = f y () jl u () l b () jl = f y () jl <u () l where Y () =[y () (9),, y() n ] T, =, 2; j =,, n and l =,, c. E.9 generates the fnal bnary codes B for the tranng data X, n whch the medan value of each dmenson s used as the threshold for bnarzaton. The learnt hash functons and bnarzaton step are used to map unseen data (e.g., database and uery) nto the Hammng space. In the onlne search phase, gven a uery x () from the -th modal, we frst approxmate t wth ts dstances to k centrods,.e., z () compute ts y () y () to generate ts bnary codes b () dstances between b () usng Es.(-2), and then usng E.7, followed by the bnarzaton on. Fnally the Hammng and database bnary codes are com- n any other modal. puted to fnd the neghbors of x () 3.5 Summary and analyss We summarze the proposed approach n Algorthm (tranng phase) and Algorthm 2 (search phase). Algorthm : Pseudo code of tranng phase Input: X, c, k Output: u () R c ; W () R k c,=,2 Perform scalable k-means on X () to obtan m () ; 2 Compute Z () by E.-2; 3 Generate W () by E.6; 4 Generate u () by E.8; Z () W () Z (2) W (2) 2 F = tr(w ()T Z ()T Z () W () + W (2)T Z (2)T Z (2) W (2) (5) W ()T Z ()T Z (2) W (2) W (2)T Z (2)T Z () W () ) = tr(w T ZW), where W =[W ()T ; W (2)T ] T R 2k c and ( ) Z ()T Z () Z ()T Z (2) Z = R 2k 2k. Z (2)T Z () Z (2)T Z (2) Then the objectve functon n E.4 becomes: max W tr(w T ZW) s.t., W T W = I. (6) E.6 s an egenvalue problem. We can obtan the optmal W va solvng the egenvalue problem on Z. W represents the hash functons to generate Y as follows: Y () = tr(z () W () ) (7) where W () = W( : k, :) and W (2) = W(k +:end, :). Algorthm 2: Pseudo code of search phase Input: x (), u (), W () Output: Nearestneghborsofx () n another modal Compute z () by E.-2; 2 Compute y () by E.7; 3 Generate b () t by E.9; 4 Match b () wth database bnary codes n another modal; In the tranng phase of, tme cost manly comes from the clusterng process, new representaton generaton, and egenvalue decomposton n generatng hash functons. Applyng a scalable clusterng method, such as [5], clusters can be generated n lnear tme complexty to the tranng data sze n. Generatngthek-dmensonal representatons Z takes the complexty of O(kn). The tme complexty to generate W s O(mn{nk 2,k 3 }). Snce k n for largescale tranng datasets, O(k 3 ) s the complexty to generate 47

6 hash functons. Therefore, the overall tme complexty s O(max{kn, k 3 }). Gven that k n, weexpectthatk 2 <n or both have smlar scale. Ths leads to the approxmaton of O(kn) tme complexty for the tranng phase. Havng k as a constant, the fnal tme complexty becomes lnear to the tranng data sze. In the search phase, the tme complexty s constant. 3.6 Extenson We present an extenson of Algorthm and Algorthm 2 to the case of more than two modals, whch makes us to use the nformaton avalable n all the possble modals to acheve better learnng results. To do ths, we frst generate new representatons of each modal accordng to Secton 3. for preservng ntra-smlarty, and then transform new representatons of all the modals nto a common latent space for preservng nter-smlarty across any par of modals. The objectve functon for preservng nter-smlarty s defned as: mn B (),=,,p p = <j p B () B (j) 2 F s.t., B ()T e =, b () {, }, B ()T B () = I c, =,, p, () where e s a n vector,p s the number of dfferent modals, I c s a c c dentty matrx, the constrant B ()T e = reures each bt has eual chance to be or -, and the constrant B ()T B () = I d reures the bts of each modal to be obtaned ndependently. To solve E., we frst relax t to: p p mn Z () W () Z (j) W (j) 2 W (),=,,p = <j F () s.t., W ()T W () = I,=,, p. We then obtan max W tr(wt ZW) s.t., W T W = I, (2) where W =[W ()T ; ; W (p)t ] T, W R pk c and Z ()T Z () Z ()T Z (2) Z ()T Z (p) Z = Z (2)T Z () Z (2)T Z (2) Z (2)T Z (p), Z (p)t Z () Z (p)t Z (2) Z (p)t Z (p) where Z R pk pk. After solvng the egenvalue problem n E.2, we obtan the hash functons of multple modals (smlar to E.7 to E.9 n Secton 3.4). Wth hash functons and medan thresholds, we can transform database data and ueres nto the Hammng space, to support cross-modal retreval va effcent bnary code comparsons. 4. EXPERIMENTAL ANALYSIS We conduct our experments on two benchmark datasets,.e., Wk [22] and NUS-WIDE [6], so far the largest publcly avalable mult-modal datasets that are fully pared and labeled [38]. The two datasets are bmodal wth both vsual and textual modals n dfferent representatons. In our experments, each dataset s parttoned nto a uery set and a database set whch s used for tranng. 4. Comparson algorthms The comparson algorthms nclude a baselne algorthm B and state-of-the-art algorthms, ncludng [6], [2] and [38]. B s our wthout preservng ntra-smlarty, wth the purpose to test the effect of ntra-smlarty preservaton n our method. We compare wth the comparson algorthms on two cross-modal retreval tasks. Specfcally, one task s to use a text uery n the textual modal to search relevant mages n the vsual modal (shorted for Text uery vs. Image data ) and the other s to use an mage uery n the vsual modal to search relevant texts from the textual modal (shorted for Image uery vs. Text data ). 4.2 Evaluaton Metrcs We use mean Average Precson (map) [38] as one of performance measures. Gven a uery and a lst of R retreved results, the value of ts Average Precson s defned as R AP = { P (r)δ(r)}, (3) l r= where l s the number of true neghbors n ground truth, P (r) denotes the precson of the top r retreved results, and δ(r) = fther-th retreved result s a true neghbor of the uery, otherwse δ(r) =. map s the mean of all the ueres average precson. Clearly, the larger the map, the better the performance s. In our experments, we set R as the number of tranng data ponts whose Hammng dstances to the uery are not larger than 2. We also report the results on two other types of measures, ncludng recall curves wth dfferent retreved results and tme cost for generatng hash functons and searchng database bnary codes. Both map and recall curves are used to reflect the retreval effectveness and tme cost s used to evaluate the effcency. 4.3 Parameters settng By default, we set the parameter k = 3 for dataset Wk and k = 6 for dataset NUS-WIDE. Amongk centrods, we set s = 3 for representng each tranng data pont wth s nearest centrods. In our experments, we vary the length of hash codes (.e., the number of hash bts) n the range of [8, 6, 24] for dataset Wk and [8, 6, 32] for dataset NUS-WIDE. Moreover, for calculatng the value of recall curves, we set the number of retreved results n the range of [25, 5, 75,, 25, 5, 7, 2] for Wk and [, 2, 5, 8,, 2, 5] on NUS-WIDE. For all the comparson algorthms, the codes are provded by the authors. We tune the parameters accordng to the correspondng lteratures. All the experments are conducted on a computer whch has Intel Xeon(R) 2.9GHz 2 processors, 92 GB RAM and the 64-bt Wndows 7 operatng system. 4.4 Results on Wk dataset The dataset Wk [22] s generated from a group of 2,866 Wkpeda documents. In Wk, each object s an magetext par and s labeled wth exactly one of semantc 48

7 classes. The mages are represented by 28-dmensonal SIFT feature vectors. The text artcles are represented by the probablty dstrbutons over topcs, whch are derved from a latent Drchlet allocaton (LDA) model []. Followng the settng n the lterature [22], 273 data ponts form the database set and the remanng 693 data ponts form the uery set. Due to that the dataset s fully annotated, semantc neghbors for a uery s regarded as the ground truth, based on the assocated labels. The map results of all the algorthms on dfferent code lengths are reported n Fg.3.(a-b). The recall curves for two uery tasks on dfferent code lengths are plotted n Fg.4. Accordng to the expermental results, we can see that consstently performs best. For example, the maxmal dfference between and the second best one (.e., ) s about 4% n Fg.3.(a) and about 8% n Fg.3.(b) whle the code length s 24. Moreover, both and are better than, whch s wth the same concluson as n [38]. Besdes, we also have three observatons based on our expermental results. Frst,, and outperform B and whch only consder the nter-smlarty across modals and gnore the ntra-smlarty wthn a modal. Therefore, we can make an concluson that t s useful for consderng both the ntra-smlarty and the nter-smlarty together to buld cross-modal hashng. Second, although both and B consder the nter-smlarty, mproves B slghtly snce employs pror knowledge, such as the predefned smlar pars and dssmlar pars [2]. Thrd, accordng to the expermental results on map and recall curves, we see that all algorthms acheve ther best performance when the number of hash bts s 6 for dataset Wk. After achevng ther peak, the performance of all algorthms degrades. A possble reason s that a longer bnary code representaton may lead to less retreved results gven the fxed Hammng dstance threshold, whch affects ts precson and recall. Such a phenomenon has also been dscussed n [8, 38]. Table : Runnng tme for all algorthms whle fxng code length as 6 for dataset Wk and dataset NUS-WIDE. Both tranng tme and search tme are recorded n second. Wk NUS-WIDE Task Methods Image uery vs. Text data Text uery vs. Image data tran search tran search B B Table shows the tme cost of the tranng phase and the search phase of all the algorthms. We can see that s most tme consumng snce t s a generatve model, followed by, and. Snce does not consder the ntra-smlarty, t s faster than. However, has unsatsfactory performance n search ualty as shown n Fg Results on NUS-WIDE dataset The dataset NUS-WIDE orgnally contans 269, 648 mages assocated wth 8 ground truth concept tags. Followng the lterature [8, 3], we prune orgnal NUS-WIDE to form a new dataset NUS-WIDE consstng of 95, 969 mage-tag pars by keepng the pars that belong to one of the 2 most freuent tags, such as anmal, buldngs, person, etc. In our NUS-WIDE, each par s annotated by at least one of 2 labels. The mages are represented by 5-dmensonal SIFT feature vectors and the texts are represented by -dmensonal feature vectors obtaned by performng PCA on the orgnal tag occurrence features. Followng the settng n the lteratures [8, 3], we unformly sample mages from each of the selected 2 tags to form auerysetof2, mages and the left 93, 869 magetag pars servng as the database set. The ground truth s defned based on whether two mages share at least one common tag n our experments. As shown n n Fg.3.(c-d), Fg.5, and Table, one can see that the rankng of all the algorthms on dataset NUS- WIDE s largely consstent wth that on dataset Wk. The maxmal dfference between and the second best one (.e., ) s about 6% n Fg.3.(c) and about 5% n Fg.3.(d) whle the code length s 6. Table 2: Runnng tme wth dfferent number of centrods whle fxng code length as 6 for dataset Wk and dataset NUS-WIDE. Both tranng tme and search tme are recorded n second. Task Centrods Wk NUS-WIDE tran search tran search Image uery k = vs. k = Text data k = Text uery vs. Image data k = k = k = Parameters senstvty In ths secton, we test the senstvty of dfferent parameters. Frst, we look at the effect of k. We set the dfferent values on k (.e., the number of clusters n the tranng phase) and report parts of results n Fg.6. From Fg.6, we can see that a larger k value leads to better results, snce the k-dmensonal representaton can be more accurate n capturng the orgnal data dstrbuton n the tranng dataset. Nonetheless, more tranng cost occurs for a larger k value, as shown n Table 2. Our results show that a relatvely small k value (e.g., k=3 and 6 for Wk and NUS-WIDE) can acheve reasonably good results. Due to the space lmt, we do not report the results on dfferent s values. Generally, a good choce of s s between 3 to CONCLUSION In ths paper we have proposed a novel and effectve crossmodal hashng approach, namely lnear cross-modal hashng (). The man dea s to represent each tranng data pont wth a smaller k-dmensonal approxmaton whch can preserve the ntra-smlarty and reduce the tme and space complexty n learnng hash functons. We then map the 49

8 map.25.2 B map B map B map B c = 8 c = 6 c = 24 c = 8 c = 6 c = 24 c = 8 c = 6 c = 32 c = 8 c = 6 c = 32 Dfferent code lengths (c) Dfferent code lengths (c) Dfferent code lengths (c) Dfferent code lengths (c) (a) Image uery vs. Text data (b) Text uery vs. Image data (c) Image uery vs. Text data (d) Text uery vs. Image data Fgure 3: map comparson wth dfferent code lengths for dateset Wk (a-b) and for dataset NUS WIDE (c-d)..8 B.8 B.8 B No. of Retreved Samples (a) code length = No. of Retreved Samples (b) code length = No. of Retreved Samples (c) code length = 24.8 B.8 B.8 B No. of Retreved Samples (d) code length = No. of Retreved Samples (e) code length = No. of Retreved Samples (f) code length = 24 Fgure 4: curves wth dfferent code lengths for dataset Wk. The upper row (a-c) s the task of mage uery vs. text data, the bottom row (d-f) s the task of text uery vs. mage data. new representatons of the tranng data from all modals to a common latent space n whch the nter-smlarty s preserved and hash functons of each modal are obtaned. Gven a uery, t s frst transformed nto ts k-dmensonal representaton whch s then mapped nto the Hammng space wth the learnt hash functons, to match wth database bnary codes. Snce bnary codes from dfferent modals are comparable n the Hammng space, cross-modal retreval can be effectvely and effectvely supported by. The expermental results on two benchmark datasets demonstrate that outperforms the state of the art sgnfcantly wth practcal tme cost. 6. ACKNOWLEDGEMENTS Ths work was supported by the Australa Research Councl (ARC) under research Grant DP94678 and the Nature Scence Foundaton (NSF) of Chna under grants

9 .8 B.8 B.8 B No. of Retreved Samples x 4 (a) code length = No. of Retreved Samples x 4 (b) code length = No. of Retreved Samples x 4 (c) code length = 32.8 B.8 B.8 B No. of Retreved Samples x 4 (d) code length = No. of Retreved Samples x 4 (e) code length = No. of Retreved Samples x 4 (f) code length = 32 Fgure 5: curves wth dfferent code lengths for dataset NUS WIDE. The upper row (a-c) s the task of mage uery vs. text data, the bottom row (d-f) s the task of text uery vs. mage data..8 k = 3 k = 6 k =.8 k = 3 k = 6 k =.8 k = 3 k = 6 k =.8 k = 3 k = 6 k = No. of Retreved Samples No. of Retreved Samples No. of Retreved Samples x 4 No. of Retreved Samples x 4 (a) Image uery vs. Text data (b) Text uery vs. Image data (c) Image uery vs. Text data (d) Text uery vs. Image data Fgure 6: curves wth dfferent number of centrods whle fxng the code length as 6 for dateset Wk (a-b) and NUS WIDE (c-d). 7. REFERENCES [] D. M. Ble, A. Y. Ng, and M. I. Jordan. Latent drchlet allocaton. J. Mach. Learn. Res., 3:993 22, 23. [2] M.M.Bronsten,A.M.Bronsten,F.Mchel,and N. Paragos. Data fuson through cross-modalty metrc learnng usng smlarty-senstve hashng. In CVPR, pages , 2. [3] R. Chaudhry and Y. Ivanov. Fast approxmate nearest neghbor methods for non-eucldean manfolds wth applcatons to human actvty analyss n vdeos. In ECCV, pages , 2. [4] M. Chen, K. Q. Wenberger, and J. C. Bltzer. Co-tranng for doman adaptaton. In NIPS, pages 9, 2. [5] X. Chen and D. Ca. Large scale spectral clusterng wth landmark-based representaton. In AAAI, pages 33 38, 2. [6] T.-S. Chua, J. Tang, R. Hong, H. L, Z. Luo, and Y. Zheng. Nus-wde: a real-world web mage database 5

10 from natonal unversty of sngapore. In CIVR, pages 48 56, 29. [7] M. Datar, N. Immorlca, P. Indyk, and V. S. Mrrokn. Localty-senstve hashng scheme based on p-stable dstrbutons. In SOCG, pages , 24. [8] A. Gons, P. Indyk, and R. Motwan. Smlarty search n hgh dmensons va hashng. In VLDB, pages , 999. [9] J.Goldberger,S.T.Rowes,G.E.Hnton,and R. Salakhutdnov. Neghbourhood components analyss. In NIPS, pages 9, 24. [] Y. Gong, S. Lazebnk, A. Gordo, and F. Perronnn. Iteratve uantzaton: A procrustean approach to learnng bnary codes for large-scale mage retreval. IEEE Trans. Pattern Anal. Mach. Intell., page accepted, 22. [] R. Gopalan, R. L, and R. Chellappa. Doman adaptaton for object recognton: An unsupervsed approach. In ICCV, pages 999 6, 2. [2] P. Jan, B. Kuls, and K. Grauman. Fast mage search for learned metrcs. In CVPR, pages 8, 28. [3] H. Jégou, M. Douze, and C. Schmd. Product uantzaton for nearest neghbor search. In CVPR, pages 7 28, 2. [4] B. Kuls and T. Darrell. Learnng to hash wth bnary reconstructve embeddngs. In NIPS, pages 42 5, 29. [5] B. Kuls and K. Grauman. Kernelzed localty-senstve hashng for scalable mage search. In ICCV, pages , 29. [6] S. Kumar and R. Udupa. Learnng hash functons for cross-vew smlarty search. In IJCAI, pages , 2. [7] W. Lu, J. Wang, R. J, Y.-G. Jang, and S.-F. Chang. Supervsed hashng wth kernels. In CVPR, pages , 22. [8] W. Lu, J. Wang, S. Kumar, and S.-F. Chang. Hashng wth graphs. In ICML, pages 8, 2. [9] M. Norouz and D. J. Fleet. Mnmal loss hashng for compact bnary codes. In ICML, pages , 2. [2] M. Norouz, A. Punjan, and D. J. Fleet. Fast search n hammng space wth mult-ndex hashng. In CVPR, pages 38 35, 22. [2] M. Ragnsky and S. Lazebnk. Localty-senstve bnary codes from shft-nvarant kernels. In NIPS, pages 59 57, 29. [22] N. Raswasa, J. C. Perera, E. Covello, and G. Doyle. A new approach to cross-modal multmeda retreval. In ACM MM, pages 25 26, 2. [23] S. Rowes and L. Saul. Nonlnear dmensonalty reducton by locally lnear embeddng. Scence, 29(55): , 2. [24] R. Salakhutdnov and G. E. Hnton. Semantc hashng. Int. J. Approx. Reasonng, 5(7): , 29. [25] L. K. Saul and S. T. Rowes. Thnk globally, ft locally: Unsupervsed learnng of low dmensonal manfold. J.Mach.Learn.Res., 4:9 55, 23. [26] J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong. Multple feature hashng for real-tme large scale near-duplcate vdeo retreval. In ACM MM, pages , 2. [27] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter-meda hashng for large-scale retreval from heterogenous data sources. In SIGMOD, pages , 23. [28] C. Strecha, A. A. Bronsten, M. M. Bronsten, and P. Fua. Ldahash: Improved matchng wth smaller descrptors. IEEE Trans. Pattern Anal. Mach. Intell., 34():66 78, 22. [29] A. Torralba, R. Fergus, and Y. Wess. Small codes and large mage databases for recognton. In CVPR, pages 8, 28. [3] J. Wang, O. Kumar, and S.-F. Chang. Sem-supervsed hashng for scalable mage retreval. In CVPR, pages , 2. [3] J. Wang, S. Kumar, and S.-F. Chang. Seuental projecton learnng for hashng wth compact codes. In ICML, pages 27 34, 2. [32] K. Q. Wenberger, B. D. Packer, and L. K. Saul. Nonlnear dmensonalty reducton by semdefnte programmng and kernel matrx factorzaton. In AISTATS, pages , 25. [33] Y. Wess, A. Torralba, and R. Fergus. Spectral hashng. In NIPS, pages , 28. [34] C.Wu,J.Zhu,D.Ca,C.Chen,andJ.Bu. Sem-supervsed nonlnear hashng usng bootstrap seuental projecton learnng. IEEE Trans. Knowl. Data Eng., 99:, 22. [35] Y. Yang, D. Xu, F. Ne, J. Luo, and Y. Zhuang. Rankng wth local regresson and global algnment for cross meda retreval. In ACM MM, pages 75 84, 29. [36] D. Zhang, F. Wang, and L. S. Composte hashng wth multple nformaton sources. In SIGIR, pages , 2. [37] Y. Zhen and D.-Y. Yeung. Co-regularzed hashng for multmodal data. In NIPS, pages , 22. [38] Y. Zhen and D.-Y. Yeung. A probablstc model for multmodal hash functon learnng. In SIGKDD, pages , 22. [39] X. Zhu, Z. Huang, H. Cheng, J. Cu, and H. T. Shen. Sparse hashng for fast multmeda search. ACM Trans. Inf. Syst., 3(2):59 57,

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY SSDH: Sem-supervsed Deep Hashng for Large Scale Image Retreval Jan Zhang, and Yuxn Peng arxv:607.08477v2 [cs.cv] 8 Jun 207 Abstract Hashng