CONVENTIONAL classification tasks require the training

Size: px

Start display at page:

Download "CONVENTIONAL classification tasks require the training"

Alaina Bryant
5 years ago
Views:

1 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO., FEBRUARY Zero-Shot Learnng va Category-Specfc Vsual-Semantc Mappng and Label Refnement L Nu,JanfeCa, Senor Member, IEEE, Ashok Veeraraghavan, and Lqng Zhang Abstract Zero-shot learnng (ZSL) ams to classfy a test nstance from an unseen category based on the tranng nstances from seen categores n whch the gap between seen categores and unseen categores s generally brdged va vsual-semantc mappng between the low-level vsual feature space and the ntermedate semantc space. However, the vsual-semantc mappng (.e., projecton) learnt based on seen categores may not generalze well to unseen categores, whch s known as the projecton doman shft n ZSL. To address ths projecton doman shft ssue, we propose a method named adaptve embeddng ZSL (AEZSL) to learn an adaptve vsual-semantc mappng for each unseen category, followed by progressve label refnement. Moreover, to avod learnng vsual-semantc mappng for each unseen category n the large-scale classfcaton task, we addtonally propose a deep adaptve embeddng model named deep AEZSL sharng the smlar dea (.e., vsual-semantc mappng should be category specfc and related to the semantc space) wth AEZSL, whch only needs to be traned once, but can be appled to arbtrary number of unseen categores. Extensve experments demonstrate that our proposed methods acheve the state-of-theart results for mage classfcaton on three small-scale benchmark datasets and one large-scale benchmark dataset. Index Terms Zero-shot learnng (ZSL), doman adaptaton. I. INTRODUCTION CONVENTIONAL classfcaton tasks requre the tranng and test categores to be dentcal, but collectng annotated data for all categores s tme-consumng and expensve n large-scale or fne-graned classfcaton tasks. Therefore, Zero-Shot Learnng (ZSL) [1] [3], whch ams to recognze the test nstances from the categores prevously unseen n the tranng stage, has become ncreasngly popular n the feld of Manuscrpt receved December 13, 017; revsed July 4, 018; accepted September 13, 018. Date of publcaton September 8, 018; date of current verson October 3, 018. Ths work was supported n part by the Key Basc Research Program of Shangha under Grant 15JC and n part by the Natonal Basc Research Program of Chna under Grant 015CB The assocate edtor coordnatng the revew of ths manuscrpt and approvng t for publcaton was Prof. Aydn Alatan. (Correspondng author: L Nu.) L. Nu was wth the Electrc and Computer Engneerng Department, Rce Unversty, Houston, TX USA. He s now wth the Computer Scence and Engneerng Department, Shangha Jao Tong Unversty, Shangha 0040, Chna (e-mal: ustcnewly@sjtu.edu.cn). J. Ca s wth the School of Computer Engneerng, Nanyang Technologcal Unversty, Sngapore (e-mal: asjfca@ntu.edu.sg). A. Veeraraghavan s wth the Electrc and Computer Engneerng Department, Rce Unversty, Houston, TX USA (e-mal: vashok@rce.edu). L. Zhang s wth the Computer Scence and Engneerng Department, Shangha Jao Tong Unversty, Shangha 0040, Chna (e-mal: zhang-lq@ cs.sjtu.edu.cn). Color versons of one or more of the fgures n ths paper are avalable onlne at Dgtal Object Identfer /TIP computer vson. In ZSL, the gap between seen (.e., tranng) categores and unseen (.e., testng) categores s generally brdged based on the ntermedate semantc representatons. Semantc representaton of each category means the representaton of ths category n the semantc embeddng space. One popular semantc representaton s manually desgned attrbute vector [1], [4], whch s a hgh-level descrpton for each category, such as the shape (e.g., cylndrcal), materal (e.g., cloth), and color (e.g., whte). Besdes, there also exst automatcally generated semantc representatons [5] [8] ncludng the textual features extracted from onlne corpus correspondng to ths category or the word vector of ths category name. Recently, based on the ntermedate semantc representaton (.e., semantc embeddng), many ZSL approaches have been proposed, whch can be roughly categorzed nto Semantc Relatedness (SR) and Semantc Embeddng (SE) methods accordng to [9] and [10]. In partcular, the frst category SR methods [11] [13] tend to learn the vsual classfers for unseen categores usng the smlartes between unseen categores and seen categores whle the second category SE methods attract more research nterest [1], [5], [6], [10], [14] [7], whch am to buld the semantc lnk between vsual features and semantc representatons usng mappng functons. More detals of SR and SE methods can be found n Secton II. For Semantc Embeddng (SE) methods, a mappng functon needs to be learnt from the seen categores n the tranng stage, and then appled to the unseen categores n the testng stage. However, the vsual-semantc mappngs between the seen categores and unseen categores mght be consderably dfferent. In ths sense, the learnt mappng (.e., projecton) may not generalze well to the test set and thus results n poor performance, leadng to the recently hghlghted projecton doman shft, whch was frst mentoned n [0] and then theoretcally proved n [4]. To cope wth the projecton doman shft, some semsupervsed or transductve (the two termnologes semsupervsed and transductve are often nterchangeably used n the feld of ZSL) ZSL approaches have been proposed by utlzng unlabeled test nstances from unseen categores n the tranng stage. In general, these methods ether learn a common mappng functon for both seen and unseen categores, or adapt the mappng functon learnt based on the seen categores to the unseen categores. Nevertheless, ths may be nsuffcent to tackle the projecton doman shft because the mappng functon of each ndvdual category vares sgnfcantly IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

2 966 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO., FEBRUARY 019 For example, as shown n Fg. 3, the categores badmnton court and bedchamber share the same cloth attrbute, but n fact the vsual appearances of cloth from these two categores vary greatly. Another example s that the categores recyclng plant and landfll both have the attrbute cluttered space, but ther manfestaton are also consderably dfferent (see Secton V for more detals). To ths end, we focus on the projecton doman shft and tend to address ths problem by learnng a category-specfc mappng functon for each unseen category, whch has never been studed before. Snce we do not have labeled nstances for the unseen categores, the mappng functons of unseen categores can only be learnt by transferrng from those of seen categores. However, t s hard to tell whch seen categores are semantcally closer to a gven unseen category w.r.t. certan entry n the semantc representaton. Thus, we make a smple yet effectve assumpton that when the semantc representatons of two categores are smlar, the common non-zero entres of these two semantc representatons should be semantcally smlar, and thus the mappng functons of these two categores should also be smlar. Specfcally, for each unseen category, we calculate the smlartes between ths unseen category and all the seen categores based on ther semantc representatons, and assgn hgher weghts to the classfcaton tasks correspondng to the more smlar seen categores. Ths dea can be unfed wth many exstng ZSL works wth varous forms of classfcaton losses. As an nstantaton, we buld our method upon ESZSL [4] consderng ts smplcty and effectveness, whch s named as Adaptve Embeddng ZSL (AEZSL). Note that for AEZSL, we only utlze the semantc representatons of unseen categores to learn the mappng functons. In order to further utlze the unlabeled nstances from unseen categores lke prevous semsupervsed or transductve ZSL works [10], [15], [18] [0], we propose a progressve label refnement strategy followng AEZSL. One problem of AEZSL s ts neffcency for large-scale classfcaton task wth a large number of unseen categores because one vsual-semantc mappng needs to be learnt for each unseen category. Thus, we am to desgn a model whch only needs to be traned once on seen categores but can be appled to any unseen category wthout retranng the model. Wth ths am, we develop a Deep Adaptve Embeddng ZSL (DAEZSL) model, whch only utlzes seen categores n the tranng stage but has the generalzaton ablty to an arbtrary number of unseen categores. Specfcally, nstead of learnng one mappng for each unseen category based on ts semantc representaton as for AEZSL, we target at learnng a projecton functon from semantc representaton to vsualsemantc mappng. In ths sense, gven a new unseen category, ts vsual-semantc mappng can be easly generated based on ts semantc representaton. In the testng stage, gven a set of unseen categores assocated wth semantc representatons, our model s able to generate category-specfc feature weghts for each unseen category, whch can better ft the classfcaton tasks for unseen categores. Our contrbutons are threefold: 1) ths s the frst work to address the projecton doman shft by learnng category-specfc vsual-semantc mappngs wth the dea that hgher weghts should be assgned to the classfcaton tasks of more relevant seen categores for each unseen category. As an nstantaton, we buld our AEZSL method upon ESZSL [4], followed by a progressve label refnement strategy; ) we addtonally propose a deep adaptve embeddng model DAEZSL for large-scale ZSL, whch needs to be traned only once on seen categores but can be appled to arbtrary unseen category; 3) comprehensve experments are conducted on three small-scale datasets and one large-scale dataset to demonstrate the effectveness of our approaches. II. RELATED WORK As dscussed n Secton I, the exstng Zero-Shot Learnng (ZSL) methods can be categorzed nto Semantc Relatedness (SR) approaches and Semantc Embeddng (SE) approaches. From another perspectve, ZSL methods can be categorzed nto standard ZSL and transductve/semsupervsed ZSL based on whether to use the unlabeled test nstances from unseen categores n the tranng stage. Moreover, several deep learnng models have been proposed for ZSL. Addtonally, snce the termnology doman shft s commonly used n the feld of doman adaptaton, we brefly descrbe the dfference of ths termnology used n ZSL and doman adaptaton. A. Semantc Relatedness (SR) and Semantc Embeddng (SE) ZSL For SR approaches, the methods n [11] [13] construct the vsual classfers for unseen categores from those for seen categores based on the semantc smlartes between seen categores and unseen categores. Then, an extenson was made n [8], whch assumes that the vsual classfers for both seen and unseen categores can be represented by a set of base classfers. The above SR approaches do not exhbt obvous projecton doman shft problem, but they cannot take full advantage of the semantc representatons. Moreover, the above works dd not dscuss how to utlze the unlabeled test nstances n the tranng stage. In contrast, our methods can fully explot the semantc representatons and leverage the unlabeled test nstances durng the tranng procedure. For SE approaches, they can be further dvded nto three groups based on dfferent strateges to buld the semantc lnk between the vsual feature space and the semantc representaton space. The frst group of methods [1], [14] [16] are proposed for projectng the vsual features to semantc representatons based on the learnt attrbute classfers. The second group of methods [18], [19], [9], [30] target at projectng the semantc representatons to vsual features based on the learnt dctonary. It s worth mentonng that pseudo tranng samples are generated for unseen categores n [30] based on the learnt mappng. The thrd group of methods [5], [6], [10], [0] [7] tend to map the vsual feature space and the semantc representaton space nto a common space, or learn a mappng functon whch measures the compatblty between vsual features and semantc representatons. Among the above SE approaches, the approaches n [1], [5], [14], [], [3],

3 NIU et al.: ZSL VIA CATEGORY-SPECIFIC VISUAL-SEMANTIC MAPPING AND LABEL REFINEMENT 967 and [6] do not address the projecton doman shft whle the remanng transductve or sem-supervsed approaches can somehow account for the projecton doman shft problem, whch wll be detaled next. B. Transductve or Sem-Supervsed ZSL Some transductve or sem-supervsed Semantc Embeddng (SE) methods [10], [15], [18] [1], [5], [31] can allevate the projecton doman shft by utlzng the unlabeled test nstances from unseen categores n the tranng phase. Specfcally, the projecton doman shft s rectfed n [0] by projectng vsual features and multple semantc embeddngs of test nstances nto a common subspace va Canoncal Correlaton Analyss (CCA). Label propagaton for zero-shot learnng s used n both [0] and [31]. The approach n [18] learns dctonares for the seen categores and the unseen categores separately whle enforcng these two dctonares to be close. The methods n [10], [19], [1], and [5] learn the mappng functon and smultaneously nfer the test labels. In [15] and [1], a Laplacan regularzer s employed based on the semantc representatons or the label representatons of test nstances. The above transductve or sem-supervsed ZSL approaches are able to account for the projecton doman shft problem. However, the approach n [0] requres mult-vew semantc embeddngs, whch are often unavalable. Besdes, the methods n [10], [15], [18], [19], [1], and [5] ether learn a common mappng functon for all the categores or adapt the mappng functon learnt from the seen categores to the unseen categores, whch s lkely to be mproper for the real-world applcatons snce the mappng functon of each category may vary sgnfcantly. In contrast, our methods learn a categoryspecfc mappng for each unseen category, whch can capture dverse semantc meanngs of the same attrbute for dfferent categores. C. Deep ZSL Compared wth tradtonal ZSL methods based on extracted deep learnng features, there are fewer end-to-end deep ZSL models [6], [7], [3] [34]. These works learn the mappng from vsual features to semantc representatons [7], map vsual feature space and semantc space to a common space [6], [3], [34], or drectly learn classfers for dfferent categores based on ther semantc representatons [33]. However, none of the above deep ZSL methods mentons the projecton doman shft ssue. In contrast, we focus on tacklng the projecton doman shft n ths paper and propose a deep adaptve embeddng model DAEZSL whch can adapt the vsual-semantc mappng to dfferent categores mplctly by learnng category-specfc feature weght. D. Doman Adaptaton Doman adaptaton methods am to address the doman shft ssues,.e., to reduce the doman dstrbuton msmatch between the source doman (.e., tranng set) and the target doman (.e., test set). Regardng the doman shft problem, doman adaptaton methods focus on the dfference of margnal probablty or class condtonal probablty between the source doman and the target doman whle ZSL methods focus on the dfference of vsual-semantc mappngs (.e., projectons) between seen categores and unseen categores, whch s referred to as projecton doman shft. Thus, the termnology of doman shft has qute dfferent meanngs n these two felds. In ths paper, we focus on the projecton doman shft problem n ZSL. III. BACKGROUND In ths paper, for ease of presentaton, a vector/matrx s denoted by a lowercase/uppercase letter n boldface. The transpose of a vector/matrx s denoted usng the superscrpt. We use A B to denote the dot product of two matrces. Moreover, we use I to denote the dentty matrx and A 1 to denote the nverse matrx of A. We use upperscrpt s (resp., t) to ndcate seen (resp., unseen) categores whle omttng the upperscrpt for not beng specfc wth seen or unseen categores. A. Problem Defnton Assume we have C s seen categores and C t unseen categores. Let us denote the tranng (resp., test) data from seen (resp., unseen) categores as X s R d ns (resp., X t R d nt ), where d s the dmensonalty of vsual features and n s (resp., n t ) s the number of tranng (resp., test) nstances from seen (resp., unseen) categores. We assume each category s assocated wth an a-dm semantc representaton and thus the semantc representatons of seen (resp., unseen) categores can be stacked as A s R a Cs (resp., A t R a Ct ). In order to brdge the gap between vsual features and semantc representatons and smultaneously explot the dscrmnatve capacty of semantc representatons, nspred by Romera- Paredes and Torr [4], Guo et al. [5], and Qao et al. [35], we use the mappng matrx W R d a to match vsual features wth semantc representatons,.e., X s WA s,whch measures the compatblty between nstances and categores. Ideally, the most compatble semantc representaton of each tranng nstance should be from ts ground-truth category. In the testng stage, a test nstance x t s assgned to the category correspondng to the maxmum value n the C t -dm vector x t WA t,nwhchwa t s essentally the stacked vsual classfers for unseen categores. B. Embarrassngly Smple Zero-Shot Learnng (ESZSL) Before ntroducng our method, we brefly ntroduce Embarrassngly Smple Zero-Shot Learnng (ESZSL) [4], based on whch we buld our own method consderng ts smplcty and effectveness. ESZSL s formulated as mn W Xs WA s Y s F + γ WAs F + λ X s W F + β W F, (1) n whch γ, λ, and β are trade-off parameters, and Y s R ns C s s a bnary label matrx wth the -th row beng the label vector of the -th tranng nstance. Note that n (1), X s WA s Y s F can fully explot the dscrmnablty of semantc representatons by assgnng each nstance to the category wth the most compatble semantc representaton.

4 968 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO., FEBRUARY 019 Specfcally, the mult-class classfcaton error s mnmzed by both learnng how to predct attrbutes va X s W and also consderng the mportance of each attrbute contrbutng to the fnal classfcaton. WA s F s used to control the complexty of the projecton of semantc embeddngs onto the vsual feature space, whch allows far comparsons of semantc embeddngs from dfferent categores. X s W F s used to bound the varance of the projecton of vsual features onto the semantc embeddng space, whch makes the approach nvarant to dverse feature dstrbuton. W F s a standard weght penalty to control the complexty of W. By settng the dervatve of (1) w.r.t. W to zero, we obtan (X s X s + γ I)WA s A s + λ(x s X s + β λ I)W = Xs Y s A s. () when settng β = γλ, the problem n () has a close-form soluton: W = (X s X s + γ I) 1 X s Y s A s (A s A s + λi) 1. (3) IV. OUR METHODS In ths secton, we frst buld our AEZSL method upon ESZSL to learn category-specfc mappng matrx n Secton IV-A1 followed by a label refnement strategy n Secton IV-A. Moreover, we propose a deep adaptve embeddng model named DAEZSL for large-scale ZSL n Secton IV-B, whch shares the smlar dea wth AEZSL. A. Adaptve Embeddng Zero-Shot Learnng In ths secton, we frst ntroduce how to learn categoryspecfc vsual-semantc mappng, followed by descrbng our label refnement strategy. 1) Category-Specfc Vsual-Semantc Mappng: Snce the vsual-semantc mappngs of dfferent categores could be largely dfferent, we am to learn a category-specfc mappng matrx for each unseen category (.e., W c for the c-th unseen category) to address the projecton doman shft. Because no labeled nstances are provded for the unseen categores, we can only transfer from the mappng matrces of seen categores. However, t s a challengng task to determne whch seen categores are more semantcally smlar to a gven unseen category w.r.t. certan entry n the semantc representaton. To facltate the transfer, we assume that when the semantc representatons of two categores are smlar, the common non-zero entres shared by these two semantc representatons should be semantcally smlar, and thus the mappng matrces of these two categores should also be smlar. Specfcally, recall that n (1), each column n X s WA s Y s corresponds to the classfcaton task for each seen category, and the tasks for dfferent seen categores are dependent on one another wth a common W. To avod ambguty, n the remander of ths paper, we use c to denote the ndex of unseen category and c to denote the ndex of seen category. Gven the c-th unseen category, n order to ensure that W c s close to the mappng matrces of those smlar seen categores, we assgn hgher weghts to the classfcaton tasks for those more smlar seen categores. In partcular, we formulate ths dea as (X s W c A s Y s )S c,wheres c R Cs C s s a dagonal matrx wth the c-th dagonal element beng the cosne smlarty at c a s c s c c = ac t as c, whch measures the smlarty between the semantc representaton ac t of the c-th unseen category and the semantc representaton as c of the c-th seen category. Note that (X s W c A s Y s )S c s equvalent to multplyng the c-th column of X s W c A s Y s by sc c. For better explanaton, assumng that the c-th unseen category s smlar to the c-th seen category and ther semantc representatons share the common non-zero entry ndces { j 1, j,..., j l }, then a hgher weght sc c should be assgned to X s W c as c ys c wth ys c beng the c-th column of Y s. In ths way, the learnt W c should be closer to the mappng matrx of the c-th seen category w.r.t. the { j 1, j,..., j l }-th columns. To ths end, we tend to solve W c s for all unseen categores smultaneously usng the followng formulaton: 1 mn W c C t c=1 (X s W c A s Y s )S c F + λ 1 + λ C t c=1 W c F + λ 3 C t X s W c F c=1 W c W c F, (4) where λ 1, λ,andλ 3 are trade-off parameters, whch can be obtaned usng cross-valdaton, and c< c Wc W c F s a co-regularzer whch encourages W c s for dfferent unseen categores to share some common parts. Note that we omt the regularzer W c A s F n (1) for ease of optmzaton. When λ 3 approaches Infnty, the mappngs of all categores are enforced to be the same W. In ths case, the problem n (4) reduces to 1 mn W (Xs WA s Y s ) S F + λ 1 Xs W F + λ W F, (5) n whch the S s a dagonal matrx wth the c-th dagonal element beng C t c=1 (sc c ) C t. Compared wth (1), we assgn hgher weghts on the classfcaton tasks correspondng to the seen categores whch are closer to the overall unseen categores based on S. The problem n (4) can be solved n an alternatng fashon by updatng one W c whle fxng all the other W c s. We leave the detaled soluton to Appendx A. After learnng W c s, we can obtan the vsual classfer for the c-th unseen category as p c = W c ac t. In the testng stage, gven a test nstance xt from unseen categores, we can obtan ts decson value for the c-th unseen category as p c x t. Dscusson: The projecton doman shft problem has been theoretcally proved n [4], n whch the seen (resp., useen) categores are referred to as the source (resp., target) doman. The source (resp., target) doman sample can be represented as x,c s = vec(xs as c ) (resp., xt,c = vec(xt at c )). By assumng that the dfference of vsual feature dstrbuton between two domans (x s s and xt s) s neglgble, the doman shft manly depends on the dfference of semantc representatons between two domans (ac s s and at c s). Then, two extreme c< c

5 NIU et al.: ZSL VIA CATEGORY-SPECIFIC VISUAL-SEMANTIC MAPPING AND LABEL REFINEMENT 969 cases are presented n [4] when the semantc representatons of two domans are dentcal or orthogonal. Specfcally, when the semantc representatons of two domans are dentcal (.e., ac t = as c ), the error bound of the learnt classfer s approxmate to that of a standard classfer wthout projecton doman shft. In ths case, we have s c c = 1, whch allows the maxmum transfer. When ther semantc representatons are orthogonal (.e., ac t as c = 0), the error bound s vacuous and no transfer can be done. In ths case, we have sc c = 0, whch means no transfer. So the analyss for our problem accords wth that n [4], whch verfes that t s reasonable to control how much to transfer from seen categores based on the cosne smlartes. In ths paper, we buld our AEZSL method upon ESZSL due to ts smplcty and effectveness. However, t s worth mentonng that the dea of AEZSL,.e., assgnng hgher weghts on the classfcaton tasks of more smlar seen categores for each unseen category, can be ncorporated nto many exstng ZSL frameworks wth slght modfcaton based on ther learnng paradgms and used losses. ) Progressve Label Refnement: Note that n Secton IV-A1, we only utlze the semantc representatons of unseen categores to learn the adaptve mappng matrces. In order to further adapt the mappng matrces to the unseen categores by utlzng the unlabeled test nstances, we propose a progressve approach to update vsual classfers and refne predcted test labels alternatvely, smlar to some progressve sem-supervsed learnng approaches lke [36]. Unlke tradtonal sem-supervsed learnng whch requres the labeled and unlabeled nstances to be from the same set of categores, zero-shot learnng does not have any labeled nstances from the unseen categores. So we dvde the test set nto a confdent set L and an unconfdent set U, nwhch the labels n L (.e., Y l ) are expected to be relatvely more accurate than those n U (.e., Y u ). Note that Y l and Y u are both bnary label matrces, smlar to Y s n (4). Intally, L s an empty set and U s the entre test set wth ntal labels Y u predcted by the ntal classfers p c s, whch are obtaned based on W c s from Secton IV-A1. Then, we move k most confdent nstances from U nto L, and update the vsual classfers based on the new confdent and unconfdent sets. We repeat the above process teratvely untl U becomes empty and output Y l as fnal predcted test labels. The detals of each teraton wll be descrbed as follows, n whch we stack p c s for all unseen categores as P, and splt X t nto X l and X u correspondng to L and U. In each teraton, we frst use the vsual classfers P from the prevous teraton to select k most confdent nstances from U based on the confdence score, whch s defned as a softmax functon conf(x t ) = exp(pc(xt ) x t ) ĉ exp(pĉ x t ) wth c(x t ) beng the assgned label of x t correspondng to p c(xt ) whch acheves the hghest predcton score on x t.note that the labels of the selected k nstances are updated as c(x t ) whle the labels of the remanng nstances n U stay unchanged. After movng the k most confdent nstances from U nto L, we update P by changng some predcted labels n U (.e., X u P) whle usng L as weak supervson. In partcular, we update P wth the am to flp the labels of set U predcted by P (.e., X u P) based on a group-lasso regularzer X u P Y u,1, whch allows some rows n X u P to be nconsstent wth those n Y u. Ths means that the grouplasso regularzer encourages the predcted labels of some unconfdent nstances (the rows n X u P Y u wth non-zero entres) to be flpped. Moreover, we rely on the smlartes among unconfdent nstances and the smlartes among unseen categores to regulate the label flppng, whch wll be explaned as follows. 1) For the smlartes among unconfdent nstances, we employ a standard Laplacan regularzer based on the smoothness assumpton that the predcted label vectors of two unconfdent nstances should be close to each other when ther vsual features are smlar. ) For the smlartes among unseen categores, we construct a transton matrx Ŝ to characterze the probabltes that one category label s flpped to another, n whch Ŝ, j s the cosne smlarty a t a t j a t at j, smlar to that n Secton IV-A1. Wth the transton matrx Ŝ, we employ a coherent regularzer tr(y u ŜP X u ), whch enforces the predcted labels P X u to be coherent wth the expected transted labels Y u Ŝ based on the transton probabltes. To be more specfc, let us denote Y 1 = X u P and Y = Y u Ŝ, then maxmzng tr(y Y 1 ) wll enforce each row of Y 1 and Y (.e., label vector of each sample) to be coherent. Note that we set the dagonal elements of Ŝ as zeros to encourage the labels to be flpped. To ths end, the formulaton to update P n each teraton can be wrtten as mn P 1 Xl P Y l F + γ 1 Xu P Y u,1 γ tr(y u ŜP X u ) + γ 3 tr(p X u H u X u P), (6) where γ 1, γ, and γ 3 are trade-off parameters, whch can be obtaned usng cross-valdaton, and H u s the Laplacan matrx constructed based on X u followng [1]. Specfcally, we construct the smlarty matrx H u wth each entry H, u j beng the nverse Eucldean dstance between x u and x u j,and then produce the Laplacan matrx H u = dag( H u 1) H u. The problem n (18) s not easy to solve due to the regularzer X u P Y u,1. We leave the detaled soluton to (6) n Appendx B. We refer to our AEZSL wth Label Refnement as AEZSL_LR. Dscusson: Snce p c = W c ac t (see Secton IV-A1) and ac t s are fxed, updatng pc s mples updatng W c.inother words, by updatng p c based on the smlartes among test nstances and among unseen categores, we actually adapt the mappng matrces W c s to the test set mplctly. B. Deep Adaptve Embeddng Zero-Shot Learnng One of the mportant applcatons of zero-shot learnng s large-scale classfcaton snce t s dffcult to obtan suffcent well-labeled tranng data exhaustvely for all the categores.

6 970 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO., FEBRUARY 019 However, our AEZSL method s not very effcent n ths scenaro, because we need to learn one vsual-semantc mappng for each unseen category (e.g., over 0, 000 test categores n the ImageNet 011 1K dataset). Ths concern motvates us to desgn a model whch s more sutable for large-scale ZSL. In order to mtgate the burden nduced by a large set of unseen categores, we desgn a deep adaptve embeddng model named Deep Adaptve Embeddng ZSL (DAEZSL), whch only needs to be traned once but can generalze to arbtrary number of unseen categores. Smlar as n Secton IV-A1, we assume the vsualsemantc mappngs should be category-specfc and related to the semantc representaton of each category. To avod learnng one mappng for each unseen category as n Secton IV-A1, one possble approach s learnng a projecton f ( ) from semantc representaton a c to vsual-semantc mappng W c,.e., f (a c ) = W c, so that gven a new unseen category wth semantc representaton a c, we can easly obtan ts vsual-semantc mappng W c by usng W c = f (a c ). However, the sze of W c s usually very large (.e., d a), so we adopt an alternatve approach for the sake of computatonal effcency, that s, learnng the projecton g( ) from semantc representaton a c to feature weght m c,.e., g(a c ) = m c. The feature weght m c s appled on the vsual feature va elementwse product, and thus has the same dmenson d as vsual feature. Snce the sze of m c (.e., d) s much smaller than that of W c (.e., d a), t s much more effcent to learn the projecton g( ) nstead of f ( ). In fact, applyng the feature weghts of dfferent categores can be treated as mplctly adaptng the vsual-semantc mappng to dfferent categores, whch wll be explaned as follows. Assume we have a common vsual-semantc mappng W, gven the -th tranng nstance wth vsual feature x s, ts decson values after applyng the feature weght mc s s (x s mc s )WA s = x s ( M c s W)As, (7) n whch M c s s horzontally stacked a copes of ms c. Then, we defne the mplct vsual-semantc mappng W c s as W c s = M c s W, (8) from whch we can see that learnng feature weght mc s s equvalent to adaptng W to W c s by usng the weght M c s. Note that we smplfy the task of adaptng W by usng the weght M c s wth all columns beng the same ms c, so that the number of varables (.e., szeofmc s ) generated by g( ) s greatly reduced compared wth the sze of vsual-semantc mappng. In practce, we only need to learn W and g( ) wthout explctly producng W c s. In the remander of ths secton, we use mplct W c s merely for the purpose of better explanaton. Specfcally, we expect mplct W c s to satsfy two propertes: generalzablty and specfcty, analogous to AEZSL n (4) (category-specfc W c s for specfcty and co-regularzer for generalzablty). 1) Generalzablty: On one hand, we expect any W c s can correctly classfy the tranng nstances from any category, whch can be acheved by mnmzng the square loss C s c=1 Xs W c sas Y s F wth Xs and Y s beng the same as defned n (4). After defnng the one-hot label vector of x s as y s wth the c()-th entry beng 1 and M s R d Cs as the aggregated feature weghts over all C s categores wth the c-th column beng mc s, we can have C s X s W c s As Y s F, c=1 C s n s = x s W c s As y s c=1 =1 n s C s = x s ( M c s W)As y s =1 c=1 n s C s = =1 c=1 n s = =1 (x s m s c )WAs y s ( X s M s )WA s Ȳ s F, (9) n whch X s R d Cs s horzontally stacked C s copes of x s and Ȳ s R Cs C s s vertcally stacked C s copes of y s. Hence, n our mplementaton, gven the -th tranng nstance, we duplcate x s and y s to C s copes, and apply the aggregated feature weghts M s over all C s categores on the duplcated vsual features X s. Then, the decson value matrx of the -th tranng nstance ( X s M s )WA s can be easly obtaned. ) Specfcty: On the other hand, we expect that W c s can better ft the classfcaton task correspondng to the c-th category. For the -th tranng nstance, we use J = ( X s M s )WA s to denote ts decson value matrx, n whch J c 1,c s the decson value of c -th category obtaned by usng W s c 1. In the followng, we focus on the decson values of ts groundtruth category c(),.e., thec()-th column of J. In ths case, smlar as n (7), the decson value of c()-th seen category obtaned by usng W s c s Jc,c() ĩ = (xs m s c )Was c() = xs W s c as c(). (10) Then, we expect Jc(),c() obtaned by W c() s should be larger than Jc,c() ĩ obtaned by W s c for c = c(). Wth ths am, we employ the hnge loss R = c =c() max(0, J c,c() ĩ Jc(),c() +ρ) to push J c(),c() to be larger than J c,c() ĩ by margn ρ for c = c(). In our experments, we emprcally set ρ as 0.5 consderng that J s regressed to bnary label matrx. By takng both generalzablty and specfcty nto consderaton, and ncorporatng CNN used to extract vsual features nto our model, the loss functon of our end-to-end DAEZSL model s desgned as mn θ n s ( J Ȳ s F + R ), (11) =1 n whch θ = {θ CNN, θ g( ), W} wth θ CNN (resp., θ g( ) ) beng the parameters of CNN (resp., g( )).Basedon(11), we am to have mplct W c s s whch can generalze to other categores by mnmzng J Ȳ s F and smultaneously better ft the classfcaton task correspondng to ts own category by mnmzng R. Note that semantc representatons of unseen

7 NIU et al.: ZSL VIA CATEGORY-SPECIFIC VISUAL-SEMANTIC MAPPING AND LABEL REFINEMENT 971 Fg. 1. Illustraton of our Deep Adaptve Embeddng Zero-Shot Learnng (DAEZSL) model. In the top flow, the feature of each nput mage s duplcated to C copes. In the bottom flow, the semantc representatons of all C categores pass through mult-layer perceptrons (MLP) and generate the feature weghts for C categores, whch are appled on the duplcated features va elementwse product. Then, the weghted features are fed nto a fully connected (fc) layer (.e., vsual-semantc mappng), followed by dot product wth semantc representatons, and fnally output the decson value matrx, based on whch we mnmze the tranng loss n the tranng stage and predct test nstances n the testng stage. categores and unlabeled test nstances are not utlzed n the tranng stage. Our DAEZSL model s llustrated n Fg. 1, from whch we can see that g( ) s modeled by mult-layer perceptrons (MLP) and W s modeled by fully connected (fc) layer. Specfcally, durng the tranng process, we nput the tranng mages and the semantc representatons of all C s seen categores,.e., A s R Cs a.inthetopflow,d-dm feature of the -th tranng mage s duplcated to C s copes,.e., X s R Cs d.inthe bottom flow, A s passes through MLP and generates the feature weghts M s R Cs d for C s categores. After applyng the generated feature weghts on the duplcated vsual features va elementwse product, the weghted features X s M s are fed nto fc layer (.e., the common vsual-semantc mappng matrx W), and output ( X s M s )W. Fnally, we perform dot product on ( X s M s )W and A s, leadng to the decson value matrx J, based on whch we employ the loss functon n (11). Note that we tran an end-to-end system to mnmze the loss functon n (11), durng whch the parameters of CNN, MLP, and fc layer n Fg. 1 are jontly optmzed. More detals about the network archtecture and tranng process can be found n Secton V-B. In the predcton stage, gven an unseen category wth semantc representaton at c, we can easly generate ts feature weght m t c va the projecton functon g( ), whch mples an adaptve vsual-semantc mappng W t c. Intutvely, f at c s smlar to ac s of the c-th seen category, ther generated feature weghts m t c and ms c should be smlar. Furthermore, ther vsual-semantc mappng matrces W t c and W c s should be close to each other. Therefore, the vsual-semantc mappng of a gven unseen category s expected to be n correlaton wth those of smlar seen categores, analogous to learnng W c based on the smlarty matrx S c n (4). We further elaborate on the predcton procedure based on Fg. 1. In partcular, gven the -th test mage x t and the set of unseen categores wth sze C t, we use ths test mage and the semantc representatons of all C t unseen categores A t R C t a as nput. The test mage passes through the top flow and generates C t duplcated copes of vsual features,.e., X t R Ct d, whle A t passes through the bottom flow and generates the feature weghts M t R Ct d for all unseen categores. After performng ( X t M t )WA t, we can obtan a decson value matrx J R Ct C t for the -th test mage. Fnally, we get the dagonal of J as decson value vector and classfy ths test mage as the category correspondng to the hghest decson value. Note that only the dagonal of J s used for predcton because we assume that for the c-th W t c at c unseen category, the decson value Jc,c = xt (see (10)) obtaned based on W c t can best ft the classfcaton task for ths category. 3) Relaton to ESZSL: When fxng the feature weghts for all categores,.e., M s, as all-one matrx wthout learnng MLP n Fg. 1, our DAEZSL model approxmately reduces to ESZSL, n whch the vsual-semantc mappngs of all categores are the same. 4) Relaton to AEZSL: Our DAEZSL model shares the smlar dea wth our AEZSL method, that s, vsual-semantc mappng should be category-specfc and related to the semantc representaton of each category. Besdes, we desgn the tranng loss n (11) consderng the generalzablty and specfcty of vsual-semantc mappngs, n analogy to learnng category-specfc W c s wth co-regularzer n (4). Moreover, n our DAEZSL model, the vsual-semantc mappng of a gven unseen category s expected to be close to those of seen categores wth smlar semantc representatons, resemblng the frst regularzer based on the smlarty matrx S c n (4). V. EXPERIMENTS In ths secton, we conduct experments for mage classfcaton on three small-scale datasets (e.g., CUB, SUN, and Dogs) and one large-scale dataset (e.g., ImageNet).Onthe small-scale datasets, we compare our AEZSL and AEZSL_LR

8 97 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO., FEBRUARY 019 methods wth ther specal cases as well as standard/semsupervsed ZSL baselne methods. Note that our DAEZSL method tends to learn a mappng from semantc representaton to feature weght, whch s n demand of adequate seen categores. Wth scarce seen categores n the tranng stage, the learnt DAEZSL model cannot generalze well to unseen categores. Thus, we omt the results of DAEZSL on the small-scale datasets. On the large-scale dataset, due to the neffcency of AEZSL for a large number of unseen categores, we only evaluate our DAEZSL method, whch s specfcally desgned for large-scale ZSL, and compare wth recently reported state-of-the-art results. A. Zero-Shot Learnng on Small-Scale Datasets 1) Expermental Settngs: We conduct experments on the followng three popular benchmark datasets whch are commonly used for zero-shot learnng tasks: CUB [37]: Caltech-UCSD Brd (CUB) has n total 11, 788 mages dstrbuted n 00 brd categores. Followng the ZSL settng n [38], we use the standard tran-test splt wth 150 (resp., 50) categores as seen (resp., unseen) categores. The CUB dataset contans a 31-dm bnary human specfed attrbute vector for each mage, so we average the attrbute vectors of the mages wthn each category and use t as the semantc representaton of that category. SUN [39]: Scene UNderstandng (SUN) dataset has 0 mages n each scene category. Followng the ZSL settng n [40], we use the provded lst of 10 categores as unseen categores and the rest of 707 categores as seen categores. Smlar to CUB, the averaged 10-dm attrbute vector s used as the semantc representaton for each category. Dogs [41]: ZSL was frstly performed on the Stanford Dogs dataset n [5], whch uses 19, 501 mages from 113 breeds of dogs. We follow the provded tran-test splt n [5],.e., 85(resp., 8) categores as seen (resp., unseen) categores. Snce there s no manually annotated attrbute for the Dogs dataset, we combne two types of output embeddngs learned from onlne corpus (.e., 3, 850-dm Bag-of-Words embeddng and 163-dm WordNet-derved smlarty embeddng) as the semantc representaton for each category, whch has demonstrated superor performance n [5]. For CUB and SUN datasets, we extract the 4, 096-dm output of the 6-th layer of the pretraned VGG [45] as the vsual feature for each mage, followng myrad of prevous ZSL works such as [13], [14], [3], and [44]. For the Dogs dataset, we use the 1, 04-dm output of the top layer of the pretraned GoogleNet [46] followng [5] and []. ) Baselnes: We compare our AEZSL and AEZSL_LR methods wth two sets of ZSL baselnes: standard ZSL methods and sem-supervsed/transductve ZSL methods. For the frst set, we nclude the followng methods [1], [5], [9], [11] [14], [], [3], [8], [4], [43], whch do not utlze unlabeled test nstances n the tranng stage. For the second set, we compare wth the followng transductve or sem-supervsed ZSL methods [10], [15], [18], [19], [1], [5], [44], whch utlze the unlabeled test nstances and semantc representatons of unseen categores n the tranng phase. Note that our AEZSL belongs to standard ZSL methods whle AEZSL_LR belongs to sem-supervsed ZSL methods. Besdes two sets of baselnes mentoned above, we consder two specal cases of our AEZSL method: AEZSL_sm wthout usng the co-regularzer c< c Wc W c F by settng λ 3 = 0 and AEZSL_nf wth λ 3 bengverylarge(λ 3 = ). We also report the results of one specal cases of our AEZSL_LR method, namely, AEZSL_LR (one-step), whch s a nonprogressve approach by treatng the entre test set as the unconfdent set U and usng the method n (6) wthout the regularzer X l P Y l F. We use mult-class accuracy for performance evaluaton. For the baselnes, f the expermental settngs (.e., tran-test splt, vsual feature, semantc representaton, evaluaton metrc, etc) mentoned n ther papers are exactly the same as ours, we drectly copy ther reported results. Otherwse, we run ther methods usng our expermental settng for far comparson. 3) Parameters: Our AEZSL method has three trade-off parameters λ 1, λ,andλ 3 n (4). Besdes, our label refnement strategy has three addtonal trade-off parameters γ 1, γ,and γ 3 (see (6)). We use cross-valdaton strategy to determne the above trade-off parameters. Specfcally, followng [19], we choose the frst C c categores based on the default category ndces from C s seen categores as valdaton categores, n whch C c satsfes Cc C C s = t C s +C t. In our experments, we frst learn models based on the seen categores excludng the valdaton categores wth λ 1, λ,andλ 3 set wthn the range [10 3, 10,...,10 3 ], and then test the learnt models on the valdaton categores. After determnng the optmal trade-off parameters through the grd search, we learn the fnal model based on all seen categores. Note that we also have a hyper parameter k, whch s the number of selected confdent test nstances n each teraton. We emprcally fx k as 100 for CUB and Dogs datasets, and 10 for the SUN dataset gven that there are only 00 test nstances n the SUN dataset. Based on our expermental observaton, our methods are relatvely robust when the parameters are set wthn certan range. By takng λ 3 and k as examples, we vary λ 3 (resp., k) n the range of [10 1,...,10 3 ] (resp., [80, 90,...,10]) and evaluate our AEZSL_LR method on the Dogs dataset. The performance varance s llustrated n the mddle and rght subfgure n Fg. 5, from whch we can see that our method s nsenstve to λ 3 and k wthn certan range. We have smlar observatons for the other parameters on the other datasets. 4) Expermental Results: The expermental results for our AEZSL and AEZSL_LR methods as well as all the baselnes are reported n Table I. It can be seen that ESZSL acheves compettve results compared wth other standard ZSL baselnes, whch demonstrates the effectveness of ESZSL despte ts smplcty. By comparng AEZSL_sm wth AEZSL, we observe that AEZSL acheves better results after employng the co-regularzer, so t s benefcal to encourage the mappngs from dfferent categores to share some common parts. We also observe that our AEZSL method not only outperforms ESZSL and AEZSL_sm, but also outperforms the standard ZSL

NIU et al.: ZSL VIA CATEGORY-SPECIFIC VISUAL-SEMANTIC MAPPING AND LABEL REFINEMENT 973 TABLE I ACCURACIES (%) OF DIFFERENT BASELINE METHODS AND OUR METHODS ON THREE BENCHMARK DATASETS.

THE BEST RESULTS ARE HIGHLIGHTED IN BOLDFACE baselnes [1], [5], [9], [11] [14], [], [3], [8], [4], [43], whch ndcates the advantage of learnng a vsual-semantc mappng for each unseen category.

9 NIU et al.: ZSL VIA CATEGORY-SPECIFIC VISUAL-SEMANTIC MAPPING AND LABEL REFINEMENT 973 TABLE I ACCURACIES (%) OF DIFFERENT BASELINE METHODS AND OUR METHODS ON THREE BENCHMARK DATASETS.THE BEST RESULTS ARE HIGHLIGHTED IN BOLDFACE TABLE II ACCURACIES (%) OF DIFFERENT METHODS UNDER THE NEW SETTING PROPOSED IN [47] ON CUB AND SUN DATASETS. THE BEST RESULTS ARE HIGHLIGHTED IN BOLDFACE baselnes [1], [5], [9], [11] [14], [], [3], [8], [4], [43], whch ndcates the advantage of learnng a vsual-semantc mappng for each unseen category. From Table I, we also observe that the transductve or semsupervsed baselnes [10], [15], [18], [19], [1], [5], [44] generally acheve better results than those standard ZSL baselnes, ndcatng that t s helpful to utlze the unlabeled test nstances n the tranng stage. Another observaton s that our AEZSL_LR method outperforms AEZSL, whch demonstrates the effectveness of the progressve label refnement. Moreover, AEZSL_LR also performs better than AEZSL_LR (one-step). Ths s because the selected confdent set wth more accurate predcted labels can provde weak supervson. Fnally, our AEZSL_LR method outperforms all the baselnes and acheves the state-of-the-art results on three datasets, whch agan shows the advantage of adaptng the vsualsemantc mappng to each unseen category. 5) New Expermental Settng on CUB and SUN Datasets: In [47], new expermental settngs on benchmark datasets were proposed for zero-shot learnng. For far comparson wth the results reported n [47] and recent papers followng the settng [47], we addtonally conduct experments usng the tranng-testng splt as well as ResNet vsual features provded n [47]. The results of our methods on the CUB (resp., SUN) dataset are reported n Table II, whch are referred to as CUB (resp., SUN). We observe that our methods outperform the other baselnes, whch demonstrates the consstent superorty of our methods under dfferent expermental settngs. 6) Qualtatve Analyss of the Attrbutes Learnt by AEZSL: In order to show the advantage of learnng category-specfc Fg.. Illustraton of two nstance mages from the category flea market and ther correspondng mapped values for the attrbutes cloth and cluttered space obtaned by usng ESZSL and our AEZSL method. vsual-semantc mappngs ntutvely, we take the SUN dataset as an example to nvestgate why our AEZSL can correctly classfy the test nstances whch are msclassfed by ESZSL. As llustrated n Fgure, the two mages are correctly classfed as flea market by our AEZSL method but are msclassfed as shoe shop by ESZSL. Based on the semantc representatons of flea market and shoe shop, we fnd that among the fve attrbutes wth maxmum values n the semantc representaton of flea market, only two attrbutes (.e., cloth and cluttered space ) have larger values than those of shoe shop. We smply treat these two attrbutes as the representatve attrbutes to dstngush flea market from shoe shop. Fgure shows the mapped values of the above two confusng mages correspondng to two representatve attrbutes obtaned by usng ESZSL and our AEZSL method, whch are calculated by usng X s W and X s W c (W c s the category-specfc mappng matrx for flea market ) respectvely. It can be seen that our AEZSL method obtans larger mapped values for the two representatve attrbutes than ESZSL, whch contrbutes to the correct classfcaton of two confusng mages as flea market. Based on the fact that AEZSL obtans larger mapped values correspondng to the two representatve attrbutes, we conjecture that the learnt mappng usng our method can bet-

974 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO., FEBRUARY 019 k confdent nstances s mproved usng the updated vsual classfers, whch verfes the effectveness of progressve label refnement.

8) Comparson Between Dfferent Semantc Representatons: In Table I, we use contnuous attrbute vector as the semantc representaton on the CUB and SUN datasets.

Specfcally, we bnarze the category-level contnuous attrbute vector based on the threshold 0 as the bnary attrbute vector and use 400-dm wordvec [50] provded n [] correspondng to each category name.

However, n the other two cases, our AEZSL method s stll better than ESZSL and further mproved by AEZSL_LR. Fg. 3. The frst (resp.

10 974 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO., FEBRUARY 019 k confdent nstances s mproved usng the updated vsual classfers, whch verfes the effectveness of progressve label refnement. We have smlar observatons on the other datasets. 8) Comparson Between Dfferent Semantc Representatons: In Table I, we use contnuous attrbute vector as the semantc representaton on the CUB and SUN datasets. Instead, we can also use bnary attrbute vector or word vector. Specfcally, we bnarze the category-level contnuous attrbute vector based on the threshold 0 as the bnary attrbute vector and use 400-dm wordvec [50] provded n [] correspondng to each category name. By takng CUB dataset as an example, we report the results n Table IV, n whch the results obtaned by usng contnuous attrbute vector are sgnfcantly better. However, n the other two cases, our AEZSL method s stll better than ESZSL and further mproved by AEZSL_LR. Fg. 3. The frst (resp., second) row contans the nstance mages from dfferent categores wth the attrbute cloth (resp., cluttered space ). (a) Badmnton court (ndoor). (b) bedchamber. (c) recyclng plant. (d) landfll. ter capture the semantc meanngs of cloth and cluttered space. To verfy ths pont, we frst show some nstance mages from dfferent categores wth the attrbute cloth (resp., cluttered space ) n the frst (resp., second) row n Fgure 3. Together wth the nstance mages n Fgure, we can observe that for dfferent categores, the vsual appearances of cloth and cluttered space are consderably dfferent as dscussed n Secton I. Thus, a general mappng matrx learnt by ESZSL cannot tell the subtle dscrepancy n the semantc meanngs of the same attrbute between dfferent categores. In contrast, our AEZSL method learns the category-specfc mappng matrx for flea market based on the smlartes between ths unseen category and each seen category, n whch the major transfer comes from more smlar seen categores. In Fgure 4, we show the nstance mages from four nearest neghborng seen categores of flea market wth the attrbutes cloth and cluttered space, from whch t can be seen that n terms of the vsual appearances of cloth and cluttered space, these neghborng categores resemble flea market much better than other categores such as those n Fgure 3. Therefore, AEZSL can learn a better fttng vsualsemantc mappng for flea market. We have smlar observatons for the other unseen categores and on the other datasets. 7) Performance Varaton w.r.t. Iteraton n AEZSL_LR: Snce our proposed label refnement method s a progressve approach, we are nterested n the varaton of the accuracy of the predcted test labels (.e., the unon of Y l and Y u ) w.r.t. the number of teratons. By takng the Dogs dataset as an example, we plot the label accuracy w.r.t. the number of teratons n the left subfgure n Fgure 5, from whch we can observe that the accuracy of predcted test labels ncreases from 44.6% to 50.6% steadly wthn 50 teratons. Recall that n each teraton, we move the top k confdent nstances from the unconfdent set wth ther refned predcted labels nto the confdent set. Thus, we can nfer that n most teratons, the accuracy of the predcted labels of the selected B. Zero-Shot Learnng on Large-Scale Dataset In ths secton, we apply our deep adaptve embeddng model DAEZSL on the ImageNet dataset and compare wth the state-of-the-art reported results. 1) Expermental Settngs: We strctly follow the expermental settngs n [6] and [8]. Specfcally, we use the 1000 categores n ImageNet ILSVRC 01 1K [51] as seen categores, and perform evaluaton on three test scenaros, whch are chosen from ImageNet 011 1K dataset and bult based on ImageNet label herarchy wth ncreasng dffculty. The three test sets are lsted as follows. -hop test set: 1, 509 unseen categores wthn two tree hops of the 1000 seen categores accordng to the ImageNet label herarchy, 1 whch are semantcally and vsually smlar wth 1000 seen categores. 3-hop test set: 7, 678 unseen categores wthn three tree hops of 1000 seen categores, whch s constructed n a smlar way to -hop test set. all test set: all the 0,345 unseen categores n the ImageNet 011 1K dataset whch do not belong to the ILSVRC 01 1K dataset. Note that -hop (resp., 3-hop) test set s a subset of 3-hop (resp., all ) test set, and all the test sets have no overlap wth the tranng set (e.g., ImageNet ILSVRC 01 1K). ) Semantc Representatons: For both seen categores and unseen categores, we use the 500-dm word vector for each category provded n [8], whch s obtaned based on a skpgram language model [50] traned on the latest Wkpeda corpus. Note that one ImageNet category may have more than one word accordng to ts synset, we smply average the word vectors of all the words appearng n ts synset as the word vector for that category. 3) Evaluaton Metrcs: Evaluatng ZSL methods on the large-scale ImageNet dataset s a nontrval task consderng the large number of unseen categores and the semantc overlap of dfferent categores n the ImageNet label herarchy. So we use dfferent evaluaton metrcs compared wth mult-class accuracy used on three small-scale datasets (.e., CUB, SUN, and Dog) n Secton V-A. Followng [6], we use two metrcs 1

NIU et al.: ZSL VIA CATEGORY-SPECIFIC VISUAL-SEMANTIC MAPPING AND LABEL REFINEMENT 975 Fg. 4.

(d) General store (ndoor). Fg. 5. The left subfgure shows the accuracy varaton of predcted test labels w.r.t. the number of teratons n label refnement on the Dogs dataset. The mddle (resp.

Flat ht@k and Herarchcal precson@k for performance evaluaton. To be exact, Flat ht@k s defned as the percentage of test nstances for whch the ground-truth category s n ts top K predctons.

Gven each test mage, we generate a feasble set of K nearest categores of ts ground-truth category n the ImageNet label herarchy and calculate the overlap rato between the feasble set and the top K

11 NIU et al.: ZSL VIA CATEGORY-SPECIFIC VISUAL-SEMANTIC MAPPING AND LABEL REFINEMENT 975 Fg. 4. The nstance mages from four nearest neghborng seen categores of the unseen category flea market wth the attrbutes cloth and cluttered space. (a) Bazzar (ndoor). (b) Thrft shop. (c) Market (ndoor). (d) General store (ndoor). Fg. 5. The left subfgure shows the accuracy varaton of predcted test labels w.r.t. the number of teratons n label refnement on the Dogs dataset. The mddle (resp., rght) subfgure shows the performance varance of our AEZSL_LR method w.r.t. parameter k (resp., λ 3 ) on the Dogs dataset, n whch the dash lne ndcates the parameter we use n Table I. Flat ht@k and Herarchcal precson@k for performance evaluaton. To be exact, Flat ht@k s defned as the percentage of test nstances for whch the ground-truth category s n ts top K predctons. When K = 1, Flat ht@k s dentcal wth the mult-class accuracy. Compared wth Flat ht@k, Herarchcal precson@k takes the herarchcal structure of categores nto consderaton. Gven each test mage, we generate a feasble set of K nearest categores of ts ground-truth category n the ImageNet label herarchy and calculate the overlap rato between the feasble set and the top K predctons of ths test mage, that s, the precson of top K predctons. When generatng a feasble set of K nearest categores for a ground-truth category, we enlarge the searchng radus around the ground-truth category n the ImageNet label herarchy teratvely and add the searched categores belongng to the test set to the feasble set. Ths procedure s repeated untl the sze of feasble set exceeds K. For more detals, please see the Appendx of [6] or the Supplementary of [8]. Note that when K = 1, Flat ht@k s equal to Herarchcal precson@k, so we omt Herarchcal precson@1 n Table III to avod redundancy. 4) Network Structure: In terms of CNN and MLP n Fg. 1, we use GoogleNet as the CNN structure n our experments, whch s ntalzed wth released model [46] pretraned on ImageNet dataset. The dmenson of output from CNN s 1, 04. The MLP we use has one hdden layer wth ts sze emprcally set as 750, whch s approxmately the average of feature dmenson (d = 104) and attrbute dmenson (a = 500),.e., d+a. Besdes, we add a dropout layer followng the hdden layer n MLP wth 50% rato of dropped output. Addtonally, we add a sgmod layer followng MLP to normalze the feature weght n the range of (0, 1). The network s mplemented usng TensorFlow and we use AdaGrad optmzer for tranng, wth batchsze as 18 and learnng rate as ) Expermental Results: To the best of our knowledge, there are few ZSL papers [6], [1], [8], [43] reportng ther performances on the ImageNet dataset n -hop, 3-hop, and all test settngs. We compare our DAESL method wth the reported results of DeVISE [6], ConSE [1], EXEM [43], and SYNC [8] n Table III. We also nclude ESZSL as a baselne, n whch we tran the DAEZSL network usng allone feature weghts wthout learnng MLP. From Table III, we can observe that DAEZSL acheves far better results than ESZSL, whch demonstrates the advantage of category-specfc feature weght. Our DAEZSL also outperforms all the baselne methods n most cases (5 out of 7), ndcatng that t s benefcal to learn deep adaptve embeddng model whch can mplctly adapt vsual-semantc mappng to dfferent categores by usng category-specfc feature weght. Addtonally, n Table V, we report the results of our DAEZSL method wthout fne-tunng CNN model parameters. We also report the results wthout usng ReLU actvaton n the hdden layer of MLP, as well as those wth varous numbers of hdden layers (.e., l h ) n MLP or varous numbers of nodes n each hdden layer (.e., n h ). From Table V, we observe that DAEZSL (l h = 1 or, n h = 750) wth ReLU n MLP and CNN fne-tunng generally performs more favorably. Recall that we employ MLP to learn a mappng from semantc representaton to feature weght, whch requres adequate seen categores n the tranng stage. Wth scarce seen categores n the tranng stage, the learnt MLP may not generalze

12 976 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO., FEBRUARY 019 TABLE III ACCURACIES (%) OF DIFFERENT BASELINE METHODS AND OUR DAEZSL METHOD ON THE IMAGENET DATASET. THE BEST RESULTS ARE HIGHLIGHTED IN BOLDFACE TABLE IV ACCURACIES (%) OF DIFFERENT METHODS USING WORD VECTOR OR ATTRIBUTE VECTOR (BINARY OR CONTINUOUS) ON THE CUB DATASET.THE BEST RESULTS ARE HIGHLIGHTED IN BOLDFACE TABLE V ACCURACIES (%) OUR DAEZSL METHOD WITH DIFFERENT CONFIGURATIONS UNDER THE -HOP TEST SETTING ON THE IMAGENET DATASET.WE USE l h TO DENOTE THE NUMBER OF HIDDEN LAYERS IN MLP AND n h TO DENOTE THE NUMBER OF NODES IN EACH HIDDEN LAYER. NOTE THAT THE DEFAULT CONFIGURATION OF DAEZSL IS l h = 1, n h = 750. THE BEST RESULTS ARE HIGHLIGHTED IN BOLDFACE Fg. 6. Accuracy (.e., Flat Ht@1) of our AEZSL and DAEZSL methods under the -hop test settng on the ImageNet dataset wth dfferent numbers of seen categores n the tranng stage. well to unseen categores n the testng stage. To verfy ths pont, we vary the number of seen categores and plot the performance curve of AEZSL and DAEZSL n Fg. 6. Based on Fg. 6, we observe that DAEZSL can acheve better results than AEZSL wth adequate seen categores whle performng worse than AEZSL when the number of seen categores s very small. C. Generalzed Zero-Shot Learnng Most of exstng ZSL methods assume that n the testng stage, the test nstances only come from the unseen categores, whch s actually an unrealstc settng because the nstances from seen categores may also be encountered n the testng stage. So t s more useful to predct a test nstance from ether seen categores or unseen categores nstead of assumng that the test nstances are only from unseen categores. However, when usng the mxture of test nstances from both seen and unseen categores for testng, the performance wll be sgnfcantly degraded due to the bas of predcton scores between seen category label space and unseen category label space, as demonstrated n [7] and [5]. Ths more challengng test settng s referred to as generalzed zero-shot learnng (GZSL) n [5]. To valdate the effectveness of our AEZSL method under the more realstc GZSL settng, we addtonally conduct experments by mxng the test nstances from both seen and unseen categores as the test set. In partcular, we follow the settng n [5], that s, we move 0% of the nstances from each seen category to the test set, and thus the test set ncludes both seen categores and unseen categores.

13 NIU et al.: ZSL VIA CATEGORY-SPECIFIC VISUAL-SEMANTIC MAPPING AND LABEL REFINEMENT 977 TABLE VI ACCURACIES (%) OF DIFFERENT METHODS UNDER THE GENERALIZED ZSL SETTING ON FOUR BENCHMARK DATASETS.THE BEST RESULTS ARE HIGHLIGHTED IN BOLDFACE When applyng our AEZSL method to generalzed ZSL settng, we adopt the same strategy as n (4) and the only dfference s that we learn (C s +C t ) nstead of C t category-specfc vsual-semantc mappngs. Partcularly, we learn categoryspecfc mappngs for both unseen and seen categores by assgnng dfferent weghts on dfferent classfcaton tasks of dfferent seen categores based on the smlarty between each category and all the seen categores. For far comparson wth state-of-the-art results under the generalzed ZSL settng, we further employ the exstng calbrated stackng strategy proposed n [5] on the results obtaned by our AEZSL method. The dea of calbrated stackng strategy s smple yet very effectve, that s, to reduce the predcton scores n the seen category label space by a threshold, whch s learnt based on a performance metrc called Area Under SeenUnseen accuracy Curve (AUSUC) on the valdaton set, accordng to the observaton that the predcton scores n the seen category label space are often hgher than those n the unseen category label space. Thus, after employng the calbrated stackng strategy, we expect to obtan unbased predcton scores. In Table VI, we report the results obtaned by ESZSL and our AEZSL method as well as our AEZSL after employng Calbrated Stackng (CS) strategy, whch s referred to as AEZSL(CS). We also compare wth [5], whch s specfcally desgned for generalzed ZSL. From Table VI, we observe that our AEZSL method outperforms ESZSL on all datasets, whch shows that our AEZSL method s also effectve under the generalzed ZSL settng. Moreover, after employng the calbrated stackng strategy, the accuracy obtaned by our AEZSL method s greatly mproved, whch demonstrates that our AEZSL method can be perfectly ntegrated wth the exstng calbrated stackng strategy. Fnally, we observe that AEZSL(CS) acheves sgnfcantly better result than the other baselnes on all datasets, whch agan ndcates the effectveness of our method under the generalzed ZSL settng. VI. CONCLUSION In ths paper, we have proposed our AEZSL method, whch ams to learn a category-specfc vsual-semantc mappng for each unseen category based on the smlartes between unseen categores and seen categores, followed by progressve label refnement. Moreover, we addtonally propose a deep adaptve embeddng model named DAEZSL for large-scale ZSL, whch only needs to be traned once on the seen categores but can be appled to arbtrary number of unseen categores. Comprehensve expermental results demonstrate the effectveness of our proposed methods. APPENDIX A. Soluton to (4) We rewrte the problem n (4) as follows, 1 C t mn (X s W c A s Y s )S c W c F + λ C t 1 X s W c F c=1 + λ C t c=1 W c F + λ 3 c=1 W c W c F, (1) To solve the problem n (1), we update each W c by fxng all the other W c s for c = c n an alternatng fashon. Specfcally, by settng the dervatve of (1) w.r.t. W c to zeros, we obtan the followng equaton: c< c (X s X s )W c (A s S c S c A s + λ 1 I) + ((C t 1)λ 3 + λ )W c = X s Y s S c S c A s + λ 3 W c. (13) By denotng L = X s X s, T = A s S c S c A s + λ 1 I, N = X s Y s S c S c A s + λ 3 c =c W c, and μ = (C t 1)λ 3 + λ, the problem n (13) becomes a specal case of Sylvester equaton w.r.t. W c as follows, c =c LW c T + μw c = N. (14) Snce L and T are symmetrc real matrces, the problem n (14) has an effcent soluton. Inspred by Smoncn [53], we perform Sngular Value Decomposton (SVD) on L and T,.e., L = Û L Û and T = ˆV T ˆV wth L (resp., T )beng a dagonal matrx, n whch the -th dagonal element s σ L (resp., σ T ). Then, we can rewrte (14) as Û L Û W c ˆV T ˆV + μw c = N. (15) After multplyng (15) by Û (resp., ˆV) ontheleft(resp., rght) and consderng the orthonormalty of Û (resp., ˆV), we have L Û W c ˆV T + μû W c ˆV = Û N ˆV. (16) By denotng Ŵ c = Û W c ˆV and ˆN = Û N ˆV, we can arrve at L Ŵ c T + μŵ c = ˆN. (17) Based on (17), we can easly solve each entry n Ŵ c as Ŵ c j = ˆN j σ L σ T j + μ. Note that L s postve sem-defnte, T s postve defnte, and μ>0, so σ L σ j T + μ = 0,, j, and the problem n (17) has unque soluton. Then, we can recover W c by usng W c = ÛŴ c ˆV. Because SVD on L and T can be precomputed, our algorthm s qute effcent. We update each W c alternatvely untl the objectve of (1) converges. We name ths method as Adaptve Embeddng ZSL (AEZSL) and the algorthm to solve (1) s lsted n Algorthm 1.

14 978 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO., FEBRUARY 019 Algorthm 1 The Algorthm to Solve AEZSL (1) Algorthm The Algorthm of Progressve Label Refnement B. Soluton to (6) We rewrte the problem n (6) as follows, mn P 1 Xl P Y l F + γ 1 Xu P Y u,1 γ tr(y u ŜP X u ) + γ 3 tr(p X u H u X u P). (18) Accordng to [54], solvng (18) s equvalent to solvng the followng problem teratvely: mn P 1 Xl P Y l F + γ 1 tr((p X u Y u )D(X u P Y u )) γ tr(y u ŜP X u ) + γ 3 tr(p X u H u X u P), (19) where D s a dagonal matrx wth the -th dagonal element 1 beng q,nwhchq s the -th row of Q wth Q = X u P Y u. In each teraton of solvng P, we calculate D based on prevous P and then update P by solvng (19). As clamed n [54], the updated P n each teraton wll decrease the objectve of (18). By settng the dervatve of (19) w.r.t. P to zeros, we can obtan the close-form soluton to P as P = (X l X l + γ 1 X u DX u + γ 3 X u H u X u + νi) 1 (X l Y l + γ 1 X u DY u + γ X u Y u Ŝ), (0) n whch ν s a very small number (.e., ν 0). We update P by solvng (19) teratvely untl the objectve of (18) converges. Accordng to [54], the output P s the global optmum soluton to (18) because 1 Xl P Y l F γ tr(y u ŜP X u ) + γ 3 tr(p X u H u X u P) s convex w.r.t. P. The whole algorthm of the proposed progressve label refnement s summarzed n Algorthm. REFERENCES [1] C. H. Lampert, H. Ncksch, and S. Harmelng, Attrbute-based classfcaton for zero-shot vsual object categorzaton, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 3, pp , Mar [] Y. Guo, G. Dng, J. Han, and Y. Gao, Zero-shot learnng wth transferred samples, IEEE Trans. Image Process., vol. 6, no. 7, pp , Jul [3] C. Luo, Z. L, K. Huang, J. Feng, and M. Wang, Zero-shot learnng va attrbute regresson and class prototype rectfcaton, IEEE Trans. Image Process., vol. 7, no., pp , Feb [4] A. Farhad, I. Endres, D. Hoem, and D. Forsyth, Descrbng objects by ther attrbutes, n Proc. IEEE Conf. CVPR, Mam, FL, USA, Jun. 009, pp [5] Z. Akata, S. Reed, and D. Walter, Evaluaton of output embeddngs for fne-graned mage classfcaton, n Proc. Comput. Vs. Pattern Recognt., Boston, MA, USA, Jun. 015, pp [6] A. Frome, G. S. Corrado, and J. Shlens, Devse: A deep vsual-semantc embeddng model, n Proc. Adv. Neural Inf. Process. Syst., Lake Tahoe, NV, USA, Dec. 013, pp [7] R. Socher, M. Ganjoo, C. D. Mannng, and A. Ng, Zero-shot learnng through cross-modal transfer, n Proc. 7th Annu. Conf. Neural Inf. Process. Syst., Lake Tahoe, NV, USA, 013, pp [8] F. X. Yu, L. Cao, R. S. Fers, J. R. Smth, and S. F. Chang, Desgnng category-level attrbutes for dscrmnatve vsual recognton, n Proc. Comput. Vs. Pattern Recognt., Portland, OR, USA, Jun. 013, pp [9] Z. Fu, T. Xang, and E. Kodrov, Zero-shot object recognton by semantc manfold dstance, n Proc. Comput. Vs. Pattern Recognt., Boston, MA, USA, Jun. 015, pp [10] X. L and Y. Guo, Max-margn zero-shot learnng for mult-class classfcaton, n Proc. Int. Conf. Art. Intell. Stat., San Dego, CA, USA, May 015, pp [11] T. Mensnk, E. Gavves, and C. G. M. Snoek, COSTA: Co-occurrence statstcs for zero-shot classfcaton, n Proc. 7th IEEE Conf. Comput. Vs. Pattern Recognt., Columbus, OH, USA, Jun. 014, pp [1] M. Norouz et al., Zero-shot learnng by convex combnaton of semantc embeddngs, n Proc. Int. Conf. Learn. Represent., Banff, AB, Canada, Apr [13] Z. Zhang and V. Salgrama, Zero-shot learnng va semantc smlarty embeddng, n Proc. IEEE Int. Conf. Comput. Vs. (ICCV), Santago, Chle, Dec. 015, pp [14] M. Bucher, S. Herbn, and F. Jure, Improvng semantc embeddng consstency by metrc learnng for zero-shot classffcaton, n Proc. 14th Eur. Conf. Comput. Vs., Amsterdam, The Netherlands, Oct. 016, pp [15] X. Xu, T. Hospedales, and S. Gong, Transductve zero-shot acton recognton by word-vector embeddng, Int. J. Comput. Vs., vol. 13, no. 3, pp , 017. [16] D. Jayaraman, F. Sha, and K. Grauman, Decorrelatng semantc vsual attrbutes by resstng the urge to share, n Proc. IEEE Conf. Comput. Vs. Pattern Recognt. (CVPR), Columbus, OH, USA, Jun. 014, pp [17] E. Kodrov, T. Xang, and S. Gong, Semantc autoencoder for zeroshot learnng, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Honolulu, HI, USA, Jul. 017, pp [18] E. Kodrov, T. Xang, and Z. Fu, Unsupervsed doman adaptaton for zero-shot learnng, n Proc. Int. Conf. Comput. Vs., Santago, Chle, Dec. 015, pp [19] S. M. Shojaee and M. S. Baghshah. (016). Sem-supervsed zero-shot learnng by a clusterng-based approach. [Onlne]. Avalable: arxv.org/abs/ [0] Y. Fu, T. M. Hospedales, T. Xang, Z. Fu, and S. Gong, Transductve mult-vew embeddng for zero-shot recognton and annotaton, n Proc. 13th Eur. Conf. Comput. Vs., Zurch, Swtzerland, Sep. 014, pp [1] X. L, Y. Guo, and D. Schuurmans, Sem-supervsed zero-shot classfcaton wth label representaton learnng, n Proc. IEEE Int. Conf. Comput. Vs. (ICCV), Santago, Chle, Dec. 015, pp [] Y. Q. Xan, Z. Akata, and G. Sharma, Latent embeddngs for zero-shot classfcaton, n Proc. Comput. Vs. Pattern Recognt., Las Vegas, NV, USA, Jun. 016, pp [3] Z. Zhang and V. Salgrama, Zero-shot learnng va jont latent smlarty embeddng, n Proc. Comput. Vs. Pattern Recognt., Las Vegas, NV, USA, Jun. 016, pp [4] B. Romera-Paredes and P. H. S. Torr, An embarrassngly smple approach to zero-shot learnng, n Proc. Int. Conf. Mach. Learn., Llle, France, Jul. 015, pp

NIU et al.: ZSL VIA CATEGORY-SPECIFIC VISUAL-SEMANTIC MAPPING AND LABEL REFINEMENT 979 [5] Y. Guo, G. Dng, X. Jn, and J. Wang, Transductve zero-shot recognton va shared model space learnng, n Proc.

Vs. Conf., York, U.K., Sep. 016, pp. 40.1 40.11. [7] Z. Al-Halah, M. Tapasw, and R. Stefelhagen, Recoverng the mssng lnk: Predctng class-attrbute assocatons for unsupervsed zero-shot learnng, n Proc.

[9] M. Ye and Y. Guo, Zero-shot classfcaton wth dscrmnatve semantc representaton learnng, n Proc. 30th IEEE Conf. Comput. Vs. Pattern Recognt., Honolulu, HI, USA, Jul. 017, pp. 7140 7148. [30] S.

Rohrbach, S. Ebert, and B. Schele, Transfer learnng n a transductve settng, n Proc. Adv. Neural Inf. Process. Syst., Lake Tahoe, NV, USA, Dec. 013, pp. 46 54. [3] Y. Yang and T. M.

15 NIU et al.: ZSL VIA CATEGORY-SPECIFIC VISUAL-SEMANTIC MAPPING AND LABEL REFINEMENT 979 [5] Y. Guo, G. Dng, X. Jn, and J. Wang, Transductve zero-shot recognton va shared model space learnng, n Proc. 30th AAAI Conf. Artf. Intell., Phoenx, AZ, USA, Feb. 016, pp [6] Y. Long, L. Lu, and L. Shao, Attrbute embeddng wth vsual-semantc ambguty removal for zero-shot learnng, n Proc. Brt. Mach. Vs. Conf., York, U.K., Sep. 016, pp [7] Z. Al-Halah, M. Tapasw, and R. Stefelhagen, Recoverng the mssng lnk: Predctng class-attrbute assocatons for unsupervsed zero-shot learnng, n Proc. IEEE Conf. Comput. Vs. Pattern Recognt. (CVPR), Las Vegas, NV, USA, Jun. 016, pp [8] S. Changpnyo, W. L. Chao, and B. Gong, Syntheszed classfers for zero-shot learnng, n Proc. Comput. Vs. Pattern Recognt., Las Vegas, NV, USA, Jun. 016, pp [9] M. Ye and Y. Guo, Zero-shot classfcaton wth dscrmnatve semantc representaton learnng, n Proc. 30th IEEE Conf. Comput. Vs. Pattern Recognt., Honolulu, HI, USA, Jul. 017, pp [30] S. Changpnyo, W.-L. Chao, and F. Sha, Predctng vsual exemplars of unseen classes for zero-shot learnng, n Proc. IEEE Int. Conf. Comput. Vs. (ICCV), Vence, Italy, Oct. 017, pp [31] M. Rohrbach, S. Ebert, and B. Schele, Transfer learnng n a transductve settng, n Proc. Adv. Neural Inf. Process. Syst., Lake Tahoe, NV, USA, Dec. 013, pp [3] Y. Yang and T. M. Hospedales, A unfed perspectve on multdoman and mult-task learnng, n Proc. Int. Conf. Learn. Represent., San Dego, CA, USA, May 015. [33] L. Ba, K. Swersky, and S. Fdler, Predctng deep zero-shot convolutonal neural networks usng textual descrptons, n Proc. Int. Conf. Comput. Vs., Santago, Chle, Dec. 015, pp [34] L. Zhang, T. Xang, and S. Gong, Learnng a deep embeddng model for zero-shot learnng, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Honolulu, HI, USA, Apr. 017, pp [35] R. Qao, L. Lu, and C. Shen, Less s more: Zero-shot learnng from onlne textual documents wth nose suppresson, n Proc. Comput. Vs. Pattern Recognt., Las Vegas, NV, USA, Jun. 016, pp [36] A. Blum and T. Mtchell, Combnng labeled and unlabeled data wth co-tranng, n Proc. 11th Annu. Conf. Comput. Learn. Theory, New York, NY, USA, Jul. 1998, pp [37] C. Wah, S. Branson, P. Welnder, P. Perona, and S. Belonge, The caltech-ucsd brds dataset, Calforna Insttute of Technology, Pasadena, CA, USA, Tech. Rep. CNS-TR , 011. [38] Z. Akata, F. Perronnn, Z. Harchaou, and C. Schmd, Label-embeddng for attrbute-based classfcaton, n Proc. 6th IEEE Conf. Comput. Vs. Pattern Recognt., Portland, OR, USA, Jun. 013, pp [39] J. Xao, J. Hays, K. A. Ehnger, A. Olva, and A. Torralba, SUN database: Large-scale scene recognton from abbey to zoo, n Proc. IEEE Conf. Comput. Vs. Pattern Recognt., San Francsco, CA, USA, Jun. 010, pp [40] D. Jayaraman and K. Grauman, Zero-shot recognton wth unrelable attrbutes, n Proc. 8th Annu. Conf. Neural Inf. Process. Syst., Montreal, QC, Canada, Dec. 014, pp [41] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fe-Fe, Novel dataset for fne-graned mage categorzaton, n Proc. 4th IEEE Conf. Comput. Vs. Pattern Recognt. Workshop Fne-Graned Vs. Categorzaton (FGVC), Colorado Sprngs, CO, USA, Jun [4] D. Wang, Y. L, Y. Ln, and Y. Zhuang, Relatonal knowledge transfer for zero-shot learnng, n Proc. 30th AAAI Conf. Artf. Intell., Phoenx, AZ, USA, Feb. 016, pp [43] S. Changpnyo, W.-L. Chao, and F. Sha, Predctng vsual exemplars of unseen classes for zero-shot learnng, n Proc. 16th Int. Conf. Comput. Vs., Vence, Italy, Oct. 017, pp [44] Z. Zhang and V. Salgrama, Zero-shot recognton va structured predcton, n Proc. 14th Eur. Conf. Comput. Vs., Amsterdam, The Netherlands, Oct. 016, pp [45] K. Smonyan and A. Zsserman. (014). Very deep convolutonal networks for large-scale mage recognton. [Onlne]. Avalable: arxv.org/abs/ [46] C. Szegedy et al., Gong deeper wth convolutons, n Proc. IEEE Conf. CVPR, Boston, MA, USA, Jun. 015, pp [47] Y. Xan, B. Schele, and Z. Akata, Zero-shot learnng The good, the bad and the ugly, n Proc. 30th IEEE Conf. Comput. Vs. Pattern Recognt., Jul. 017, pp [48] H. Zhang and P. Konusz, Zero-shot kernel learnng, n Proc. 31st IEEE Conf. Comput. Vs. Pattern Recognt., Salt Lake Cty, UT, Jun. 018, pp [49] V. K. Verma, G. Arora, A. Mshra, and P. Ra, Generalzed zero-shot learnng va syntheszed examples, n Proc. 31th IEEE Conf. Comput. Vs. Pattern Recognt., Salt Lake Cty, UT, Jun. 018, pp [50] T. Mkolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Dstrbuted representatons of words and phrases and ther compostonalty, n Proc. Adv. Neural Inf. Process. Syst., Lake Tahoe, NV, USA, Dec. 013, pp [51] O. Russakovsky et al., ImageNet large scale vsual recognton challenge, Int. J. Comput. Vs., vol. 115, no. 3, pp. 11 5, 015. [5] W.-L. Chao, S. Changpnyo, B. Gong, and F. Sha, An emprcal study and analyss of generalzed zero-shot learnng for object recognton n the wld, n Proc. 14th Eur. Conf. Comput. Vs., Amsterdam, The Netherlands, Oct. 016, pp [53] V. Smoncn, Computatonal methods for lnear matrx equatons, SIAM Rev., vol. 58, no. 3, pp , 016. [54] F. Ne, H. Huang, X. Ca, and C. H. Dng, Effcent and robust feature selecton va jont l,1-norms mnmzaton, n Proc. 4th Annu. Conf. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec. 010, pp L Nu receved the B.E. degree from the Unversty of Scence and Technology of Chna, Hefe, Chna, n 011, and the Ph.D. degree from Nanyang Technologcal Unversty, Sngapore, n 017. He was a Post-Doctoral Assocate wth Rce Unversty, Houston, TX, USA. He s currently an Assocate Professor wth the Computer Scence and Engneerng Department, Shangha Jao Tong Unversty, Shangha, Chna. Hs current research nterests nclude machne learnng, deep learnng, and computer vson. Janfe Ca (S 98 M 0 SM 07) receved the Ph.D. degree from the Unversty of Mssour Columba. He s currently a Professor and the Head of the Vsual and Interactve Computng Dvson, School of Computer Scence and Engneerng, Nanyang Technologcal Unversty, Sngapore, where he s also the Head of the Computer Communcaton Dvson. He has authored over 00 techncal papers n nternatonal journals and conferences. Hs major research nterests nclude computer vson, multmeda, and machne learnng. He s currently an AE of the IEEE TMM. He served as an AE for the IEEE TIP and TCSVT. Ashok Veeraraghavan receved the bachelor s degree n electrcal engneerng from IIT Madras n 00 and the M.S. and Ph.D. degrees from the Department of Electrcal and Computer Engneerng, Unversty of Maryland, College Park, n 004 and 008, respectvely. He spent three wonderful and fun-flled years as a Research Scentst wth Mtsubsh Electrc Research Laboratores, Cambrdge, MA, USA. He s currently an Assocate Professor of electrcal and computer engneerng wth Rce Unversty, TX, USA, where he drects the Computatonal Imagng and Vson Lab. Hs research nterests are broadly n the areas of computatonal magng, computer vson, and robotcs. Hs thess receved the Doctoral Dssertaton Award from the Department of Electrcal and Computer Engneerng, Unversty of Maryland. Hs work receved numerous awards, ncludng the NSF CAREER Award n 017 and the Hershel M. Rch Inventon Award n 016 and 017. Lqng Zhang receved the Ph.D. degree from Zhongshan Unversty, Guangzhou, Chna, n He was wth the South Chna Unversty of Technology, where he was promoted to a Full Professor n He was a Research Scentst wth the RIKEN Bran Scence Insttute, Japan, from 1997 to 00. He s currently a Professor wth the Department of Computer Scence and Engneerng, Shangha Jao Tong Unversty, Shangha, Chna. He has authored over 50 papers n journals and nternatonal conferences. Hs current research nterests cover computatonal theory for cortcal networks, bran computer nterface, vsual percepton and mage search, and statstcal learnng and nference.

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.