IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 4, APRIL

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 4, APRIL"

Georgia Bradley
5 years ago
Views:

1 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 4, APRIL Weakly Supervsed Fne-Graned Categorzaton Wth Part-Based Image Representaton Yu Zhang, Xu-Shen We, Janxn Wu, Member, IEEE, Janfe Ca, Senor Member, IEEE, Jangbo Lu, Senor Member, IEEE, Vet-Anh Nguyen, Member, IEEE, andmnhn.do,fellow, IEEE Abstract In ths paper, we propose a fne-graned mage categorzaton system wth easy deployment. We do not use any object/part annotaton (weakly supervsed) n the tranng or n the testng stage, but only class labels for tranng mages. Fnegraned mage categorzaton ams to classfy objects wth only subtle dstnctons (e.g., two breeds of dogs that look alke). Most exstng works heavly rely on object/part detectors to buld the correspondence between object parts, whch requre accurate object or object part annotatons at least for tranng mages. The need for expensve object annotatons prevents the wde usage of these methods. Instead, we propose to generate multscale part proposals from object proposals, select useful part proposals, and use them to compute a global mage representaton for categorzaton. Ths s specally desgned for the weakly supervsed fne-graned categorzaton task, because useful parts have been shown to play a crtcal role n exstng annotatondependent works, but accurate part detectors are hard to acqure. Wth the proposed mage representaton, we can further detect and vsualze the key (most dscrmnatve) parts n objects of dfferent classes. In the experments, the proposed weakly supervsed method acheves comparable or better accuracy than the state-of-the-art weakly supervsed methods and most exstng annotaton-dependent methods on three challengng datasets. Its success suggests that t s not always necessary to learn expensve object/part detectors n fne-graned mage categorzaton. Index Terms Fne-graned categorzaton, weakly-supervsed, part selecton. Manuscrpt receved September 29, 2015; revsed January 6, 2016 and February 6, 2016; accepted February 8, Date of publcaton February 18, 2016; date of current verson March 1, Y. Zhang, J. Lu, V.-A. Nguyen, and M. N. Do are supported by the research grant for the Human-Centered Cyber- physcal Systems Programme at the Advanced Dgtal Scences Center from Sngapore s Agency for Scence, Technology and Research (A*STAR). J. Wu s supported n part by the Natonal Natural Scence Foundaton of Chna under Grant No J. Ca s supported n part by Sngapore MoE AcRF Ter-1 Grant RG138/14. M. N. Do s supported n part by the US Natonal Scence Foundaton (NSF) grants CCF and IIS The assocate edtor coordnatng the revew of ths manuscrpt and approvng t for publcaton was Prof. Chrstne Gullemot. (Correspondng author: Janxn Wu.) Y. Zhang s wth the Bonformatcs Insttute, A*STAR, Sngapore (e-mal: zhangyu@b.a-star.edu.sg). Ths work was manly done when he was workng n the Advanced Dgtal Scences Center and Nanyang Technologcal Unversty, Sngapore. X.-S. We and J. Wu are wth the Natonal Key Laboratory for Novel Software Technology, Nanjng Unversty, Nanjng , Chna (e-mal: wexs@lamda.nju.edu.cn; wujx2001@nju.edu.cn). J. Ca s wth the School of Computer Engneerng, Nanyang Technologcal Unversty, Sngapore (e-mal: asjfca@ntu.edu.sg). J. Lu and V.-A. Nguyen are wth the Advanced Dgtal Scences Center, Sngapore (e-mal: jangbo.lu@adsc.com.sg; vanguyeng@ adsc.com.sg). M. N. Do s wth the Unversty of Illnos at Urbana Champagn, Urbana, IL USA (e-mal: mnhdo@llnos.edu). Color versons of one or more of the fgures n ths paper are avalable onlne at Dgtal Object Identfer /TIP I. F INTRODUCTION INE-GRAINED mage categorzaton has been popular for the past few years. Dfferent from tradtonal general mage recognton such as scene or object recognton, fnegraned categorzaton deals wth mages wth subtle dstnctons, whch usually nvolves the classfcaton of subclasses of objects belongng to the same class lke brds [1] [4], dogs [5], planes [6], plants [7] [9], etc. As shown n Fg. 1, fnegraned categorzaton needs to dscrmnate between objects that are vsually smlar to each other. In the red box of Fg. 1, Sberan Husky and Malamute are two dfferent breeds of dogs that mght be dffcult to dstngush for humans that are not experts. However, general mage categorzaton s comparatvely easer, e.g., most people can easly recognze that the red box n Fg. 1 contans dogs whle the blue box contans a kangaroo. Image representatons that used to be useful for general mage categorzaton may fal n fnegraned mage categorzaton, especally when the objects are not well algned, e.g., the two dogs are n dfferent pose and the backgrounds are cluttered. Therefore, fne-graned categorzaton requres methods that are more dscrmnatve than those for general mage classfcaton. Fne-graned categorzaton has wde applcatons n both ndustry and research socetes. Dfferent datasets have been constructed n dfferent domans, e.g., brds [1], butterfles [10], cars [11], etc. These datasets can have sgnfcant socal mpacts, e.g., butterfles [10] are used to evaluate the forest ecosystem and clmate change. One mportant common feature of many exstng fne-graned methods s that they explctly use annotatons of an object or even object parts to depct the object as precsely as possble. Boundng boxes of objects and / or object parts are the most commonly used annotatons. Most of them heavly rely on object / part detectors to fnd the part correspondence among objects. For example, n [12] and [13], the poselet [14] s used to detect object parts. Then, each object s represented wth a bag of poselets, and sutable matches among poselets (parts) could be found between two objects. Instead of usng poselets, [15] used the deformable part models (DPM) [16] for object part detecton. In [15] DPM s learned from the annotated object parts n tranng objects, whch s then appled on testng objects to detect parts. Some works, lke [17] and [18], transfer the part annotatons from objects n tranng mages to those sharng smlar shapes n testng mages. Instead of seekng precse part localzaton, [17] proposed an unsupervsed object algnment technque, whch roughly algns IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See for more nformaton.

1714 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 4, APRIL 2016 Fg. 1. Fne-graned categorzaton vs. general mage categorzaton.

General mage categorzaton usually dstngushes an object such as dogs (red box) from other objects that are vsually very dfferent (e.g., a kangaroo).

Recently, [19] proposed to use object and part detectors wth powerful CNN feature representatons [20], whch acheves state-of-the-art results on the Caltech-UCSD Brds (CUB) 200-2011 [1] dataset.

Krause et al. [22] used object boundng boxes to cosegment objects and algn the parts. Some other works, e.g., [23], [24], recognze fne-graned mages wth human n the loop.

To acheve accurate part detecton, most exstng fnegraned works requre annotated boundng boxes for objects, n both tranng and testng stages.

However, even wth such a setup, t s stll hard for the wde deployment of these methods snce accurate object annotatons needed n the tranng stage are usually expensve to acqure, especally for

[25] have shown promsng results wthout usng the detaled manual annotatons. They try to detect accurate objects and parts wth complex deep learnng models for fnegraned recognton.

Dfferent from general mage recognton whch usually uses a holstc mage representaton, we also try to make use of part nformaton.

2 1714 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 4, APRIL 2016 Fg. 1. Fne-graned categorzaton vs. general mage categorzaton. Fne-graned categorzaton (red box) processes vsually smlar objects, e.g., to recognze Sberan Husky and Malamute. General mage categorzaton usually dstngushes an object such as dogs (red box) from other objects that are vsually very dfferent (e.g., a kangaroo). objects and dvdes them nto correspondng parts along certan drectons. It acheves better results than the label transfer method. Recently, [19] proposed to use object and part detectors wth powerful CNN feature representatons [20], whch acheves state-of-the-art results on the Caltech-UCSD Brds (CUB) [1] dataset. The geometrc relatonshp between an object and ts parts are consdered n [19]. Zhang et al. [21] also show that part-based models wth CNN features are able to capture subtle dstnctons among objects. Krause et al. [22] used object boundng boxes to cosegment objects and algn the parts. Some other works, e.g., [23], [24], recognze fne-graned mages wth human n the loop. In ths paper, a part refers to a subregon n an object. For example, the parts n a brd nclude head, body, legs, etc. To acheve accurate part detecton, most exstng fnegraned works requre annotated boundng boxes for objects, n both tranng and testng stages. As ponted out n [19], such a requrement s not so realstc for practcal usage. Thus, a few works, such as [19] and [20], have looked nto a more realstc setup,.e., only utlzng the boundng box n the tranng stage but not n the testng stage. However, even wth such a setup, t s stll hard for the wde deployment of these methods snce accurate object annotatons needed n the tranng stage are usually expensve to acqure, especally for large-scale mage classfcaton problems. It s an nterestng research problem that frees us from the dependency on detaled manual annotatons n fne-graned mage categorzaton tasks. Xao et al. [25] have shown promsng results wthout usng the detaled manual annotatons. They try to detect accurate objects and parts wth complex deep learnng models for fnegraned recognton. In ths paper, t s also our am to categorze fne-graned mages wth only category labels and wthout any boundng box annotaton n both tranng and testng stages, whle not degradng the categorzaton accuracy. Our setup s the same as that of [25]. Notce that n the exstng annotatondependent works, representatve parts lke head and body n brds [19] have been shown to play the key role n capturng the subtle dfferences of fne-graned mages. Dfferent from general mage recognton whch usually uses a holstc mage representaton, we also try to make use of part nformaton. However, unlke state-of-the-art fne-graned categorzaton methods, we do not try to fnd accurate part detectons. Snce the exstng accurate part detectors (e.g., [19]) rely on the boundng box annotaton whle we consder a weakly-supervsed setup n ths research. Our key dea s to Fg. 2. System overvew. Ths fgure s best vewed n color. Note that we do not use any boundng box or part annotaton. generate part proposals from object proposals, then select useful part proposals, and encode the selected part proposals nto a global mage representaton for fne-graned categorzaton. Fg. 2 gves a system overvew, where there are three major steps: part proposal generaton, useful part selecton, and mult-scale mage representaton. In the frst step, we extract object proposals whch are mage patches that may contan an object. Part proposals are the sub-regons of the object proposals n each mage, as llustrated n Fg. 2. We propose an effcent mult-max poolng (MMP) strategy to generate features for mult-scale part proposals by leveragng the nternal structure of CNN. Consderng the fact that most part proposals generated n the frst step are from background clutters (whch are harmful to categorzaton), n the second step, we propose to select useful part proposals from each mage by explorng useful nformaton n part clusters (all part proposals are clustered). For each part cluster, we compute an mportance score, ndcatng how mportant the cluster s for the fne-graned categorzaton task. Then, those part proposals assgned to the useful clusters (.e., those wth the largest mportance scores) are selected as useful parts. Fnally, the selected part proposals n each mage are encoded nto a global mage representaton. To hghlght the subtle dstncton among fne-graned objects, we encode the selected parts at dfferent scales separately, whch we name as SCale Pyramd Matchng (ScPM). ScPM provdes a better dscrmnaton than encodng all parts n one mage altogether,.e., wthout usng the proposed scale pyramd matchng. Note that we propose to select many useful parts from mult-scale part proposals of objects n each mage and

ZHANG et al.: WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENTATION 1715 Fg. 3. Black-capped Vreo and Yellow-throated Vreo.

3 ZHANG et al.: WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENTATION 1715 Fg. 3. Black-capped Vreo and Yellow-throated Vreo. They have the most dstnctve parts n multple part proposals: black cap and yellow throat, respectvely, whch are specfed n red boxes. On the rght, we show the key parts detected usng the proposed representaton from the two speces. More examples of detected dscrmnatve parts can be found n Fg. 8. Ths fgure s best vewed n color. compute a global mage representaton for t, whch s then used to learn a lnear classfer for mage categorzaton. We beleve that selectng many useful part proposals s better than selectng only the best part proposal n the fnal global representaton. Ths s because t s very dffcult to determne the exact locaton of an object/part n the mage n our weakly-supervsed scenaro. Multple useful part proposals can compensate each other to provde more useful nformaton n characterzng the object. Expermental results show that the proposed method acheves comparable or better accuracy than state-of-the-art weakly-supervsed work [25] and even most of the exstng annotaton-dependent methods on three challengng benchmark datasets. Its success suggests that t s not always necessary to learn expensve object / part detectors n fne-graned mage categorzaton. In addton, utlzng the proposed weakly-supervsed fne-graned mage representaton, we can detect the key (most dscrmnatve) object parts for dfferent classes, whch concde well wth the rules used by human experts (e.g., the yellow-throated vreo and the black-capped vreo dffer because the yellow-throated vreo has a yellow throat whle the black-capped vreo has a black head, cf. Fg. 3). Overall, our man contrbuton les n the explct part proposal generaton and selecton, whch, to the best of our knowledge, s for the frst tme proposed for fne-graned mage categorzaton n a weakly-supervsed setup. Another major contrbuton s the proposed framework whch coherently ntegrates the three modules, part proposal generaton, useful part selecton and mult-scale mage representaton, and acheves state-of-the-art results. II. RELATED WORKS In ths secton, we revew several works from two aspects of fne-graned categorzaton: part based mage representaton and weakly supervsed methods. A. Part Based Methods Part representaton has been nvestgated n general mage recognton. In [26], over-segmented regons n mages are used as parts and LDA (lnear dscrmnant analyss) s used to learn the most dscrmnatve ones for scene recognton. In [27], dscrmnatve parts/modes are selected through the mean shft method on local patches n mages for each class. In [28], a set of representatve parts are learned usng an SVM (support vector machne) classfer wth the group sparse constrant for each class n mage recognton and segmentaton. All these methods tred to evaluate each part, whch may be very computatonally expensve when the part number s very large. Part based methods have also been used n fne-graned mage categorzaton for a long tme. Detaled part annotatons are provded wth some datasets lke CUB [1], where each brd n the mage has 15 part annotatons. Some methods, for nstance [17], [18], drectly extract feature vectors from these annotated parts for recognton. Gavves et al. [17] also consder generatng parts from algned objects by dvdng each object nto several segments and assumng that each segment s a part n the object. Some works consder a more practcal setup when part annotatons are mssng n the testng phase. They learn part detectors from annotated parts n the tranng mages and apply them on testng mages to detect parts. These part detectors nclude DPM or object classfers learned for each object class. Zhang et al. [19] used selectve search to generate object/part proposals from each mage, and appled the learned part detectors on them to detect the head and body n the brd. The proposal whch yelds the hghest response to a certan part detector s used as the detected part n the object. Convolutonal neural networks (CNN) have been wdely used n mage recognton. The outputs from the nner convolutonal (CONV) layers can be seen as the feature representatons of sub-regons n the mage. When CNN s used on an object proposal, the outputs from the nner convolutonal layers can be seen as the part representatons, e.g., [25] used CNN on detected objects, and used the outputs from CONV4 (n Alexnet) as the parts. Smon and Rodner [29] used the outputs from all layers n CNN and selected some mportant ones as parts. Recently, CNN aded by regon proposal methods, has become popular n object recognton/detecton, e.g., RCNN [30], fast-rcnn [31], faster-rcnn [32], and RCNNmnus-R [33]. All these four methods focus on the supervsed object detecton, where object boundng boxes n tranng mages are necessary to learn the object detectors. They cannot be drectly used n our weakly-supervsed fne-graned mage categorzaton. These methods generate object level representatons, whle ours used fne-graned part level representatons. In RCNN, CNN s appled on each object proposal (boundng box acqured by selectve search on the nput mage) and the output from the fully connected layer s used as the feature vector, where CNN s appled multple tmes on an mage. In Fast-RCNN, CNN s only appled once on the whole mage. The boundng boxes of object proposals are mapped to the fnal convolutonal (CONV) layer to get the object feature. Smlarly, RCNN-mnus-R used sldng wndows to map to the last CONV layer n CNN n order to get the object representaton. In Faster-RCNN, nstead of mappng object

4 1716 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 4, APRIL 2016 proposal from nput mages, sldng wndows are drectly used on the last CONV layer to get the object feature. Some exstng works are related to the proposed method. The proposed MMP s an effcent way to generate mult-scale part proposals to characterze fne-graned objects. It can be easly appled on mllons or even bllons of object proposals n a dataset. Unlke [25], where the outputs of CONV4 n CNN are used as parts, MMP provdes dense coverage on dfferent scales from part level to object level for each object proposal. The large number of part proposals provde us more opportunty to mne subtle useful nformaton of objects. Part selecton can automatcally explore those parts whch are mportant for categorzaton by only usng mage-level labels. It s more effcent and practcal than tryng to learn explct part detectors wthout groundtruth object/part annotatons. Xao et al. [25] also worked on fne-graned categorzaton wthout object/part annotatons, whch requres much more computaton than ours. Xao et al. [25] used two CNN models to detect nterestng objects and further learned accurate part detectors from them. In contrast, we only need to select mportant parts from all part proposals, whch are generated by applyng one CNN model. More mportantly, our method shows that wthout explctly detectng the fnegraned objects/parts, the proposed mage representaton can acqure a better dscrmnance than [25] (cf. Table III). ScPM s dfferent from the Mult-scale Pyramd Poolng (MPP) method n [34], where MPP encodes local features from mages reszed on dfferent scales nto separate Fsher vector (FV) [35], and aggregates all the FVs nto one to represent an mage. Such aggregaton may not hghlght the subtle dfferences of object parts on dfferent scales, whch s especally mportant n fne-graned objects wth complex backgrounds. In contrast, n ScPM, we automatcally select dfferent numbers of mportant part clusters on dfferent scales usng the proposed part selecton method descrbed n Sec. III-B. We wll also use FV to encode the parts on each scale. The fnal FV representatons from dfferent scales are lkely to have dfferent lengths, whch cannot be smply aggregated as MPP. We denote the strategy used n MPP as sum poolng, and compare t wth the proposed ScPM n the experment. Spatal pyramd matchng (SPM) [36] s also not sutable for fne-graned mage categorzaton. Ths s because spatal correspondence does not necessarly exst among manually splt regons n fne-graned mages, whch may cause possble spatal msmatchng problems [37]. B. Weakly Supervsed Fne-Graned Categorzaton Most exstng fne-graned works heavly rely on the object/ part annotatons n categorzaton when the objects are n complex backgrounds. [25] s the frst work whch categorzes fnegraned mages wthout usng human annotatons n any mage (both tranng and testng), but wth only mage labels. In [25], a CNN that s pre-traned from ImageNet s frst used as an object detector to detect the object from each mage. Then, part features (outputs from CONV4 n CNN) are extracted from objects and clustered nto several mportant ones by spectral clusterng. For each part cluster, a part detector s learned to dfferentate t from other clusters. Fnally, these part detectors are used to detect useful parts n testng mages. In [25], each part s evaluated extensvely by the learned part detectors and the detected ones are concatenated nto the fnal mage representaton. In contrast, our method frst encodes the large number of parts nto a global mage representaton and then performs part selecton on t, whch can save much more computatonal effort than [25]. Smon and Rodner [29] also categorzed fne-graned mages n the same setup. They frst generated a pool of parts by usng the outputs from all layers n CNN. Then, they selected useful ones for categorzaton. They consder two ways of selecton: one s to randomly select some parts; the other s to select a compact set by consderng the relatonshp among them. These parts are concatenated to represent the mage. Jaderberg et al. [38] learned to detect and algn objects n an end-to-end system. Ths system ncludes two parts: one s an object detector, whch s followed by a spatal transformer. The spatal transformer s learned to algn the detected objects automatcally to make the parts match accurately. Ths paper s dfferent from [25], [29], and [38], n that, we do not explctly detect/algn the object/part n the mage, but propose an effcent part selecton method to extract the most dscrmnatve nformaton for categorzaton. III. FINE-GRAINED IMAGE REPRESENTATION WITHOUT USING OBJECT/PART ANNOTATIONS The proposed part-based mage representaton ncludes three parts: part proposal generaton, part selecton, and mult-scale mage representaton, whch are detaled n Sectons III-A to III-C, respectvely. A. Part Proposal Generaton Regonal nformaton has been shown to mprove mage classfcaton wth hand-crafted methods lke spatal pyramd matchng [36] and receptve felds [39]. When a CNN model s appled on an mage, features of local regons can be acqured automatcally from ts nternal structure. Assume the output from a layer n CNN s N N d dmenson, whch s the output of d flters for N N spatal cells. Each spatal cell s computed from a receptve feld n the nput mage. The receptve felds of all the spatal cells n the nput mage can hghly overlap wth each other. The sze of one receptve feld can be computed layer by layer n CNN. In a convoluton (poolng) layer, f the flter (poolng) sze s a a and the strde s s, thent T cells n the output of ths layer corresponds to [s(t 1) + a] [s(t 1) + a] cells n the nput of ths layer. For example, one cell n the CONV5 (the 5th convolutonal) layer of CNN model (magenet-vgg-m) [40] corresponds to a receptve feld n the nput mage (cf. Fg. 4). We generate features of mult-scale receptve felds for an mage by leveragng the nternal outputs of CNN wth lttle addtonal computatonal cost (cf. Fg. 5). Consderng the outputs of one layer n CNN, we can pool the actvaton vectors of adjacent cells of dfferent szes, whch correspond

ZHANG et al.: WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENTATION 1717 Fg. 4. Receptve felds computed usng the CNN model (magenet-vgg-m) [40].

detecton methods. Selectve search [42] s used n our framework gven ts hgh computaton effcency, whch has also been used n [19] and [30] to generate ntal object/part canddates for object detectors.

Generatng mult-scale part proposals. For an nput object proposal, by applyng CNN on t, spatal cells of dfferent szes on the CONV5 layer n CNN correspond to parts of dfferent scales.

5 ZHANG et al.: WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENTATION 1717 Fg. 4. Receptve felds computed usng the CNN model (magenet-vgg-m) [40]. One cell n the CONV5 layer corresponds to a receptve feld n the nput mage. We only show the spatal szes of the mage and flters, where a a s the flter (poolng) sze, and st means the strde. detecton methods. Selectve search [42] s used n our framework gven ts hgh computaton effcency, whch has also been used n [19] and [30] to generate ntal object/part canddates for object detectors. After generatng multple object proposals, we apply the CNN model on each boundng box/object proposal, and use the proposed MMP to get a large number of part proposals from each object proposal. Fg. 5. Generatng mult-scale part proposals. For an nput object proposal, by applyng CNN on t, spatal cells of dfferent szes on the CONV5 layer n CNN correspond to parts of dfferent scales. Ths fgure s best vewed n color. to receptve felds wth dfferent szes n the nput mage. Max-poolng s used here. In partcular, gven the N N d output X n one layer n CNN, we use max-poolng to combne nformaton from all M M adjacent cells, that s: z M, j,k = max X p,q,k, p<+m, j q< j+m s.t. 1 M N, 1 k d, (1) where M ranges from 1 (sngle cell) to N (all the cells). In Eq. 1, an M M spatal neghborhood s represented by a d-dmensonal feature mappng z M.WhenM s assgned to dfferent values, the correspondng cells can cover receptve felds of dfferent szes (scales) n the nput mage, thus provdng a more comprehensve nformaton. We name ths proposed part proposal generaton strategy as mult-max poolng (MMP) and apply t to the CONV5 layer (last CONV layer n CNN). Ths s because the CONV5 layer can capture more meanngful object/part nformaton than those shallow layers n CNN [41]. When a CNN model s appled on an object boundng box n an mage, the acqured receptve felds from MMP can be seen as the part canddates for the object. Thus, we can acqure a mult-scale representaton of parts n objects wth MMP. To compute the part proposals, we frst generate object proposals from each mage. Object proposals are those regons nsde an mage that have hgh objectness,.e., havng a hgher chance to contan an object. Snce no object/part annotatons are utlzed, we could only use unsupervsed object B. Part Selecton We then propose to select useful (.e., dscrmnatve) part clusters, and form a global representaton from these useful parts n each mage. Among the object/part proposals, most of them are from background clutters, whch are harmful for mage recognton. For example, n the CUB [1] dataset, when we use the ntersecton over unon crtera, only 10.4% object proposals cover the foreground object. The part proposals from those unsuccessful object proposals wll contrbute lttle to the classfcaton, or even be nosy and harmful. Thus, we need to fnd those useful part proposals (dscrmnatve parts of the foreground object) for our fnal mage representaton. Our basc dea s to select useful parts through mnng the useful nformaton n part clusters. We frst cluster all part proposals n the tranng set nto several groups. Then, we compute the mportance of each cluster for mage classfcaton. Those part proposals assgned to the useful clusters (clusters wth the hghest mportance values) are selected as the useful parts. We compute the cluster mportance wth the ad of Fsher vector (FV) [35]. 1 We frst encode all the part proposals n each mage nto a FV wth a GMM (Gaussan Mxture Model). The GMM s learned usng part proposals extracted from tranng mages. Each Gaussan corresponds to a part cluster. Then, for each dmenson n FVs of all tranng mages x :, we compute ts mportance usng ts mutual nformaton (MI) wth the class labels y [45]. Zhang et al. [45] show that dfferent dmensons n FV have weak correlatons, whch advocates processng each dmenson separately. The MI value of each dmenson x : n FV s computed as: I (x :, y) = H ( y) + H (x : ) H (x :, y), (2) 1 VLAD can be used n our framework, whch s used n [43] to encode CNN of multple spatal regons for general mage classfcaton. We choose FV because t has a better dscrmnance than VLAD [44].

6 1718 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 4, APRIL 2016 where H s the entropy of a random varable. Snce y remans unchanged for dfferent, we smply need to compute H (x : ) H (x :, y). In order to compute the value dstrbuton of x : n Eq. 2, an effcent 1-BIT quantzaton method [45] s used. For a scaler x n x :, t s quantzed accordng to { 1 x 0 x (3) 1 x < 0. Fnally, the cluster (Gaussan) mportance s the summaton of the MI values of all FV dmensons computed from ths Gaussan. For a Gaussan G, ts mportance s computed as: m(g) = I (x :, y). (4) G We only keep those dmensons n FV from the most mportant Gaussans wth the largest mportance values. As wll be shown n Sec. IV, ths novel strategy greatly mproves categorzaton accuracy, even when object or part annotatons are not used at all. C. Mult-Scale Image Representaton Consderng our part proposals are generated at dfferent scales (wth dfferent M n Eq. 1), aggregatng all of them nto a sngle mage representaton cannot hghlght the subtle dstncton n fne-graned mages. Thus, we propose to encode part proposals n an mage on dfferent scales separately and we name t SCale Pyramd Matchng (ScPM). The steps are as follows: Generate parts on dfferent scales. Gven an mage I, whch contans a set of object proposals I = {o 1,...,o I }, each object proposal o contans a set of mult-scale part proposals o ={z 1,...,z o }. For part proposals n I on dfferent scales M {1,...,N}, we compute separate FVs. In practce, the scale number can be very large (N = 13 n the CNN settng), whch may lead to a severe memory problem. Snce the part proposals on neghborng scales are smlar n sze, we can dvde all the scales nto m (m N) non-overlappng groups {g( j), j = 1,...,m, g( j) {1,...,N}}. Compute FV usng selected parts on each scale. Foran mage I, ts part proposals belongng to the scale group g( j) are used to compute one FV φ j (I) as: φ j (I) =[f μ j 1 f μ j f σ j (I) = 1 w j (I) = 1 2w j (I), f σ j 1 c(t) g( j) c(t) g( j) (I),..., f μ j γ j t () (I), f j σ (I),...], (5) ( γt j z c(t) t μ j ) () σ j, (6) [ ] c(t) (z t μ j )2 (σ j )2 1 (7) where {w j, μ j, σ j } are the mxture weghts, mean vectors, and standard devaton vectors of the -th selected dagonal Gaussan n the j-th scale group g( j), j = 1,...,m, respectvely. {z t } are the, Fg. 6. The process of generatng mage representaton usng ScPM. selected part proposals n an mage, c(t) s the scale ndex of the t-th part and γ j t () s the weght of the t-th nstance to the -th Gaussan n the j-th scale group. Image representaton. Followng [35], two parts correspondng to the mean and the standard devaton n each Gaussan of FV are used. Each of the m FVs s power and l 2 normalzed ndependently, and then concatenated to represent the whole mage as φ(i): φ(i) =[φ 1 (I),...,φ m (I)]. (8) Feature normalzaton. Because of the l 2 normalzaton, each φ (I) satsfes that φ (I) 2 = 1. After part selecton, however, ths property ceases to hold. Because only a few parts are selected, we expect φ (I) 2 < 1forall1 m. Data normalzaton has been shown to effectvely mprove the dscrmnatve power of a representaton [46]. For the mage representaton after part selecton, we apply power normalzaton and l 2 normalzaton agan. The whole process s llustrated n Fg. 6. IV. EXPERIMENTS In ths secton, we evaluate the proposed weakly-supervsed method for fne-graned categorzaton. The selectve search method [42] wth default parameters s used to generate object proposals for each mage. The pre-learned CNN models [40] from ImageNet are used to extract features from each object proposal as [30], whch has been shown to acheve state-of-the-art results. It s fne-tuned wth tranng mages and ther labels. We would lke to pont out that we do not fne tune CNN usng object proposals because many of them are from background clutters, whch may deterorate the CNN performance. We use the magenet-vgg-m model [40], gven that ts effcency and accuracy are both satsfactory. It has a smlar structure (wth 5 convolutonal layers) to that of AlexNet [47]. The part proposals n each scale group are assgned nto 128 clusters. Each part feature s reduced nto 128 dmensons by PCA. All 13 part scales (N = 13 n the CNN model) are dvded nto 8 scale groups: the frst 4 scales form the frst 4 groups, the subsequent 6 scales form 3 groups wth 2 scales n one group, and the last 3 scales form the last scale group. Ths arrangement makes the number of parts

7 ZHANG et al.: WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENTATION 1719 TABLE I EVALUATION OF DIFFERENT MODULES IN THE PROPOSED IMAGE REPRESENTATION ON CUB DATASET TABLE II CLASSIFICATION ACCURACY (%) OF THE PART BASED IMAGE REPRESENTATION WITH DIFFERENT NUMBERS OF GMM n each group roughly balanced. The dmenson of the global mage representaton usng FV becomes: = , from whch dfferent fractons of useful part clusters wll be selected and evaluated. We evaluate the proposed method on three benchmark fne-graned datasets: CUB [1]: The Caltech-UCSD Brds dataset contans 200 dfferent brd classes. It ncludes 5994 tranng mages and 5794 testng mages. StanfordDogs [5]: Ths dataset contans 120 dfferent types of dogs and ncludes mages n total. VMMR-40 [11]: It contans 928 classes. Each class has at least 40 mages. The dataset contans mages n total. We use 20 mages n each class for tranng and the rest for testng. For all datasets, we only use the class labels of mages n the tranng stage. We choose LIBLINEAR [48] to learn lnear SVM classfers for classfcaton. All the experments are run on a computer wth Intel K CPU, 64G man memory, and an Nvda Ttan GPU. A. Influences of Dfferent Modules We evaluate dfferent modules n the proposed part based mage representaton (wthout part selecton) on the CUB dataset n Table I: The effect of MMP n the proposed mage representaton. We compare the part proposals generated usng the outputs of CONV5 and CONV5+MMP. All part proposals n each mage are encoded nto one FV (wthout part selecton and ScPM). It can be seen that mult-scale part proposals (CONV5+MMP) can greatly mprove the recognton accuracy over sngle-scale part proposals (CONV5) by about 10%. Ths s because MMP can provde very dense coverage of object parts at dfferent scales. The part based mage representaton s also shown to be sgnfcantly better than the object based mage representaton. The nfluence of ScPM n the proposed mage representaton. Usng the mult-scale part proposals generated by MMP, ScPM acheves a better accuracy (2.3% hgher) than that of the method encodng all part proposals altogether. Ths shows that t s benefcal to encode parts at dfferent scales separately. Evaluaton of the global mage representaton usng CNN, ndcated as whole mage n Table I. The CNN model s appled on the whole mage, whch s represented usng the output of FC7. It leads to a sgnfcantly worse accuracy rate than our part based method. TABLE III CLASSIFICATION ACCURACY COMPARISONS ON CUB DATASET USING VGG-CNN-MMODEL We evaluate the proposed mult-scale mage representaton wth dfferent numbers of GMMs n Table II. The classfcaton accuracy ncreases when the number of GMMs ncreases. After the GMM number exceeds 128, the accuracy mprovement becomes slower. As a tradeoff between the accuracy and computatonal effcency (ncludng both memory footprnt and computaton tme), we use 128 GMMs n the followng experments as the default value. We compare ScPM wth the sum poolng method used on FV [34] n Table II. ScPM shows better classfcaton results than the sum poolng [34] when dfferent GMMs are used n FV. Ths s because ScPM can hghlght the dfference of fne-graned objects on varous scales. We summarzed the observatons from the above evaluatons. Frst, MMP+ScPM can compute an effcent mult-scale part representaton. Second, ScPM s better than the sum poolng when poolng multple FVs nto a global representaton. Fnally, we fx the the number of Gaussan components n GMM as 128 when computng FV n the followng experments. In the followng secton, we wll show that the proposed part selecton can further mprove the accuracy. B. Part Selecton We show the classfcaton accuracy usng part selecton on the proposed mage representaton (MMP+ScPM) for CUB n Table III.

8 1720 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 4, APRIL 2016 TABLE IV CLASSIFICATION ACCURACY ON CUB WITH OBJECT BASED IMAGE REPRESENTATION TABLE V CLASSIFICATION ACCURACY (%) ON CUB DATASET USING THE METHOD IN [26] ON THE SAME PARTS IN TABLE III TABLE VI CLASSIFICATION ACCURACY (%) ON CUB DATASET USING VLAD. FV RESULTS ARE ALSO CITED FOR COMPARISON It can be seen that part selecton can greatly mprove accuracy. We show the results correspondng to selectng dfferent fractons of part clusters n the mage representaton. When selectng the most mportant quarter of the part clusters (fracton 25%), a peak s reached, and t s better than the one wthout part selecton (fracton 100%) by 3.5%. Even when fewer part proposals are selected (fracton 12.5%), ts accuracy s stll better than the one wthout part selecton by 2.4%. Ths shows that part selecton can effcently reduce the nose ntroduced by those part proposals from background clutters. We also compare part selecton wth feature selecton [45] on the same feature representaton wth the same selecton fracton (25%). Feature selecton (77.54%) s worse than part selecton (78.92%). Ths s because part selecton can keep more semantc nformaton of parts. As a comparson to our proposed part based mage representaton, we evaluated an object based mage representaton for fne-graned mage categorzaton. We appled CNN on each object proposal and extracted the output from the FC7 layer as the object feature (reduced to 128 by PCA). The objects n each mage were encoded nto a FV wth 128 GMMs. We appled feature selecton [45] on the FVs and computed ther classfcaton accuraces. The results are shown n Table IV. When the background nose s dscarded wth dfferent selecton fractons, the classfcaton can be mproved to the hghest 62.89% on the object based mage representaton. We also evaluated the object generaton method usng fast- RCNN [31],.e., mappng object proposals to the last CONV layer n CNN to get the object features. The object features are encoded nto FV and appled feature selecton (25% fracton), whch has 63.41% accuracy. Although ther computaton can be faster, they have much lower accuracy than our part based mage representaton. Our best accuracy (78.92%) sgnfcantly outperforms the state-of-the-art weakly-supervsed methods [25], [29] by over 9% and 10% respectvely when smlar CNN models (vgg-cnn-m and AlexNet) are used. Wth a deeper and more powerful CNN model (vgg-verydeep), [25] reduces the gap to ours to 1% whle [29] acheves hgher accuracy. Note that, n addton to the hgh complexty of usng the very deep CNN model, [29] s expensve because t needs to evaluate each part to select the best ones. In contrast, ours only selects best part clusters, whch has a much smaller number than that of parts. Jaderberg et al. [38] acheve much hgher results than other works because they used a more powerful baselne CNN structure. We also compared wth the blocks that shout method [26] on our parts used n Table V. Useful parts are selected through learned part classfers and then encoded nto a FV for each mage. The accuracy does not mprove when more part classfers are used, whch s also lower than ours n Table III. We also show the accuracy of annotaton-dependent methods usng object / part annotatons n the tranng stage but not n the testng stage, whch use the least annotatons and are closest to our weakly-supervsed setup. Most of these methods try to learn expensve part detectors to get accurate matchng for recognton. However, the superor performance of our method shows that they are not always necessary, especally n weakly-supervsed fne-graned categorzaton. We would lke to hghlght that part selecton s more mportant n fne-graned categorzaton than feature selecton n general mage categorzaton. Wth part selecton, the accuracy s 3.5% (78.92% vs %) hgher than the orgnal mage representaton. In [45], feature selecton s used to compress FV for general mage recognton lke object recognton. Much smaller (around 1%) mprovement after selecton (worse n most tme) s acheved over the orgnal FV, whch s sgnfcantly dfferent from the mprovement observed n Table III. Ths fact clearly shows the dstncton between the two applcatons. In the weakly-supervsed fne-graned tasks, selectng proper object parts s crtcal, whle n general mage recognton, the global mage representaton wthout selecton s usually already good. We also compare the proposed mage representaton (usng FV) wth usng VLAD [43]. The classfcaton accuracy usng VLAD s shown n Table VI. VLAD leads to nferor results than FV usng dfferent selecton fractons. On each selecton fracton, the accuracy of VLAD s about 2% worse than that of FV. In the followng experments, we wll only use FV n the proposed mage representatons. We further evaluate the proposed method wth the very deep CNN model (VGG-verydeep-16) [51]. The classfcaton results are shown n Table VII. The very deep CNN model has 13 convolutonal layers. It has a much deeper structure than our prevously used CNN model (the vgg-m model), whch has only 5 convolutonal layers. Thus, the very deep CNN model can provde more dscrmnaton n mage recognton tasks. We also use the outputs from the layer before the last convolutonal layer n our method. We fnd that the very deep CNN model has better results than the shallow model (77.28% vs %), when part selecton s not used.

9 ZHANG et al.: WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENTATION 1721 TABLE VII CLASSIFICATION ACCURACY (%) ON CUB DATASET USING VGG-VERYDEEP-16 CNN MODEL However, after part selecton s used, the dfference shrnks, where the best classfcaton accuraces of the two models are 79.34% vs %. Ths shows that a weak (shallow) CNN model can beneft from part selecton n the proposed mage representaton. Besdes, the very deep CNN model ntroduces much more computaton than the shallow model. Thus, n the followng experments, we wll only use the shallow CNN model (magenet-vgg-m) n the proposed method. We evaluate the tme cost n each module of the proposed method on CUB The mage representaton generaton tme s 3.4 seconds per mage, where CNN costs 0.9 second, part generaton 2.3 seconds, FV 0.2 second. The cost of part selecton s almost neglgble. Learnng 8 GMMs (for 8 scales) costs about 1 hour (usng 1/5 tranng mages). Learnng part selecton parameters costs 1500 seconds. SVM classfers take 40 mnutes durng tranng and 5 mnutes durng testng for features wthout part selecton. Wth part selecton, the tme s proportonally reduced wth respect to the selecton fracton. Overall, these results show that: 1) part selecton s mportant n weakly-supervsed fne-graned categorzaton; 2) t s not always necessary to learn expensve object/part detectors n fne-graned categorzaton; 3) a very deep CNN model s not necessary n extractng parts when part selecton s used; and 4) FV s better than VLAD n generatng the mage representaton. C. Understand Subtle Vsual Dfferences: Wth the Help of Key Part Detecton We want to detect and show the key (most dscrmnatve) parts n fne-graned mages of dfferent classes to gve a more nsghtful understandng of the crtcal property n objects, whch may help us n feature desgn for fne-graned mages. We learn a bnary SVM (support vector machne) classfer n each selected part cluster to compute the part score. Ths classfer s used to propagate the mage labels to parts. In the tranng phase, for each selected part cluster, we aggregate the part features n one mage assgned to ths cluster altogether (smlar to VLAD). The aggregated features of tranng mages are l 2 normalzed and are then used to tran a classfer wth mage labels. In the testng phase, gven a part, ts score s computed as the dot-product between the classfer for the part cluster t falls n (only consderng those parts n the selected part clusters) and ts feature (the CNN actvaton vector). Note that n both tranng and testng processes, the part features are centered (.e., mnus the cluster center n each part cluster). Fg. 7 shows parts that belong to two clusters. The parts are sorted accordng to ther mportance scores n Fg. 7. Part varatons. Parts are from two dfferent part clusters. They are shown accordng to ther mportance scores n the descendng order wthn each part cluster. descendng order. We can see that parts n the same cluster are relatvely coherent, correspondng manly to the head regon of the two speces of brds. Then, we show more examples of key part detecton n Fg. 8. In each par, we show one sample mage and 20 detected key parts wth the hghest (smallest) scores from all testng mages of the postve (negatve) class. The brd names are gven n the captons, whch clearly ndcates how humans characterze dfferent brds. It can be seen that the detected parts capture well the key parts n these speces, whch are consstent wth human-defned rules. We also fnd that the proposed method can capture some tny dstncton that mght not be easly dscrmnated by human eyes. For example, n the frst par, the key parts n the red-belled woodpecker and red-headed woodpecker are both red, and the locatons are very close. From the detected parts, we can fnd that the red color of the red-headed woodpecker s darker and the feather of red-belled woodpecker s fner. From the detected parts, we can also understand the necessty to select many useful parts n the proposed mage representaton. Only usng the best part may cause possble loss of useful nformaton n characterzng an object. Multple good parts can compensate each other from dfferent aspects lke locaton, vew, scale, etc. Ths also explans why the proposed representaton works better than [25], whch only uses the detected best part for categorzaton. D. Classfcaton Results on Stanford Dogs We show the categorzaton accuracy for Stanford Dogs n Table VIII. The proposed method (ether wth or wthout part selecton) shows much better accuracy than the exstng annotaton-dependent works. Part selecton also plays an mportant role n the proposed mage representaton, whch leads to a 2.69% mprovement over the orgnal representaton. Stanford Dogs s a subset n ImageNet. It s also evaluated n state-of-the-art weakly-supervsed works [25], [29], whose results are sgnfcantly lower than ours. E. Classfcaton Results on VMMR-40 VMMR-40 s a recently released large-scale dataset for car recognton. The mages are captured from dfferent angles by

1722 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 4, APRIL 2016 Fg. 8. Key (most dscrmnatve) parts vsualzaton for parwse classes.

The mportant parts found by the proposed method concde well wth the rules human experts use to dstngush these brds.

(c) Blue Jay vs. Green Jay. dfferent users and devces. The cars are not well algned. Some mages contan rrelevant backgrounds.

10 1722 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 4, APRIL 2016 Fg. 8. Key (most dscrmnatve) parts vsualzaton for parwse classes. Key parts are detected from testng mages usng the classfer learned from tranng mages. Top 20 key parts are shown for each class. The mportant parts found by the proposed method concde well wth the rules human experts use to dstngush these brds. Ths fgure s best vewed n color. (a) Red-belled Woodpecker vs. Red-headed Woodpecker. (b) Red-wnged Blackbrd vs. Yellow-headed Blackbrd. (c) Blue Jay vs. Green Jay. dfferent users and devces. The cars are not well algned. Some mages contan rrelevant backgrounds. We show the classfcaton accuracy n Table IX. We frst test the classfcaton accuracy usng the CNN FC7 feature extracted from the whole mage. Then, we test our proposed part based mage representaton wth dfferent part selecton fractons.

ZHANG et al.: WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENTATION 1723 TABLE VIII CLASSIFICATION ACCURACY ON STANFORDDOGS Fg. 9.

11 ZHANG et al.: WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENTATION 1723 TABLE VIII CLASSIFICATION ACCURACY ON STANFORDDOGS Fg. 9. Key part detecton of two models n VMMR-40 dataset [11]: acurantegra-1991 and acura-ntegra TABLE IX CLASSIFICATION ACCURACY ON VMMR-40 We can see that the performance of the part based mage representaton greatly outperforms that of the whole mage representaton. Part selecton does not mprove as much accuracy as those observed n the prevous two datasets. Ths s because the backgrounds n VMMR-40 mages are less complex than those n CUB and Standford Dogs. The classfcaton results valdate the capablty of the proposed method n characterzng the unalgned fne-graned objects n complex backgrounds. We also show the detected key parts n Fg. 9. We can see that the man dfference of the two models le n the rear lghts, whch are accurately detected. F. Dscussons The major argument of ths paper s that part selecton s a more natural and effcent choce than usng part detectors n weakly-supervsed fne-graned mage categorzaton. Partcularly, we fnd that: It s hard to learn accurate part detectors to algn objects wthout object / part annotatons n fne-graned mage categorzaton (cf. Table III). Mult-scale part representaton s mportant to characterze fne-graned objects on dfferent scales (cf. Table I). Selectng multple good parts s better than detectng one best part n fne-graned object recognton (cf. Table III and Fg. 8). Selected parts are dscrmnatve for categorzaton by dscardng the background nose n mages (cf. Fg. 8). We have provded the followng methods for effcent representaton of fne-graned objects n the weakly-supervsed setup: Mult-max poolng (MMP) s an effcent way to generate mult-scale part proposals from the CNN outputs on object proposals. Part selecton s necessary to reduce the background nose n mages, whch s more effcent than those methods tryng to learn accurate object/part detectors. Encodng useful part proposals on dfferent scales separately (ScPM) can hghlght the subtle dstnctons among fne-graned objects. In our experence, there s one ssue wth the proposed framework: the part proposal generaton process may ntroduce heavy computatons, when the numbers of mages and object proposals are very large n the dataset. Our part proposals are generated from CNN whch s appled on each object proposal. It s mportant to research on how to reduce the number of effectve object proposals (so that we can reduce the tmes of CNN appled on object proposals) or how to generate part proposals drectly from CNN computed on mages. V. CONCLUSIONS In ths paper, we have proposed to categorze fne-graned mages wthout usng any object/part annotaton ether n the tranng or n the testng stage. Our basc dea s to select multple useful parts from mult-scale part proposals and use them to compute a global mage representaton for categorzaton. Ths s specally desgned for fne-graned categorzaton n the weakly-supervsed scenaro, because parts have been shown to play an mportant role n the exstng annotatondependent works. Also, accurate part detectors are usually hard to acqure. Partcularly, we propose an effcent multmax poolng strategy to generate mult-scale part proposals by usng the nternal outputs of CNN on object proposals n each mage. Then, we select useful parts from those part clusters whch are mportant for categorzaton. Fnally, we encode the selected parts at dfferent scales separately n a global mage representaton. Wth the proposed mage/part representaton technque, we use t to detect the key parts of objects n dfferent classes, whose vsualzaton results are ntutve and concde well wth rules used by human experts. In the experments, on three challengng datasets, our proposed weakly-supervsed method acheves comparable or better results than those of state-of-the-art weakly-supervsed works [25], [29] and most exstng annotaton-dependent methods. Future works would nclude utlzng the part nformaton mned from the global mage representaton to help localze objects and further mprove classfcaton.

12 1724 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 4, APRIL 2016 REFERENCES [1] C. Wah, S. Branson, P. Welnder, P. Perona, and S. Belonge, The Caltech-UCSD brds dataset, Dept. Comput. Neural Syst. Program, Calforna Inst. Technol., Pasadena, CA, USA, Tech. Rep. CNS-TR , [2] T. Berg, J. Lu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur, Brdsnap: Large-scale fne-graned vsual categorzaton of brds, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2014, pp [3] A. Iscen, G. Tolas, P.-H. Gosseln, and H. Jegou, A comparson of dense regon detectors for mage search and fne-graned classfcaton, IEEE Trans. Image Process., vol. 24, no. 8, pp , Aug [4] L. Xe, Q. Tan, M. Wang, and B. Zhang, Spatal poolng of heterogeneous features for mage classfcaton, IEEE Trans. Image Process., vol. 23, no. 5, pp , May [5] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. L, Novel dataset for fne-graned mage categorzaton, n Proc. 1st Workshop Fne-Graned Vs. Categorzaton (FGVC), 2011, pp [6] A. Vedald et al., Understandng objects n detal wth fne-graned attrbutes, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2014, pp [7] M.-E. Nlsback and A. Zsserman, Automated flower classfcaton over a large number of classes, n Proc. Indan Conf. Comput. Vs., Graph. Image Process., 2008, pp [8] A. R. Sfar, N. Boujemaa, and D. Geman, Vantage feature frames for fne-graned categorzaton, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2013, pp [9] S. Gao, I. W.-H. Tsang, and Y. Ma, Learnng category-specfc dctonary and shared dctonary for fne-graned mage categorzaton, IEEE Trans. Image Process., vol. 23, no. 2, pp , Feb [10] E. Rodner, M. Smon, G. Brehm, S. Petsch, J. W. Wagele, and J. Denzler, Fne-graned recognton datasets for bodversty analyss, n Proc. 3rd Workshop Fne-Graned Vs. Categorzaton (FGVC), 2015, pp [11] A. Ben Khalfa and H. Frgu, A dataset for vehcle make and model recognton, n Proc. 3rd Workshop Fne-Graned Vs. Categorzaton (FGVC), 2015, pp [12] R. Farrell, O. Oza, N. Zhang, V. I. Moraru, T. Darrell, and L. S. Davs, Brdlets: Subordnate categorzaton usng volumetrc prmtves and pose-normalzed appearance, n Proc. IEEE Int. Conf. Comput. Vs., Nov. 2011, pp [13] N. Zhang, R. Farrell, and T. Darrell, Pose poolng kernels for subcategory recognton, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2012, pp [14] L. Bourdev, S. Maj, T. Brox, and J. Malk, Detectng people usng mutually consstent poselet actvatons, n Proc. 11th Eur. Conf. Comput. Vs., 2010, pp [15] N. Zhang, R. Farrell, F. Iandola, and T. Darrell, Deformable part descrptors for fne-graned recognton and attrbute predcton, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Dec. 2013, pp [16] P. F. Felzenszwalb, R. B. Grshck, D. McAllester, and D. Ramanan, Object detecton wth dscrmnatvely traned part-based models, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp , Sep [17] E. Gavves, B. Fernando, C. G. M. Snoek, A. W. M. Smeulders, and T. Tuytelaars, Fne-graned categorzaton by algnments, n Proc. IEEE Int. Conf. Comput. Vs., Dec. 2013, pp [18] C. Gorng, E. Rodner, A. Freytag, and J. Denzler, Nonparametrc part transfer for fne-graned recognton, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2014, pp [19] N. Zhang, J. Donahue, R. Grshck, and T. Darrell, Part-based R-CNNs for fne-graned category detecton, n Proc. 13th Eur. Conf. Comput. Vs., 2014, pp [20] J. Donahue et al., DeCAF: A deep convolutonal actvaton feature for generc vsual recognton, n Proc. Int. Conf. Mach. Learn., 2014, pp [21] N. Zhang, M. Palur, M. Ranzato, T. Darrell, and L. Bourdev, PANDA: Pose algned networks for deep attrbute modelng, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2014, pp [22] J. Krause, H. Jn, J. Yang, and F.-F. L, Fne-graned recognton wthout part annotatons, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2015, pp [23] J. Deng, J. Krause, and F.-F. L, Fne-graned crowdsourcng for fnegraned recognton, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2013, pp [24] C. Wah, G. Van Horn, S. Branson, S. Maj, P. Perona, and S. Belonge, Smlarty comparsons for nteractve fne-graned categorzaton, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2014, pp [25] T. Xao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, The applcaton of two-level attenton models n deep convolutonal neural network for fne-graned mage classfcaton, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2015, pp [26] M. Juneja, A. Vedald, C. V. Jawahar, and A. Zsserman, Blocks that shout: Dstnctve parts for scene classfcaton, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2013, pp [27] C. Doersch, A. Gupta, and A. A. Efros, Md-level vsual element dscovery as dscrmnatve mode seekng, n Proc. Adv. Neural Inf. Process. Syst., 2013, pp [28] J. Sun and J. Ponce, Learnng dscrmnatve part detectors for mage classfcaton and cosegmentaton, n Proc. IEEE Int. Conf. Comput. Vs., Dec. 2013, pp [29] M. Smon and E. Rodner, Neural actvaton constellatons: Unsupervsed part model dscovery wth convolutonal networks, n Proc. IEEE Int. Conf. Comput. Vs., Dec. 2015, pp [30] R. Grshck, J. Donahue, T. Darrell, and J. Malk, Rch feature herarches for accurate object detecton and semantc segmentaton, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2014, pp [31] R. Grshck, Fast R-CNN, n Proc. IEEE Int. Conf. Comput. Vs., 2015, pp [32] S. Ren, K. He, R. Grshck, and J. Sun, Faster R-CNN: Towards realtme object detecton wth regon proposal networks, n Proc. Adv. Neural Inf. Processng Syst., 2015, pp [33] K. Lenc and A. Vedald, R-CNN mnus R, n Proc. Brt. Mach. Vs. Conf. (BMVC), 2015, pp [34] D. Yoo, S. Park, J.-Y. Lee, and I. S. Kweon. (2014). Fsher kernel for deep neural actvatons. [Onlne]. Avalable: abs/ [35] J. Sánchez, F. Perronnn, T. Mensnk, and J. Verbeek, Image classfcaton wth the Fsher vector: Theory and practce, Int. J. Comput. Vs., vol. 105, no. 3, pp , [36] S. Lazebnk, C. Schmd, and J. Ponce, Beyond bags of features: Spatal pyramd matchng for recognzng natural scene categores, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2006, pp [37] Y. Zhang, J. Wu, J. Ca, and W. Ln, Flexble mage smlarty computaton usng hyper-spatal matchng, IEEE Trans. Image Process., vol. 23, no. 9, pp , Sep [38] M. Jaderberg, K. Smonyan, A. Zsserman, and K. Kavukcuoglu, Spatal transformer networks, n Proc. Adv. Neural Inf. Process. Syst., 2015, pp [39] Y. Ja, C. Huang, and T. Darrell, Beyond spatal pyramds: Receptve feld learnng for pooled mage features, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2012, pp [40] K. Chatfeld, K. Smonyan, A. Vedald, and A. Zsserman, Return of the devl n the detals: Delvng deep nto convolutonal nets, n Proc. Brt. Mach. Vs. Conf., 2014, pp [41] M. D. Zeler and R. Fergus, Vsualzng and understandng convolutonal networks, n Proc. 13th Eur. Conf. Comput. Vs., 2014, pp [42] J. R. R. Ujlngs, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders, Selectve search for object recognton, Int. J. Comput. Vs., vol. 104, no. 2, pp , Apr [43] Y. Gong, L. Wang, R. Guo, and S. Lazebnk, Mult-scale orderless poolng of deep convolutonal actvaton features, n Proc. 13th Eur. Conf. Comput. Vs., 2014, pp [44] H. Jégou, F. Perronnn, M. Douze, J. Sánchez, P. Pérez, and C. Schmd, Aggregatng local mage descrptors nto compact codes, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 9, pp , Sep [45] Y. Zhang, J. Wu, and J. Ca, Compact representaton for mage classfcaton: To choose or to compress? n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2014, pp [46] J. Wu, Y. Zhang, and W. Ln, Towards good practces for acton vdeo encodng, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2014, pp [47] A. Krzhevsky, I. Sutskever, and G. E. Hnton, Imagenet classfcaton wth deep convolutonal neural networks, n Proc. Adv. Neural Inf. Process. Syst., 2012, pp

ZHANG et al.: WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENTATION [48] R.-E. Fan, K.-W. Chang, C.-J. Hseh, X.-R. Wang, and C.-J. Ln, LIBLINEAR: A lbrary for large lnear classfcaton, J.

IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2015, pp. 4749 4757. [50] S. Branson, G. Van Horn, S. Belonge, and P.

Zsserman, Very deep convolutonal networks for large-scale mage recognton, n Proc. Int. Conf. Learn. Represent., 2015, pp. 1 14. [52] S. Yang, L. Bo, J. Wang, and L.

Xue, Whch looks lke whch: Explorng nter-class relatonshps n fne-graned vsual categorzaton, n Proc. 13th Eur. Conf. Comput. Vs., 2014, pp. 425 440. Yu Zhang receved the B.S.

He s currently a Post-Doctoral Fellow wth the Bonformatcs Insttute, Agency for Scence, Technology and Research, Sngapore. Hs research nterest s computer vson. Xu-Shen We receved the B.S. degree n computer scence and technology n 2012.

Janxn Wu (M 09) receved the B.S. and M.S. degrees n computer scence from Nanjng Unversty, Chna, and the Ph.D. degree n computer scence from the Georga Insttute of Technology.

He s currently a Professor wth the Department of Computer Scence and Technology, Nanjng Unversty, and s assocated wth the Natonal Key Laboratory for Novel Software Technology, Chna.

13 ZHANG et al.: WEAKLY SUPERVISED FINE-GRAINED CATEGORIZATION WITH PART-BASED IMAGE REPRESENTATION [48] R.-E. Fan, K.-W. Chang, C.-J. Hseh, X.-R. Wang, and C.-J. Ln, LIBLINEAR: A lbrary for large lnear classfcaton, J. Mach. Learn. Res., vol. 9, pp , Jun [49] L. Lu, C. Shen, and A. van den Hengel, The treasure beneath convolutonal layers: Cross-convolutonal-layer poolng for mage classfcaton, n Proc. IEEE Int. Conf. Comput. Vs. Pattern Recognt., Jun. 2015, pp [50] S. Branson, G. Van Horn, S. Belonge, and P. Perona, Improved brd speces reognton usng pose normalzed deep convoluton nets, n Proc. Brt. Mach. Vs. Conf., 2014, pp [51] K. Smonyan and A. Zsserman, Very deep convolutonal networks for large-scale mage recognton, n Proc. Int. Conf. Learn. Represent., 2015, pp [52] S. Yang, L. Bo, J. Wang, and L. Shapro, Unsupervsed template learnng for fne-graned object recognton, n Proc. Adv. Neural Inf. Process. Syst., 2012, pp [53] J. Pu, Y.-G. Jang, J. Wang, and X. Xue, Whch looks lke whch: Explorng nter-class relatonshps n fne-graned vsual categorzaton, n Proc. 13th Eur. Conf. Comput. Vs., 2014, pp Yu Zhang receved the B.S. and M.S. degrees n telecommuncatons engneerng from Xdan Unversty, Chna, and the Ph.D. degree n computer engneerng from Nanyang Technologcal Unversty, Sngapore. He s currently a Post-Doctoral Fellow wth the Bonformatcs Insttute, Agency for Scence, Technology and Research, Sngapore. Hs research nterest s computer vson. Xu-Shen We receved the B.S. degree n computer scence and technology n He s currently pursung the Ph.D. degree wth the Department of Computer Scence and Technology, Nanjng Unversty, Chna. Hs research nterests are computer vson and machne learnng. Janxn Wu (M 09) receved the B.S. and M.S. degrees n computer scence from Nanjng Unversty, Chna, and the Ph.D. degree n computer scence from the Georga Insttute of Technology. He was an Assstant Professor wth Nanyang Technologcal Unversty, Sngapore. He s currently a Professor wth the Department of Computer Scence and Technology, Nanjng Unversty, and s assocated wth the Natonal Key Laboratory for Novel Software Technology, Chna. Hs research nterests are computer vson and machne learnng. He has served as an Area Char for ICCV 2015 and a Senor PC Member for AAAI Janfe Ca (S 98 M 02 SM 07) receved the Ph.D. degree from the Unversty of Mssour Columba. He s currently an Assocate Professor and has served as the Head of the Vsual and Interactve Computng Dvson and the Computer Communcaton Dvson wth the School of Computer Engneerng, Nanyang Technologcal Unversty, Sngapore. He has authored more than 170 techncal papers n nternatonal journals and conferences. Hs major research nterests nclude computer vson, vsual computng, and multmeda networkng. He has been actvely partcpatng n program commttees of varous conferences. He has served as the Leadng Techncal Program Char of the IEEE Internatonal Conference on Multmeda and Expo 2012 and the Leadng General Char of the Pacfc-Rm Conference on Multmeda Snce 2013, he has served as an Assocate Edtor of the IEEE T RANSACTIONS ON I MAGE P ROCESSING. He was served as an Assocate Edtor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY from 2006 to Jangbo Lu (M 09 SM 15) receved the B.S. and M.S. degrees n electrcal engneerng from Zhejang Unversty, Hangzhou, Chna, n 2000 and 2003, respectvely, and the Ph.D. degree n electrcal engneerng, Katholeke Unverstet Leuven, Leuven, Belgum, n From 2003 to 2004, he was wth VIA-S3 Graphcs, Shangha, Chna, as a Graphcs Processng Unt Archtecture Desgn Engneer. In 2002 and 2005, he conducted vstng research wth Mcrosoft Research Asa, Bejng, Chna. Snce 2004, he has been wth the Multmeda Group, Interunversty Mcroelectroncs Center, Leuven, Belgum, as a Ph.D. Researcher. Snce 2009, he has been wth the Advanced Dgtal Scences Center, Sngapore, whch s a jont research center between the Unversty of Illnos at Urbana Champagn, Urbana, and the Agency for Scence, Technology and Research, Sngapore, where he s leadng a few research projects as a Senor Research Scentst. Hs research nterests nclude computer vson, vsual computng, mage processng, vdeo communcaton, nteractve multmeda applcatons and systems, and effcent algorthms for varous archtectures. He receved the 2012 T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY Best Assocate Edtor Award. He was an Assocate Edtor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY. Vet-Anh Nguyen (M 10) receved the B.S. and Ph.D. degrees n electrcal and electronc engneerng from Nanyang Technologcal Unversty (NTU), Sngapore, n 2004 and 2010, respectvely. He was wth the School of Electrcal and Electronc Engneerng, NTU, as a Research Fellow from 2008 to He s currently wth the Advanced Dgtal Scences Center, Sngapore, whch was jontly founded by the Unversty of Illnos at Urbana Champagn and the Agency for Scence, Technology and Research, a Sngapore government agency. Hs research nterests nclude mage and vdeo processng, meda compresson and delvery, computer vson, and real-tme multmeda system. Mnh N. Do (M 01 SM 07 F 14) was born n Vetnam n He receved the B.Eng. degree n computer engneerng from the Unversty of Canberra, Australa, n 1997, and the Dr.Sc. degree n communcaton systems from the Swss Federal Insttute of Technology Lausanne (EPFL), Swtzerland, n He was the Co-Founder and CTO of Personfy Inc., a spn-off from UIUC to commercalze depth-based vsual communcaton. Snce 2002, he has been a Faculty Member wth the Unversty of Illnos at Urbana Champagn (UIUC), where he s currently a Professor wth the Department of Electrcal and Computer Engneerng, and hold jont appontments wth the Coordnated Scence Laboratory, the Beckman Insttute for Advanced Scence and Technology, and the Department of Boengneerng. Hs research nterests nclude sgnal processng, computatonal magng, geometrc vson, and data analytcs. He receved a Slver Medal from the 32nd Internatonal Mathematcal Olympad n 1991, a Unversty Medal from the Unversty of Canberra n 1997, a Doctorate Award from EPFL n 2001, a CAREER Award from the Natonal Scence Foundaton n 2003, and a Young Author Best Paper Award from the IEEE n He was named a Beckman Fellow at the Center for Advanced Study, UIUC, n 2006, and receved of a Xerox Award for Faculty Research from the College of Engneerng, UIUC, n He was a member of the IEEE Sgnal Processng Theory and Methods Techncal Commttee and the Image, Vdeo, and Multdmensonal Sgnal Processng Techncal Commttee, and an Assocate Edtor of the IEEE T RANSACTIONS ON I MAGE P ROCESSING.

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages